diff mbox series

[v4] remoteproc: core: do pm relax when in RPROC_OFFLINE

Message ID 1663312351-28476-1-git-send-email-quic_aiquny@quicinc.com (mailing list archive)
State Not Applicable
Headers show
Series [v4] remoteproc: core: do pm relax when in RPROC_OFFLINE | expand

Commit Message

Aiqun Yu (Maria) Sept. 16, 2022, 7:12 a.m. UTC
RPROC_OFFLINE state indicate there is no recovery process
is in progress and no chance to do the pm_relax.
Because when recovering from crash, rproc->lock is held and
state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
and then unlock rproc->lock.
When the state is in RPROC_OFFLINE it means separate request
of rproc_stop was done and no need to hold the wakeup source
in crash handler to recover any more.

Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
---
 drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

Mukesh Ojha Sept. 16, 2022, 8:47 a.m. UTC | #1
Hi,

On 9/16/2022 12:42 PM, Maria Yu wrote:
> RPROC_OFFLINE state indicate there is no recovery process
> is in progress and no chance to do the pm_relax.
> Because when recovering from crash, rproc->lock is held and
> state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> and then unlock rproc->lock.
> When the state is in RPROC_OFFLINE it means separate request
> of rproc_stop was done and no need to hold the wakeup source
> in crash handler to recover any more.
> 
> Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
> ---
>   drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
>   1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> index e5279ed9a8d7..6bc7b8b7d01e 100644
> --- a/drivers/remoteproc/remoteproc_core.c
> +++ b/drivers/remoteproc/remoteproc_core.c
> @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct work_struct *work)
>   	if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
>   		/* handle only the first crash detected */
>   		mutex_unlock(&rproc->lock);
> +		/*
> +		 * RPROC_OFFLINE state indicate there is no recovery process
> +		 * is in progress and no chance to have pm_relax in place.
> +		 * Because when recovering from crash, rproc->lock is held and
> +		 * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> +		 * and then unlock rproc->lock.
> +		 * RPROC_OFFLINE is only an intermediate state in recovery
> +		 * process.
> +		 */
> +		if (rproc->state == RPROC_OFFLINE)
> +			pm_relax(rproc->dev.parent);


LGTM.
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>

-Mukesh
>   		return;
>   	}
>
Mathieu Poirier Oct. 12, 2022, 8:43 p.m. UTC | #2
Please add what has changed from one version to another, either in a cover
letter or after the "Signed-off-by".  There are many examples on how to do that
on the mailing list.

On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
> RPROC_OFFLINE state indicate there is no recovery process
> is in progress and no chance to do the pm_relax.
> Because when recovering from crash, rproc->lock is held and
> state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> and then unlock rproc->lock.

You are correct - because the lock is held rproc->state should be set to RPROC_RUNNING
when rproc_trigger_recovery() returns.  If that is not the case then something
went wrong.

Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before returning,
so we know the remote processor was stopped.  Therefore if rproc->state is set
to RPROC_OFFLINE something went wrong in either request_firmware() or
rproc_start().  Either way the remote processor is offline and the system probably
in an unknown/unstable.  As such I don't see how calling pm_relax() can help
things along.  

I suggest spending time understanding what leads to the failure when recovering
from a crash and address that problem(s).

Thanks,
Mathieu


> When the state is in RPROC_OFFLINE it means separate request
> of rproc_stop was done and no need to hold the wakeup source
> in crash handler to recover any more.
> 
> Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
> ---
>  drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> index e5279ed9a8d7..6bc7b8b7d01e 100644
> --- a/drivers/remoteproc/remoteproc_core.c
> +++ b/drivers/remoteproc/remoteproc_core.c
> @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct work_struct *work)
>  	if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
>  		/* handle only the first crash detected */
>  		mutex_unlock(&rproc->lock);
> +		/*
> +		 * RPROC_OFFLINE state indicate there is no recovery process
> +		 * is in progress and no chance to have pm_relax in place.
> +		 * Because when recovering from crash, rproc->lock is held and
> +		 * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> +		 * and then unlock rproc->lock.
> +		 * RPROC_OFFLINE is only an intermediate state in recovery
> +		 * process.
> +		 */
> +		if (rproc->state == RPROC_OFFLINE)
> +			pm_relax(rproc->dev.parent);
>  		return;
>  	}
>  
> -- 
> 2.7.4
>
Aiqun Yu (Maria) Oct. 13, 2022, 1:40 a.m. UTC | #3
Hi Mathieu,

On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
> Please add what has changed from one version to another, either in a cover
> letter or after the "Signed-off-by".  There are many examples on how to do that
> on the mailing list.
> 
Thx for the information, will take a note and benefit for next time.

> On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
>> RPROC_OFFLINE state indicate there is no recovery process
>> is in progress and no chance to do the pm_relax.
>> Because when recovering from crash, rproc->lock is held and
>> state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>> and then unlock rproc->lock.
> 
> You are correct - because the lock is held rproc->state should be set to RPROC_RUNNING
> when rproc_trigger_recovery() returns.  If that is not the case then something
> went wrong.
> 
> Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before returning,
> so we know the remote processor was stopped.  Therefore if rproc->state is set
> to RPROC_OFFLINE something went wrong in either request_firmware() or
> rproc_start().  Either way the remote processor is offline and the system probably
> in an unknown/unstable.  As such I don't see how calling pm_relax() can help
> things along.
> 
PROC_OFFLINE is possible that rproc_shutdown is triggered and 
successfully finished.
Even if it is multi crash rproc_crash_handler_work contention issue, and 
last rproc_trigger_recovery bailed out with only 
rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
Since the subsystem may still can be recovered with customer's next 
trigger of rproc_start, and we can make each error out path clean with 
pm resources.

> I suggest spending time understanding what leads to the failure when recovering
> from a crash and address that problem(s).
> 
In current case, the customer's information is that the issue happened 
when rproc_shutdown is triggered at similar time. So not an issue from 
error out of rproc_trigger_recovery.
> Thanks,
> Mathieu
> 
> 
>> When the state is in RPROC_OFFLINE it means separate request
>> of rproc_stop was done and no need to hold the wakeup source
>> in crash handler to recover any more.
>>
>> Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
>> ---
>>   drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
>>   1 file changed, 11 insertions(+)
>>
>> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
>> index e5279ed9a8d7..6bc7b8b7d01e 100644
>> --- a/drivers/remoteproc/remoteproc_core.c
>> +++ b/drivers/remoteproc/remoteproc_core.c
>> @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct work_struct *work)
>>   	if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
>>   		/* handle only the first crash detected */
>>   		mutex_unlock(&rproc->lock);
>> +		/*
>> +		 * RPROC_OFFLINE state indicate there is no recovery process
>> +		 * is in progress and no chance to have pm_relax in place.
>> +		 * Because when recovering from crash, rproc->lock is held and
>> +		 * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>> +		 * and then unlock rproc->lock.
>> +		 * RPROC_OFFLINE is only an intermediate state in recovery
>> +		 * process.
>> +		 */
>> +		if (rproc->state == RPROC_OFFLINE)
>> +			pm_relax(rproc->dev.parent);
>>   		return;
>>   	}
>>   
>> -- 
>> 2.7.4
>>
Mathieu Poirier Oct. 13, 2022, 5:34 p.m. UTC | #4
On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:
> Hi Mathieu,
> 
> On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
> > Please add what has changed from one version to another, either in a cover
> > letter or after the "Signed-off-by".  There are many examples on how to do that
> > on the mailing list.
> > 
> Thx for the information, will take a note and benefit for next time.
> 
> > On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
> > > RPROC_OFFLINE state indicate there is no recovery process
> > > is in progress and no chance to do the pm_relax.
> > > Because when recovering from crash, rproc->lock is held and
> > > state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> > > and then unlock rproc->lock.
> > 
> > You are correct - because the lock is held rproc->state should be set to RPROC_RUNNING
> > when rproc_trigger_recovery() returns.  If that is not the case then something
> > went wrong.
> > 
> > Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before returning,
> > so we know the remote processor was stopped.  Therefore if rproc->state is set
> > to RPROC_OFFLINE something went wrong in either request_firmware() or
> > rproc_start().  Either way the remote processor is offline and the system probably
> > in an unknown/unstable.  As such I don't see how calling pm_relax() can help
> > things along.
> > 
> PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
> finished.
> Even if it is multi crash rproc_crash_handler_work contention issue, and
> last rproc_trigger_recovery bailed out with only
> rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
> Since the subsystem may still can be recovered with customer's next trigger
> of rproc_start, and we can make each error out path clean with pm resources.
> 
> > I suggest spending time understanding what leads to the failure when recovering
> > from a crash and address that problem(s).
> > 
> In current case, the customer's information is that the issue happened when
> rproc_shutdown is triggered at similar time. So not an issue from error out
> of rproc_trigger_recovery.

That is a very important element to consider and should have been mentioned from
the beginning.  What I see happening is the following:

rproc_report_crash()
        pm_stay_awake()
        queue_work() // current thread is suspended

rproc_shutdown()
        rproc_stop()
                rproc->state = RPROC_OFFLINE;

rproc_crash_handler_work()
        if (rproc->state == RPROC_OFFLINE)
                return // pm_relax() is not called

The right way to fix this is to add a pm_relax() in rproc_shutdown() and
rproc_detach(), along with a very descriptive comment as to why it is needed.


> > Thanks,
> > Mathieu
> > 
> > 
> > > When the state is in RPROC_OFFLINE it means separate request
> > > of rproc_stop was done and no need to hold the wakeup source
> > > in crash handler to recover any more.
> > > 
> > > Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
> > > ---
> > >   drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
> > >   1 file changed, 11 insertions(+)
> > > 
> > > diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> > > index e5279ed9a8d7..6bc7b8b7d01e 100644
> > > --- a/drivers/remoteproc/remoteproc_core.c
> > > +++ b/drivers/remoteproc/remoteproc_core.c
> > > @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct work_struct *work)
> > >   	if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
> > >   		/* handle only the first crash detected */
> > >   		mutex_unlock(&rproc->lock);
> > > +		/*
> > > +		 * RPROC_OFFLINE state indicate there is no recovery process
> > > +		 * is in progress and no chance to have pm_relax in place.
> > > +		 * Because when recovering from crash, rproc->lock is held and
> > > +		 * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> > > +		 * and then unlock rproc->lock.
> > > +		 * RPROC_OFFLINE is only an intermediate state in recovery
> > > +		 * process.
> > > +		 */
> > > +		if (rproc->state == RPROC_OFFLINE)
> > > +			pm_relax(rproc->dev.parent);
> > >   		return;
> > >   	}
> > > -- 
> > > 2.7.4
> > > 
> 
> 
> -- 
> Thx and BRs,
> Aiqun(Maria) Yu
Mathieu Poirier Oct. 13, 2022, 6:03 p.m. UTC | #5
On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote:
> On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:
> > Hi Mathieu,
> > 
> > On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
> > > Please add what has changed from one version to another, either in a cover
> > > letter or after the "Signed-off-by".  There are many examples on how to do that
> > > on the mailing list.
> > > 
> > Thx for the information, will take a note and benefit for next time.
> > 
> > > On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
> > > > RPROC_OFFLINE state indicate there is no recovery process
> > > > is in progress and no chance to do the pm_relax.
> > > > Because when recovering from crash, rproc->lock is held and
> > > > state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> > > > and then unlock rproc->lock.
> > > 
> > > You are correct - because the lock is held rproc->state should be set to RPROC_RUNNING
> > > when rproc_trigger_recovery() returns.  If that is not the case then something
> > > went wrong.
> > > 
> > > Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before returning,
> > > so we know the remote processor was stopped.  Therefore if rproc->state is set
> > > to RPROC_OFFLINE something went wrong in either request_firmware() or
> > > rproc_start().  Either way the remote processor is offline and the system probably
> > > in an unknown/unstable.  As such I don't see how calling pm_relax() can help
> > > things along.
> > > 
> > PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
> > finished.
> > Even if it is multi crash rproc_crash_handler_work contention issue, and
> > last rproc_trigger_recovery bailed out with only
> > rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
> > Since the subsystem may still can be recovered with customer's next trigger
> > of rproc_start, and we can make each error out path clean with pm resources.
> > 
> > > I suggest spending time understanding what leads to the failure when recovering
> > > from a crash and address that problem(s).
> > > 
> > In current case, the customer's information is that the issue happened when
> > rproc_shutdown is triggered at similar time. So not an issue from error out
> > of rproc_trigger_recovery.
> 
> That is a very important element to consider and should have been mentioned from
> the beginning.  What I see happening is the following:
> 
> rproc_report_crash()
>         pm_stay_awake()
>         queue_work() // current thread is suspended
> 
> rproc_shutdown()
>         rproc_stop()
>                 rproc->state = RPROC_OFFLINE;
> 
> rproc_crash_handler_work()
>         if (rproc->state == RPROC_OFFLINE)
>                 return // pm_relax() is not called
> 
> The right way to fix this is to add a pm_relax() in rproc_shutdown() and
> rproc_detach(), along with a very descriptive comment as to why it is needed.

Thinking about this further there are more ramifications to consider.  Please
confirm the above scenario is what you are facing.  I will advise on how to move
forward if that is the case.

> 
> 
> > > Thanks,
> > > Mathieu
> > > 
> > > 
> > > > When the state is in RPROC_OFFLINE it means separate request
> > > > of rproc_stop was done and no need to hold the wakeup source
> > > > in crash handler to recover any more.
> > > > 
> > > > Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
> > > > ---
> > > >   drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
> > > >   1 file changed, 11 insertions(+)
> > > > 
> > > > diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> > > > index e5279ed9a8d7..6bc7b8b7d01e 100644
> > > > --- a/drivers/remoteproc/remoteproc_core.c
> > > > +++ b/drivers/remoteproc/remoteproc_core.c
> > > > @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct work_struct *work)
> > > >   	if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
> > > >   		/* handle only the first crash detected */
> > > >   		mutex_unlock(&rproc->lock);
> > > > +		/*
> > > > +		 * RPROC_OFFLINE state indicate there is no recovery process
> > > > +		 * is in progress and no chance to have pm_relax in place.
> > > > +		 * Because when recovering from crash, rproc->lock is held and
> > > > +		 * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> > > > +		 * and then unlock rproc->lock.
> > > > +		 * RPROC_OFFLINE is only an intermediate state in recovery
> > > > +		 * process.
> > > > +		 */
> > > > +		if (rproc->state == RPROC_OFFLINE)
> > > > +			pm_relax(rproc->dev.parent);
> > > >   		return;
> > > >   	}
> > > > -- 
> > > > 2.7.4
> > > > 
> > 
> > 
> > -- 
> > Thx and BRs,
> > Aiqun(Maria) Yu
Aiqun Yu (Maria) Oct. 20, 2022, 5:52 a.m. UTC | #6
On 10/14/2022 2:03 AM, Mathieu Poirier wrote:
> On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote:
>> On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:
>>> Hi Mathieu,
>>>
>>> On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
>>>> Please add what has changed from one version to another, either in a cover
>>>> letter or after the "Signed-off-by".  There are many examples on how to do that
>>>> on the mailing list.
>>>>
>>> Thx for the information, will take a note and benefit for next time.
>>>
>>>> On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
>>>>> RPROC_OFFLINE state indicate there is no recovery process
>>>>> is in progress and no chance to do the pm_relax.
>>>>> Because when recovering from crash, rproc->lock is held and
>>>>> state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>> and then unlock rproc->lock.
>>>>
>>>> You are correct - because the lock is held rproc->state should be set to RPROC_RUNNING
>>>> when rproc_trigger_recovery() returns.  If that is not the case then something
>>>> went wrong.
>>>>
>>>> Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before returning,
>>>> so we know the remote processor was stopped.  Therefore if rproc->state is set
>>>> to RPROC_OFFLINE something went wrong in either request_firmware() or
>>>> rproc_start().  Either way the remote processor is offline and the system probably
>>>> in an unknown/unstable.  As such I don't see how calling pm_relax() can help
>>>> things along.
>>>>
>>> PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
>>> finished.
>>> Even if it is multi crash rproc_crash_handler_work contention issue, and
>>> last rproc_trigger_recovery bailed out with only
>>> rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
>>> Since the subsystem may still can be recovered with customer's next trigger
>>> of rproc_start, and we can make each error out path clean with pm resources.
>>>
>>>> I suggest spending time understanding what leads to the failure when recovering
>>>> from a crash and address that problem(s).
>>>>
>>> In current case, the customer's information is that the issue happened when
>>> rproc_shutdown is triggered at similar time. So not an issue from error out
>>> of rproc_trigger_recovery.
>>
>> That is a very important element to consider and should have been mentioned from
>> the beginning.  What I see happening is the following:
>>
>> rproc_report_crash()
>>          pm_stay_awake()
>>          queue_work() // current thread is suspended
>>
>> rproc_shutdown()
>>          rproc_stop()
>>                  rproc->state = RPROC_OFFLINE;
>>
>> rproc_crash_handler_work()
>>          if (rproc->state == RPROC_OFFLINE)
>>                  return // pm_relax() is not called
>>
>> The right way to fix this is to add a pm_relax() in rproc_shutdown() and
>> rproc_detach(), along with a very descriptive comment as to why it is needed.
> 
> Thinking about this further there are more ramifications to consider.  Please
> confirm the above scenario is what you are facing.  I will advise on how to move
> forward if that is the case.
> 
Not sure if the situation is clear or not. So resend the email again.

The above senario is what customer is facing. crash hanppened while at 
the same time shutdown is triggered.
And the device cannto goes to suspend state after that.
the subsystem can still be start normally after this.

>>
>>
>>>> Thanks,
>>>> Mathieu
>>>>
>>>>
>>>>> When the state is in RPROC_OFFLINE it means separate request
>>>>> of rproc_stop was done and no need to hold the wakeup source
>>>>> in crash handler to recover any more.
>>>>>
>>>>> Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
>>>>> ---
>>>>>    drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
>>>>>    1 file changed, 11 insertions(+)
>>>>>
>>>>> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
>>>>> index e5279ed9a8d7..6bc7b8b7d01e 100644
>>>>> --- a/drivers/remoteproc/remoteproc_core.c
>>>>> +++ b/drivers/remoteproc/remoteproc_core.c
>>>>> @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct work_struct *work)
>>>>>    	if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
>>>>>    		/* handle only the first crash detected */
>>>>>    		mutex_unlock(&rproc->lock);
>>>>> +		/*
>>>>> +		 * RPROC_OFFLINE state indicate there is no recovery process
>>>>> +		 * is in progress and no chance to have pm_relax in place.
>>>>> +		 * Because when recovering from crash, rproc->lock is held and
>>>>> +		 * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>> +		 * and then unlock rproc->lock.
>>>>> +		 * RPROC_OFFLINE is only an intermediate state in recovery
>>>>> +		 * process.
>>>>> +		 */
>>>>> +		if (rproc->state == RPROC_OFFLINE)
>>>>> +			pm_relax(rproc->dev.parent);
>>>>>    		return;
>>>>>    	}
>>>>> -- 
>>>>> 2.7.4
>>>>>
>>>
>>>
>>> -- 
>>> Thx and BRs,
>>> Aiqun(Maria) Yu
Mathieu Poirier Oct. 21, 2022, 7:34 p.m. UTC | #7
On Wed, 19 Oct 2022 at 23:52, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
>
> On 10/14/2022 2:03 AM, Mathieu Poirier wrote:
> > On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote:
> >> On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:
> >>> Hi Mathieu,
> >>>
> >>> On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
> >>>> Please add what has changed from one version to another, either in a cover
> >>>> letter or after the "Signed-off-by".  There are many examples on how to do that
> >>>> on the mailing list.
> >>>>
> >>> Thx for the information, will take a note and benefit for next time.
> >>>
> >>>> On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
> >>>>> RPROC_OFFLINE state indicate there is no recovery process
> >>>>> is in progress and no chance to do the pm_relax.
> >>>>> Because when recovering from crash, rproc->lock is held and
> >>>>> state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> >>>>> and then unlock rproc->lock.
> >>>>
> >>>> You are correct - because the lock is held rproc->state should be set to RPROC_RUNNING
> >>>> when rproc_trigger_recovery() returns.  If that is not the case then something
> >>>> went wrong.
> >>>>
> >>>> Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before returning,
> >>>> so we know the remote processor was stopped.  Therefore if rproc->state is set
> >>>> to RPROC_OFFLINE something went wrong in either request_firmware() or
> >>>> rproc_start().  Either way the remote processor is offline and the system probably
> >>>> in an unknown/unstable.  As such I don't see how calling pm_relax() can help
> >>>> things along.
> >>>>
> >>> PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
> >>> finished.
> >>> Even if it is multi crash rproc_crash_handler_work contention issue, and
> >>> last rproc_trigger_recovery bailed out with only
> >>> rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
> >>> Since the subsystem may still can be recovered with customer's next trigger
> >>> of rproc_start, and we can make each error out path clean with pm resources.
> >>>
> >>>> I suggest spending time understanding what leads to the failure when recovering
> >>>> from a crash and address that problem(s).
> >>>>
> >>> In current case, the customer's information is that the issue happened when
> >>> rproc_shutdown is triggered at similar time. So not an issue from error out
> >>> of rproc_trigger_recovery.
> >>
> >> That is a very important element to consider and should have been mentioned from
> >> the beginning.  What I see happening is the following:
> >>
> >> rproc_report_crash()
> >>          pm_stay_awake()
> >>          queue_work() // current thread is suspended
> >>
> >> rproc_shutdown()
> >>          rproc_stop()
> >>                  rproc->state = RPROC_OFFLINE;
> >>
> >> rproc_crash_handler_work()
> >>          if (rproc->state == RPROC_OFFLINE)
> >>                  return // pm_relax() is not called
> >>
> >> The right way to fix this is to add a pm_relax() in rproc_shutdown() and
> >> rproc_detach(), along with a very descriptive comment as to why it is needed.
> >
> > Thinking about this further there are more ramifications to consider.  Please
> > confirm the above scenario is what you are facing.  I will advise on how to move
> > forward if that is the case.
> >
> Not sure if the situation is clear or not. So resend the email again.
>
> The above senario is what customer is facing. crash hanppened while at
> the same time shutdown is triggered.

Unfortunately this is not enough details to address a problem as
complex as this one.

> And the device cannto goes to suspend state after that.
> the subsystem can still be start normally after this.

If the code flow I pasted above reflects the problem at hand, the
current patch will not be sufficient to address the issue.  If Arnaud
confirms my suspicions we will have to think about a better solution.

>
> >>
> >>
> >>>> Thanks,
> >>>> Mathieu
> >>>>
> >>>>
> >>>>> When the state is in RPROC_OFFLINE it means separate request
> >>>>> of rproc_stop was done and no need to hold the wakeup source
> >>>>> in crash handler to recover any more.
> >>>>>
> >>>>> Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
> >>>>> ---
> >>>>>    drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
> >>>>>    1 file changed, 11 insertions(+)
> >>>>>
> >>>>> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> >>>>> index e5279ed9a8d7..6bc7b8b7d01e 100644
> >>>>> --- a/drivers/remoteproc/remoteproc_core.c
> >>>>> +++ b/drivers/remoteproc/remoteproc_core.c
> >>>>> @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct work_struct *work)
> >>>>>           if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
> >>>>>                   /* handle only the first crash detected */
> >>>>>                   mutex_unlock(&rproc->lock);
> >>>>> +         /*
> >>>>> +          * RPROC_OFFLINE state indicate there is no recovery process
> >>>>> +          * is in progress and no chance to have pm_relax in place.
> >>>>> +          * Because when recovering from crash, rproc->lock is held and
> >>>>> +          * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> >>>>> +          * and then unlock rproc->lock.
> >>>>> +          * RPROC_OFFLINE is only an intermediate state in recovery
> >>>>> +          * process.
> >>>>> +          */
> >>>>> +         if (rproc->state == RPROC_OFFLINE)
> >>>>> +                 pm_relax(rproc->dev.parent);
> >>>>>                   return;
> >>>>>           }
> >>>>> --
> >>>>> 2.7.4
> >>>>>
> >>>
> >>>
> >>> --
> >>> Thx and BRs,
> >>> Aiqun(Maria) Yu
>
>
> --
> Thx and BRs,
> Aiqun(Maria) Yu
Aiqun Yu (Maria) Oct. 24, 2022, 3:17 a.m. UTC | #8
On 10/22/2022 3:34 AM, Mathieu Poirier wrote:
> On Wed, 19 Oct 2022 at 23:52, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
>>
>> On 10/14/2022 2:03 AM, Mathieu Poirier wrote:
>>> On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote:
>>>> On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:
>>>>> Hi Mathieu,
>>>>>
>>>>> On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
>>>>>> Please add what has changed from one version to another, either in a cover
>>>>>> letter or after the "Signed-off-by".  There are many examples on how to do that
>>>>>> on the mailing list.
>>>>>>
>>>>> Thx for the information, will take a note and benefit for next time.
>>>>>
>>>>>> On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
>>>>>>> RPROC_OFFLINE state indicate there is no recovery process
>>>>>>> is in progress and no chance to do the pm_relax.
>>>>>>> Because when recovering from crash, rproc->lock is held and
>>>>>>> state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>>>> and then unlock rproc->lock.
>>>>>>
>>>>>> You are correct - because the lock is held rproc->state should be set to RPROC_RUNNING
>>>>>> when rproc_trigger_recovery() returns.  If that is not the case then something
>>>>>> went wrong.
>>>>>>
>>>>>> Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before returning,
>>>>>> so we know the remote processor was stopped.  Therefore if rproc->state is set
>>>>>> to RPROC_OFFLINE something went wrong in either request_firmware() or
>>>>>> rproc_start().  Either way the remote processor is offline and the system probably
>>>>>> in an unknown/unstable.  As such I don't see how calling pm_relax() can help
>>>>>> things along.
>>>>>>
>>>>> PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
>>>>> finished.
>>>>> Even if it is multi crash rproc_crash_handler_work contention issue, and
>>>>> last rproc_trigger_recovery bailed out with only
>>>>> rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
>>>>> Since the subsystem may still can be recovered with customer's next trigger
>>>>> of rproc_start, and we can make each error out path clean with pm resources.
>>>>>
>>>>>> I suggest spending time understanding what leads to the failure when recovering
>>>>>> from a crash and address that problem(s).
>>>>>>
>>>>> In current case, the customer's information is that the issue happened when
>>>>> rproc_shutdown is triggered at similar time. So not an issue from error out
>>>>> of rproc_trigger_recovery.
>>>>
>>>> That is a very important element to consider and should have been mentioned from
>>>> the beginning.  What I see happening is the following:
>>>>
>>>> rproc_report_crash()
>>>>           pm_stay_awake()
>>>>           queue_work() // current thread is suspended
>>>>
>>>> rproc_shutdown()
>>>>           rproc_stop()
>>>>                   rproc->state = RPROC_OFFLINE;
>>>>
>>>> rproc_crash_handler_work()
>>>>           if (rproc->state == RPROC_OFFLINE)
>>>>                   return // pm_relax() is not called
>>>>
>>>> The right way to fix this is to add a pm_relax() in rproc_shutdown() and
>>>> rproc_detach(), along with a very descriptive comment as to why it is needed.
>>>
>>> Thinking about this further there are more ramifications to consider.  Please
>>> confirm the above scenario is what you are facing.  I will advise on how to move
>>> forward if that is the case.
>>>
>> Not sure if the situation is clear or not. So resend the email again.
>>
>> The above senario is what customer is facing. crash hanppened while at
>> the same time shutdown is triggered.
> 
> Unfortunately this is not enough details to address a problem as
> complex as this one.
> 
>> And the device cannto goes to suspend state after that.
>> the subsystem can still be start normally after this.
> 
> If the code flow I pasted above reflects the problem at hand, the
> current patch will not be sufficient to address the issue.  If Arnaud
> confirms my suspicions we will have to think about a better solution.
>

Hi Mathiew,

Could you pls have more details of any side effects other then power 
issue of the current senario?
Why the current patch is not sufficient pls?


Have the current senario in details with rproc->lock information in details:

| subsystem crashed interrupt issued      | user trigger shutdown
| rproc_report_crash()                    |
|          pm_stay_awake()                |
|          queue_work()                   |
|                                         |rproc_shutdown
|                                         |mutex_lock(&rproc->lock);
|                                         |rproc_stop()
|rproc_crash_handler_work()               |rproc->state = RPROC_OFFLINE;
|                                         |mutex_unlock(&rproc->lock);
|mutex_lock(&rproc->lock);                |
|if (rproc->state == RPROC_OFFLINE)       |
|return // pm_relax() is not called       |rproc_boot
|mutex_unlock(&rproc->lock);              |
|                                         |mutex_lock(&rproc->lock);
|                                         |rproc_start()
|                                         |mutex_unlock(&rproc->lock);


>>
>>>>
>>>>
>>>>>> Thanks,
>>>>>> Mathieu
>>>>>>
>>>>>>
>>>>>>> When the state is in RPROC_OFFLINE it means separate request
>>>>>>> of rproc_stop was done and no need to hold the wakeup source
>>>>>>> in crash handler to recover any more.
>>>>>>>
>>>>>>> Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
>>>>>>> ---
>>>>>>>     drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
>>>>>>>     1 file changed, 11 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
>>>>>>> index e5279ed9a8d7..6bc7b8b7d01e 100644
>>>>>>> --- a/drivers/remoteproc/remoteproc_core.c
>>>>>>> +++ b/drivers/remoteproc/remoteproc_core.c
>>>>>>> @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct work_struct *work)
>>>>>>>            if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
>>>>>>>                    /* handle only the first crash detected */
>>>>>>>                    mutex_unlock(&rproc->lock);
>>>>>>> +         /*
>>>>>>> +          * RPROC_OFFLINE state indicate there is no recovery process
>>>>>>> +          * is in progress and no chance to have pm_relax in place.
>>>>>>> +          * Because when recovering from crash, rproc->lock is held and
>>>>>>> +          * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>>>> +          * and then unlock rproc->lock.
>>>>>>> +          * RPROC_OFFLINE is only an intermediate state in recovery
>>>>>>> +          * process.
>>>>>>> +          */
>>>>>>> +         if (rproc->state == RPROC_OFFLINE)
>>>>>>> +                 pm_relax(rproc->dev.parent);
>>>>>>>                    return;
>>>>>>>            }
>>>>>>> --
>>>>>>> 2.7.4
>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thx and BRs,
>>>>> Aiqun(Maria) Yu
>>
>>
>> --
>> Thx and BRs,
>> Aiqun(Maria) Yu
Arnaud POULIQUEN Oct. 28, 2022, 3:31 p.m. UTC | #9
Hi,

On 10/24/22 05:17, Aiqun(Maria) Yu wrote:
> On 10/22/2022 3:34 AM, Mathieu Poirier wrote:
>> On Wed, 19 Oct 2022 at 23:52, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
>>>
>>> On 10/14/2022 2:03 AM, Mathieu Poirier wrote:
>>>> On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote:
>>>>> On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:
>>>>>> Hi Mathieu,
>>>>>>
>>>>>> On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
>>>>>>> Please add what has changed from one version to another, either in a cover
>>>>>>> letter or after the "Signed-off-by".  There are many examples on how to
>>>>>>> do that
>>>>>>> on the mailing list.
>>>>>>>
>>>>>> Thx for the information, will take a note and benefit for next time.
>>>>>>
>>>>>>> On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
>>>>>>>> RPROC_OFFLINE state indicate there is no recovery process
>>>>>>>> is in progress and no chance to do the pm_relax.
>>>>>>>> Because when recovering from crash, rproc->lock is held and
>>>>>>>> state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>>>>> and then unlock rproc->lock.
>>>>>>>
>>>>>>> You are correct - because the lock is held rproc->state should be set to
>>>>>>> RPROC_RUNNING
>>>>>>> when rproc_trigger_recovery() returns.  If that is not the case then
>>>>>>> something
>>>>>>> went wrong.
>>>>>>>
>>>>>>> Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before
>>>>>>> returning,
>>>>>>> so we know the remote processor was stopped.  Therefore if rproc->state
>>>>>>> is set
>>>>>>> to RPROC_OFFLINE something went wrong in either request_firmware() or
>>>>>>> rproc_start().  Either way the remote processor is offline and the system
>>>>>>> probably
>>>>>>> in an unknown/unstable.  As such I don't see how calling pm_relax() can help
>>>>>>> things along.
>>>>>>>
>>>>>> PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
>>>>>> finished.
>>>>>> Even if it is multi crash rproc_crash_handler_work contention issue, and
>>>>>> last rproc_trigger_recovery bailed out with only
>>>>>> rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
>>>>>> Since the subsystem may still can be recovered with customer's next trigger
>>>>>> of rproc_start, and we can make each error out path clean with pm resources.
>>>>>>
>>>>>>> I suggest spending time understanding what leads to the failure when
>>>>>>> recovering
>>>>>>> from a crash and address that problem(s).
>>>>>>>
>>>>>> In current case, the customer's information is that the issue happened when
>>>>>> rproc_shutdown is triggered at similar time. So not an issue from error out
>>>>>> of rproc_trigger_recovery.
>>>>>
>>>>> That is a very important element to consider and should have been mentioned
>>>>> from
>>>>> the beginning.  What I see happening is the following:
>>>>>
>>>>> rproc_report_crash()
>>>>>           pm_stay_awake()
>>>>>           queue_work() // current thread is suspended
>>>>>
>>>>> rproc_shutdown()
>>>>>           rproc_stop()
>>>>>                   rproc->state = RPROC_OFFLINE;
>>>>>
>>>>> rproc_crash_handler_work()
>>>>>           if (rproc->state == RPROC_OFFLINE)
>>>>>                   return // pm_relax() is not called
>>>>>
>>>>> The right way to fix this is to add a pm_relax() in rproc_shutdown() and
>>>>> rproc_detach(), along with a very descriptive comment as to why it is needed.
>>>>
>>>> Thinking about this further there are more ramifications to consider.  Please
>>>> confirm the above scenario is what you are facing.  I will advise on how to
>>>> move
>>>> forward if that is the case.
>>>>
>>> Not sure if the situation is clear or not. So resend the email again.
>>>
>>> The above senario is what customer is facing. crash hanppened while at
>>> the same time shutdown is triggered.
>>
>> Unfortunately this is not enough details to address a problem as
>> complex as this one.
>>
>>> And the device cannto goes to suspend state after that.
>>> the subsystem can still be start normally after this.
>>
>> If the code flow I pasted above reflects the problem at hand, the
>> current patch will not be sufficient to address the issue.  If Arnaud
>> confirms my suspicions we will have to think about a better solution.
>>
> 
> Hi Mathiew,
> 
> Could you pls have more details of any side effects other then power issue of
> the current senario?
> Why the current patch is not sufficient pls?
> 
> 
> Have the current senario in details with rproc->lock information in details:
> 
> | subsystem crashed interrupt issued      | user trigger shutdown
> | rproc_report_crash()                    |
> |          pm_stay_awake()                |
> |          queue_work()                   |
> |                                         |rproc_shutdown
> |                                         |mutex_lock(&rproc->lock);
> |                                         |rproc_stop()
> |rproc_crash_handler_work()               |rproc->state = RPROC_OFFLINE;
> |                                         |mutex_unlock(&rproc->lock);
> |mutex_lock(&rproc->lock);                |
> |if (rproc->state == RPROC_OFFLINE)       |
> |return // pm_relax() is not called       |rproc_boot
> |mutex_unlock(&rproc->lock);              |
> |                                         |mutex_lock(&rproc->lock);
> |                                         |rproc_start()
> |                                         |mutex_unlock(&rproc->lock);
> 
> 

Agree with Mathieu, this is not so simple.

Here is my view  hoping I haven't missed a point in your discussion or
an other corner cases.

I tried to analyze the issues (in what follows, the term "condition" means
the "if" condition in which Aiqun proposes to add the fix) :

I can see 4 use cases with race condition

1) crash report while already one is treated (rproc_boot_recovery called) 
     => not a real use case as if the remote processor is crashed we
	      should not have a second crash report
 	
2) rproc_stop executed between the queuing of the crash work and the call of
  rproc_crash_handler_work
   => rproc->state = RPROC_OFFLINE
   => we enter in the "condition" and the pm_relax has to be called
   => This commit fix should solve this use case

3) rproc_detach executed between the queue of the crash work and the call of
  rproc_crash_handler_work
   => rproc->state = RPROC_DETACHED;
   => we don't go in "the condition" and issue because the recovery reattach
      to the remote processor
   => but pm_relax is called
   => probably need an extra fix to avoid to re-attach

4) crash report while already one is treated (rproc_attach_recovery called)
   this one corresponds to an auto reboot of the remote processor, with a
   new crash
   => rproc->state = RPROC_CRASHED or rproc->state = RPROC_DETACHED;
   4)a) rproc->state = RPROC_CRASHED if rproc->recovery_disabled = true
	=> should call pm_relax if rproc->recovery_disabled = true
	=> commit does not work for this use case

   4)b) rproc->state = RPROC_DETACHED if recovery fails
       => error case with an unstable state
       => how to differentiate it from the use case 3) ?
       => introduce a RPROC_RECOVERY_FAIL state?		


Then pm_stay_awake is called when the crash work is queued.
It seems to me coherent to call the pm_relax in the work handler.



Here is a quick and dirty patch (not tested) that should take into account the 
main use cases ( except 1) and 4)b) )

@@ -2009,8 +2009,18 @@ static void rproc_crash_handler_work(struct work_struct *work)
 
 	mutex_lock(&rproc->lock);
 
-	if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
+	if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE ||
+	    rproc->state == RPROC_DETACHED) {
 		/* handle only the first crash detected */
+
+		/*
+		 * call pm-relax in following use cases:
+		 * - the remote processor has been stopped by the user
+		 * - the remote processor is detached
+		 + - the remote proc has an autonomous reset but recovery_disabled is true. 
+		 */
+		if(rproc->state != RPROC_CRASHED || rproc->recovery_disabled)
+			pm_relax(rproc->dev.parent);
 		mutex_unlock(&rproc->lock);
 		return;
 	}

Regards,
Arnaud

>>>
>>>>>
>>>>>
>>>>>>> Thanks,
>>>>>>> Mathieu
>>>>>>>
>>>>>>>
>>>>>>>> When the state is in RPROC_OFFLINE it means separate request
>>>>>>>> of rproc_stop was done and no need to hold the wakeup source
>>>>>>>> in crash handler to recover any more.
>>>>>>>>
>>>>>>>> Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
>>>>>>>> ---
>>>>>>>>     drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
>>>>>>>>     1 file changed, 11 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/remoteproc/remoteproc_core.c
>>>>>>>> b/drivers/remoteproc/remoteproc_core.c
>>>>>>>> index e5279ed9a8d7..6bc7b8b7d01e 100644
>>>>>>>> --- a/drivers/remoteproc/remoteproc_core.c
>>>>>>>> +++ b/drivers/remoteproc/remoteproc_core.c
>>>>>>>> @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>            if (rproc->state == RPROC_CRASHED || rproc->state ==
>>>>>>>> RPROC_OFFLINE) {
>>>>>>>>                    /* handle only the first crash detected */
>>>>>>>>                    mutex_unlock(&rproc->lock);
>>>>>>>> +         /*
>>>>>>>> +          * RPROC_OFFLINE state indicate there is no recovery process
>>>>>>>> +          * is in progress and no chance to have pm_relax in place.
>>>>>>>> +          * Because when recovering from crash, rproc->lock is held and
>>>>>>>> +          * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>>>>> +          * and then unlock rproc->lock.
>>>>>>>> +          * RPROC_OFFLINE is only an intermediate state in recovery
>>>>>>>> +          * process.
>>>>>>>> +          */
>>>>>>>> +         if (rproc->state == RPROC_OFFLINE)
>>>>>>>> +                 pm_relax(rproc->dev.parent);
>>>>>>>>                    return;
>>>>>>>>            }
>>>>>>>> -- 
>>>>>>>> 2.7.4
>>>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> Thx and BRs,
>>>>>> Aiqun(Maria) Yu
>>>
>>>
>>> -- 
>>> Thx and BRs,
>>> Aiqun(Maria) Yu
> 
>
Aiqun Yu (Maria) Oct. 31, 2022, 1:08 a.m. UTC | #10
Hi,

Thx for the detailed comments, here is my updated comments in below as well:

For the contention of export function of rproc_detach and crash 
interrupt handler is never take condition and well handlered in current 
driver.

If this condition wanted to be take into consideration, what about we 
have an extra change to support rproc_detach and crash contention senarios?

other use cases should be already well handler per my understanding.
Let me know if there is any mis-understanding.


On 10/28/2022 11:31 PM, Arnaud POULIQUEN wrote:
> Hi,
> 
> On 10/24/22 05:17, Aiqun(Maria) Yu wrote:
>> On 10/22/2022 3:34 AM, Mathieu Poirier wrote:
>>> On Wed, 19 Oct 2022 at 23:52, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
>>>>
>>>> On 10/14/2022 2:03 AM, Mathieu Poirier wrote:
>>>>> On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote:
>>>>>> On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:
>>>>>>> Hi Mathieu,
>>>>>>>
>>>>>>> On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
>>>>>>>> Please add what has changed from one version to another, either in a cover
>>>>>>>> letter or after the "Signed-off-by".  There are many examples on how to
>>>>>>>> do that
>>>>>>>> on the mailing list.
>>>>>>>>
>>>>>>> Thx for the information, will take a note and benefit for next time.
>>>>>>>
>>>>>>>> On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
>>>>>>>>> RPROC_OFFLINE state indicate there is no recovery process
>>>>>>>>> is in progress and no chance to do the pm_relax.
>>>>>>>>> Because when recovering from crash, rproc->lock is held and
>>>>>>>>> state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>>>>>> and then unlock rproc->lock.
>>>>>>>>
>>>>>>>> You are correct - because the lock is held rproc->state should be set to
>>>>>>>> RPROC_RUNNING
>>>>>>>> when rproc_trigger_recovery() returns.  If that is not the case then
>>>>>>>> something
>>>>>>>> went wrong.
>>>>>>>>
>>>>>>>> Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before
>>>>>>>> returning,
>>>>>>>> so we know the remote processor was stopped.  Therefore if rproc->state
>>>>>>>> is set
>>>>>>>> to RPROC_OFFLINE something went wrong in either request_firmware() or
>>>>>>>> rproc_start().  Either way the remote processor is offline and the system
>>>>>>>> probably
>>>>>>>> in an unknown/unstable.  As such I don't see how calling pm_relax() can help
>>>>>>>> things along.
>>>>>>>>
>>>>>>> PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
>>>>>>> finished.
>>>>>>> Even if it is multi crash rproc_crash_handler_work contention issue, and
>>>>>>> last rproc_trigger_recovery bailed out with only
>>>>>>> rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
>>>>>>> Since the subsystem may still can be recovered with customer's next trigger
>>>>>>> of rproc_start, and we can make each error out path clean with pm resources.
>>>>>>>
>>>>>>>> I suggest spending time understanding what leads to the failure when
>>>>>>>> recovering
>>>>>>>> from a crash and address that problem(s).
>>>>>>>>
>>>>>>> In current case, the customer's information is that the issue happened when
>>>>>>> rproc_shutdown is triggered at similar time. So not an issue from error out
>>>>>>> of rproc_trigger_recovery.
>>>>>>
>>>>>> That is a very important element to consider and should have been mentioned
>>>>>> from
>>>>>> the beginning.  What I see happening is the following:
>>>>>>
>>>>>> rproc_report_crash()
>>>>>>            pm_stay_awake()
>>>>>>            queue_work() // current thread is suspended
>>>>>>
>>>>>> rproc_shutdown()
>>>>>>            rproc_stop()
>>>>>>                    rproc->state = RPROC_OFFLINE;
>>>>>>
>>>>>> rproc_crash_handler_work()
>>>>>>            if (rproc->state == RPROC_OFFLINE)
>>>>>>                    return // pm_relax() is not called
>>>>>>
>>>>>> The right way to fix this is to add a pm_relax() in rproc_shutdown() and
>>>>>> rproc_detach(), along with a very descriptive comment as to why it is needed.
>>>>>
>>>>> Thinking about this further there are more ramifications to consider.  Please
>>>>> confirm the above scenario is what you are facing.  I will advise on how to
>>>>> move
>>>>> forward if that is the case.
>>>>>
>>>> Not sure if the situation is clear or not. So resend the email again.
>>>>
>>>> The above senario is what customer is facing. crash hanppened while at
>>>> the same time shutdown is triggered.
>>>
>>> Unfortunately this is not enough details to address a problem as
>>> complex as this one.
>>>
>>>> And the device cannto goes to suspend state after that.
>>>> the subsystem can still be start normally after this.
>>>
>>> If the code flow I pasted above reflects the problem at hand, the
>>> current patch will not be sufficient to address the issue.  If Arnaud
>>> confirms my suspicions we will have to think about a better solution.
>>>
>>
>> Hi Mathiew,
>>
>> Could you pls have more details of any side effects other then power issue of
>> the current senario?
>> Why the current patch is not sufficient pls?
>>
>>
>> Have the current senario in details with rproc->lock information in details:
>>
>> | subsystem crashed interrupt issued      | user trigger shutdown
>> | rproc_report_crash()                    |
>> |          pm_stay_awake()                |
>> |          queue_work()                   |
>> |                                         |rproc_shutdown
>> |                                         |mutex_lock(&rproc->lock);
>> |                                         |rproc_stop()
>> |rproc_crash_handler_work()               |rproc->state = RPROC_OFFLINE;
>> |                                         |mutex_unlock(&rproc->lock);
>> |mutex_lock(&rproc->lock);                |
>> |if (rproc->state == RPROC_OFFLINE)       |
>> |return // pm_relax() is not called       |rproc_boot
>> |mutex_unlock(&rproc->lock);              |
>> |                                         |mutex_lock(&rproc->lock);
>> |                                         |rproc_start()
>> |                                         |mutex_unlock(&rproc->lock);
>>
>>
> 
> Agree with Mathieu, this is not so simple.
> 
> Here is my view  hoping I haven't missed a point in your discussion or
> an other corner cases.
> 
> I tried to analyze the issues (in what follows, the term "condition" means
> the "if" condition in which Aiqun proposes to add the fix) :
> 
> I can see 4 use cases with race condition
> 
> 1) crash report while already one is treated (rproc_boot_recovery called)
>       => not a real use case as if the remote processor is crashed we
> 	      should not have a second crash report
>   	
The second crash report can be avoid via the current rproc->state check 
within the protection of rproc->lock.
> 2) rproc_stop executed between the queuing of the crash work and the call of
>    rproc_crash_handler_work
>     => rproc->state = RPROC_OFFLINE
>     => we enter in the "condition" and the pm_relax has to be called
>     => This commit fix should solve this use case
> 
> 3) rproc_detach executed between the queue of the crash work and the call of
>    rproc_crash_handler_work
>     => rproc->state = RPROC_DETACHED;
>     => we don't go in "the condition" and issue because the recovery reattach
>        to the remote processor
>     => but pm_relax is called
>     => probably need an extra fix to avoid to re-attach
> 
The RPROC_DETACHED state is never considered with crash contention in 
current driver.
if the subsystem is in RPROC_DETACHED state while it still received a 
crash intterupt, should it ignore the current crash interrupt or 
continue to do a rproc_boot and continue to attach?

> 4) crash report while already one is treated (rproc_attach_recovery called)
>     this one corresponds to an auto reboot of the remote processor, with a
>     new crash
>     => rproc->state = RPROC_CRASHED or rproc->state = RPROC_DETACHED;
>     4)a) rproc->state = RPROC_CRASHED if rproc->recovery_disabled = true
> 	=> should call pm_relax if rproc->recovery_disabled = true
> 	=> commit does not work for this use case
> 
In the recovery_disabled==true case rproc_crash_handler_work will call 
pm_relax directly.
The senario was well handled.
>     4)b) rproc->state = RPROC_DETACHED if recovery fails
>         => error case with an unstable state
>         => how to differentiate it from the use case 3) ?
>         => introduce a RPROC_RECOVERY_FAIL state?		
> 
For every recvery fails senario, function rproc_crash_handler_work will 
call pm_relax always.
The senario was well handled.
> 
> Then pm_stay_awake is called when the crash work is queued.
> It seems to me coherent to call the pm_relax in the work handler.
> 
> 
> 
> Here is a quick and dirty patch (not tested) that should take into account the
> main use cases ( except 1) and 4)b) )
> 
> @@ -2009,8 +2009,18 @@ static void rproc_crash_handler_work(struct work_struct *work)
>   
>   	mutex_lock(&rproc->lock);
>   
> -	if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
> +	if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE ||
> +	    rproc->state == RPROC_DETACHED) {
>   		/* handle only the first crash detected */
> +
> +		/*
> +		 * call pm-relax in following use cases:
> +		 * - the remote processor has been stopped by the user
> +		 * - the remote processor is detached
> +		 + - the remote proc has an autonomous reset but recovery_disabled is true.
> +		 */
> +		if(rproc->state != RPROC_CRASHED || rproc->recovery_disabled)
> +			pm_relax(rproc->dev.parent);
>   		mutex_unlock(&rproc->lock);
>   		return;
>   	}
> 
> Regards,
> Arnaud
> 
>>>>
>>>>>>
>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Mathieu
>>>>>>>>
>>>>>>>>
>>>>>>>>> When the state is in RPROC_OFFLINE it means separate request
>>>>>>>>> of rproc_stop was done and no need to hold the wakeup source
>>>>>>>>> in crash handler to recover any more.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
>>>>>>>>> ---
>>>>>>>>>      drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
>>>>>>>>>      1 file changed, 11 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/remoteproc/remoteproc_core.c
>>>>>>>>> b/drivers/remoteproc/remoteproc_core.c
>>>>>>>>> index e5279ed9a8d7..6bc7b8b7d01e 100644
>>>>>>>>> --- a/drivers/remoteproc/remoteproc_core.c
>>>>>>>>> +++ b/drivers/remoteproc/remoteproc_core.c
>>>>>>>>> @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct
>>>>>>>>> work_struct *work)
>>>>>>>>>             if (rproc->state == RPROC_CRASHED || rproc->state ==
>>>>>>>>> RPROC_OFFLINE) {
>>>>>>>>>                     /* handle only the first crash detected */
>>>>>>>>>                     mutex_unlock(&rproc->lock);
>>>>>>>>> +         /*
>>>>>>>>> +          * RPROC_OFFLINE state indicate there is no recovery process
>>>>>>>>> +          * is in progress and no chance to have pm_relax in place.
>>>>>>>>> +          * Because when recovering from crash, rproc->lock is held and
>>>>>>>>> +          * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>>>>>> +          * and then unlock rproc->lock.
>>>>>>>>> +          * RPROC_OFFLINE is only an intermediate state in recovery
>>>>>>>>> +          * process.
>>>>>>>>> +          */
>>>>>>>>> +         if (rproc->state == RPROC_OFFLINE)
>>>>>>>>> +                 pm_relax(rproc->dev.parent);
>>>>>>>>>                     return;
>>>>>>>>>             }
>>>>>>>>> -- 
>>>>>>>>> 2.7.4
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>> Thx and BRs,
>>>>>>> Aiqun(Maria) Yu
>>>>
>>>>
>>>> -- 
>>>> Thx and BRs,
>>>> Aiqun(Maria) Yu
>>
>>
Mathieu Poirier Nov. 1, 2022, 8:11 p.m. UTC | #11
On Fri, 28 Oct 2022 at 09:31, Arnaud POULIQUEN
<arnaud.pouliquen@foss.st.com> wrote:
>
> Hi,
>
> On 10/24/22 05:17, Aiqun(Maria) Yu wrote:
> > On 10/22/2022 3:34 AM, Mathieu Poirier wrote:
> >> On Wed, 19 Oct 2022 at 23:52, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
> >>>
> >>> On 10/14/2022 2:03 AM, Mathieu Poirier wrote:
> >>>> On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote:
> >>>>> On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:
> >>>>>> Hi Mathieu,
> >>>>>>
> >>>>>> On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
> >>>>>>> Please add what has changed from one version to another, either in a cover
> >>>>>>> letter or after the "Signed-off-by".  There are many examples on how to
> >>>>>>> do that
> >>>>>>> on the mailing list.
> >>>>>>>
> >>>>>> Thx for the information, will take a note and benefit for next time.
> >>>>>>
> >>>>>>> On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
> >>>>>>>> RPROC_OFFLINE state indicate there is no recovery process
> >>>>>>>> is in progress and no chance to do the pm_relax.
> >>>>>>>> Because when recovering from crash, rproc->lock is held and
> >>>>>>>> state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> >>>>>>>> and then unlock rproc->lock.
> >>>>>>>
> >>>>>>> You are correct - because the lock is held rproc->state should be set to
> >>>>>>> RPROC_RUNNING
> >>>>>>> when rproc_trigger_recovery() returns.  If that is not the case then
> >>>>>>> something
> >>>>>>> went wrong.
> >>>>>>>
> >>>>>>> Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before
> >>>>>>> returning,
> >>>>>>> so we know the remote processor was stopped.  Therefore if rproc->state
> >>>>>>> is set
> >>>>>>> to RPROC_OFFLINE something went wrong in either request_firmware() or
> >>>>>>> rproc_start().  Either way the remote processor is offline and the system
> >>>>>>> probably
> >>>>>>> in an unknown/unstable.  As such I don't see how calling pm_relax() can help
> >>>>>>> things along.
> >>>>>>>
> >>>>>> PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
> >>>>>> finished.
> >>>>>> Even if it is multi crash rproc_crash_handler_work contention issue, and
> >>>>>> last rproc_trigger_recovery bailed out with only
> >>>>>> rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
> >>>>>> Since the subsystem may still can be recovered with customer's next trigger
> >>>>>> of rproc_start, and we can make each error out path clean with pm resources.
> >>>>>>
> >>>>>>> I suggest spending time understanding what leads to the failure when
> >>>>>>> recovering
> >>>>>>> from a crash and address that problem(s).
> >>>>>>>
> >>>>>> In current case, the customer's information is that the issue happened when
> >>>>>> rproc_shutdown is triggered at similar time. So not an issue from error out
> >>>>>> of rproc_trigger_recovery.
> >>>>>
> >>>>> That is a very important element to consider and should have been mentioned
> >>>>> from
> >>>>> the beginning.  What I see happening is the following:
> >>>>>
> >>>>> rproc_report_crash()
> >>>>>           pm_stay_awake()
> >>>>>           queue_work() // current thread is suspended
> >>>>>
> >>>>> rproc_shutdown()
> >>>>>           rproc_stop()
> >>>>>                   rproc->state = RPROC_OFFLINE;
> >>>>>
> >>>>> rproc_crash_handler_work()
> >>>>>           if (rproc->state == RPROC_OFFLINE)
> >>>>>                   return // pm_relax() is not called
> >>>>>
> >>>>> The right way to fix this is to add a pm_relax() in rproc_shutdown() and
> >>>>> rproc_detach(), along with a very descriptive comment as to why it is needed.
> >>>>
> >>>> Thinking about this further there are more ramifications to consider.  Please
> >>>> confirm the above scenario is what you are facing.  I will advise on how to
> >>>> move
> >>>> forward if that is the case.
> >>>>
> >>> Not sure if the situation is clear or not. So resend the email again.
> >>>
> >>> The above senario is what customer is facing. crash hanppened while at
> >>> the same time shutdown is triggered.
> >>
> >> Unfortunately this is not enough details to address a problem as
> >> complex as this one.
> >>
> >>> And the device cannto goes to suspend state after that.
> >>> the subsystem can still be start normally after this.
> >>
> >> If the code flow I pasted above reflects the problem at hand, the
> >> current patch will not be sufficient to address the issue.  If Arnaud
> >> confirms my suspicions we will have to think about a better solution.
> >>
> >
> > Hi Mathiew,
> >
> > Could you pls have more details of any side effects other then power issue of
> > the current senario?
> > Why the current patch is not sufficient pls?
> >
> >
> > Have the current senario in details with rproc->lock information in details:
> >
> > | subsystem crashed interrupt issued      | user trigger shutdown
> > | rproc_report_crash()                    |
> > |          pm_stay_awake()                |
> > |          queue_work()                   |
> > |                                         |rproc_shutdown
> > |                                         |mutex_lock(&rproc->lock);
> > |                                         |rproc_stop()
> > |rproc_crash_handler_work()               |rproc->state = RPROC_OFFLINE;
> > |                                         |mutex_unlock(&rproc->lock);
> > |mutex_lock(&rproc->lock);                |
> > |if (rproc->state == RPROC_OFFLINE)       |
> > |return // pm_relax() is not called       |rproc_boot
> > |mutex_unlock(&rproc->lock);              |
> > |                                         |mutex_lock(&rproc->lock);
> > |                                         |rproc_start()
> > |                                         |mutex_unlock(&rproc->lock);
> >
> >
>
> Agree with Mathieu, this is not so simple.
>

Thanks for looking into this.

> Here is my view  hoping I haven't missed a point in your discussion or
> an other corner cases.
>
> I tried to analyze the issues (in what follows, the term "condition" means
> the "if" condition in which Aiqun proposes to add the fix) :
>
> I can see 4 use cases with race condition
>
> 1) crash report while already one is treated (rproc_boot_recovery called)
>      => not a real use case as if the remote processor is crashed we
>               should not have a second crash report
>

That part is of great concern to me.  *Theoretically* we should not
get a new crash report while one has already been dispatched but the
current code accounts for this scenario and as such the possibility
can't be dismissed.  Therefore we need to expect rproc_report_crash()
to be called multiple times before a single instance of
rproc_boot_recovery() is scheduled.

> 2) rproc_stop executed between the queuing of the crash work and the call of
>   rproc_crash_handler_work
>    => rproc->state = RPROC_OFFLINE
>    => we enter in the "condition" and the pm_relax has to be called
>    => This commit fix should solve this use case
>
> 3) rproc_detach executed between the queue of the crash work and the call of
>   rproc_crash_handler_work
>    => rproc->state = RPROC_DETACHED;
>    => we don't go in "the condition" and issue because the recovery reattach
>       to the remote processor
>    => but pm_relax is called
>    => probably need an extra fix to avoid to re-attach
>
> 4) crash report while already one is treated (rproc_attach_recovery called)
>    this one corresponds to an auto reboot of the remote processor, with a
>    new crash
>    => rproc->state = RPROC_CRASHED or rproc->state = RPROC_DETACHED;
>    4)a) rproc->state = RPROC_CRASHED if rproc->recovery_disabled = true
>         => should call pm_relax if rproc->recovery_disabled = true
>         => commit does not work for this use case
>
>    4)b) rproc->state = RPROC_DETACHED if recovery fails
>        => error case with an unstable state
>        => how to differentiate it from the use case 3) ?
>        => introduce a RPROC_RECOVERY_FAIL state?
>

The case where a recovery fails needs to be considered and is the
reason the original patch doesn't work.  Right now in
rproc_crash_handler_work(), it is not possible to differentiate
between a legitimate shutdown request (scenario #2 above) and a
recovery that went wrong.  I think introducing RPROC_RECOVERY_FAIL
would greatly simplify things.

My initial evaluation had not considered the attach/detach scenarios -
thanks for adding that in the mix.

Aiqun, please send a new patchset that adds a new remote processor
state, i.e RPROC_RECOVERY_FAIL.  There should also be another patch in
that set that takes attach/detach scenarios into account.  The code
between the v6.0 and v6.1 cycle has changed a lot in that area so make
sure to properly rebase.

>
> Then pm_stay_awake is called when the crash work is queued.
> It seems to me coherent to call the pm_relax in the work handler.
>
>
>
> Here is a quick and dirty patch (not tested) that should take into account the
> main use cases ( except 1) and 4)b) )
>
> @@ -2009,8 +2009,18 @@ static void rproc_crash_handler_work(struct work_struct *work)
>
>         mutex_lock(&rproc->lock);
>
> -       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
> +       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE ||
> +           rproc->state == RPROC_DETACHED) {
>                 /* handle only the first crash detected */
> +
> +               /*
> +                * call pm-relax in following use cases:
> +                * - the remote processor has been stopped by the user
> +                * - the remote processor is detached
> +                + - the remote proc has an autonomous reset but recovery_disabled is true.
> +                */
> +               if(rproc->state != RPROC_CRASHED || rproc->recovery_disabled)
> +                       pm_relax(rproc->dev.parent);
>                 mutex_unlock(&rproc->lock);
>                 return;
>         }
>
> Regards,
> Arnaud
>
> >>>
> >>>>>
> >>>>>
> >>>>>>> Thanks,
> >>>>>>> Mathieu
> >>>>>>>
> >>>>>>>
> >>>>>>>> When the state is in RPROC_OFFLINE it means separate request
> >>>>>>>> of rproc_stop was done and no need to hold the wakeup source
> >>>>>>>> in crash handler to recover any more.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
> >>>>>>>> ---
> >>>>>>>>     drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
> >>>>>>>>     1 file changed, 11 insertions(+)
> >>>>>>>>
> >>>>>>>> diff --git a/drivers/remoteproc/remoteproc_core.c
> >>>>>>>> b/drivers/remoteproc/remoteproc_core.c
> >>>>>>>> index e5279ed9a8d7..6bc7b8b7d01e 100644
> >>>>>>>> --- a/drivers/remoteproc/remoteproc_core.c
> >>>>>>>> +++ b/drivers/remoteproc/remoteproc_core.c
> >>>>>>>> @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct
> >>>>>>>> work_struct *work)
> >>>>>>>>            if (rproc->state == RPROC_CRASHED || rproc->state ==
> >>>>>>>> RPROC_OFFLINE) {
> >>>>>>>>                    /* handle only the first crash detected */
> >>>>>>>>                    mutex_unlock(&rproc->lock);
> >>>>>>>> +         /*
> >>>>>>>> +          * RPROC_OFFLINE state indicate there is no recovery process
> >>>>>>>> +          * is in progress and no chance to have pm_relax in place.
> >>>>>>>> +          * Because when recovering from crash, rproc->lock is held and
> >>>>>>>> +          * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> >>>>>>>> +          * and then unlock rproc->lock.
> >>>>>>>> +          * RPROC_OFFLINE is only an intermediate state in recovery
> >>>>>>>> +          * process.
> >>>>>>>> +          */
> >>>>>>>> +         if (rproc->state == RPROC_OFFLINE)
> >>>>>>>> +                 pm_relax(rproc->dev.parent);
> >>>>>>>>                    return;
> >>>>>>>>            }
> >>>>>>>> --
> >>>>>>>> 2.7.4
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Thx and BRs,
> >>>>>> Aiqun(Maria) Yu
> >>>
> >>>
> >>> --
> >>> Thx and BRs,
> >>> Aiqun(Maria) Yu
> >
> >
Aiqun Yu (Maria) Nov. 2, 2022, 10:53 a.m. UTC | #12
Hi,

Let me think about this carefully.

When in RPROC_RECOVERY_FAIL case we want to re-do the recovery process 
again or just leave the pm_relax?

recovery fail case 1:
|                                      |firstcrash interrupt issued
| second crashed interrupt issued      | rproc_report_crash()
| rproc_report_crash()                 |          pm_stay_awake()
|          pm_stay_awake()             |          queue_work()
|          queue_work()                |rproc_crash_handler_work()
|                                      |mutex_lock(&rproc->lock);
|                                      |rproc_stop()
|rproc_crash_handler_work()            |rproc->state = RPROC_OFFLINE;
|                                      |RPROC_RECOVERY_FAIL //new
|                                      |mutex_unlock(&rproc->lock);
|mutex_lock(&rproc->lock);             |pm_relax()
|if (rproc->state == RPROC_OFFLINE)    |
|return // shouldn't do pm_relax if RPROC_RECOVERY_FAIL?  |
|mutex_unlock(&rproc->lock);           |
|                                      |
|                                      |
|                                      |

recovery fail case 2:
|                                      |firstcrash interrupt issued
|                                      | rproc_report_crash()
|                                      |          pm_stay_awake()
|                                      |          queue_work()
|                                      |rproc_crash_handler_work()
|                                      |mutex_lock(&rproc->lock);
|                                      |rproc_stop()
|                                      |rproc->state = RPROC_OFFLINE;
|                                      |RPROC_RECOVERY_FAIL //new
|                                      |mutex_unlock(&rproc->lock);
|                                      |pm_relax()
|
| second crashed interrupt issued      |
| rproc_report_crash()                 |
|          pm_stay_awake()             |
|          queue_work()                |
|pm_stay_awake()
|mutex_lock(&rproc->lock);
|if (rproc->state == RPROC_OFFLINE)    |
|return // still need do pm_relax if RPROC_RECOVERY_FAIL?  |
|mutex_unlock(&rproc->lock);           |
|                                      |
|                                      |
|                                      |

Maybe I can have:
1. the pm_stay_awake and pm_relax with count based and call with paired 
for fix current concurency issue.
2. RPROC_RECOVERY_FAIL can be another patch for continue try to do 
recovery work.
3. handle RPROC_DETACHED case.

On 11/2/2022 4:11 AM, Mathieu Poirier wrote:
> On Fri, 28 Oct 2022 at 09:31, Arnaud POULIQUEN
> <arnaud.pouliquen@foss.st.com> wrote:
>>
>> Hi,
>>
>> On 10/24/22 05:17, Aiqun(Maria) Yu wrote:
>>> On 10/22/2022 3:34 AM, Mathieu Poirier wrote:
>>>> On Wed, 19 Oct 2022 at 23:52, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
>>>>>
>>>>> On 10/14/2022 2:03 AM, Mathieu Poirier wrote:
>>>>>> On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote:
>>>>>>> On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:
>>>>>>>> Hi Mathieu,
>>>>>>>>
>>>>>>>> On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
>>>>>>>>> Please add what has changed from one version to another, either in a cover
>>>>>>>>> letter or after the "Signed-off-by".  There are many examples on how to
>>>>>>>>> do that
>>>>>>>>> on the mailing list.
>>>>>>>>>
>>>>>>>> Thx for the information, will take a note and benefit for next time.
>>>>>>>>
>>>>>>>>> On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
>>>>>>>>>> RPROC_OFFLINE state indicate there is no recovery process
>>>>>>>>>> is in progress and no chance to do the pm_relax.
>>>>>>>>>> Because when recovering from crash, rproc->lock is held and
>>>>>>>>>> state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>>>>>>> and then unlock rproc->lock.
>>>>>>>>>
>>>>>>>>> You are correct - because the lock is held rproc->state should be set to
>>>>>>>>> RPROC_RUNNING
>>>>>>>>> when rproc_trigger_recovery() returns.  If that is not the case then
>>>>>>>>> something
>>>>>>>>> went wrong.
>>>>>>>>>
>>>>>>>>> Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before
>>>>>>>>> returning,
>>>>>>>>> so we know the remote processor was stopped.  Therefore if rproc->state
>>>>>>>>> is set
>>>>>>>>> to RPROC_OFFLINE something went wrong in either request_firmware() or
>>>>>>>>> rproc_start().  Either way the remote processor is offline and the system
>>>>>>>>> probably
>>>>>>>>> in an unknown/unstable.  As such I don't see how calling pm_relax() can help
>>>>>>>>> things along.
>>>>>>>>>
>>>>>>>> PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
>>>>>>>> finished.
>>>>>>>> Even if it is multi crash rproc_crash_handler_work contention issue, and
>>>>>>>> last rproc_trigger_recovery bailed out with only
>>>>>>>> rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
>>>>>>>> Since the subsystem may still can be recovered with customer's next trigger
>>>>>>>> of rproc_start, and we can make each error out path clean with pm resources.
>>>>>>>>
>>>>>>>>> I suggest spending time understanding what leads to the failure when
>>>>>>>>> recovering
>>>>>>>>> from a crash and address that problem(s).
>>>>>>>>>
>>>>>>>> In current case, the customer's information is that the issue happened when
>>>>>>>> rproc_shutdown is triggered at similar time. So not an issue from error out
>>>>>>>> of rproc_trigger_recovery.
>>>>>>>
>>>>>>> That is a very important element to consider and should have been mentioned
>>>>>>> from
>>>>>>> the beginning.  What I see happening is the following:
>>>>>>>
>>>>>>> rproc_report_crash()
>>>>>>>            pm_stay_awake()
>>>>>>>            queue_work() // current thread is suspended
>>>>>>>
>>>>>>> rproc_shutdown()
>>>>>>>            rproc_stop()
>>>>>>>                    rproc->state = RPROC_OFFLINE;
>>>>>>>
>>>>>>> rproc_crash_handler_work()
>>>>>>>            if (rproc->state == RPROC_OFFLINE)
>>>>>>>                    return // pm_relax() is not called
>>>>>>>
>>>>>>> The right way to fix this is to add a pm_relax() in rproc_shutdown() and
>>>>>>> rproc_detach(), along with a very descriptive comment as to why it is needed.
>>>>>>
>>>>>> Thinking about this further there are more ramifications to consider.  Please
>>>>>> confirm the above scenario is what you are facing.  I will advise on how to
>>>>>> move
>>>>>> forward if that is the case.
>>>>>>
>>>>> Not sure if the situation is clear or not. So resend the email again.
>>>>>
>>>>> The above senario is what customer is facing. crash hanppened while at
>>>>> the same time shutdown is triggered.
>>>>
>>>> Unfortunately this is not enough details to address a problem as
>>>> complex as this one.
>>>>
>>>>> And the device cannto goes to suspend state after that.
>>>>> the subsystem can still be start normally after this.
>>>>
>>>> If the code flow I pasted above reflects the problem at hand, the
>>>> current patch will not be sufficient to address the issue.  If Arnaud
>>>> confirms my suspicions we will have to think about a better solution.
>>>>
>>>
>>> Hi Mathiew,
>>>
>>> Could you pls have more details of any side effects other then power issue of
>>> the current senario?
>>> Why the current patch is not sufficient pls?
>>>
>>>
>>> Have the current senario in details with rproc->lock information in details:
>>>
>>> | subsystem crashed interrupt issued      | user trigger shutdown
>>> | rproc_report_crash()                    |
>>> |          pm_stay_awake()                |
>>> |          queue_work()                   |
>>> |                                         |rproc_shutdown
>>> |                                         |mutex_lock(&rproc->lock);
>>> |                                         |rproc_stop()
>>> |rproc_crash_handler_work()               |rproc->state = RPROC_OFFLINE;
>>> |                                         |mutex_unlock(&rproc->lock);
>>> |mutex_lock(&rproc->lock);                |
>>> |if (rproc->state == RPROC_OFFLINE)       |
>>> |return // pm_relax() is not called       |rproc_boot
>>> |mutex_unlock(&rproc->lock);              |
>>> |                                         |mutex_lock(&rproc->lock);
>>> |                                         |rproc_start()
>>> |                                         |mutex_unlock(&rproc->lock);
>>>
>>>
>>
>> Agree with Mathieu, this is not so simple.
>>
> 
> Thanks for looking into this.
> 
>> Here is my view  hoping I haven't missed a point in your discussion or
>> an other corner cases.
>>
>> I tried to analyze the issues (in what follows, the term "condition" means
>> the "if" condition in which Aiqun proposes to add the fix) :
>>
>> I can see 4 use cases with race condition
>>
>> 1) crash report while already one is treated (rproc_boot_recovery called)
>>       => not a real use case as if the remote processor is crashed we
>>                should not have a second crash report
>>
> 
> That part is of great concern to me.  *Theoretically* we should not
> get a new crash report while one has already been dispatched but the
> current code accounts for this scenario and as such the possibility
> can't be dismissed.  Therefore we need to expect rproc_report_crash()
> to be called multiple times before a single instance of
> rproc_boot_recovery() is scheduled.


> 
>> 2) rproc_stop executed between the queuing of the crash work and the call of
>>    rproc_crash_handler_work
>>     => rproc->state = RPROC_OFFLINE
>>     => we enter in the "condition" and the pm_relax has to be called
>>     => This commit fix should solve this use case
>>
>> 3) rproc_detach executed between the queue of the crash work and the call of
>>    rproc_crash_handler_work
>>     => rproc->state = RPROC_DETACHED;
>>     => we don't go in "the condition" and issue because the recovery reattach
>>        to the remote processor
>>     => but pm_relax is called
>>     => probably need an extra fix to avoid to re-attach
>>
>> 4) crash report while already one is treated (rproc_attach_recovery called)
>>     this one corresponds to an auto reboot of the remote processor, with a
>>     new crash
>>     => rproc->state = RPROC_CRASHED or rproc->state = RPROC_DETACHED;
>>     4)a) rproc->state = RPROC_CRASHED if rproc->recovery_disabled = true
>>          => should call pm_relax if rproc->recovery_disabled = true
>>          => commit does not work for this use case
>>
>>     4)b) rproc->state = RPROC_DETACHED if recovery fails
>>         => error case with an unstable state
>>         => how to differentiate it from the use case 3) ?
>>         => introduce a RPROC_RECOVERY_FAIL state?
>>
> 
> The case where a recovery fails needs to be considered and is the
> reason the original patch doesn't work.  Right now in
> rproc_crash_handler_work(), it is not possible to differentiate
> between a legitimate shutdown request (scenario #2 above) and a
> recovery that went wrong.  I think introducing RPROC_RECOVERY_FAIL
> would greatly simplify things.
> 
> My initial evaluation had not considered the attach/detach scenarios -
> thanks for adding that in the mix.
> 
> Aiqun, please send a new patchset that adds a new remote processor
> state, i.e RPROC_RECOVERY_FAIL.  There should also be another patch in
> that set that takes attach/detach scenarios into account.  The code
> between the v6.0 and v6.1 cycle has changed a lot in that area so make
> sure to properly rebase.
> 
I will try.

>>
>> Then pm_stay_awake is called when the crash work is queued.
>> It seems to me coherent to call the pm_relax in the work handler.
>>
>>
>>
>> Here is a quick and dirty patch (not tested) that should take into account the
>> main use cases ( except 1) and 4)b) )
>>
>> @@ -2009,8 +2009,18 @@ static void rproc_crash_handler_work(struct work_struct *work)
>>
>>          mutex_lock(&rproc->lock);
>>
>> -       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
>> +       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE ||
>> +           rproc->state == RPROC_DETACHED) {
>>                  /* handle only the first crash detected */
>> +
>> +               /*
>> +                * call pm-relax in following use cases:
>> +                * - the remote processor has been stopped by the user
>> +                * - the remote processor is detached
>> +                + - the remote proc has an autonomous reset but recovery_disabled is true.
>> +                */
>> +               if(rproc->state != RPROC_CRASHED || rproc->recovery_disabled)
>> +                       pm_relax(rproc->dev.parent);
>>                  mutex_unlock(&rproc->lock);
>>                  return;
>>          }
>>
>> Regards,
>> Arnaud
>>
>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Mathieu
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> When the state is in RPROC_OFFLINE it means separate request
>>>>>>>>>> of rproc_stop was done and no need to hold the wakeup source
>>>>>>>>>> in crash handler to recover any more.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
>>>>>>>>>> ---
>>>>>>>>>>      drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
>>>>>>>>>>      1 file changed, 11 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>> b/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>> index e5279ed9a8d7..6bc7b8b7d01e 100644
>>>>>>>>>> --- a/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>> +++ b/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>> @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct
>>>>>>>>>> work_struct *work)
>>>>>>>>>>             if (rproc->state == RPROC_CRASHED || rproc->state ==
>>>>>>>>>> RPROC_OFFLINE) {
>>>>>>>>>>                     /* handle only the first crash detected */
>>>>>>>>>>                     mutex_unlock(&rproc->lock);
>>>>>>>>>> +         /*
>>>>>>>>>> +          * RPROC_OFFLINE state indicate there is no recovery process
>>>>>>>>>> +          * is in progress and no chance to have pm_relax in place.
>>>>>>>>>> +          * Because when recovering from crash, rproc->lock is held and
>>>>>>>>>> +          * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>>>>>>> +          * and then unlock rproc->lock.
>>>>>>>>>> +          * RPROC_OFFLINE is only an intermediate state in recovery
>>>>>>>>>> +          * process.
>>>>>>>>>> +          */
>>>>>>>>>> +         if (rproc->state == RPROC_OFFLINE)
>>>>>>>>>> +                 pm_relax(rproc->dev.parent);
>>>>>>>>>>                     return;
>>>>>>>>>>             }
>>>>>>>>>> --
>>>>>>>>>> 2.7.4
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thx and BRs,
>>>>>>>> Aiqun(Maria) Yu
>>>>>
>>>>>
>>>>> --
>>>>> Thx and BRs,
>>>>> Aiqun(Maria) Yu
>>>
>>>
Mathieu Poirier Nov. 2, 2022, 6:03 p.m. UTC | #13
On Wed, Nov 02, 2022 at 06:53:49PM +0800, Aiqun(Maria) Yu wrote:
> Hi,
> 
> Let me think about this carefully.
> 
> When in RPROC_RECOVERY_FAIL case we want to re-do the recovery process again
> or just leave the pm_relax?

Neither. 

When a recovery fail we don't want to call pm_relax().  The code in
rproc_crash_handler_work() becomes:

        
	if (rproc->state == RPROC_OFFLINE) {
                /* We have raced with rproc_shutdown() */
                pm_relax()
		mutex_unlock(&rproc->lock);
		return;
        }

	if (rproc->state == RPROC_CRASHED ||
            rproc->state == RPROC_RECOVERY_FAILED) {
		/* handle only the first crash detected */
		mutex_unlock(&rproc->lock);
		return;
	}


RPROC_RECOVERY_FAILED gets set in rproc_boot_recovery() if request_firmware() or
rproc_start() fail.  Function rproc_trigger_recovery() needs to allow for the
recovery the the remote processor is in RPROC_RECOVERY_FAILED state.  As such
the condition becomes:

        /* State could have changed before we got the mutex */
	if (rproc->state != RPROC_CRASHED &&
            rproc->state != RPROC_RECOVERY_FAILED)
		goto unlock_mutex;

Start with that and we can look at corner cases (if some exists) with a fresh
patchset.  Note that I have not addressed the attach/detach() scenario in the
above.

Thanks,
Mathieu

> 
> recovery fail case 1:
> |                                      |firstcrash interrupt issued
> | second crashed interrupt issued      | rproc_report_crash()
> | rproc_report_crash()                 |          pm_stay_awake()
> |          pm_stay_awake()             |          queue_work()
> |          queue_work()                |rproc_crash_handler_work()
> |                                      |mutex_lock(&rproc->lock);
> |                                      |rproc_stop()
> |rproc_crash_handler_work()            |rproc->state = RPROC_OFFLINE;
> |                                      |RPROC_RECOVERY_FAIL //new
> |                                      |mutex_unlock(&rproc->lock);
> |mutex_lock(&rproc->lock);             |pm_relax()
> |if (rproc->state == RPROC_OFFLINE)    |
> |return // shouldn't do pm_relax if RPROC_RECOVERY_FAIL?  |
> |mutex_unlock(&rproc->lock);           |
> |                                      |
> |                                      |
> |                                      |
> 
> recovery fail case 2:
> |                                      |firstcrash interrupt issued
> |                                      | rproc_report_crash()
> |                                      |          pm_stay_awake()
> |                                      |          queue_work()
> |                                      |rproc_crash_handler_work()
> |                                      |mutex_lock(&rproc->lock);
> |                                      |rproc_stop()
> |                                      |rproc->state = RPROC_OFFLINE;
> |                                      |RPROC_RECOVERY_FAIL //new
> |                                      |mutex_unlock(&rproc->lock);
> |                                      |pm_relax()
> |
> | second crashed interrupt issued      |
> | rproc_report_crash()                 |
> |          pm_stay_awake()             |
> |          queue_work()                |
> |pm_stay_awake()
> |mutex_lock(&rproc->lock);
> |if (rproc->state == RPROC_OFFLINE)    |
> |return // still need do pm_relax if RPROC_RECOVERY_FAIL?  |
> |mutex_unlock(&rproc->lock);           |
> |                                      |
> |                                      |
> |                                      |
> 
> Maybe I can have:
> 1. the pm_stay_awake and pm_relax with count based and call with paired for
> fix current concurency issue.
> 2. RPROC_RECOVERY_FAIL can be another patch for continue try to do recovery
> work.
> 3. handle RPROC_DETACHED case.
> 
> On 11/2/2022 4:11 AM, Mathieu Poirier wrote:
> > On Fri, 28 Oct 2022 at 09:31, Arnaud POULIQUEN
> > <arnaud.pouliquen@foss.st.com> wrote:
> > > 
> > > Hi,
> > > 
> > > On 10/24/22 05:17, Aiqun(Maria) Yu wrote:
> > > > On 10/22/2022 3:34 AM, Mathieu Poirier wrote:
> > > > > On Wed, 19 Oct 2022 at 23:52, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
> > > > > > 
> > > > > > On 10/14/2022 2:03 AM, Mathieu Poirier wrote:
> > > > > > > On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote:
> > > > > > > > On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:
> > > > > > > > > Hi Mathieu,
> > > > > > > > > 
> > > > > > > > > On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
> > > > > > > > > > Please add what has changed from one version to another, either in a cover
> > > > > > > > > > letter or after the "Signed-off-by".  There are many examples on how to
> > > > > > > > > > do that
> > > > > > > > > > on the mailing list.
> > > > > > > > > > 
> > > > > > > > > Thx for the information, will take a note and benefit for next time.
> > > > > > > > > 
> > > > > > > > > > On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
> > > > > > > > > > > RPROC_OFFLINE state indicate there is no recovery process
> > > > > > > > > > > is in progress and no chance to do the pm_relax.
> > > > > > > > > > > Because when recovering from crash, rproc->lock is held and
> > > > > > > > > > > state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> > > > > > > > > > > and then unlock rproc->lock.
> > > > > > > > > > 
> > > > > > > > > > You are correct - because the lock is held rproc->state should be set to
> > > > > > > > > > RPROC_RUNNING
> > > > > > > > > > when rproc_trigger_recovery() returns.  If that is not the case then
> > > > > > > > > > something
> > > > > > > > > > went wrong.
> > > > > > > > > > 
> > > > > > > > > > Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before
> > > > > > > > > > returning,
> > > > > > > > > > so we know the remote processor was stopped.  Therefore if rproc->state
> > > > > > > > > > is set
> > > > > > > > > > to RPROC_OFFLINE something went wrong in either request_firmware() or
> > > > > > > > > > rproc_start().  Either way the remote processor is offline and the system
> > > > > > > > > > probably
> > > > > > > > > > in an unknown/unstable.  As such I don't see how calling pm_relax() can help
> > > > > > > > > > things along.
> > > > > > > > > > 
> > > > > > > > > PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
> > > > > > > > > finished.
> > > > > > > > > Even if it is multi crash rproc_crash_handler_work contention issue, and
> > > > > > > > > last rproc_trigger_recovery bailed out with only
> > > > > > > > > rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
> > > > > > > > > Since the subsystem may still can be recovered with customer's next trigger
> > > > > > > > > of rproc_start, and we can make each error out path clean with pm resources.
> > > > > > > > > 
> > > > > > > > > > I suggest spending time understanding what leads to the failure when
> > > > > > > > > > recovering
> > > > > > > > > > from a crash and address that problem(s).
> > > > > > > > > > 
> > > > > > > > > In current case, the customer's information is that the issue happened when
> > > > > > > > > rproc_shutdown is triggered at similar time. So not an issue from error out
> > > > > > > > > of rproc_trigger_recovery.
> > > > > > > > 
> > > > > > > > That is a very important element to consider and should have been mentioned
> > > > > > > > from
> > > > > > > > the beginning.  What I see happening is the following:
> > > > > > > > 
> > > > > > > > rproc_report_crash()
> > > > > > > >            pm_stay_awake()
> > > > > > > >            queue_work() // current thread is suspended
> > > > > > > > 
> > > > > > > > rproc_shutdown()
> > > > > > > >            rproc_stop()
> > > > > > > >                    rproc->state = RPROC_OFFLINE;
> > > > > > > > 
> > > > > > > > rproc_crash_handler_work()
> > > > > > > >            if (rproc->state == RPROC_OFFLINE)
> > > > > > > >                    return // pm_relax() is not called
> > > > > > > > 
> > > > > > > > The right way to fix this is to add a pm_relax() in rproc_shutdown() and
> > > > > > > > rproc_detach(), along with a very descriptive comment as to why it is needed.
> > > > > > > 
> > > > > > > Thinking about this further there are more ramifications to consider.  Please
> > > > > > > confirm the above scenario is what you are facing.  I will advise on how to
> > > > > > > move
> > > > > > > forward if that is the case.
> > > > > > > 
> > > > > > Not sure if the situation is clear or not. So resend the email again.
> > > > > > 
> > > > > > The above senario is what customer is facing. crash hanppened while at
> > > > > > the same time shutdown is triggered.
> > > > > 
> > > > > Unfortunately this is not enough details to address a problem as
> > > > > complex as this one.
> > > > > 
> > > > > > And the device cannto goes to suspend state after that.
> > > > > > the subsystem can still be start normally after this.
> > > > > 
> > > > > If the code flow I pasted above reflects the problem at hand, the
> > > > > current patch will not be sufficient to address the issue.  If Arnaud
> > > > > confirms my suspicions we will have to think about a better solution.
> > > > > 
> > > > 
> > > > Hi Mathiew,
> > > > 
> > > > Could you pls have more details of any side effects other then power issue of
> > > > the current senario?
> > > > Why the current patch is not sufficient pls?
> > > > 
> > > > 
> > > > Have the current senario in details with rproc->lock information in details:
> > > > 
> > > > | subsystem crashed interrupt issued      | user trigger shutdown
> > > > | rproc_report_crash()                    |
> > > > |          pm_stay_awake()                |
> > > > |          queue_work()                   |
> > > > |                                         |rproc_shutdown
> > > > |                                         |mutex_lock(&rproc->lock);
> > > > |                                         |rproc_stop()
> > > > |rproc_crash_handler_work()               |rproc->state = RPROC_OFFLINE;
> > > > |                                         |mutex_unlock(&rproc->lock);
> > > > |mutex_lock(&rproc->lock);                |
> > > > |if (rproc->state == RPROC_OFFLINE)       |
> > > > |return // pm_relax() is not called       |rproc_boot
> > > > |mutex_unlock(&rproc->lock);              |
> > > > |                                         |mutex_lock(&rproc->lock);
> > > > |                                         |rproc_start()
> > > > |                                         |mutex_unlock(&rproc->lock);
> > > > 
> > > > 
> > > 
> > > Agree with Mathieu, this is not so simple.
> > > 
> > 
> > Thanks for looking into this.
> > 
> > > Here is my view  hoping I haven't missed a point in your discussion or
> > > an other corner cases.
> > > 
> > > I tried to analyze the issues (in what follows, the term "condition" means
> > > the "if" condition in which Aiqun proposes to add the fix) :
> > > 
> > > I can see 4 use cases with race condition
> > > 
> > > 1) crash report while already one is treated (rproc_boot_recovery called)
> > >       => not a real use case as if the remote processor is crashed we
> > >                should not have a second crash report
> > > 
> > 
> > That part is of great concern to me.  *Theoretically* we should not
> > get a new crash report while one has already been dispatched but the
> > current code accounts for this scenario and as such the possibility
> > can't be dismissed.  Therefore we need to expect rproc_report_crash()
> > to be called multiple times before a single instance of
> > rproc_boot_recovery() is scheduled.
> 
> 
> > 
> > > 2) rproc_stop executed between the queuing of the crash work and the call of
> > >    rproc_crash_handler_work
> > >     => rproc->state = RPROC_OFFLINE
> > >     => we enter in the "condition" and the pm_relax has to be called
> > >     => This commit fix should solve this use case
> > > 
> > > 3) rproc_detach executed between the queue of the crash work and the call of
> > >    rproc_crash_handler_work
> > >     => rproc->state = RPROC_DETACHED;
> > >     => we don't go in "the condition" and issue because the recovery reattach
> > >        to the remote processor
> > >     => but pm_relax is called
> > >     => probably need an extra fix to avoid to re-attach
> > > 
> > > 4) crash report while already one is treated (rproc_attach_recovery called)
> > >     this one corresponds to an auto reboot of the remote processor, with a
> > >     new crash
> > >     => rproc->state = RPROC_CRASHED or rproc->state = RPROC_DETACHED;
> > >     4)a) rproc->state = RPROC_CRASHED if rproc->recovery_disabled = true
> > >          => should call pm_relax if rproc->recovery_disabled = true
> > >          => commit does not work for this use case
> > > 
> > >     4)b) rproc->state = RPROC_DETACHED if recovery fails
> > >         => error case with an unstable state
> > >         => how to differentiate it from the use case 3) ?
> > >         => introduce a RPROC_RECOVERY_FAIL state?
> > > 
> > 
> > The case where a recovery fails needs to be considered and is the
> > reason the original patch doesn't work.  Right now in
> > rproc_crash_handler_work(), it is not possible to differentiate
> > between a legitimate shutdown request (scenario #2 above) and a
> > recovery that went wrong.  I think introducing RPROC_RECOVERY_FAIL
> > would greatly simplify things.
> > 
> > My initial evaluation had not considered the attach/detach scenarios -
> > thanks for adding that in the mix.
> > 
> > Aiqun, please send a new patchset that adds a new remote processor
> > state, i.e RPROC_RECOVERY_FAIL.  There should also be another patch in
> > that set that takes attach/detach scenarios into account.  The code
> > between the v6.0 and v6.1 cycle has changed a lot in that area so make
> > sure to properly rebase.
> > 
> I will try.
> 
> > > 
> > > Then pm_stay_awake is called when the crash work is queued.
> > > It seems to me coherent to call the pm_relax in the work handler.
> > > 
> > > 
> > > 
> > > Here is a quick and dirty patch (not tested) that should take into account the
> > > main use cases ( except 1) and 4)b) )
> > > 
> > > @@ -2009,8 +2009,18 @@ static void rproc_crash_handler_work(struct work_struct *work)
> > > 
> > >          mutex_lock(&rproc->lock);
> > > 
> > > -       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
> > > +       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE ||
> > > +           rproc->state == RPROC_DETACHED) {
> > >                  /* handle only the first crash detected */
> > > +
> > > +               /*
> > > +                * call pm-relax in following use cases:
> > > +                * - the remote processor has been stopped by the user
> > > +                * - the remote processor is detached
> > > +                + - the remote proc has an autonomous reset but recovery_disabled is true.
> > > +                */
> > > +               if(rproc->state != RPROC_CRASHED || rproc->recovery_disabled)
> > > +                       pm_relax(rproc->dev.parent);
> > >                  mutex_unlock(&rproc->lock);
> > >                  return;
> > >          }
> > > 
> > > Regards,
> > > Arnaud
> > > 
> > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > > Thanks,
> > > > > > > > > > Mathieu
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > When the state is in RPROC_OFFLINE it means separate request
> > > > > > > > > > > of rproc_stop was done and no need to hold the wakeup source
> > > > > > > > > > > in crash handler to recover any more.
> > > > > > > > > > > 
> > > > > > > > > > > Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
> > > > > > > > > > > ---
> > > > > > > > > > >      drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
> > > > > > > > > > >      1 file changed, 11 insertions(+)
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/drivers/remoteproc/remoteproc_core.c
> > > > > > > > > > > b/drivers/remoteproc/remoteproc_core.c
> > > > > > > > > > > index e5279ed9a8d7..6bc7b8b7d01e 100644
> > > > > > > > > > > --- a/drivers/remoteproc/remoteproc_core.c
> > > > > > > > > > > +++ b/drivers/remoteproc/remoteproc_core.c
> > > > > > > > > > > @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct
> > > > > > > > > > > work_struct *work)
> > > > > > > > > > >             if (rproc->state == RPROC_CRASHED || rproc->state ==
> > > > > > > > > > > RPROC_OFFLINE) {
> > > > > > > > > > >                     /* handle only the first crash detected */
> > > > > > > > > > >                     mutex_unlock(&rproc->lock);
> > > > > > > > > > > +         /*
> > > > > > > > > > > +          * RPROC_OFFLINE state indicate there is no recovery process
> > > > > > > > > > > +          * is in progress and no chance to have pm_relax in place.
> > > > > > > > > > > +          * Because when recovering from crash, rproc->lock is held and
> > > > > > > > > > > +          * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> > > > > > > > > > > +          * and then unlock rproc->lock.
> > > > > > > > > > > +          * RPROC_OFFLINE is only an intermediate state in recovery
> > > > > > > > > > > +          * process.
> > > > > > > > > > > +          */
> > > > > > > > > > > +         if (rproc->state == RPROC_OFFLINE)
> > > > > > > > > > > +                 pm_relax(rproc->dev.parent);
> > > > > > > > > > >                     return;
> > > > > > > > > > >             }
> > > > > > > > > > > --
> > > > > > > > > > > 2.7.4
> > > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > --
> > > > > > > > > Thx and BRs,
> > > > > > > > > Aiqun(Maria) Yu
> > > > > > 
> > > > > > 
> > > > > > --
> > > > > > Thx and BRs,
> > > > > > Aiqun(Maria) Yu
> > > > 
> > > > 
> 
> 
> -- 
> Thx and BRs,
> Aiqun(Maria) Yu
Aiqun Yu (Maria) Nov. 3, 2022, 2:03 a.m. UTC | #14
On 11/3/2022 2:03 AM, Mathieu Poirier wrote:
> On Wed, Nov 02, 2022 at 06:53:49PM +0800, Aiqun(Maria) Yu wrote:
>> Hi,
>>
>> Let me think about this carefully.
>>
>> When in RPROC_RECOVERY_FAIL case we want to re-do the recovery process again
>> or just leave the pm_relax?
> 
> Neither.
> 
> When a recovery fail we don't want to call pm_relax().  The code in
> rproc_crash_handler_work() becomes:
> 
>          
> 	if (rproc->state == RPROC_OFFLINE) {
>                  /* We have raced with rproc_shutdown() */
>                  pm_relax()
> 		mutex_unlock(&rproc->lock);
> 		return;
>          }
> 
> 	if (rproc->state == RPROC_CRASHED ||
>              rproc->state == RPROC_RECOVERY_FAILED) {
> 		/* handle only the first crash detected */
> 		mutex_unlock(&rproc->lock);
> 		return;
> 	}
> 
> 
> RPROC_RECOVERY_FAILED gets set in rproc_boot_recovery() if request_firmware() or
> rproc_start() fail.  Function rproc_trigger_recovery() needs to allow for the
> recovery the the remote processor is in RPROC_RECOVERY_FAILED state.  As such
> the condition becomes:
> 
>          /* State could have changed before we got the mutex */
> 	if (rproc->state != RPROC_CRASHED &&
>              rproc->state != RPROC_RECOVERY_FAILED)
> 		goto unlock_mutex;
> 
> Start with that and we can look at corner cases (if some exists) with a fresh
> patchset.  Note that I have not addressed the attach/detach() scenario in the
> above.

If we didn't deal with the recovery failed case with correct pm_relax 
call, it may left the device in a state that cannot enter to suspend state.
Because first PROC_RECOVERY_FAIL case cannot ensure it have pm_relax 
called before the second crash handler call pm_stay_awake or not.

So, What about the atomic count along with pm_relax and pm_stay_awake ?

struct rproc{
...
atomic_t wake_count;
...
}

rproc_pm_stay_awake()
{
	atomic_inc(&wake_count);
	pm_stay_awake();
}

rproc_pm_relax()
{
	if (atomic_dec_return(&wake_count) == 0)
		pm_stay_awake();
}

can refer code like:

rproc_report_crash()
{
	...
	rproc_pm_stay_awake();
	queue_work();
	...
}

rproc_crash_handler_work()
{
	...
	if (rproc->state == RPROC_OFFLINE || rproc->state == RPROC_CRASHED) {
                   /* We have raced with rproc_shutdown() */
                   rproc_pm_relax();
  		mutex_unlock(&rproc->lock);
  		return;
           }
	...
}

> 
> Thanks,
> Mathieu
> 
>>
>> recovery fail case 1:
>> |                                      |firstcrash interrupt issued
>> | second crashed interrupt issued      | rproc_report_crash()
>> | rproc_report_crash()                 |          pm_stay_awake()
>> |          pm_stay_awake()             |          queue_work()
>> |          queue_work()                |rproc_crash_handler_work()
>> |                                      |mutex_lock(&rproc->lock);
>> |                                      |rproc_stop()
>> |rproc_crash_handler_work()            |rproc->state = RPROC_OFFLINE;
>> |                                      |RPROC_RECOVERY_FAIL //new
>> |                                      |mutex_unlock(&rproc->lock);
>> |mutex_lock(&rproc->lock);             |pm_relax()
>> |if (rproc->state == RPROC_OFFLINE)    |
>> |return // shouldn't do pm_relax if RPROC_RECOVERY_FAIL?  |
>> |mutex_unlock(&rproc->lock);           |
>> |                                      |
>> |                                      |
>> |                                      |
>>
>> recovery fail case 2:
>> |                                      |firstcrash interrupt issued
>> |                                      | rproc_report_crash()
>> |                                      |          pm_stay_awake()
>> |                                      |          queue_work()
>> |                                      |rproc_crash_handler_work()
>> |                                      |mutex_lock(&rproc->lock);
>> |                                      |rproc_stop()
>> |                                      |rproc->state = RPROC_OFFLINE;
>> |                                      |RPROC_RECOVERY_FAIL //new
>> |                                      |mutex_unlock(&rproc->lock);
>> |                                      |pm_relax()
>> |
>> | second crashed interrupt issued      |
>> | rproc_report_crash()                 |
>> |          pm_stay_awake()             |
>> |          queue_work()                |
>> |pm_stay_awake()
>> |mutex_lock(&rproc->lock);
>> |if (rproc->state == RPROC_OFFLINE)    |
>> |return // still need do pm_relax if RPROC_RECOVERY_FAIL?  |
>> |mutex_unlock(&rproc->lock);           |
>> |                                      |
>> |                                      |
>> |                                      |
>>
>> Maybe I can have:
>> 1. the pm_stay_awake and pm_relax with count based and call with paired for
>> fix current concurency issue.
>> 2. RPROC_RECOVERY_FAIL can be another patch for continue try to do recovery
>> work.
>> 3. handle RPROC_DETACHED case.
>>
>> On 11/2/2022 4:11 AM, Mathieu Poirier wrote:
>>> On Fri, 28 Oct 2022 at 09:31, Arnaud POULIQUEN
>>> <arnaud.pouliquen@foss.st.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On 10/24/22 05:17, Aiqun(Maria) Yu wrote:
>>>>> On 10/22/2022 3:34 AM, Mathieu Poirier wrote:
>>>>>> On Wed, 19 Oct 2022 at 23:52, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
>>>>>>>
>>>>>>> On 10/14/2022 2:03 AM, Mathieu Poirier wrote:
>>>>>>>> On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote:
>>>>>>>>> On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:
>>>>>>>>>> Hi Mathieu,
>>>>>>>>>>
>>>>>>>>>> On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
>>>>>>>>>>> Please add what has changed from one version to another, either in a cover
>>>>>>>>>>> letter or after the "Signed-off-by".  There are many examples on how to
>>>>>>>>>>> do that
>>>>>>>>>>> on the mailing list.
>>>>>>>>>>>
>>>>>>>>>> Thx for the information, will take a note and benefit for next time.
>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
>>>>>>>>>>>> RPROC_OFFLINE state indicate there is no recovery process
>>>>>>>>>>>> is in progress and no chance to do the pm_relax.
>>>>>>>>>>>> Because when recovering from crash, rproc->lock is held and
>>>>>>>>>>>> state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>>>>>>>>> and then unlock rproc->lock.
>>>>>>>>>>>
>>>>>>>>>>> You are correct - because the lock is held rproc->state should be set to
>>>>>>>>>>> RPROC_RUNNING
>>>>>>>>>>> when rproc_trigger_recovery() returns.  If that is not the case then
>>>>>>>>>>> something
>>>>>>>>>>> went wrong.
>>>>>>>>>>>
>>>>>>>>>>> Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before
>>>>>>>>>>> returning,
>>>>>>>>>>> so we know the remote processor was stopped.  Therefore if rproc->state
>>>>>>>>>>> is set
>>>>>>>>>>> to RPROC_OFFLINE something went wrong in either request_firmware() or
>>>>>>>>>>> rproc_start().  Either way the remote processor is offline and the system
>>>>>>>>>>> probably
>>>>>>>>>>> in an unknown/unstable.  As such I don't see how calling pm_relax() can help
>>>>>>>>>>> things along.
>>>>>>>>>>>
>>>>>>>>>> PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
>>>>>>>>>> finished.
>>>>>>>>>> Even if it is multi crash rproc_crash_handler_work contention issue, and
>>>>>>>>>> last rproc_trigger_recovery bailed out with only
>>>>>>>>>> rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
>>>>>>>>>> Since the subsystem may still can be recovered with customer's next trigger
>>>>>>>>>> of rproc_start, and we can make each error out path clean with pm resources.
>>>>>>>>>>
>>>>>>>>>>> I suggest spending time understanding what leads to the failure when
>>>>>>>>>>> recovering
>>>>>>>>>>> from a crash and address that problem(s).
>>>>>>>>>>>
>>>>>>>>>> In current case, the customer's information is that the issue happened when
>>>>>>>>>> rproc_shutdown is triggered at similar time. So not an issue from error out
>>>>>>>>>> of rproc_trigger_recovery.
>>>>>>>>>
>>>>>>>>> That is a very important element to consider and should have been mentioned
>>>>>>>>> from
>>>>>>>>> the beginning.  What I see happening is the following:
>>>>>>>>>
>>>>>>>>> rproc_report_crash()
>>>>>>>>>             pm_stay_awake()
>>>>>>>>>             queue_work() // current thread is suspended
>>>>>>>>>
>>>>>>>>> rproc_shutdown()
>>>>>>>>>             rproc_stop()
>>>>>>>>>                     rproc->state = RPROC_OFFLINE;
>>>>>>>>>
>>>>>>>>> rproc_crash_handler_work()
>>>>>>>>>             if (rproc->state == RPROC_OFFLINE)
>>>>>>>>>                     return // pm_relax() is not called
>>>>>>>>>
>>>>>>>>> The right way to fix this is to add a pm_relax() in rproc_shutdown() and
>>>>>>>>> rproc_detach(), along with a very descriptive comment as to why it is needed.
>>>>>>>>
>>>>>>>> Thinking about this further there are more ramifications to consider.  Please
>>>>>>>> confirm the above scenario is what you are facing.  I will advise on how to
>>>>>>>> move
>>>>>>>> forward if that is the case.
>>>>>>>>
>>>>>>> Not sure if the situation is clear or not. So resend the email again.
>>>>>>>
>>>>>>> The above senario is what customer is facing. crash hanppened while at
>>>>>>> the same time shutdown is triggered.
>>>>>>
>>>>>> Unfortunately this is not enough details to address a problem as
>>>>>> complex as this one.
>>>>>>
>>>>>>> And the device cannto goes to suspend state after that.
>>>>>>> the subsystem can still be start normally after this.
>>>>>>
>>>>>> If the code flow I pasted above reflects the problem at hand, the
>>>>>> current patch will not be sufficient to address the issue.  If Arnaud
>>>>>> confirms my suspicions we will have to think about a better solution.
>>>>>>
>>>>>
>>>>> Hi Mathiew,
>>>>>
>>>>> Could you pls have more details of any side effects other then power issue of
>>>>> the current senario?
>>>>> Why the current patch is not sufficient pls?
>>>>>
>>>>>
>>>>> Have the current senario in details with rproc->lock information in details:
>>>>>
>>>>> | subsystem crashed interrupt issued      | user trigger shutdown
>>>>> | rproc_report_crash()                    |
>>>>> |          pm_stay_awake()                |
>>>>> |          queue_work()                   |
>>>>> |                                         |rproc_shutdown
>>>>> |                                         |mutex_lock(&rproc->lock);
>>>>> |                                         |rproc_stop()
>>>>> |rproc_crash_handler_work()               |rproc->state = RPROC_OFFLINE;
>>>>> |                                         |mutex_unlock(&rproc->lock);
>>>>> |mutex_lock(&rproc->lock);                |
>>>>> |if (rproc->state == RPROC_OFFLINE)       |
>>>>> |return // pm_relax() is not called       |rproc_boot
>>>>> |mutex_unlock(&rproc->lock);              |
>>>>> |                                         |mutex_lock(&rproc->lock);
>>>>> |                                         |rproc_start()
>>>>> |                                         |mutex_unlock(&rproc->lock);
>>>>>
>>>>>
>>>>
>>>> Agree with Mathieu, this is not so simple.
>>>>
>>>
>>> Thanks for looking into this.
>>>
>>>> Here is my view  hoping I haven't missed a point in your discussion or
>>>> an other corner cases.
>>>>
>>>> I tried to analyze the issues (in what follows, the term "condition" means
>>>> the "if" condition in which Aiqun proposes to add the fix) :
>>>>
>>>> I can see 4 use cases with race condition
>>>>
>>>> 1) crash report while already one is treated (rproc_boot_recovery called)
>>>>        => not a real use case as if the remote processor is crashed we
>>>>                 should not have a second crash report
>>>>
>>>
>>> That part is of great concern to me.  *Theoretically* we should not
>>> get a new crash report while one has already been dispatched but the
>>> current code accounts for this scenario and as such the possibility
>>> can't be dismissed.  Therefore we need to expect rproc_report_crash()
>>> to be called multiple times before a single instance of
>>> rproc_boot_recovery() is scheduled.
>>
>>
>>>
>>>> 2) rproc_stop executed between the queuing of the crash work and the call of
>>>>     rproc_crash_handler_work
>>>>      => rproc->state = RPROC_OFFLINE
>>>>      => we enter in the "condition" and the pm_relax has to be called
>>>>      => This commit fix should solve this use case
>>>>
>>>> 3) rproc_detach executed between the queue of the crash work and the call of
>>>>     rproc_crash_handler_work
>>>>      => rproc->state = RPROC_DETACHED;
>>>>      => we don't go in "the condition" and issue because the recovery reattach
>>>>         to the remote processor
>>>>      => but pm_relax is called
>>>>      => probably need an extra fix to avoid to re-attach
>>>>
>>>> 4) crash report while already one is treated (rproc_attach_recovery called)
>>>>      this one corresponds to an auto reboot of the remote processor, with a
>>>>      new crash
>>>>      => rproc->state = RPROC_CRASHED or rproc->state = RPROC_DETACHED;
>>>>      4)a) rproc->state = RPROC_CRASHED if rproc->recovery_disabled = true
>>>>           => should call pm_relax if rproc->recovery_disabled = true
>>>>           => commit does not work for this use case
>>>>
>>>>      4)b) rproc->state = RPROC_DETACHED if recovery fails
>>>>          => error case with an unstable state
>>>>          => how to differentiate it from the use case 3) ?
>>>>          => introduce a RPROC_RECOVERY_FAIL state?
>>>>
>>>
>>> The case where a recovery fails needs to be considered and is the
>>> reason the original patch doesn't work.  Right now in
>>> rproc_crash_handler_work(), it is not possible to differentiate
>>> between a legitimate shutdown request (scenario #2 above) and a
>>> recovery that went wrong.  I think introducing RPROC_RECOVERY_FAIL
>>> would greatly simplify things.
>>>
>>> My initial evaluation had not considered the attach/detach scenarios -
>>> thanks for adding that in the mix.
>>>
>>> Aiqun, please send a new patchset that adds a new remote processor
>>> state, i.e RPROC_RECOVERY_FAIL.  There should also be another patch in
>>> that set that takes attach/detach scenarios into account.  The code
>>> between the v6.0 and v6.1 cycle has changed a lot in that area so make
>>> sure to properly rebase.
>>>
>> I will try.
>>
>>>>
>>>> Then pm_stay_awake is called when the crash work is queued.
>>>> It seems to me coherent to call the pm_relax in the work handler.
>>>>
>>>>
>>>>
>>>> Here is a quick and dirty patch (not tested) that should take into account the
>>>> main use cases ( except 1) and 4)b) )
>>>>
>>>> @@ -2009,8 +2009,18 @@ static void rproc_crash_handler_work(struct work_struct *work)
>>>>
>>>>           mutex_lock(&rproc->lock);
>>>>
>>>> -       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
>>>> +       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE ||
>>>> +           rproc->state == RPROC_DETACHED) {
>>>>                   /* handle only the first crash detected */
>>>> +
>>>> +               /*
>>>> +                * call pm-relax in following use cases:
>>>> +                * - the remote processor has been stopped by the user
>>>> +                * - the remote processor is detached
>>>> +                + - the remote proc has an autonomous reset but recovery_disabled is true.
>>>> +                */
>>>> +               if(rproc->state != RPROC_CRASHED || rproc->recovery_disabled)
>>>> +                       pm_relax(rproc->dev.parent);
>>>>                   mutex_unlock(&rproc->lock);
>>>>                   return;
>>>>           }
>>>>
>>>> Regards,
>>>> Arnaud
>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Mathieu
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> When the state is in RPROC_OFFLINE it means separate request
>>>>>>>>>>>> of rproc_stop was done and no need to hold the wakeup source
>>>>>>>>>>>> in crash handler to recover any more.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>       drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
>>>>>>>>>>>>       1 file changed, 11 insertions(+)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>> b/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>> index e5279ed9a8d7..6bc7b8b7d01e 100644
>>>>>>>>>>>> --- a/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>> +++ b/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>> @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct
>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>              if (rproc->state == RPROC_CRASHED || rproc->state ==
>>>>>>>>>>>> RPROC_OFFLINE) {
>>>>>>>>>>>>                      /* handle only the first crash detected */
>>>>>>>>>>>>                      mutex_unlock(&rproc->lock);
>>>>>>>>>>>> +         /*
>>>>>>>>>>>> +          * RPROC_OFFLINE state indicate there is no recovery process
>>>>>>>>>>>> +          * is in progress and no chance to have pm_relax in place.
>>>>>>>>>>>> +          * Because when recovering from crash, rproc->lock is held and
>>>>>>>>>>>> +          * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>>>>>>>>> +          * and then unlock rproc->lock.
>>>>>>>>>>>> +          * RPROC_OFFLINE is only an intermediate state in recovery
>>>>>>>>>>>> +          * process.
>>>>>>>>>>>> +          */
>>>>>>>>>>>> +         if (rproc->state == RPROC_OFFLINE)
>>>>>>>>>>>> +                 pm_relax(rproc->dev.parent);
>>>>>>>>>>>>                      return;
>>>>>>>>>>>>              }
>>>>>>>>>>>> --
>>>>>>>>>>>> 2.7.4
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Thx and BRs,
>>>>>>>>>> Aiqun(Maria) Yu
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thx and BRs,
>>>>>>> Aiqun(Maria) Yu
>>>>>
>>>>>
>>
>>
>> -- 
>> Thx and BRs,
>> Aiqun(Maria) Yu
Mathieu Poirier Nov. 4, 2022, 3:59 p.m. UTC | #15
On Thu, Nov 03, 2022 at 10:03:49AM +0800, Aiqun(Maria) Yu wrote:
> On 11/3/2022 2:03 AM, Mathieu Poirier wrote:
> > On Wed, Nov 02, 2022 at 06:53:49PM +0800, Aiqun(Maria) Yu wrote:
> > > Hi,
> > > 
> > > Let me think about this carefully.
> > > 
> > > When in RPROC_RECOVERY_FAIL case we want to re-do the recovery process again
> > > or just leave the pm_relax?
> > 
> > Neither.
> > 
> > When a recovery fail we don't want to call pm_relax().  The code in
> > rproc_crash_handler_work() becomes:
> > 
> > 	if (rproc->state == RPROC_OFFLINE) {
> >                  /* We have raced with rproc_shutdown() */
> >                  pm_relax()
> > 		mutex_unlock(&rproc->lock);
> > 		return;
> >          }
> > 
> > 	if (rproc->state == RPROC_CRASHED ||
> >              rproc->state == RPROC_RECOVERY_FAILED) {
> > 		/* handle only the first crash detected */
> > 		mutex_unlock(&rproc->lock);
> > 		return;
> > 	}
> > 
> > 
> > RPROC_RECOVERY_FAILED gets set in rproc_boot_recovery() if request_firmware() or
> > rproc_start() fail.  Function rproc_trigger_recovery() needs to allow for the
> > recovery the the remote processor is in RPROC_RECOVERY_FAILED state.  As such
> > the condition becomes:
> > 
> >          /* State could have changed before we got the mutex */
> > 	if (rproc->state != RPROC_CRASHED &&
> >              rproc->state != RPROC_RECOVERY_FAILED)
> > 		goto unlock_mutex;
> > 
> > Start with that and we can look at corner cases (if some exists) with a fresh
> > patchset.  Note that I have not addressed the attach/detach() scenario in the
> > above.
> 
> If we didn't deal with the recovery failed case with correct pm_relax call,
> it may left the device in a state that cannot enter to suspend state.

That is what I am looking for.  We don't want to give the impression that
everything is fine by allowing the device to suspend.  If the remote processor
can't be recovered than it needs to be dealth with.

> Because first PROC_RECOVERY_FAIL case cannot ensure it have pm_relax called
> before the second crash handler call pm_stay_awake or not.
> 

I've been thinking about that part.  I don't think adding a wake_count to
control calls to pm_stay_awake()/pm_relax() is the best way to go.  There is a
similar count happening in the PM runtime subsystem and that is what we should
be using.  I have asked a power management expert at Linaro for guidance with
this matter.  I should be able to get back to you with a way forward by the end
of next week.

> So, What about the atomic count along with pm_relax and pm_stay_awake ?
> 
> struct rproc{
> ...
> atomic_t wake_count;
> ...
> }
> 
> rproc_pm_stay_awake()
> {
> 	atomic_inc(&wake_count);
> 	pm_stay_awake();
> }
> 
> rproc_pm_relax()
> {
> 	if (atomic_dec_return(&wake_count) == 0)
> 		pm_stay_awake();
> }
> 
> can refer code like:
> 
> rproc_report_crash()
> {
> 	...
> 	rproc_pm_stay_awake();
> 	queue_work();
> 	...
> }
> 
> rproc_crash_handler_work()
> {
> 	...
> 	if (rproc->state == RPROC_OFFLINE || rproc->state == RPROC_CRASHED) {
>                   /* We have raced with rproc_shutdown() */
>                   rproc_pm_relax();
>  		mutex_unlock(&rproc->lock);
>  		return;
>           }
> 	...
> }
> 
> > 
> > Thanks,
> > Mathieu
> > 
> > > 
> > > recovery fail case 1:
> > > |                                      |firstcrash interrupt issued
> > > | second crashed interrupt issued      | rproc_report_crash()
> > > | rproc_report_crash()                 |          pm_stay_awake()
> > > |          pm_stay_awake()             |          queue_work()
> > > |          queue_work()                |rproc_crash_handler_work()
> > > |                                      |mutex_lock(&rproc->lock);
> > > |                                      |rproc_stop()
> > > |rproc_crash_handler_work()            |rproc->state = RPROC_OFFLINE;
> > > |                                      |RPROC_RECOVERY_FAIL //new
> > > |                                      |mutex_unlock(&rproc->lock);
> > > |mutex_lock(&rproc->lock);             |pm_relax()
> > > |if (rproc->state == RPROC_OFFLINE)    |
> > > |return // shouldn't do pm_relax if RPROC_RECOVERY_FAIL?  |
> > > |mutex_unlock(&rproc->lock);           |
> > > |                                      |
> > > |                                      |
> > > |                                      |
> > > 
> > > recovery fail case 2:
> > > |                                      |firstcrash interrupt issued
> > > |                                      | rproc_report_crash()
> > > |                                      |          pm_stay_awake()
> > > |                                      |          queue_work()
> > > |                                      |rproc_crash_handler_work()
> > > |                                      |mutex_lock(&rproc->lock);
> > > |                                      |rproc_stop()
> > > |                                      |rproc->state = RPROC_OFFLINE;
> > > |                                      |RPROC_RECOVERY_FAIL //new
> > > |                                      |mutex_unlock(&rproc->lock);
> > > |                                      |pm_relax()
> > > |
> > > | second crashed interrupt issued      |
> > > | rproc_report_crash()                 |
> > > |          pm_stay_awake()             |
> > > |          queue_work()                |
> > > |pm_stay_awake()
> > > |mutex_lock(&rproc->lock);
> > > |if (rproc->state == RPROC_OFFLINE)    |
> > > |return // still need do pm_relax if RPROC_RECOVERY_FAIL?  |
> > > |mutex_unlock(&rproc->lock);           |
> > > |                                      |
> > > |                                      |
> > > |                                      |
> > > 
> > > Maybe I can have:
> > > 1. the pm_stay_awake and pm_relax with count based and call with paired for
> > > fix current concurency issue.
> > > 2. RPROC_RECOVERY_FAIL can be another patch for continue try to do recovery
> > > work.
> > > 3. handle RPROC_DETACHED case.
> > > 
> > > On 11/2/2022 4:11 AM, Mathieu Poirier wrote:
> > > > On Fri, 28 Oct 2022 at 09:31, Arnaud POULIQUEN
> > > > <arnaud.pouliquen@foss.st.com> wrote:
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > On 10/24/22 05:17, Aiqun(Maria) Yu wrote:
> > > > > > On 10/22/2022 3:34 AM, Mathieu Poirier wrote:
> > > > > > > On Wed, 19 Oct 2022 at 23:52, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
> > > > > > > > 
> > > > > > > > On 10/14/2022 2:03 AM, Mathieu Poirier wrote:
> > > > > > > > > On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote:
> > > > > > > > > > On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:
> > > > > > > > > > > Hi Mathieu,
> > > > > > > > > > > 
> > > > > > > > > > > On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
> > > > > > > > > > > > Please add what has changed from one version to another, either in a cover
> > > > > > > > > > > > letter or after the "Signed-off-by".  There are many examples on how to
> > > > > > > > > > > > do that
> > > > > > > > > > > > on the mailing list.
> > > > > > > > > > > > 
> > > > > > > > > > > Thx for the information, will take a note and benefit for next time.
> > > > > > > > > > > 
> > > > > > > > > > > > On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
> > > > > > > > > > > > > RPROC_OFFLINE state indicate there is no recovery process
> > > > > > > > > > > > > is in progress and no chance to do the pm_relax.
> > > > > > > > > > > > > Because when recovering from crash, rproc->lock is held and
> > > > > > > > > > > > > state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> > > > > > > > > > > > > and then unlock rproc->lock.
> > > > > > > > > > > > 
> > > > > > > > > > > > You are correct - because the lock is held rproc->state should be set to
> > > > > > > > > > > > RPROC_RUNNING
> > > > > > > > > > > > when rproc_trigger_recovery() returns.  If that is not the case then
> > > > > > > > > > > > something
> > > > > > > > > > > > went wrong.
> > > > > > > > > > > > 
> > > > > > > > > > > > Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before
> > > > > > > > > > > > returning,
> > > > > > > > > > > > so we know the remote processor was stopped.  Therefore if rproc->state
> > > > > > > > > > > > is set
> > > > > > > > > > > > to RPROC_OFFLINE something went wrong in either request_firmware() or
> > > > > > > > > > > > rproc_start().  Either way the remote processor is offline and the system
> > > > > > > > > > > > probably
> > > > > > > > > > > > in an unknown/unstable.  As such I don't see how calling pm_relax() can help
> > > > > > > > > > > > things along.
> > > > > > > > > > > > 
> > > > > > > > > > > PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
> > > > > > > > > > > finished.
> > > > > > > > > > > Even if it is multi crash rproc_crash_handler_work contention issue, and
> > > > > > > > > > > last rproc_trigger_recovery bailed out with only
> > > > > > > > > > > rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
> > > > > > > > > > > Since the subsystem may still can be recovered with customer's next trigger
> > > > > > > > > > > of rproc_start, and we can make each error out path clean with pm resources.
> > > > > > > > > > > 
> > > > > > > > > > > > I suggest spending time understanding what leads to the failure when
> > > > > > > > > > > > recovering
> > > > > > > > > > > > from a crash and address that problem(s).
> > > > > > > > > > > > 
> > > > > > > > > > > In current case, the customer's information is that the issue happened when
> > > > > > > > > > > rproc_shutdown is triggered at similar time. So not an issue from error out
> > > > > > > > > > > of rproc_trigger_recovery.
> > > > > > > > > > 
> > > > > > > > > > That is a very important element to consider and should have been mentioned
> > > > > > > > > > from
> > > > > > > > > > the beginning.  What I see happening is the following:
> > > > > > > > > > 
> > > > > > > > > > rproc_report_crash()
> > > > > > > > > >             pm_stay_awake()
> > > > > > > > > >             queue_work() // current thread is suspended
> > > > > > > > > > 
> > > > > > > > > > rproc_shutdown()
> > > > > > > > > >             rproc_stop()
> > > > > > > > > >                     rproc->state = RPROC_OFFLINE;
> > > > > > > > > > 
> > > > > > > > > > rproc_crash_handler_work()
> > > > > > > > > >             if (rproc->state == RPROC_OFFLINE)
> > > > > > > > > >                     return // pm_relax() is not called
> > > > > > > > > > 
> > > > > > > > > > The right way to fix this is to add a pm_relax() in rproc_shutdown() and
> > > > > > > > > > rproc_detach(), along with a very descriptive comment as to why it is needed.
> > > > > > > > > 
> > > > > > > > > Thinking about this further there are more ramifications to consider.  Please
> > > > > > > > > confirm the above scenario is what you are facing.  I will advise on how to
> > > > > > > > > move
> > > > > > > > > forward if that is the case.
> > > > > > > > > 
> > > > > > > > Not sure if the situation is clear or not. So resend the email again.
> > > > > > > > 
> > > > > > > > The above senario is what customer is facing. crash hanppened while at
> > > > > > > > the same time shutdown is triggered.
> > > > > > > 
> > > > > > > Unfortunately this is not enough details to address a problem as
> > > > > > > complex as this one.
> > > > > > > 
> > > > > > > > And the device cannto goes to suspend state after that.
> > > > > > > > the subsystem can still be start normally after this.
> > > > > > > 
> > > > > > > If the code flow I pasted above reflects the problem at hand, the
> > > > > > > current patch will not be sufficient to address the issue.  If Arnaud
> > > > > > > confirms my suspicions we will have to think about a better solution.
> > > > > > > 
> > > > > > 
> > > > > > Hi Mathiew,
> > > > > > 
> > > > > > Could you pls have more details of any side effects other then power issue of
> > > > > > the current senario?
> > > > > > Why the current patch is not sufficient pls?
> > > > > > 
> > > > > > 
> > > > > > Have the current senario in details with rproc->lock information in details:
> > > > > > 
> > > > > > | subsystem crashed interrupt issued      | user trigger shutdown
> > > > > > | rproc_report_crash()                    |
> > > > > > |          pm_stay_awake()                |
> > > > > > |          queue_work()                   |
> > > > > > |                                         |rproc_shutdown
> > > > > > |                                         |mutex_lock(&rproc->lock);
> > > > > > |                                         |rproc_stop()
> > > > > > |rproc_crash_handler_work()               |rproc->state = RPROC_OFFLINE;
> > > > > > |                                         |mutex_unlock(&rproc->lock);
> > > > > > |mutex_lock(&rproc->lock);                |
> > > > > > |if (rproc->state == RPROC_OFFLINE)       |
> > > > > > |return // pm_relax() is not called       |rproc_boot
> > > > > > |mutex_unlock(&rproc->lock);              |
> > > > > > |                                         |mutex_lock(&rproc->lock);
> > > > > > |                                         |rproc_start()
> > > > > > |                                         |mutex_unlock(&rproc->lock);
> > > > > > 
> > > > > > 
> > > > > 
> > > > > Agree with Mathieu, this is not so simple.
> > > > > 
> > > > 
> > > > Thanks for looking into this.
> > > > 
> > > > > Here is my view  hoping I haven't missed a point in your discussion or
> > > > > an other corner cases.
> > > > > 
> > > > > I tried to analyze the issues (in what follows, the term "condition" means
> > > > > the "if" condition in which Aiqun proposes to add the fix) :
> > > > > 
> > > > > I can see 4 use cases with race condition
> > > > > 
> > > > > 1) crash report while already one is treated (rproc_boot_recovery called)
> > > > >        => not a real use case as if the remote processor is crashed we
> > > > >                 should not have a second crash report
> > > > > 
> > > > 
> > > > That part is of great concern to me.  *Theoretically* we should not
> > > > get a new crash report while one has already been dispatched but the
> > > > current code accounts for this scenario and as such the possibility
> > > > can't be dismissed.  Therefore we need to expect rproc_report_crash()
> > > > to be called multiple times before a single instance of
> > > > rproc_boot_recovery() is scheduled.
> > > 
> > > 
> > > > 
> > > > > 2) rproc_stop executed between the queuing of the crash work and the call of
> > > > >     rproc_crash_handler_work
> > > > >      => rproc->state = RPROC_OFFLINE
> > > > >      => we enter in the "condition" and the pm_relax has to be called
> > > > >      => This commit fix should solve this use case
> > > > > 
> > > > > 3) rproc_detach executed between the queue of the crash work and the call of
> > > > >     rproc_crash_handler_work
> > > > >      => rproc->state = RPROC_DETACHED;
> > > > >      => we don't go in "the condition" and issue because the recovery reattach
> > > > >         to the remote processor
> > > > >      => but pm_relax is called
> > > > >      => probably need an extra fix to avoid to re-attach
> > > > > 
> > > > > 4) crash report while already one is treated (rproc_attach_recovery called)
> > > > >      this one corresponds to an auto reboot of the remote processor, with a
> > > > >      new crash
> > > > >      => rproc->state = RPROC_CRASHED or rproc->state = RPROC_DETACHED;
> > > > >      4)a) rproc->state = RPROC_CRASHED if rproc->recovery_disabled = true
> > > > >           => should call pm_relax if rproc->recovery_disabled = true
> > > > >           => commit does not work for this use case
> > > > > 
> > > > >      4)b) rproc->state = RPROC_DETACHED if recovery fails
> > > > >          => error case with an unstable state
> > > > >          => how to differentiate it from the use case 3) ?
> > > > >          => introduce a RPROC_RECOVERY_FAIL state?
> > > > > 
> > > > 
> > > > The case where a recovery fails needs to be considered and is the
> > > > reason the original patch doesn't work.  Right now in
> > > > rproc_crash_handler_work(), it is not possible to differentiate
> > > > between a legitimate shutdown request (scenario #2 above) and a
> > > > recovery that went wrong.  I think introducing RPROC_RECOVERY_FAIL
> > > > would greatly simplify things.
> > > > 
> > > > My initial evaluation had not considered the attach/detach scenarios -
> > > > thanks for adding that in the mix.
> > > > 
> > > > Aiqun, please send a new patchset that adds a new remote processor
> > > > state, i.e RPROC_RECOVERY_FAIL.  There should also be another patch in
> > > > that set that takes attach/detach scenarios into account.  The code
> > > > between the v6.0 and v6.1 cycle has changed a lot in that area so make
> > > > sure to properly rebase.
> > > > 
> > > I will try.
> > > 
> > > > > 
> > > > > Then pm_stay_awake is called when the crash work is queued.
> > > > > It seems to me coherent to call the pm_relax in the work handler.
> > > > > 
> > > > > 
> > > > > 
> > > > > Here is a quick and dirty patch (not tested) that should take into account the
> > > > > main use cases ( except 1) and 4)b) )
> > > > > 
> > > > > @@ -2009,8 +2009,18 @@ static void rproc_crash_handler_work(struct work_struct *work)
> > > > > 
> > > > >           mutex_lock(&rproc->lock);
> > > > > 
> > > > > -       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
> > > > > +       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE ||
> > > > > +           rproc->state == RPROC_DETACHED) {
> > > > >                   /* handle only the first crash detected */
> > > > > +
> > > > > +               /*
> > > > > +                * call pm-relax in following use cases:
> > > > > +                * - the remote processor has been stopped by the user
> > > > > +                * - the remote processor is detached
> > > > > +                + - the remote proc has an autonomous reset but recovery_disabled is true.
> > > > > +                */
> > > > > +               if(rproc->state != RPROC_CRASHED || rproc->recovery_disabled)
> > > > > +                       pm_relax(rproc->dev.parent);
> > > > >                   mutex_unlock(&rproc->lock);
> > > > >                   return;
> > > > >           }
> > > > > 
> > > > > Regards,
> > > > > Arnaud
> > > > > 
> > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Mathieu
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > When the state is in RPROC_OFFLINE it means separate request
> > > > > > > > > > > > > of rproc_stop was done and no need to hold the wakeup source
> > > > > > > > > > > > > in crash handler to recover any more.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
> > > > > > > > > > > > > ---
> > > > > > > > > > > > >       drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
> > > > > > > > > > > > >       1 file changed, 11 insertions(+)
> > > > > > > > > > > > > 
> > > > > > > > > > > > > diff --git a/drivers/remoteproc/remoteproc_core.c
> > > > > > > > > > > > > b/drivers/remoteproc/remoteproc_core.c
> > > > > > > > > > > > > index e5279ed9a8d7..6bc7b8b7d01e 100644
> > > > > > > > > > > > > --- a/drivers/remoteproc/remoteproc_core.c
> > > > > > > > > > > > > +++ b/drivers/remoteproc/remoteproc_core.c
> > > > > > > > > > > > > @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct
> > > > > > > > > > > > > work_struct *work)
> > > > > > > > > > > > >              if (rproc->state == RPROC_CRASHED || rproc->state ==
> > > > > > > > > > > > > RPROC_OFFLINE) {
> > > > > > > > > > > > >                      /* handle only the first crash detected */
> > > > > > > > > > > > >                      mutex_unlock(&rproc->lock);
> > > > > > > > > > > > > +         /*
> > > > > > > > > > > > > +          * RPROC_OFFLINE state indicate there is no recovery process
> > > > > > > > > > > > > +          * is in progress and no chance to have pm_relax in place.
> > > > > > > > > > > > > +          * Because when recovering from crash, rproc->lock is held and
> > > > > > > > > > > > > +          * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> > > > > > > > > > > > > +          * and then unlock rproc->lock.
> > > > > > > > > > > > > +          * RPROC_OFFLINE is only an intermediate state in recovery
> > > > > > > > > > > > > +          * process.
> > > > > > > > > > > > > +          */
> > > > > > > > > > > > > +         if (rproc->state == RPROC_OFFLINE)
> > > > > > > > > > > > > +                 pm_relax(rproc->dev.parent);
> > > > > > > > > > > > >                      return;
> > > > > > > > > > > > >              }
> > > > > > > > > > > > > --
> > > > > > > > > > > > > 2.7.4
> > > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > --
> > > > > > > > > > > Thx and BRs,
> > > > > > > > > > > Aiqun(Maria) Yu
> > > > > > > > 
> > > > > > > > 
> > > > > > > > --
> > > > > > > > Thx and BRs,
> > > > > > > > Aiqun(Maria) Yu
> > > > > > 
> > > > > > 
> > > 
> > > 
> > > -- 
> > > Thx and BRs,
> > > Aiqun(Maria) Yu
> 
> 
> -- 
> Thx and BRs,
> Aiqun(Maria) Yu
Aiqun Yu (Maria) Nov. 7, 2022, 1:14 a.m. UTC | #16
Hi,
On 11/4/2022 11:59 PM, Mathieu Poirier wrote:
> On Thu, Nov 03, 2022 at 10:03:49AM +0800, Aiqun(Maria) Yu wrote:
>> On 11/3/2022 2:03 AM, Mathieu Poirier wrote:
>>> On Wed, Nov 02, 2022 at 06:53:49PM +0800, Aiqun(Maria) Yu wrote:
>>>> Hi,
>>>>
>>>> Let me think about this carefully.
>>>>
>>>> When in RPROC_RECOVERY_FAIL case we want to re-do the recovery process again
>>>> or just leave the pm_relax?
>>>
>>> Neither.
>>>
>>> When a recovery fail we don't want to call pm_relax().  The code in
>>> rproc_crash_handler_work() becomes:
>>>
>>> 	if (rproc->state == RPROC_OFFLINE) {
>>>                   /* We have raced with rproc_shutdown() */
>>>                   pm_relax()
>>> 		mutex_unlock(&rproc->lock);
>>> 		return;
>>>           }
>>>
>>> 	if (rproc->state == RPROC_CRASHED ||
>>>               rproc->state == RPROC_RECOVERY_FAILED) {
>>> 		/* handle only the first crash detected */
>>> 		mutex_unlock(&rproc->lock);
>>> 		return;
>>> 	}
>>>
>>>
>>> RPROC_RECOVERY_FAILED gets set in rproc_boot_recovery() if request_firmware() or
>>> rproc_start() fail.  Function rproc_trigger_recovery() needs to allow for the
>>> recovery the the remote processor is in RPROC_RECOVERY_FAILED state.  As such
>>> the condition becomes:
>>>
>>>           /* State could have changed before we got the mutex */
>>> 	if (rproc->state != RPROC_CRASHED &&
>>>               rproc->state != RPROC_RECOVERY_FAILED)
>>> 		goto unlock_mutex;
>>>
>>> Start with that and we can look at corner cases (if some exists) with a fresh
>>> patchset.  Note that I have not addressed the attach/detach() scenario in the
>>> above.
>>
>> If we didn't deal with the recovery failed case with correct pm_relax call,
>> it may left the device in a state that cannot enter to suspend state.
> 
> That is what I am looking for.  We don't want to give the impression that
> everything is fine by allowing the device to suspend.  If the remote processor
> can't be recovered than it needs to be dealth with.
For the normal recovery failed case, it still need to do pm_relax to not 
prevent the device goes to suspend. It is what in normal recovery failed 
case we do in rproc_crash_handler_work as well.
rproc_crash_handler_work will not check the result of the 
rproc_trigger_recovery return value, and will always do pm_relax.

For current conconrency cornor case as well, it is better to consistant 
with the current design of recovery fail senarios in normal cases.

I personally agree that we shouldn't do nothing when it is a 
RPROC_RECOVERY_FAILED senario when it is in rproc_crash_handler_work 
check, because it maybe crash happened when it is trying to do the recovery.
So I suggested to do a continue try of trigger recovery again instead of 
doing nothing and bail out if it is a RPROC_RECOVERY_FAILED state.

> 
>> Because first PROC_RECOVERY_FAIL case cannot ensure it have pm_relax called
>> before the second crash handler call pm_stay_awake or not.
>>
> 
> I've been thinking about that part.  I don't think adding a wake_count to
> control calls to pm_stay_awake()/pm_relax() is the best way to go.  There is a
> similar count happening in the PM runtime subsystem and that is what we should
> be using.  I have asked a power management expert at Linaro for guidance with
> this matter.  I should be able to get back to you with a way forward by the end
> of next week.
> 
Thx for the specific date provided as well. I will wait until your reply 
for next patchset then.

>> So, What about the atomic count along with pm_relax and pm_stay_awake ?
>>
>> struct rproc{
>> ...
>> atomic_t wake_count;
>> ...
>> }
>>
>> rproc_pm_stay_awake()
>> {
>> 	atomic_inc(&wake_count);
>> 	pm_stay_awake();
>> }
>>
>> rproc_pm_relax()
>> {
>> 	if (atomic_dec_return(&wake_count) == 0)
>> 		pm_stay_awake();
>> }
>>
>> can refer code like:
>>
>> rproc_report_crash()
>> {
>> 	...
>> 	rproc_pm_stay_awake();
>> 	queue_work();
>> 	...
>> }
>>
>> rproc_crash_handler_work()
>> {
>> 	...
>> 	if (rproc->state == RPROC_OFFLINE || rproc->state == RPROC_CRASHED) {
>>                    /* We have raced with rproc_shutdown() */
>>                    rproc_pm_relax();
>>   		mutex_unlock(&rproc->lock);
>>   		return;
>>            }
>> 	...
>> }
>>
>>>
>>> Thanks,
>>> Mathieu
>>>
>>>>
>>>> recovery fail case 1:
>>>> |                                      |firstcrash interrupt issued
>>>> | second crashed interrupt issued      | rproc_report_crash()
>>>> | rproc_report_crash()                 |          pm_stay_awake()
>>>> |          pm_stay_awake()             |          queue_work()
>>>> |          queue_work()                |rproc_crash_handler_work()
>>>> |                                      |mutex_lock(&rproc->lock);
>>>> |                                      |rproc_stop()
>>>> |rproc_crash_handler_work()            |rproc->state = RPROC_OFFLINE;
>>>> |                                      |RPROC_RECOVERY_FAIL //new
>>>> |                                      |mutex_unlock(&rproc->lock);
>>>> |mutex_lock(&rproc->lock);             |pm_relax()
>>>> |if (rproc->state == RPROC_OFFLINE)    |
>>>> |return // shouldn't do pm_relax if RPROC_RECOVERY_FAIL?  |
>>>> |mutex_unlock(&rproc->lock);           |
>>>> |                                      |
>>>> |                                      |
>>>> |                                      |
>>>>
>>>> recovery fail case 2:
>>>> |                                      |firstcrash interrupt issued
>>>> |                                      | rproc_report_crash()
>>>> |                                      |          pm_stay_awake()
>>>> |                                      |          queue_work()
>>>> |                                      |rproc_crash_handler_work()
>>>> |                                      |mutex_lock(&rproc->lock);
>>>> |                                      |rproc_stop()
>>>> |                                      |rproc->state = RPROC_OFFLINE;
>>>> |                                      |RPROC_RECOVERY_FAIL //new
>>>> |                                      |mutex_unlock(&rproc->lock);
>>>> |                                      |pm_relax()
>>>> |
>>>> | second crashed interrupt issued      |
>>>> | rproc_report_crash()                 |
>>>> |          pm_stay_awake()             |
>>>> |          queue_work()                |
>>>> |pm_stay_awake()
>>>> |mutex_lock(&rproc->lock);
>>>> |if (rproc->state == RPROC_OFFLINE)    |
>>>> |return // still need do pm_relax if RPROC_RECOVERY_FAIL?  |
>>>> |mutex_unlock(&rproc->lock);           |
>>>> |                                      |
>>>> |                                      |
>>>> |                                      |
>>>>
>>>> Maybe I can have:
>>>> 1. the pm_stay_awake and pm_relax with count based and call with paired for
>>>> fix current concurency issue.
>>>> 2. RPROC_RECOVERY_FAIL can be another patch for continue try to do recovery
>>>> work.
>>>> 3. handle RPROC_DETACHED case.
>>>>
>>>> On 11/2/2022 4:11 AM, Mathieu Poirier wrote:
>>>>> On Fri, 28 Oct 2022 at 09:31, Arnaud POULIQUEN
>>>>> <arnaud.pouliquen@foss.st.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On 10/24/22 05:17, Aiqun(Maria) Yu wrote:
>>>>>>> On 10/22/2022 3:34 AM, Mathieu Poirier wrote:
>>>>>>>> On Wed, 19 Oct 2022 at 23:52, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
>>>>>>>>>
>>>>>>>>> On 10/14/2022 2:03 AM, Mathieu Poirier wrote:
>>>>>>>>>> On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote:
>>>>>>>>>>> On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:
>>>>>>>>>>>> Hi Mathieu,
>>>>>>>>>>>>
>>>>>>>>>>>> On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
>>>>>>>>>>>>> Please add what has changed from one version to another, either in a cover
>>>>>>>>>>>>> letter or after the "Signed-off-by".  There are many examples on how to
>>>>>>>>>>>>> do that
>>>>>>>>>>>>> on the mailing list.
>>>>>>>>>>>>>
>>>>>>>>>>>> Thx for the information, will take a note and benefit for next time.
>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
>>>>>>>>>>>>>> RPROC_OFFLINE state indicate there is no recovery process
>>>>>>>>>>>>>> is in progress and no chance to do the pm_relax.
>>>>>>>>>>>>>> Because when recovering from crash, rproc->lock is held and
>>>>>>>>>>>>>> state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>>>>>>>>>>> and then unlock rproc->lock.
>>>>>>>>>>>>>
>>>>>>>>>>>>> You are correct - because the lock is held rproc->state should be set to
>>>>>>>>>>>>> RPROC_RUNNING
>>>>>>>>>>>>> when rproc_trigger_recovery() returns.  If that is not the case then
>>>>>>>>>>>>> something
>>>>>>>>>>>>> went wrong.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before
>>>>>>>>>>>>> returning,
>>>>>>>>>>>>> so we know the remote processor was stopped.  Therefore if rproc->state
>>>>>>>>>>>>> is set
>>>>>>>>>>>>> to RPROC_OFFLINE something went wrong in either request_firmware() or
>>>>>>>>>>>>> rproc_start().  Either way the remote processor is offline and the system
>>>>>>>>>>>>> probably
>>>>>>>>>>>>> in an unknown/unstable.  As such I don't see how calling pm_relax() can help
>>>>>>>>>>>>> things along.
>>>>>>>>>>>>>
>>>>>>>>>>>> PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
>>>>>>>>>>>> finished.
>>>>>>>>>>>> Even if it is multi crash rproc_crash_handler_work contention issue, and
>>>>>>>>>>>> last rproc_trigger_recovery bailed out with only
>>>>>>>>>>>> rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
>>>>>>>>>>>> Since the subsystem may still can be recovered with customer's next trigger
>>>>>>>>>>>> of rproc_start, and we can make each error out path clean with pm resources.
>>>>>>>>>>>>
>>>>>>>>>>>>> I suggest spending time understanding what leads to the failure when
>>>>>>>>>>>>> recovering
>>>>>>>>>>>>> from a crash and address that problem(s).
>>>>>>>>>>>>>
>>>>>>>>>>>> In current case, the customer's information is that the issue happened when
>>>>>>>>>>>> rproc_shutdown is triggered at similar time. So not an issue from error out
>>>>>>>>>>>> of rproc_trigger_recovery.
>>>>>>>>>>>
>>>>>>>>>>> That is a very important element to consider and should have been mentioned
>>>>>>>>>>> from
>>>>>>>>>>> the beginning.  What I see happening is the following:
>>>>>>>>>>>
>>>>>>>>>>> rproc_report_crash()
>>>>>>>>>>>              pm_stay_awake()
>>>>>>>>>>>              queue_work() // current thread is suspended
>>>>>>>>>>>
>>>>>>>>>>> rproc_shutdown()
>>>>>>>>>>>              rproc_stop()
>>>>>>>>>>>                      rproc->state = RPROC_OFFLINE;
>>>>>>>>>>>
>>>>>>>>>>> rproc_crash_handler_work()
>>>>>>>>>>>              if (rproc->state == RPROC_OFFLINE)
>>>>>>>>>>>                      return // pm_relax() is not called
>>>>>>>>>>>
>>>>>>>>>>> The right way to fix this is to add a pm_relax() in rproc_shutdown() and
>>>>>>>>>>> rproc_detach(), along with a very descriptive comment as to why it is needed.
>>>>>>>>>>
>>>>>>>>>> Thinking about this further there are more ramifications to consider.  Please
>>>>>>>>>> confirm the above scenario is what you are facing.  I will advise on how to
>>>>>>>>>> move
>>>>>>>>>> forward if that is the case.
>>>>>>>>>>
>>>>>>>>> Not sure if the situation is clear or not. So resend the email again.
>>>>>>>>>
>>>>>>>>> The above senario is what customer is facing. crash hanppened while at
>>>>>>>>> the same time shutdown is triggered.
>>>>>>>>
>>>>>>>> Unfortunately this is not enough details to address a problem as
>>>>>>>> complex as this one.
>>>>>>>>
>>>>>>>>> And the device cannto goes to suspend state after that.
>>>>>>>>> the subsystem can still be start normally after this.
>>>>>>>>
>>>>>>>> If the code flow I pasted above reflects the problem at hand, the
>>>>>>>> current patch will not be sufficient to address the issue.  If Arnaud
>>>>>>>> confirms my suspicions we will have to think about a better solution.
>>>>>>>>
>>>>>>>
>>>>>>> Hi Mathiew,
>>>>>>>
>>>>>>> Could you pls have more details of any side effects other then power issue of
>>>>>>> the current senario?
>>>>>>> Why the current patch is not sufficient pls?
>>>>>>>
>>>>>>>
>>>>>>> Have the current senario in details with rproc->lock information in details:
>>>>>>>
>>>>>>> | subsystem crashed interrupt issued      | user trigger shutdown
>>>>>>> | rproc_report_crash()                    |
>>>>>>> |          pm_stay_awake()                |
>>>>>>> |          queue_work()                   |
>>>>>>> |                                         |rproc_shutdown
>>>>>>> |                                         |mutex_lock(&rproc->lock);
>>>>>>> |                                         |rproc_stop()
>>>>>>> |rproc_crash_handler_work()               |rproc->state = RPROC_OFFLINE;
>>>>>>> |                                         |mutex_unlock(&rproc->lock);
>>>>>>> |mutex_lock(&rproc->lock);                |
>>>>>>> |if (rproc->state == RPROC_OFFLINE)       |
>>>>>>> |return // pm_relax() is not called       |rproc_boot
>>>>>>> |mutex_unlock(&rproc->lock);              |
>>>>>>> |                                         |mutex_lock(&rproc->lock);
>>>>>>> |                                         |rproc_start()
>>>>>>> |                                         |mutex_unlock(&rproc->lock);
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Agree with Mathieu, this is not so simple.
>>>>>>
>>>>>
>>>>> Thanks for looking into this.
>>>>>
>>>>>> Here is my view  hoping I haven't missed a point in your discussion or
>>>>>> an other corner cases.
>>>>>>
>>>>>> I tried to analyze the issues (in what follows, the term "condition" means
>>>>>> the "if" condition in which Aiqun proposes to add the fix) :
>>>>>>
>>>>>> I can see 4 use cases with race condition
>>>>>>
>>>>>> 1) crash report while already one is treated (rproc_boot_recovery called)
>>>>>>         => not a real use case as if the remote processor is crashed we
>>>>>>                  should not have a second crash report
>>>>>>
>>>>>
>>>>> That part is of great concern to me.  *Theoretically* we should not
>>>>> get a new crash report while one has already been dispatched but the
>>>>> current code accounts for this scenario and as such the possibility
>>>>> can't be dismissed.  Therefore we need to expect rproc_report_crash()
>>>>> to be called multiple times before a single instance of
>>>>> rproc_boot_recovery() is scheduled.
>>>>
>>>>
>>>>>
>>>>>> 2) rproc_stop executed between the queuing of the crash work and the call of
>>>>>>      rproc_crash_handler_work
>>>>>>       => rproc->state = RPROC_OFFLINE
>>>>>>       => we enter in the "condition" and the pm_relax has to be called
>>>>>>       => This commit fix should solve this use case
>>>>>>
>>>>>> 3) rproc_detach executed between the queue of the crash work and the call of
>>>>>>      rproc_crash_handler_work
>>>>>>       => rproc->state = RPROC_DETACHED;
>>>>>>       => we don't go in "the condition" and issue because the recovery reattach
>>>>>>          to the remote processor
>>>>>>       => but pm_relax is called
>>>>>>       => probably need an extra fix to avoid to re-attach
>>>>>>
>>>>>> 4) crash report while already one is treated (rproc_attach_recovery called)
>>>>>>       this one corresponds to an auto reboot of the remote processor, with a
>>>>>>       new crash
>>>>>>       => rproc->state = RPROC_CRASHED or rproc->state = RPROC_DETACHED;
>>>>>>       4)a) rproc->state = RPROC_CRASHED if rproc->recovery_disabled = true
>>>>>>            => should call pm_relax if rproc->recovery_disabled = true
>>>>>>            => commit does not work for this use case
>>>>>>
>>>>>>       4)b) rproc->state = RPROC_DETACHED if recovery fails
>>>>>>           => error case with an unstable state
>>>>>>           => how to differentiate it from the use case 3) ?
>>>>>>           => introduce a RPROC_RECOVERY_FAIL state?
>>>>>>
>>>>>
>>>>> The case where a recovery fails needs to be considered and is the
>>>>> reason the original patch doesn't work.  Right now in
>>>>> rproc_crash_handler_work(), it is not possible to differentiate
>>>>> between a legitimate shutdown request (scenario #2 above) and a
>>>>> recovery that went wrong.  I think introducing RPROC_RECOVERY_FAIL
>>>>> would greatly simplify things.
>>>>>
>>>>> My initial evaluation had not considered the attach/detach scenarios -
>>>>> thanks for adding that in the mix.
>>>>>
>>>>> Aiqun, please send a new patchset that adds a new remote processor
>>>>> state, i.e RPROC_RECOVERY_FAIL.  There should also be another patch in
>>>>> that set that takes attach/detach scenarios into account.  The code
>>>>> between the v6.0 and v6.1 cycle has changed a lot in that area so make
>>>>> sure to properly rebase.
>>>>>
>>>> I will try.
>>>>
>>>>>>
>>>>>> Then pm_stay_awake is called when the crash work is queued.
>>>>>> It seems to me coherent to call the pm_relax in the work handler.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Here is a quick and dirty patch (not tested) that should take into account the
>>>>>> main use cases ( except 1) and 4)b) )
>>>>>>
>>>>>> @@ -2009,8 +2009,18 @@ static void rproc_crash_handler_work(struct work_struct *work)
>>>>>>
>>>>>>            mutex_lock(&rproc->lock);
>>>>>>
>>>>>> -       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
>>>>>> +       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE ||
>>>>>> +           rproc->state == RPROC_DETACHED) {
>>>>>>                    /* handle only the first crash detected */
>>>>>> +
>>>>>> +               /*
>>>>>> +                * call pm-relax in following use cases:
>>>>>> +                * - the remote processor has been stopped by the user
>>>>>> +                * - the remote processor is detached
>>>>>> +                + - the remote proc has an autonomous reset but recovery_disabled is true.
>>>>>> +                */
>>>>>> +               if(rproc->state != RPROC_CRASHED || rproc->recovery_disabled)
>>>>>> +                       pm_relax(rproc->dev.parent);
>>>>>>                    mutex_unlock(&rproc->lock);
>>>>>>                    return;
>>>>>>            }
>>>>>>
>>>>>> Regards,
>>>>>> Arnaud
>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Mathieu
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> When the state is in RPROC_OFFLINE it means separate request
>>>>>>>>>>>>>> of rproc_stop was done and no need to hold the wakeup source
>>>>>>>>>>>>>> in crash handler to recover any more.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>        drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
>>>>>>>>>>>>>>        1 file changed, 11 insertions(+)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>>>> b/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>>>> index e5279ed9a8d7..6bc7b8b7d01e 100644
>>>>>>>>>>>>>> --- a/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>>>> +++ b/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>>>> @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct
>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>               if (rproc->state == RPROC_CRASHED || rproc->state ==
>>>>>>>>>>>>>> RPROC_OFFLINE) {
>>>>>>>>>>>>>>                       /* handle only the first crash detected */
>>>>>>>>>>>>>>                       mutex_unlock(&rproc->lock);
>>>>>>>>>>>>>> +         /*
>>>>>>>>>>>>>> +          * RPROC_OFFLINE state indicate there is no recovery process
>>>>>>>>>>>>>> +          * is in progress and no chance to have pm_relax in place.
>>>>>>>>>>>>>> +          * Because when recovering from crash, rproc->lock is held and
>>>>>>>>>>>>>> +          * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>>>>>>>>>>> +          * and then unlock rproc->lock.
>>>>>>>>>>>>>> +          * RPROC_OFFLINE is only an intermediate state in recovery
>>>>>>>>>>>>>> +          * process.
>>>>>>>>>>>>>> +          */
>>>>>>>>>>>>>> +         if (rproc->state == RPROC_OFFLINE)
>>>>>>>>>>>>>> +                 pm_relax(rproc->dev.parent);
>>>>>>>>>>>>>>                       return;
>>>>>>>>>>>>>>               }
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> 2.7.4
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Thx and BRs,
>>>>>>>>>>>> Aiqun(Maria) Yu
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thx and BRs,
>>>>>>>>> Aiqun(Maria) Yu
>>>>>>>
>>>>>>>
>>>>
>>>>
>>>> -- 
>>>> Thx and BRs,
>>>> Aiqun(Maria) Yu
>>
>>
>> -- 
>> Thx and BRs,
>> Aiqun(Maria) Yu
Mathieu Poirier Nov. 10, 2022, 8:50 p.m. UTC | #17
I had a couple of good discussions with our power management expert
and even then, the way forward isn't as clear as I would have liked.
I am currently travelling and as such don't have the required time to
go into greater details, something I will be doing next week.

On Sun, 6 Nov 2022 at 18:14, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
>
> Hi,
> On 11/4/2022 11:59 PM, Mathieu Poirier wrote:
> > On Thu, Nov 03, 2022 at 10:03:49AM +0800, Aiqun(Maria) Yu wrote:
> >> On 11/3/2022 2:03 AM, Mathieu Poirier wrote:
> >>> On Wed, Nov 02, 2022 at 06:53:49PM +0800, Aiqun(Maria) Yu wrote:
> >>>> Hi,
> >>>>
> >>>> Let me think about this carefully.
> >>>>
> >>>> When in RPROC_RECOVERY_FAIL case we want to re-do the recovery process again
> >>>> or just leave the pm_relax?
> >>>
> >>> Neither.
> >>>
> >>> When a recovery fail we don't want to call pm_relax().  The code in
> >>> rproc_crash_handler_work() becomes:
> >>>
> >>>     if (rproc->state == RPROC_OFFLINE) {
> >>>                   /* We have raced with rproc_shutdown() */
> >>>                   pm_relax()
> >>>             mutex_unlock(&rproc->lock);
> >>>             return;
> >>>           }
> >>>
> >>>     if (rproc->state == RPROC_CRASHED ||
> >>>               rproc->state == RPROC_RECOVERY_FAILED) {
> >>>             /* handle only the first crash detected */
> >>>             mutex_unlock(&rproc->lock);
> >>>             return;
> >>>     }
> >>>
> >>>
> >>> RPROC_RECOVERY_FAILED gets set in rproc_boot_recovery() if request_firmware() or
> >>> rproc_start() fail.  Function rproc_trigger_recovery() needs to allow for the
> >>> recovery the the remote processor is in RPROC_RECOVERY_FAILED state.  As such
> >>> the condition becomes:
> >>>
> >>>           /* State could have changed before we got the mutex */
> >>>     if (rproc->state != RPROC_CRASHED &&
> >>>               rproc->state != RPROC_RECOVERY_FAILED)
> >>>             goto unlock_mutex;
> >>>
> >>> Start with that and we can look at corner cases (if some exists) with a fresh
> >>> patchset.  Note that I have not addressed the attach/detach() scenario in the
> >>> above.
> >>
> >> If we didn't deal with the recovery failed case with correct pm_relax call,
> >> it may left the device in a state that cannot enter to suspend state.
> >
> > That is what I am looking for.  We don't want to give the impression that
> > everything is fine by allowing the device to suspend.  If the remote processor
> > can't be recovered than it needs to be dealth with.
> For the normal recovery failed case, it still need to do pm_relax to not
> prevent the device goes to suspend. It is what in normal recovery failed
> case we do in rproc_crash_handler_work as well.
> rproc_crash_handler_work will not check the result of the
> rproc_trigger_recovery return value, and will always do pm_relax.
>
> For current conconrency cornor case as well, it is better to consistant
> with the current design of recovery fail senarios in normal cases.
>
> I personally agree that we shouldn't do nothing when it is a
> RPROC_RECOVERY_FAILED senario when it is in rproc_crash_handler_work
> check, because it maybe crash happened when it is trying to do the recovery.
> So I suggested to do a continue try of trigger recovery again instead of
> doing nothing and bail out if it is a RPROC_RECOVERY_FAILED state.
>
> >
> >> Because first PROC_RECOVERY_FAIL case cannot ensure it have pm_relax called
> >> before the second crash handler call pm_stay_awake or not.
> >>
> >
> > I've been thinking about that part.  I don't think adding a wake_count to
> > control calls to pm_stay_awake()/pm_relax() is the best way to go.  There is a
> > similar count happening in the PM runtime subsystem and that is what we should
> > be using.  I have asked a power management expert at Linaro for guidance with
> > this matter.  I should be able to get back to you with a way forward by the end
> > of next week.
> >
> Thx for the specific date provided as well. I will wait until your reply
> for next patchset then.
>
> >> So, What about the atomic count along with pm_relax and pm_stay_awake ?
> >>
> >> struct rproc{
> >> ...
> >> atomic_t wake_count;
> >> ...
> >> }
> >>
> >> rproc_pm_stay_awake()
> >> {
> >>      atomic_inc(&wake_count);
> >>      pm_stay_awake();
> >> }
> >>
> >> rproc_pm_relax()
> >> {
> >>      if (atomic_dec_return(&wake_count) == 0)
> >>              pm_stay_awake();
> >> }
> >>
> >> can refer code like:
> >>
> >> rproc_report_crash()
> >> {
> >>      ...
> >>      rproc_pm_stay_awake();
> >>      queue_work();
> >>      ...
> >> }
> >>
> >> rproc_crash_handler_work()
> >> {
> >>      ...
> >>      if (rproc->state == RPROC_OFFLINE || rproc->state == RPROC_CRASHED) {
> >>                    /* We have raced with rproc_shutdown() */
> >>                    rproc_pm_relax();
> >>              mutex_unlock(&rproc->lock);
> >>              return;
> >>            }
> >>      ...
> >> }
> >>
> >>>
> >>> Thanks,
> >>> Mathieu
> >>>
> >>>>
> >>>> recovery fail case 1:
> >>>> |                                      |firstcrash interrupt issued
> >>>> | second crashed interrupt issued      | rproc_report_crash()
> >>>> | rproc_report_crash()                 |          pm_stay_awake()
> >>>> |          pm_stay_awake()             |          queue_work()
> >>>> |          queue_work()                |rproc_crash_handler_work()
> >>>> |                                      |mutex_lock(&rproc->lock);
> >>>> |                                      |rproc_stop()
> >>>> |rproc_crash_handler_work()            |rproc->state = RPROC_OFFLINE;
> >>>> |                                      |RPROC_RECOVERY_FAIL //new
> >>>> |                                      |mutex_unlock(&rproc->lock);
> >>>> |mutex_lock(&rproc->lock);             |pm_relax()
> >>>> |if (rproc->state == RPROC_OFFLINE)    |
> >>>> |return // shouldn't do pm_relax if RPROC_RECOVERY_FAIL?  |
> >>>> |mutex_unlock(&rproc->lock);           |
> >>>> |                                      |
> >>>> |                                      |
> >>>> |                                      |
> >>>>
> >>>> recovery fail case 2:
> >>>> |                                      |firstcrash interrupt issued
> >>>> |                                      | rproc_report_crash()
> >>>> |                                      |          pm_stay_awake()
> >>>> |                                      |          queue_work()
> >>>> |                                      |rproc_crash_handler_work()
> >>>> |                                      |mutex_lock(&rproc->lock);
> >>>> |                                      |rproc_stop()
> >>>> |                                      |rproc->state = RPROC_OFFLINE;
> >>>> |                                      |RPROC_RECOVERY_FAIL //new
> >>>> |                                      |mutex_unlock(&rproc->lock);
> >>>> |                                      |pm_relax()
> >>>> |
> >>>> | second crashed interrupt issued      |
> >>>> | rproc_report_crash()                 |
> >>>> |          pm_stay_awake()             |
> >>>> |          queue_work()                |
> >>>> |pm_stay_awake()
> >>>> |mutex_lock(&rproc->lock);
> >>>> |if (rproc->state == RPROC_OFFLINE)    |
> >>>> |return // still need do pm_relax if RPROC_RECOVERY_FAIL?  |
> >>>> |mutex_unlock(&rproc->lock);           |
> >>>> |                                      |
> >>>> |                                      |
> >>>> |                                      |
> >>>>
> >>>> Maybe I can have:
> >>>> 1. the pm_stay_awake and pm_relax with count based and call with paired for
> >>>> fix current concurency issue.
> >>>> 2. RPROC_RECOVERY_FAIL can be another patch for continue try to do recovery
> >>>> work.
> >>>> 3. handle RPROC_DETACHED case.
> >>>>
> >>>> On 11/2/2022 4:11 AM, Mathieu Poirier wrote:
> >>>>> On Fri, 28 Oct 2022 at 09:31, Arnaud POULIQUEN
> >>>>> <arnaud.pouliquen@foss.st.com> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> On 10/24/22 05:17, Aiqun(Maria) Yu wrote:
> >>>>>>> On 10/22/2022 3:34 AM, Mathieu Poirier wrote:
> >>>>>>>> On Wed, 19 Oct 2022 at 23:52, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
> >>>>>>>>>
> >>>>>>>>> On 10/14/2022 2:03 AM, Mathieu Poirier wrote:
> >>>>>>>>>> On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote:
> >>>>>>>>>>> On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:
> >>>>>>>>>>>> Hi Mathieu,
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
> >>>>>>>>>>>>> Please add what has changed from one version to another, either in a cover
> >>>>>>>>>>>>> letter or after the "Signed-off-by".  There are many examples on how to
> >>>>>>>>>>>>> do that
> >>>>>>>>>>>>> on the mailing list.
> >>>>>>>>>>>>>
> >>>>>>>>>>>> Thx for the information, will take a note and benefit for next time.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
> >>>>>>>>>>>>>> RPROC_OFFLINE state indicate there is no recovery process
> >>>>>>>>>>>>>> is in progress and no chance to do the pm_relax.
> >>>>>>>>>>>>>> Because when recovering from crash, rproc->lock is held and
> >>>>>>>>>>>>>> state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> >>>>>>>>>>>>>> and then unlock rproc->lock.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> You are correct - because the lock is held rproc->state should be set to
> >>>>>>>>>>>>> RPROC_RUNNING
> >>>>>>>>>>>>> when rproc_trigger_recovery() returns.  If that is not the case then
> >>>>>>>>>>>>> something
> >>>>>>>>>>>>> went wrong.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before
> >>>>>>>>>>>>> returning,
> >>>>>>>>>>>>> so we know the remote processor was stopped.  Therefore if rproc->state
> >>>>>>>>>>>>> is set
> >>>>>>>>>>>>> to RPROC_OFFLINE something went wrong in either request_firmware() or
> >>>>>>>>>>>>> rproc_start().  Either way the remote processor is offline and the system
> >>>>>>>>>>>>> probably
> >>>>>>>>>>>>> in an unknown/unstable.  As such I don't see how calling pm_relax() can help
> >>>>>>>>>>>>> things along.
> >>>>>>>>>>>>>
> >>>>>>>>>>>> PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
> >>>>>>>>>>>> finished.
> >>>>>>>>>>>> Even if it is multi crash rproc_crash_handler_work contention issue, and
> >>>>>>>>>>>> last rproc_trigger_recovery bailed out with only
> >>>>>>>>>>>> rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
> >>>>>>>>>>>> Since the subsystem may still can be recovered with customer's next trigger
> >>>>>>>>>>>> of rproc_start, and we can make each error out path clean with pm resources.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I suggest spending time understanding what leads to the failure when
> >>>>>>>>>>>>> recovering
> >>>>>>>>>>>>> from a crash and address that problem(s).
> >>>>>>>>>>>>>
> >>>>>>>>>>>> In current case, the customer's information is that the issue happened when
> >>>>>>>>>>>> rproc_shutdown is triggered at similar time. So not an issue from error out
> >>>>>>>>>>>> of rproc_trigger_recovery.
> >>>>>>>>>>>
> >>>>>>>>>>> That is a very important element to consider and should have been mentioned
> >>>>>>>>>>> from
> >>>>>>>>>>> the beginning.  What I see happening is the following:
> >>>>>>>>>>>
> >>>>>>>>>>> rproc_report_crash()
> >>>>>>>>>>>              pm_stay_awake()
> >>>>>>>>>>>              queue_work() // current thread is suspended
> >>>>>>>>>>>
> >>>>>>>>>>> rproc_shutdown()
> >>>>>>>>>>>              rproc_stop()
> >>>>>>>>>>>                      rproc->state = RPROC_OFFLINE;
> >>>>>>>>>>>
> >>>>>>>>>>> rproc_crash_handler_work()
> >>>>>>>>>>>              if (rproc->state == RPROC_OFFLINE)
> >>>>>>>>>>>                      return // pm_relax() is not called
> >>>>>>>>>>>
> >>>>>>>>>>> The right way to fix this is to add a pm_relax() in rproc_shutdown() and
> >>>>>>>>>>> rproc_detach(), along with a very descriptive comment as to why it is needed.
> >>>>>>>>>>
> >>>>>>>>>> Thinking about this further there are more ramifications to consider.  Please
> >>>>>>>>>> confirm the above scenario is what you are facing.  I will advise on how to
> >>>>>>>>>> move
> >>>>>>>>>> forward if that is the case.
> >>>>>>>>>>
> >>>>>>>>> Not sure if the situation is clear or not. So resend the email again.
> >>>>>>>>>
> >>>>>>>>> The above senario is what customer is facing. crash hanppened while at
> >>>>>>>>> the same time shutdown is triggered.
> >>>>>>>>
> >>>>>>>> Unfortunately this is not enough details to address a problem as
> >>>>>>>> complex as this one.
> >>>>>>>>
> >>>>>>>>> And the device cannto goes to suspend state after that.
> >>>>>>>>> the subsystem can still be start normally after this.
> >>>>>>>>
> >>>>>>>> If the code flow I pasted above reflects the problem at hand, the
> >>>>>>>> current patch will not be sufficient to address the issue.  If Arnaud
> >>>>>>>> confirms my suspicions we will have to think about a better solution.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Hi Mathiew,
> >>>>>>>
> >>>>>>> Could you pls have more details of any side effects other then power issue of
> >>>>>>> the current senario?
> >>>>>>> Why the current patch is not sufficient pls?
> >>>>>>>
> >>>>>>>
> >>>>>>> Have the current senario in details with rproc->lock information in details:
> >>>>>>>
> >>>>>>> | subsystem crashed interrupt issued      | user trigger shutdown
> >>>>>>> | rproc_report_crash()                    |
> >>>>>>> |          pm_stay_awake()                |
> >>>>>>> |          queue_work()                   |
> >>>>>>> |                                         |rproc_shutdown
> >>>>>>> |                                         |mutex_lock(&rproc->lock);
> >>>>>>> |                                         |rproc_stop()
> >>>>>>> |rproc_crash_handler_work()               |rproc->state = RPROC_OFFLINE;
> >>>>>>> |                                         |mutex_unlock(&rproc->lock);
> >>>>>>> |mutex_lock(&rproc->lock);                |
> >>>>>>> |if (rproc->state == RPROC_OFFLINE)       |
> >>>>>>> |return // pm_relax() is not called       |rproc_boot
> >>>>>>> |mutex_unlock(&rproc->lock);              |
> >>>>>>> |                                         |mutex_lock(&rproc->lock);
> >>>>>>> |                                         |rproc_start()
> >>>>>>> |                                         |mutex_unlock(&rproc->lock);
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> Agree with Mathieu, this is not so simple.
> >>>>>>
> >>>>>
> >>>>> Thanks for looking into this.
> >>>>>
> >>>>>> Here is my view  hoping I haven't missed a point in your discussion or
> >>>>>> an other corner cases.
> >>>>>>
> >>>>>> I tried to analyze the issues (in what follows, the term "condition" means
> >>>>>> the "if" condition in which Aiqun proposes to add the fix) :
> >>>>>>
> >>>>>> I can see 4 use cases with race condition
> >>>>>>
> >>>>>> 1) crash report while already one is treated (rproc_boot_recovery called)
> >>>>>>         => not a real use case as if the remote processor is crashed we
> >>>>>>                  should not have a second crash report
> >>>>>>
> >>>>>
> >>>>> That part is of great concern to me.  *Theoretically* we should not
> >>>>> get a new crash report while one has already been dispatched but the
> >>>>> current code accounts for this scenario and as such the possibility
> >>>>> can't be dismissed.  Therefore we need to expect rproc_report_crash()
> >>>>> to be called multiple times before a single instance of
> >>>>> rproc_boot_recovery() is scheduled.
> >>>>
> >>>>
> >>>>>
> >>>>>> 2) rproc_stop executed between the queuing of the crash work and the call of
> >>>>>>      rproc_crash_handler_work
> >>>>>>       => rproc->state = RPROC_OFFLINE
> >>>>>>       => we enter in the "condition" and the pm_relax has to be called
> >>>>>>       => This commit fix should solve this use case
> >>>>>>
> >>>>>> 3) rproc_detach executed between the queue of the crash work and the call of
> >>>>>>      rproc_crash_handler_work
> >>>>>>       => rproc->state = RPROC_DETACHED;
> >>>>>>       => we don't go in "the condition" and issue because the recovery reattach
> >>>>>>          to the remote processor
> >>>>>>       => but pm_relax is called
> >>>>>>       => probably need an extra fix to avoid to re-attach
> >>>>>>
> >>>>>> 4) crash report while already one is treated (rproc_attach_recovery called)
> >>>>>>       this one corresponds to an auto reboot of the remote processor, with a
> >>>>>>       new crash
> >>>>>>       => rproc->state = RPROC_CRASHED or rproc->state = RPROC_DETACHED;
> >>>>>>       4)a) rproc->state = RPROC_CRASHED if rproc->recovery_disabled = true
> >>>>>>            => should call pm_relax if rproc->recovery_disabled = true
> >>>>>>            => commit does not work for this use case
> >>>>>>
> >>>>>>       4)b) rproc->state = RPROC_DETACHED if recovery fails
> >>>>>>           => error case with an unstable state
> >>>>>>           => how to differentiate it from the use case 3) ?
> >>>>>>           => introduce a RPROC_RECOVERY_FAIL state?
> >>>>>>
> >>>>>
> >>>>> The case where a recovery fails needs to be considered and is the
> >>>>> reason the original patch doesn't work.  Right now in
> >>>>> rproc_crash_handler_work(), it is not possible to differentiate
> >>>>> between a legitimate shutdown request (scenario #2 above) and a
> >>>>> recovery that went wrong.  I think introducing RPROC_RECOVERY_FAIL
> >>>>> would greatly simplify things.
> >>>>>
> >>>>> My initial evaluation had not considered the attach/detach scenarios -
> >>>>> thanks for adding that in the mix.
> >>>>>
> >>>>> Aiqun, please send a new patchset that adds a new remote processor
> >>>>> state, i.e RPROC_RECOVERY_FAIL.  There should also be another patch in
> >>>>> that set that takes attach/detach scenarios into account.  The code
> >>>>> between the v6.0 and v6.1 cycle has changed a lot in that area so make
> >>>>> sure to properly rebase.
> >>>>>
> >>>> I will try.
> >>>>
> >>>>>>
> >>>>>> Then pm_stay_awake is called when the crash work is queued.
> >>>>>> It seems to me coherent to call the pm_relax in the work handler.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Here is a quick and dirty patch (not tested) that should take into account the
> >>>>>> main use cases ( except 1) and 4)b) )
> >>>>>>
> >>>>>> @@ -2009,8 +2009,18 @@ static void rproc_crash_handler_work(struct work_struct *work)
> >>>>>>
> >>>>>>            mutex_lock(&rproc->lock);
> >>>>>>
> >>>>>> -       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
> >>>>>> +       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE ||
> >>>>>> +           rproc->state == RPROC_DETACHED) {
> >>>>>>                    /* handle only the first crash detected */
> >>>>>> +
> >>>>>> +               /*
> >>>>>> +                * call pm-relax in following use cases:
> >>>>>> +                * - the remote processor has been stopped by the user
> >>>>>> +                * - the remote processor is detached
> >>>>>> +                + - the remote proc has an autonomous reset but recovery_disabled is true.
> >>>>>> +                */
> >>>>>> +               if(rproc->state != RPROC_CRASHED || rproc->recovery_disabled)
> >>>>>> +                       pm_relax(rproc->dev.parent);
> >>>>>>                    mutex_unlock(&rproc->lock);
> >>>>>>                    return;
> >>>>>>            }
> >>>>>>
> >>>>>> Regards,
> >>>>>> Arnaud
> >>>>>>
> >>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> Mathieu
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> When the state is in RPROC_OFFLINE it means separate request
> >>>>>>>>>>>>>> of rproc_stop was done and no need to hold the wakeup source
> >>>>>>>>>>>>>> in crash handler to recover any more.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
> >>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>        drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
> >>>>>>>>>>>>>>        1 file changed, 11 insertions(+)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> diff --git a/drivers/remoteproc/remoteproc_core.c
> >>>>>>>>>>>>>> b/drivers/remoteproc/remoteproc_core.c
> >>>>>>>>>>>>>> index e5279ed9a8d7..6bc7b8b7d01e 100644
> >>>>>>>>>>>>>> --- a/drivers/remoteproc/remoteproc_core.c
> >>>>>>>>>>>>>> +++ b/drivers/remoteproc/remoteproc_core.c
> >>>>>>>>>>>>>> @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct
> >>>>>>>>>>>>>> work_struct *work)
> >>>>>>>>>>>>>>               if (rproc->state == RPROC_CRASHED || rproc->state ==
> >>>>>>>>>>>>>> RPROC_OFFLINE) {
> >>>>>>>>>>>>>>                       /* handle only the first crash detected */
> >>>>>>>>>>>>>>                       mutex_unlock(&rproc->lock);
> >>>>>>>>>>>>>> +         /*
> >>>>>>>>>>>>>> +          * RPROC_OFFLINE state indicate there is no recovery process
> >>>>>>>>>>>>>> +          * is in progress and no chance to have pm_relax in place.
> >>>>>>>>>>>>>> +          * Because when recovering from crash, rproc->lock is held and
> >>>>>>>>>>>>>> +          * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> >>>>>>>>>>>>>> +          * and then unlock rproc->lock.
> >>>>>>>>>>>>>> +          * RPROC_OFFLINE is only an intermediate state in recovery
> >>>>>>>>>>>>>> +          * process.
> >>>>>>>>>>>>>> +          */
> >>>>>>>>>>>>>> +         if (rproc->state == RPROC_OFFLINE)
> >>>>>>>>>>>>>> +                 pm_relax(rproc->dev.parent);
> >>>>>>>>>>>>>>                       return;
> >>>>>>>>>>>>>>               }
> >>>>>>>>>>>>>> --
> >>>>>>>>>>>>>> 2.7.4
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Thx and BRs,
> >>>>>>>>>>>> Aiqun(Maria) Yu
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Thx and BRs,
> >>>>>>>>> Aiqun(Maria) Yu
> >>>>>>>
> >>>>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Thx and BRs,
> >>>> Aiqun(Maria) Yu
> >>
> >>
> >> --
> >> Thx and BRs,
> >> Aiqun(Maria) Yu
>
>
> --
> Thx and BRs,
> Aiqun(Maria) Yu
Aiqun Yu (Maria) Nov. 11, 2022, 12:52 a.m. UTC | #18
On 11/11/2022 4:50 AM, Mathieu Poirier wrote:
> I had a couple of good discussions with our power management expert
> and even then, the way forward isn't as clear as I would have liked.
> I am currently travelling and as such don't have the required time to
> go into greater details, something I will be doing next week.
> 
Thx Mathieu for the info updated.
I'll wait for your update next week then.
let me know any initial questions that you have, perhaps I can also 
discuss that with our power team at the same time.
> On Sun, 6 Nov 2022 at 18:14, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
>>
>> Hi,
>> On 11/4/2022 11:59 PM, Mathieu Poirier wrote:
>>> On Thu, Nov 03, 2022 at 10:03:49AM +0800, Aiqun(Maria) Yu wrote:
>>>> On 11/3/2022 2:03 AM, Mathieu Poirier wrote:
>>>>> On Wed, Nov 02, 2022 at 06:53:49PM +0800, Aiqun(Maria) Yu wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Let me think about this carefully.
>>>>>>
>>>>>> When in RPROC_RECOVERY_FAIL case we want to re-do the recovery process again
>>>>>> or just leave the pm_relax?
>>>>>
>>>>> Neither.
>>>>>
>>>>> When a recovery fail we don't want to call pm_relax().  The code in
>>>>> rproc_crash_handler_work() becomes:
>>>>>
>>>>>      if (rproc->state == RPROC_OFFLINE) {
>>>>>                    /* We have raced with rproc_shutdown() */
>>>>>                    pm_relax()
>>>>>              mutex_unlock(&rproc->lock);
>>>>>              return;
>>>>>            }
>>>>>
>>>>>      if (rproc->state == RPROC_CRASHED ||
>>>>>                rproc->state == RPROC_RECOVERY_FAILED) {
>>>>>              /* handle only the first crash detected */
>>>>>              mutex_unlock(&rproc->lock);
>>>>>              return;
>>>>>      }
>>>>>
>>>>>
>>>>> RPROC_RECOVERY_FAILED gets set in rproc_boot_recovery() if request_firmware() or
>>>>> rproc_start() fail.  Function rproc_trigger_recovery() needs to allow for the
>>>>> recovery the the remote processor is in RPROC_RECOVERY_FAILED state.  As such
>>>>> the condition becomes:
>>>>>
>>>>>            /* State could have changed before we got the mutex */
>>>>>      if (rproc->state != RPROC_CRASHED &&
>>>>>                rproc->state != RPROC_RECOVERY_FAILED)
>>>>>              goto unlock_mutex;
>>>>>
>>>>> Start with that and we can look at corner cases (if some exists) with a fresh
>>>>> patchset.  Note that I have not addressed the attach/detach() scenario in the
>>>>> above.
>>>>
>>>> If we didn't deal with the recovery failed case with correct pm_relax call,
>>>> it may left the device in a state that cannot enter to suspend state.
>>>
>>> That is what I am looking for.  We don't want to give the impression that
>>> everything is fine by allowing the device to suspend.  If the remote processor
>>> can't be recovered than it needs to be dealth with.
>> For the normal recovery failed case, it still need to do pm_relax to not
>> prevent the device goes to suspend. It is what in normal recovery failed
>> case we do in rproc_crash_handler_work as well.
>> rproc_crash_handler_work will not check the result of the
>> rproc_trigger_recovery return value, and will always do pm_relax.
>>
>> For current conconrency cornor case as well, it is better to consistant
>> with the current design of recovery fail senarios in normal cases.
>>
>> I personally agree that we shouldn't do nothing when it is a
>> RPROC_RECOVERY_FAILED senario when it is in rproc_crash_handler_work
>> check, because it maybe crash happened when it is trying to do the recovery.
>> So I suggested to do a continue try of trigger recovery again instead of
>> doing nothing and bail out if it is a RPROC_RECOVERY_FAILED state.
>>
>>>
>>>> Because first PROC_RECOVERY_FAIL case cannot ensure it have pm_relax called
>>>> before the second crash handler call pm_stay_awake or not.
>>>>
>>>
>>> I've been thinking about that part.  I don't think adding a wake_count to
>>> control calls to pm_stay_awake()/pm_relax() is the best way to go.  There is a
>>> similar count happening in the PM runtime subsystem and that is what we should
>>> be using.  I have asked a power management expert at Linaro for guidance with
>>> this matter.  I should be able to get back to you with a way forward by the end
>>> of next week.
>>>
>> Thx for the specific date provided as well. I will wait until your reply
>> for next patchset then.
>>
>>>> So, What about the atomic count along with pm_relax and pm_stay_awake ?
>>>>
>>>> struct rproc{
>>>> ...
>>>> atomic_t wake_count;
>>>> ...
>>>> }
>>>>
>>>> rproc_pm_stay_awake()
>>>> {
>>>>       atomic_inc(&wake_count);
>>>>       pm_stay_awake();
>>>> }
>>>>
>>>> rproc_pm_relax()
>>>> {
>>>>       if (atomic_dec_return(&wake_count) == 0)
>>>>               pm_stay_awake();
>>>> }
>>>>
>>>> can refer code like:
>>>>
>>>> rproc_report_crash()
>>>> {
>>>>       ...
>>>>       rproc_pm_stay_awake();
>>>>       queue_work();
>>>>       ...
>>>> }
>>>>
>>>> rproc_crash_handler_work()
>>>> {
>>>>       ...
>>>>       if (rproc->state == RPROC_OFFLINE || rproc->state == RPROC_CRASHED) {
>>>>                     /* We have raced with rproc_shutdown() */
>>>>                     rproc_pm_relax();
>>>>               mutex_unlock(&rproc->lock);
>>>>               return;
>>>>             }
>>>>       ...
>>>> }
>>>>
>>>>>
>>>>> Thanks,
>>>>> Mathieu
>>>>>
>>>>>>
>>>>>> recovery fail case 1:
>>>>>> |                                      |firstcrash interrupt issued
>>>>>> | second crashed interrupt issued      | rproc_report_crash()
>>>>>> | rproc_report_crash()                 |          pm_stay_awake()
>>>>>> |          pm_stay_awake()             |          queue_work()
>>>>>> |          queue_work()                |rproc_crash_handler_work()
>>>>>> |                                      |mutex_lock(&rproc->lock);
>>>>>> |                                      |rproc_stop()
>>>>>> |rproc_crash_handler_work()            |rproc->state = RPROC_OFFLINE;
>>>>>> |                                      |RPROC_RECOVERY_FAIL //new
>>>>>> |                                      |mutex_unlock(&rproc->lock);
>>>>>> |mutex_lock(&rproc->lock);             |pm_relax()
>>>>>> |if (rproc->state == RPROC_OFFLINE)    |
>>>>>> |return // shouldn't do pm_relax if RPROC_RECOVERY_FAIL?  |
>>>>>> |mutex_unlock(&rproc->lock);           |
>>>>>> |                                      |
>>>>>> |                                      |
>>>>>> |                                      |
>>>>>>
>>>>>> recovery fail case 2:
>>>>>> |                                      |firstcrash interrupt issued
>>>>>> |                                      | rproc_report_crash()
>>>>>> |                                      |          pm_stay_awake()
>>>>>> |                                      |          queue_work()
>>>>>> |                                      |rproc_crash_handler_work()
>>>>>> |                                      |mutex_lock(&rproc->lock);
>>>>>> |                                      |rproc_stop()
>>>>>> |                                      |rproc->state = RPROC_OFFLINE;
>>>>>> |                                      |RPROC_RECOVERY_FAIL //new
>>>>>> |                                      |mutex_unlock(&rproc->lock);
>>>>>> |                                      |pm_relax()
>>>>>> |
>>>>>> | second crashed interrupt issued      |
>>>>>> | rproc_report_crash()                 |
>>>>>> |          pm_stay_awake()             |
>>>>>> |          queue_work()                |
>>>>>> |pm_stay_awake()
>>>>>> |mutex_lock(&rproc->lock);
>>>>>> |if (rproc->state == RPROC_OFFLINE)    |
>>>>>> |return // still need do pm_relax if RPROC_RECOVERY_FAIL?  |
>>>>>> |mutex_unlock(&rproc->lock);           |
>>>>>> |                                      |
>>>>>> |                                      |
>>>>>> |                                      |
>>>>>>
>>>>>> Maybe I can have:
>>>>>> 1. the pm_stay_awake and pm_relax with count based and call with paired for
>>>>>> fix current concurency issue.
>>>>>> 2. RPROC_RECOVERY_FAIL can be another patch for continue try to do recovery
>>>>>> work.
>>>>>> 3. handle RPROC_DETACHED case.
>>>>>>
>>>>>> On 11/2/2022 4:11 AM, Mathieu Poirier wrote:
>>>>>>> On Fri, 28 Oct 2022 at 09:31, Arnaud POULIQUEN
>>>>>>> <arnaud.pouliquen@foss.st.com> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 10/24/22 05:17, Aiqun(Maria) Yu wrote:
>>>>>>>>> On 10/22/2022 3:34 AM, Mathieu Poirier wrote:
>>>>>>>>>> On Wed, 19 Oct 2022 at 23:52, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 10/14/2022 2:03 AM, Mathieu Poirier wrote:
>>>>>>>>>>>> On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote:
>>>>>>>>>>>>> On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:
>>>>>>>>>>>>>> Hi Mathieu,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
>>>>>>>>>>>>>>> Please add what has changed from one version to another, either in a cover
>>>>>>>>>>>>>>> letter or after the "Signed-off-by".  There are many examples on how to
>>>>>>>>>>>>>>> do that
>>>>>>>>>>>>>>> on the mailing list.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thx for the information, will take a note and benefit for next time.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
>>>>>>>>>>>>>>>> RPROC_OFFLINE state indicate there is no recovery process
>>>>>>>>>>>>>>>> is in progress and no chance to do the pm_relax.
>>>>>>>>>>>>>>>> Because when recovering from crash, rproc->lock is held and
>>>>>>>>>>>>>>>> state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>>>>>>>>>>>>> and then unlock rproc->lock.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> You are correct - because the lock is held rproc->state should be set to
>>>>>>>>>>>>>>> RPROC_RUNNING
>>>>>>>>>>>>>>> when rproc_trigger_recovery() returns.  If that is not the case then
>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>> went wrong.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before
>>>>>>>>>>>>>>> returning,
>>>>>>>>>>>>>>> so we know the remote processor was stopped.  Therefore if rproc->state
>>>>>>>>>>>>>>> is set
>>>>>>>>>>>>>>> to RPROC_OFFLINE something went wrong in either request_firmware() or
>>>>>>>>>>>>>>> rproc_start().  Either way the remote processor is offline and the system
>>>>>>>>>>>>>>> probably
>>>>>>>>>>>>>>> in an unknown/unstable.  As such I don't see how calling pm_relax() can help
>>>>>>>>>>>>>>> things along.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
>>>>>>>>>>>>>> finished.
>>>>>>>>>>>>>> Even if it is multi crash rproc_crash_handler_work contention issue, and
>>>>>>>>>>>>>> last rproc_trigger_recovery bailed out with only
>>>>>>>>>>>>>> rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
>>>>>>>>>>>>>> Since the subsystem may still can be recovered with customer's next trigger
>>>>>>>>>>>>>> of rproc_start, and we can make each error out path clean with pm resources.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I suggest spending time understanding what leads to the failure when
>>>>>>>>>>>>>>> recovering
>>>>>>>>>>>>>>> from a crash and address that problem(s).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In current case, the customer's information is that the issue happened when
>>>>>>>>>>>>>> rproc_shutdown is triggered at similar time. So not an issue from error out
>>>>>>>>>>>>>> of rproc_trigger_recovery.
>>>>>>>>>>>>>
>>>>>>>>>>>>> That is a very important element to consider and should have been mentioned
>>>>>>>>>>>>> from
>>>>>>>>>>>>> the beginning.  What I see happening is the following:
>>>>>>>>>>>>>
>>>>>>>>>>>>> rproc_report_crash()
>>>>>>>>>>>>>               pm_stay_awake()
>>>>>>>>>>>>>               queue_work() // current thread is suspended
>>>>>>>>>>>>>
>>>>>>>>>>>>> rproc_shutdown()
>>>>>>>>>>>>>               rproc_stop()
>>>>>>>>>>>>>                       rproc->state = RPROC_OFFLINE;
>>>>>>>>>>>>>
>>>>>>>>>>>>> rproc_crash_handler_work()
>>>>>>>>>>>>>               if (rproc->state == RPROC_OFFLINE)
>>>>>>>>>>>>>                       return // pm_relax() is not called
>>>>>>>>>>>>>
>>>>>>>>>>>>> The right way to fix this is to add a pm_relax() in rproc_shutdown() and
>>>>>>>>>>>>> rproc_detach(), along with a very descriptive comment as to why it is needed.
>>>>>>>>>>>>
>>>>>>>>>>>> Thinking about this further there are more ramifications to consider.  Please
>>>>>>>>>>>> confirm the above scenario is what you are facing.  I will advise on how to
>>>>>>>>>>>> move
>>>>>>>>>>>> forward if that is the case.
>>>>>>>>>>>>
>>>>>>>>>>> Not sure if the situation is clear or not. So resend the email again.
>>>>>>>>>>>
>>>>>>>>>>> The above senario is what customer is facing. crash hanppened while at
>>>>>>>>>>> the same time shutdown is triggered.
>>>>>>>>>>
>>>>>>>>>> Unfortunately this is not enough details to address a problem as
>>>>>>>>>> complex as this one.
>>>>>>>>>>
>>>>>>>>>>> And the device cannto goes to suspend state after that.
>>>>>>>>>>> the subsystem can still be start normally after this.
>>>>>>>>>>
>>>>>>>>>> If the code flow I pasted above reflects the problem at hand, the
>>>>>>>>>> current patch will not be sufficient to address the issue.  If Arnaud
>>>>>>>>>> confirms my suspicions we will have to think about a better solution.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Mathiew,
>>>>>>>>>
>>>>>>>>> Could you pls have more details of any side effects other then power issue of
>>>>>>>>> the current senario?
>>>>>>>>> Why the current patch is not sufficient pls?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Have the current senario in details with rproc->lock information in details:
>>>>>>>>>
>>>>>>>>> | subsystem crashed interrupt issued      | user trigger shutdown
>>>>>>>>> | rproc_report_crash()                    |
>>>>>>>>> |          pm_stay_awake()                |
>>>>>>>>> |          queue_work()                   |
>>>>>>>>> |                                         |rproc_shutdown
>>>>>>>>> |                                         |mutex_lock(&rproc->lock);
>>>>>>>>> |                                         |rproc_stop()
>>>>>>>>> |rproc_crash_handler_work()               |rproc->state = RPROC_OFFLINE;
>>>>>>>>> |                                         |mutex_unlock(&rproc->lock);
>>>>>>>>> |mutex_lock(&rproc->lock);                |
>>>>>>>>> |if (rproc->state == RPROC_OFFLINE)       |
>>>>>>>>> |return // pm_relax() is not called       |rproc_boot
>>>>>>>>> |mutex_unlock(&rproc->lock);              |
>>>>>>>>> |                                         |mutex_lock(&rproc->lock);
>>>>>>>>> |                                         |rproc_start()
>>>>>>>>> |                                         |mutex_unlock(&rproc->lock);
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Agree with Mathieu, this is not so simple.
>>>>>>>>
>>>>>>>
>>>>>>> Thanks for looking into this.
>>>>>>>
>>>>>>>> Here is my view  hoping I haven't missed a point in your discussion or
>>>>>>>> an other corner cases.
>>>>>>>>
>>>>>>>> I tried to analyze the issues (in what follows, the term "condition" means
>>>>>>>> the "if" condition in which Aiqun proposes to add the fix) :
>>>>>>>>
>>>>>>>> I can see 4 use cases with race condition
>>>>>>>>
>>>>>>>> 1) crash report while already one is treated (rproc_boot_recovery called)
>>>>>>>>          => not a real use case as if the remote processor is crashed we
>>>>>>>>                   should not have a second crash report
>>>>>>>>
>>>>>>>
>>>>>>> That part is of great concern to me.  *Theoretically* we should not
>>>>>>> get a new crash report while one has already been dispatched but the
>>>>>>> current code accounts for this scenario and as such the possibility
>>>>>>> can't be dismissed.  Therefore we need to expect rproc_report_crash()
>>>>>>> to be called multiple times before a single instance of
>>>>>>> rproc_boot_recovery() is scheduled.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> 2) rproc_stop executed between the queuing of the crash work and the call of
>>>>>>>>       rproc_crash_handler_work
>>>>>>>>        => rproc->state = RPROC_OFFLINE
>>>>>>>>        => we enter in the "condition" and the pm_relax has to be called
>>>>>>>>        => This commit fix should solve this use case
>>>>>>>>
>>>>>>>> 3) rproc_detach executed between the queue of the crash work and the call of
>>>>>>>>       rproc_crash_handler_work
>>>>>>>>        => rproc->state = RPROC_DETACHED;
>>>>>>>>        => we don't go in "the condition" and issue because the recovery reattach
>>>>>>>>           to the remote processor
>>>>>>>>        => but pm_relax is called
>>>>>>>>        => probably need an extra fix to avoid to re-attach
>>>>>>>>
>>>>>>>> 4) crash report while already one is treated (rproc_attach_recovery called)
>>>>>>>>        this one corresponds to an auto reboot of the remote processor, with a
>>>>>>>>        new crash
>>>>>>>>        => rproc->state = RPROC_CRASHED or rproc->state = RPROC_DETACHED;
>>>>>>>>        4)a) rproc->state = RPROC_CRASHED if rproc->recovery_disabled = true
>>>>>>>>             => should call pm_relax if rproc->recovery_disabled = true
>>>>>>>>             => commit does not work for this use case
>>>>>>>>
>>>>>>>>        4)b) rproc->state = RPROC_DETACHED if recovery fails
>>>>>>>>            => error case with an unstable state
>>>>>>>>            => how to differentiate it from the use case 3) ?
>>>>>>>>            => introduce a RPROC_RECOVERY_FAIL state?
>>>>>>>>
>>>>>>>
>>>>>>> The case where a recovery fails needs to be considered and is the
>>>>>>> reason the original patch doesn't work.  Right now in
>>>>>>> rproc_crash_handler_work(), it is not possible to differentiate
>>>>>>> between a legitimate shutdown request (scenario #2 above) and a
>>>>>>> recovery that went wrong.  I think introducing RPROC_RECOVERY_FAIL
>>>>>>> would greatly simplify things.
>>>>>>>
>>>>>>> My initial evaluation had not considered the attach/detach scenarios -
>>>>>>> thanks for adding that in the mix.
>>>>>>>
>>>>>>> Aiqun, please send a new patchset that adds a new remote processor
>>>>>>> state, i.e RPROC_RECOVERY_FAIL.  There should also be another patch in
>>>>>>> that set that takes attach/detach scenarios into account.  The code
>>>>>>> between the v6.0 and v6.1 cycle has changed a lot in that area so make
>>>>>>> sure to properly rebase.
>>>>>>>
>>>>>> I will try.
>>>>>>
>>>>>>>>
>>>>>>>> Then pm_stay_awake is called when the crash work is queued.
>>>>>>>> It seems to me coherent to call the pm_relax in the work handler.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Here is a quick and dirty patch (not tested) that should take into account the
>>>>>>>> main use cases ( except 1) and 4)b) )
>>>>>>>>
>>>>>>>> @@ -2009,8 +2009,18 @@ static void rproc_crash_handler_work(struct work_struct *work)
>>>>>>>>
>>>>>>>>             mutex_lock(&rproc->lock);
>>>>>>>>
>>>>>>>> -       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
>>>>>>>> +       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE ||
>>>>>>>> +           rproc->state == RPROC_DETACHED) {
>>>>>>>>                     /* handle only the first crash detected */
>>>>>>>> +
>>>>>>>> +               /*
>>>>>>>> +                * call pm-relax in following use cases:
>>>>>>>> +                * - the remote processor has been stopped by the user
>>>>>>>> +                * - the remote processor is detached
>>>>>>>> +                + - the remote proc has an autonomous reset but recovery_disabled is true.
>>>>>>>> +                */
>>>>>>>> +               if(rproc->state != RPROC_CRASHED || rproc->recovery_disabled)
>>>>>>>> +                       pm_relax(rproc->dev.parent);
>>>>>>>>                     mutex_unlock(&rproc->lock);
>>>>>>>>                     return;
>>>>>>>>             }
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Arnaud
>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Mathieu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> When the state is in RPROC_OFFLINE it means separate request
>>>>>>>>>>>>>>>> of rproc_stop was done and no need to hold the wakeup source
>>>>>>>>>>>>>>>> in crash handler to recover any more.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>         drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
>>>>>>>>>>>>>>>>         1 file changed, 11 insertions(+)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>>>>>> b/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>>>>>> index e5279ed9a8d7..6bc7b8b7d01e 100644
>>>>>>>>>>>>>>>> --- a/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>>>>>> +++ b/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>>>>>> @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct
>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>                if (rproc->state == RPROC_CRASHED || rproc->state ==
>>>>>>>>>>>>>>>> RPROC_OFFLINE) {
>>>>>>>>>>>>>>>>                        /* handle only the first crash detected */
>>>>>>>>>>>>>>>>                        mutex_unlock(&rproc->lock);
>>>>>>>>>>>>>>>> +         /*
>>>>>>>>>>>>>>>> +          * RPROC_OFFLINE state indicate there is no recovery process
>>>>>>>>>>>>>>>> +          * is in progress and no chance to have pm_relax in place.
>>>>>>>>>>>>>>>> +          * Because when recovering from crash, rproc->lock is held and
>>>>>>>>>>>>>>>> +          * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>>>>>>>>>>>>> +          * and then unlock rproc->lock.
>>>>>>>>>>>>>>>> +          * RPROC_OFFLINE is only an intermediate state in recovery
>>>>>>>>>>>>>>>> +          * process.
>>>>>>>>>>>>>>>> +          */
>>>>>>>>>>>>>>>> +         if (rproc->state == RPROC_OFFLINE)
>>>>>>>>>>>>>>>> +                 pm_relax(rproc->dev.parent);
>>>>>>>>>>>>>>>>                        return;
>>>>>>>>>>>>>>>>                }
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> 2.7.4
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Thx and BRs,
>>>>>>>>>>>>>> Aiqun(Maria) Yu
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Thx and BRs,
>>>>>>>>>>> Aiqun(Maria) Yu
>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thx and BRs,
>>>>>> Aiqun(Maria) Yu
>>>>
>>>>
>>>> --
>>>> Thx and BRs,
>>>> Aiqun(Maria) Yu
>>
>>
>> --
>> Thx and BRs,
>> Aiqun(Maria) Yu
Mathieu Poirier Nov. 14, 2022, 9:18 p.m. UTC | #19
On Fri, Nov 11, 2022 at 08:52:11AM +0800, Aiqun(Maria) Yu wrote:
> On 11/11/2022 4:50 AM, Mathieu Poirier wrote:
> > I had a couple of good discussions with our power management expert
> > and even then, the way forward isn't as clear as I would have liked.
> > I am currently travelling and as such don't have the required time to
> > go into greater details, something I will be doing next week.
> > 
> Thx Mathieu for the info updated.
> I'll wait for your update next week then.
> let me know any initial questions that you have, perhaps I can also discuss
> that with our power team at the same time.

The problem is to determine exactly what the WQ_FREEZABLE flag does to the
rproc_recovery_wq workueue.  The documentation [1] indicate that work items on the WQ are
drained before the system is suspended.  What I understand from this is that if
two work items are queued and one is executing at the time a system suspend is
initiated, all three items will be executed before the system is allowed to be
suspended.  _If_ that is the case, there would not be a need to call
pm_stay_awake() and pm_relax() at all.

On the other hand, the PM resource I spoke to thought that in reality things
don't happen that way.  Taking the same above scenario where 2 work items are
queued and one is executing at the time of the suspend, only the work item that
is executing will be allowed to execute to completion before the system is
suspended.  The remaining two items that were queued will not execute.

If that is the case then we do need to call pm_stay_awake() and pm_relax(), and
find another strategy to fix this situation.

Until we have a clear view of how the WQ_FREEZABLE flag works, we won't be able
to move forward with this patchset.  Unfortunately, I currently do not have the
time to look into this.

If you want to take on this investigation, keep in mind that any conclusion will
need to be backed by a proof.  That can be debug messages on a console output or
a code reference in the workqueue core.  

[1]. https://elixir.bootlin.com/linux/v6.1-rc4/source/Documentation/core-api/workqueue.rst#L184


> > On Sun, 6 Nov 2022 at 18:14, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
> > > 
> > > Hi,
> > > On 11/4/2022 11:59 PM, Mathieu Poirier wrote:
> > > > On Thu, Nov 03, 2022 at 10:03:49AM +0800, Aiqun(Maria) Yu wrote:
> > > > > On 11/3/2022 2:03 AM, Mathieu Poirier wrote:
> > > > > > On Wed, Nov 02, 2022 at 06:53:49PM +0800, Aiqun(Maria) Yu wrote:
> > > > > > > Hi,
> > > > > > > 
> > > > > > > Let me think about this carefully.
> > > > > > > 
> > > > > > > When in RPROC_RECOVERY_FAIL case we want to re-do the recovery process again
> > > > > > > or just leave the pm_relax?
> > > > > > 
> > > > > > Neither.
> > > > > > 
> > > > > > When a recovery fail we don't want to call pm_relax().  The code in
> > > > > > rproc_crash_handler_work() becomes:
> > > > > > 
> > > > > >      if (rproc->state == RPROC_OFFLINE) {
> > > > > >                    /* We have raced with rproc_shutdown() */
> > > > > >                    pm_relax()
> > > > > >              mutex_unlock(&rproc->lock);
> > > > > >              return;
> > > > > >            }
> > > > > > 
> > > > > >      if (rproc->state == RPROC_CRASHED ||
> > > > > >                rproc->state == RPROC_RECOVERY_FAILED) {
> > > > > >              /* handle only the first crash detected */
> > > > > >              mutex_unlock(&rproc->lock);
> > > > > >              return;
> > > > > >      }
> > > > > > 
> > > > > > 
> > > > > > RPROC_RECOVERY_FAILED gets set in rproc_boot_recovery() if request_firmware() or
> > > > > > rproc_start() fail.  Function rproc_trigger_recovery() needs to allow for the
> > > > > > recovery the the remote processor is in RPROC_RECOVERY_FAILED state.  As such
> > > > > > the condition becomes:
> > > > > > 
> > > > > >            /* State could have changed before we got the mutex */
> > > > > >      if (rproc->state != RPROC_CRASHED &&
> > > > > >                rproc->state != RPROC_RECOVERY_FAILED)
> > > > > >              goto unlock_mutex;
> > > > > > 
> > > > > > Start with that and we can look at corner cases (if some exists) with a fresh
> > > > > > patchset.  Note that I have not addressed the attach/detach() scenario in the
> > > > > > above.
> > > > > 
> > > > > If we didn't deal with the recovery failed case with correct pm_relax call,
> > > > > it may left the device in a state that cannot enter to suspend state.
> > > > 
> > > > That is what I am looking for.  We don't want to give the impression that
> > > > everything is fine by allowing the device to suspend.  If the remote processor
> > > > can't be recovered than it needs to be dealth with.
> > > For the normal recovery failed case, it still need to do pm_relax to not
> > > prevent the device goes to suspend. It is what in normal recovery failed
> > > case we do in rproc_crash_handler_work as well.
> > > rproc_crash_handler_work will not check the result of the
> > > rproc_trigger_recovery return value, and will always do pm_relax.
> > > 
> > > For current conconrency cornor case as well, it is better to consistant
> > > with the current design of recovery fail senarios in normal cases.
> > > 
> > > I personally agree that we shouldn't do nothing when it is a
> > > RPROC_RECOVERY_FAILED senario when it is in rproc_crash_handler_work
> > > check, because it maybe crash happened when it is trying to do the recovery.
> > > So I suggested to do a continue try of trigger recovery again instead of
> > > doing nothing and bail out if it is a RPROC_RECOVERY_FAILED state.
> > > 
> > > > 
> > > > > Because first PROC_RECOVERY_FAIL case cannot ensure it have pm_relax called
> > > > > before the second crash handler call pm_stay_awake or not.
> > > > > 
> > > > 
> > > > I've been thinking about that part.  I don't think adding a wake_count to
> > > > control calls to pm_stay_awake()/pm_relax() is the best way to go.  There is a
> > > > similar count happening in the PM runtime subsystem and that is what we should
> > > > be using.  I have asked a power management expert at Linaro for guidance with
> > > > this matter.  I should be able to get back to you with a way forward by the end
> > > > of next week.
> > > > 
> > > Thx for the specific date provided as well. I will wait until your reply
> > > for next patchset then.
> > > 
> > > > > So, What about the atomic count along with pm_relax and pm_stay_awake ?
> > > > > 
> > > > > struct rproc{
> > > > > ...
> > > > > atomic_t wake_count;
> > > > > ...
> > > > > }
> > > > > 
> > > > > rproc_pm_stay_awake()
> > > > > {
> > > > >       atomic_inc(&wake_count);
> > > > >       pm_stay_awake();
> > > > > }
> > > > > 
> > > > > rproc_pm_relax()
> > > > > {
> > > > >       if (atomic_dec_return(&wake_count) == 0)
> > > > >               pm_stay_awake();
> > > > > }
> > > > > 
> > > > > can refer code like:
> > > > > 
> > > > > rproc_report_crash()
> > > > > {
> > > > >       ...
> > > > >       rproc_pm_stay_awake();
> > > > >       queue_work();
> > > > >       ...
> > > > > }
> > > > > 
> > > > > rproc_crash_handler_work()
> > > > > {
> > > > >       ...
> > > > >       if (rproc->state == RPROC_OFFLINE || rproc->state == RPROC_CRASHED) {
> > > > >                     /* We have raced with rproc_shutdown() */
> > > > >                     rproc_pm_relax();
> > > > >               mutex_unlock(&rproc->lock);
> > > > >               return;
> > > > >             }
> > > > >       ...
> > > > > }
> > > > > 
> > > > > > 
> > > > > > Thanks,
> > > > > > Mathieu
> > > > > > 
> > > > > > > 
> > > > > > > recovery fail case 1:
> > > > > > > |                                      |firstcrash interrupt issued
> > > > > > > | second crashed interrupt issued      | rproc_report_crash()
> > > > > > > | rproc_report_crash()                 |          pm_stay_awake()
> > > > > > > |          pm_stay_awake()             |          queue_work()
> > > > > > > |          queue_work()                |rproc_crash_handler_work()
> > > > > > > |                                      |mutex_lock(&rproc->lock);
> > > > > > > |                                      |rproc_stop()
> > > > > > > |rproc_crash_handler_work()            |rproc->state = RPROC_OFFLINE;
> > > > > > > |                                      |RPROC_RECOVERY_FAIL //new
> > > > > > > |                                      |mutex_unlock(&rproc->lock);
> > > > > > > |mutex_lock(&rproc->lock);             |pm_relax()
> > > > > > > |if (rproc->state == RPROC_OFFLINE)    |
> > > > > > > |return // shouldn't do pm_relax if RPROC_RECOVERY_FAIL?  |
> > > > > > > |mutex_unlock(&rproc->lock);           |
> > > > > > > |                                      |
> > > > > > > |                                      |
> > > > > > > |                                      |
> > > > > > > 
> > > > > > > recovery fail case 2:
> > > > > > > |                                      |firstcrash interrupt issued
> > > > > > > |                                      | rproc_report_crash()
> > > > > > > |                                      |          pm_stay_awake()
> > > > > > > |                                      |          queue_work()
> > > > > > > |                                      |rproc_crash_handler_work()
> > > > > > > |                                      |mutex_lock(&rproc->lock);
> > > > > > > |                                      |rproc_stop()
> > > > > > > |                                      |rproc->state = RPROC_OFFLINE;
> > > > > > > |                                      |RPROC_RECOVERY_FAIL //new
> > > > > > > |                                      |mutex_unlock(&rproc->lock);
> > > > > > > |                                      |pm_relax()
> > > > > > > |
> > > > > > > | second crashed interrupt issued      |
> > > > > > > | rproc_report_crash()                 |
> > > > > > > |          pm_stay_awake()             |
> > > > > > > |          queue_work()                |
> > > > > > > |pm_stay_awake()
> > > > > > > |mutex_lock(&rproc->lock);
> > > > > > > |if (rproc->state == RPROC_OFFLINE)    |
> > > > > > > |return // still need do pm_relax if RPROC_RECOVERY_FAIL?  |
> > > > > > > |mutex_unlock(&rproc->lock);           |
> > > > > > > |                                      |
> > > > > > > |                                      |
> > > > > > > |                                      |
> > > > > > > 
> > > > > > > Maybe I can have:
> > > > > > > 1. the pm_stay_awake and pm_relax with count based and call with paired for
> > > > > > > fix current concurency issue.
> > > > > > > 2. RPROC_RECOVERY_FAIL can be another patch for continue try to do recovery
> > > > > > > work.
> > > > > > > 3. handle RPROC_DETACHED case.
> > > > > > > 
> > > > > > > On 11/2/2022 4:11 AM, Mathieu Poirier wrote:
> > > > > > > > On Fri, 28 Oct 2022 at 09:31, Arnaud POULIQUEN
> > > > > > > > <arnaud.pouliquen@foss.st.com> wrote:
> > > > > > > > > 
> > > > > > > > > Hi,
> > > > > > > > > 
> > > > > > > > > On 10/24/22 05:17, Aiqun(Maria) Yu wrote:
> > > > > > > > > > On 10/22/2022 3:34 AM, Mathieu Poirier wrote:
> > > > > > > > > > > On Wed, 19 Oct 2022 at 23:52, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
> > > > > > > > > > > > 
> > > > > > > > > > > > On 10/14/2022 2:03 AM, Mathieu Poirier wrote:
> > > > > > > > > > > > > On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote:
> > > > > > > > > > > > > > On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:
> > > > > > > > > > > > > > > Hi Mathieu,
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
> > > > > > > > > > > > > > > > Please add what has changed from one version to another, either in a cover
> > > > > > > > > > > > > > > > letter or after the "Signed-off-by".  There are many examples on how to
> > > > > > > > > > > > > > > > do that
> > > > > > > > > > > > > > > > on the mailing list.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Thx for the information, will take a note and benefit for next time.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
> > > > > > > > > > > > > > > > > RPROC_OFFLINE state indicate there is no recovery process
> > > > > > > > > > > > > > > > > is in progress and no chance to do the pm_relax.
> > > > > > > > > > > > > > > > > Because when recovering from crash, rproc->lock is held and
> > > > > > > > > > > > > > > > > state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> > > > > > > > > > > > > > > > > and then unlock rproc->lock.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > You are correct - because the lock is held rproc->state should be set to
> > > > > > > > > > > > > > > > RPROC_RUNNING
> > > > > > > > > > > > > > > > when rproc_trigger_recovery() returns.  If that is not the case then
> > > > > > > > > > > > > > > > something
> > > > > > > > > > > > > > > > went wrong.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before
> > > > > > > > > > > > > > > > returning,
> > > > > > > > > > > > > > > > so we know the remote processor was stopped.  Therefore if rproc->state
> > > > > > > > > > > > > > > > is set
> > > > > > > > > > > > > > > > to RPROC_OFFLINE something went wrong in either request_firmware() or
> > > > > > > > > > > > > > > > rproc_start().  Either way the remote processor is offline and the system
> > > > > > > > > > > > > > > > probably
> > > > > > > > > > > > > > > > in an unknown/unstable.  As such I don't see how calling pm_relax() can help
> > > > > > > > > > > > > > > > things along.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
> > > > > > > > > > > > > > > finished.
> > > > > > > > > > > > > > > Even if it is multi crash rproc_crash_handler_work contention issue, and
> > > > > > > > > > > > > > > last rproc_trigger_recovery bailed out with only
> > > > > > > > > > > > > > > rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
> > > > > > > > > > > > > > > Since the subsystem may still can be recovered with customer's next trigger
> > > > > > > > > > > > > > > of rproc_start, and we can make each error out path clean with pm resources.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > I suggest spending time understanding what leads to the failure when
> > > > > > > > > > > > > > > > recovering
> > > > > > > > > > > > > > > > from a crash and address that problem(s).
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > In current case, the customer's information is that the issue happened when
> > > > > > > > > > > > > > > rproc_shutdown is triggered at similar time. So not an issue from error out
> > > > > > > > > > > > > > > of rproc_trigger_recovery.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > That is a very important element to consider and should have been mentioned
> > > > > > > > > > > > > > from
> > > > > > > > > > > > > > the beginning.  What I see happening is the following:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > rproc_report_crash()
> > > > > > > > > > > > > >               pm_stay_awake()
> > > > > > > > > > > > > >               queue_work() // current thread is suspended
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > rproc_shutdown()
> > > > > > > > > > > > > >               rproc_stop()
> > > > > > > > > > > > > >                       rproc->state = RPROC_OFFLINE;
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > rproc_crash_handler_work()
> > > > > > > > > > > > > >               if (rproc->state == RPROC_OFFLINE)
> > > > > > > > > > > > > >                       return // pm_relax() is not called
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > The right way to fix this is to add a pm_relax() in rproc_shutdown() and
> > > > > > > > > > > > > > rproc_detach(), along with a very descriptive comment as to why it is needed.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Thinking about this further there are more ramifications to consider.  Please
> > > > > > > > > > > > > confirm the above scenario is what you are facing.  I will advise on how to
> > > > > > > > > > > > > move
> > > > > > > > > > > > > forward if that is the case.
> > > > > > > > > > > > > 
> > > > > > > > > > > > Not sure if the situation is clear or not. So resend the email again.
> > > > > > > > > > > > 
> > > > > > > > > > > > The above senario is what customer is facing. crash hanppened while at
> > > > > > > > > > > > the same time shutdown is triggered.
> > > > > > > > > > > 
> > > > > > > > > > > Unfortunately this is not enough details to address a problem as
> > > > > > > > > > > complex as this one.
> > > > > > > > > > > 
> > > > > > > > > > > > And the device cannto goes to suspend state after that.
> > > > > > > > > > > > the subsystem can still be start normally after this.
> > > > > > > > > > > 
> > > > > > > > > > > If the code flow I pasted above reflects the problem at hand, the
> > > > > > > > > > > current patch will not be sufficient to address the issue.  If Arnaud
> > > > > > > > > > > confirms my suspicions we will have to think about a better solution.
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Hi Mathiew,
> > > > > > > > > > 
> > > > > > > > > > Could you pls have more details of any side effects other then power issue of
> > > > > > > > > > the current senario?
> > > > > > > > > > Why the current patch is not sufficient pls?
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Have the current senario in details with rproc->lock information in details:
> > > > > > > > > > 
> > > > > > > > > > | subsystem crashed interrupt issued      | user trigger shutdown
> > > > > > > > > > | rproc_report_crash()                    |
> > > > > > > > > > |          pm_stay_awake()                |
> > > > > > > > > > |          queue_work()                   |
> > > > > > > > > > |                                         |rproc_shutdown
> > > > > > > > > > |                                         |mutex_lock(&rproc->lock);
> > > > > > > > > > |                                         |rproc_stop()
> > > > > > > > > > |rproc_crash_handler_work()               |rproc->state = RPROC_OFFLINE;
> > > > > > > > > > |                                         |mutex_unlock(&rproc->lock);
> > > > > > > > > > |mutex_lock(&rproc->lock);                |
> > > > > > > > > > |if (rproc->state == RPROC_OFFLINE)       |
> > > > > > > > > > |return // pm_relax() is not called       |rproc_boot
> > > > > > > > > > |mutex_unlock(&rproc->lock);              |
> > > > > > > > > > |                                         |mutex_lock(&rproc->lock);
> > > > > > > > > > |                                         |rproc_start()
> > > > > > > > > > |                                         |mutex_unlock(&rproc->lock);
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Agree with Mathieu, this is not so simple.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Thanks for looking into this.
> > > > > > > > 
> > > > > > > > > Here is my view  hoping I haven't missed a point in your discussion or
> > > > > > > > > an other corner cases.
> > > > > > > > > 
> > > > > > > > > I tried to analyze the issues (in what follows, the term "condition" means
> > > > > > > > > the "if" condition in which Aiqun proposes to add the fix) :
> > > > > > > > > 
> > > > > > > > > I can see 4 use cases with race condition
> > > > > > > > > 
> > > > > > > > > 1) crash report while already one is treated (rproc_boot_recovery called)
> > > > > > > > >          => not a real use case as if the remote processor is crashed we
> > > > > > > > >                   should not have a second crash report
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > That part is of great concern to me.  *Theoretically* we should not
> > > > > > > > get a new crash report while one has already been dispatched but the
> > > > > > > > current code accounts for this scenario and as such the possibility
> > > > > > > > can't be dismissed.  Therefore we need to expect rproc_report_crash()
> > > > > > > > to be called multiple times before a single instance of
> > > > > > > > rproc_boot_recovery() is scheduled.
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 2) rproc_stop executed between the queuing of the crash work and the call of
> > > > > > > > >       rproc_crash_handler_work
> > > > > > > > >        => rproc->state = RPROC_OFFLINE
> > > > > > > > >        => we enter in the "condition" and the pm_relax has to be called
> > > > > > > > >        => This commit fix should solve this use case
> > > > > > > > > 
> > > > > > > > > 3) rproc_detach executed between the queue of the crash work and the call of
> > > > > > > > >       rproc_crash_handler_work
> > > > > > > > >        => rproc->state = RPROC_DETACHED;
> > > > > > > > >        => we don't go in "the condition" and issue because the recovery reattach
> > > > > > > > >           to the remote processor
> > > > > > > > >        => but pm_relax is called
> > > > > > > > >        => probably need an extra fix to avoid to re-attach
> > > > > > > > > 
> > > > > > > > > 4) crash report while already one is treated (rproc_attach_recovery called)
> > > > > > > > >        this one corresponds to an auto reboot of the remote processor, with a
> > > > > > > > >        new crash
> > > > > > > > >        => rproc->state = RPROC_CRASHED or rproc->state = RPROC_DETACHED;
> > > > > > > > >        4)a) rproc->state = RPROC_CRASHED if rproc->recovery_disabled = true
> > > > > > > > >             => should call pm_relax if rproc->recovery_disabled = true
> > > > > > > > >             => commit does not work for this use case
> > > > > > > > > 
> > > > > > > > >        4)b) rproc->state = RPROC_DETACHED if recovery fails
> > > > > > > > >            => error case with an unstable state
> > > > > > > > >            => how to differentiate it from the use case 3) ?
> > > > > > > > >            => introduce a RPROC_RECOVERY_FAIL state?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > The case where a recovery fails needs to be considered and is the
> > > > > > > > reason the original patch doesn't work.  Right now in
> > > > > > > > rproc_crash_handler_work(), it is not possible to differentiate
> > > > > > > > between a legitimate shutdown request (scenario #2 above) and a
> > > > > > > > recovery that went wrong.  I think introducing RPROC_RECOVERY_FAIL
> > > > > > > > would greatly simplify things.
> > > > > > > > 
> > > > > > > > My initial evaluation had not considered the attach/detach scenarios -
> > > > > > > > thanks for adding that in the mix.
> > > > > > > > 
> > > > > > > > Aiqun, please send a new patchset that adds a new remote processor
> > > > > > > > state, i.e RPROC_RECOVERY_FAIL.  There should also be another patch in
> > > > > > > > that set that takes attach/detach scenarios into account.  The code
> > > > > > > > between the v6.0 and v6.1 cycle has changed a lot in that area so make
> > > > > > > > sure to properly rebase.
> > > > > > > > 
> > > > > > > I will try.
> > > > > > > 
> > > > > > > > > 
> > > > > > > > > Then pm_stay_awake is called when the crash work is queued.
> > > > > > > > > It seems to me coherent to call the pm_relax in the work handler.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Here is a quick and dirty patch (not tested) that should take into account the
> > > > > > > > > main use cases ( except 1) and 4)b) )
> > > > > > > > > 
> > > > > > > > > @@ -2009,8 +2009,18 @@ static void rproc_crash_handler_work(struct work_struct *work)
> > > > > > > > > 
> > > > > > > > >             mutex_lock(&rproc->lock);
> > > > > > > > > 
> > > > > > > > > -       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
> > > > > > > > > +       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE ||
> > > > > > > > > +           rproc->state == RPROC_DETACHED) {
> > > > > > > > >                     /* handle only the first crash detected */
> > > > > > > > > +
> > > > > > > > > +               /*
> > > > > > > > > +                * call pm-relax in following use cases:
> > > > > > > > > +                * - the remote processor has been stopped by the user
> > > > > > > > > +                * - the remote processor is detached
> > > > > > > > > +                + - the remote proc has an autonomous reset but recovery_disabled is true.
> > > > > > > > > +                */
> > > > > > > > > +               if(rproc->state != RPROC_CRASHED || rproc->recovery_disabled)
> > > > > > > > > +                       pm_relax(rproc->dev.parent);
> > > > > > > > >                     mutex_unlock(&rproc->lock);
> > > > > > > > >                     return;
> > > > > > > > >             }
> > > > > > > > > 
> > > > > > > > > Regards,
> > > > > > > > > Arnaud
> > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > Mathieu
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > When the state is in RPROC_OFFLINE it means separate request
> > > > > > > > > > > > > > > > > of rproc_stop was done and no need to hold the wakeup source
> > > > > > > > > > > > > > > > > in crash handler to recover any more.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
> > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > >         drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
> > > > > > > > > > > > > > > > >         1 file changed, 11 insertions(+)
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > diff --git a/drivers/remoteproc/remoteproc_core.c
> > > > > > > > > > > > > > > > > b/drivers/remoteproc/remoteproc_core.c
> > > > > > > > > > > > > > > > > index e5279ed9a8d7..6bc7b8b7d01e 100644
> > > > > > > > > > > > > > > > > --- a/drivers/remoteproc/remoteproc_core.c
> > > > > > > > > > > > > > > > > +++ b/drivers/remoteproc/remoteproc_core.c
> > > > > > > > > > > > > > > > > @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct
> > > > > > > > > > > > > > > > > work_struct *work)
> > > > > > > > > > > > > > > > >                if (rproc->state == RPROC_CRASHED || rproc->state ==
> > > > > > > > > > > > > > > > > RPROC_OFFLINE) {
> > > > > > > > > > > > > > > > >                        /* handle only the first crash detected */
> > > > > > > > > > > > > > > > >                        mutex_unlock(&rproc->lock);
> > > > > > > > > > > > > > > > > +         /*
> > > > > > > > > > > > > > > > > +          * RPROC_OFFLINE state indicate there is no recovery process
> > > > > > > > > > > > > > > > > +          * is in progress and no chance to have pm_relax in place.
> > > > > > > > > > > > > > > > > +          * Because when recovering from crash, rproc->lock is held and
> > > > > > > > > > > > > > > > > +          * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> > > > > > > > > > > > > > > > > +          * and then unlock rproc->lock.
> > > > > > > > > > > > > > > > > +          * RPROC_OFFLINE is only an intermediate state in recovery
> > > > > > > > > > > > > > > > > +          * process.
> > > > > > > > > > > > > > > > > +          */
> > > > > > > > > > > > > > > > > +         if (rproc->state == RPROC_OFFLINE)
> > > > > > > > > > > > > > > > > +                 pm_relax(rproc->dev.parent);
> > > > > > > > > > > > > > > > >                        return;
> > > > > > > > > > > > > > > > >                }
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > 2.7.4
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > Thx and BRs,
> > > > > > > > > > > > > > > Aiqun(Maria) Yu
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > --
> > > > > > > > > > > > Thx and BRs,
> > > > > > > > > > > > Aiqun(Maria) Yu
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > --
> > > > > > > Thx and BRs,
> > > > > > > Aiqun(Maria) Yu
> > > > > 
> > > > > 
> > > > > --
> > > > > Thx and BRs,
> > > > > Aiqun(Maria) Yu
> > > 
> > > 
> > > --
> > > Thx and BRs,
> > > Aiqun(Maria) Yu
> 
> 
> -- 
> Thx and BRs,
> Aiqun(Maria) Yu
Aiqun Yu (Maria) Nov. 15, 2022, 1:30 a.m. UTC | #20
Hi,

On 11/15/2022 5:18 AM, Mathieu Poirier wrote:
> On Fri, Nov 11, 2022 at 08:52:11AM +0800, Aiqun(Maria) Yu wrote:
>> On 11/11/2022 4:50 AM, Mathieu Poirier wrote:
>>> I had a couple of good discussions with our power management expert
>>> and even then, the way forward isn't as clear as I would have liked.
>>> I am currently travelling and as such don't have the required time to
>>> go into greater details, something I will be doing next week.
>>>
>> Thx Mathieu for the info updated.
>> I'll wait for your update next week then.
>> let me know any initial questions that you have, perhaps I can also discuss
>> that with our power team at the same time.
> 
> The problem is to determine exactly what the WQ_FREEZABLE flag does to the
> rproc_recovery_wq workueue.  The documentation [1] indicate that work items on the WQ are
> drained before the system is suspended.  What I understand from this is that if
> two work items are queued and one is executing at the time a system suspend is
> initiated, all three items will be executed before the system is allowed to be
> suspended.  _If_ that is the case, there would not be a need to call
> pm_stay_awake() and pm_relax() at all.
> 
> On the other hand, the PM resource I spoke to thought that in reality things
> don't happen that way.  Taking the same above scenario where 2 work items are
> queued and one is executing at the time of the suspend, only the work item that
> is executing will be allowed to execute to completion before the system is
> suspended.  The remaining two items that were queued will not execute.
> 
> If that is the case then we do need to call pm_stay_awake() and pm_relax(), and
> find another strategy to fix this situation.
> 
> Until we have a clear view of how the WQ_FREEZABLE flag works, we won't be able
> to move forward with this patchset.  Unfortunately, I currently do not have the
> time to look into this.
I had a check on the WQ_FREEZABLE flag, here is my understanding:

when the interrupt happened, it still need pm_stay_awake to make sure 
queue_work action can active the work instead of susepend the device.

1. If  WQ_FREEZABLE, pwq->max_active = 0; // maximum number of in-flight 
work items is set to 0.

[1]. 
https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/workqueue.c#L3748

2. If  WQ_FREEZABLE, will only check pwq->nr_active to see if there is 
still freeze_workqueues_busy.
[2]. 
https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/workqueue.c#L5270

3. When in queue_work, if max_active is 0, when do queue_work it will 
not actually active the work.
[3]. 
https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/workqueue.c#L1418

for the current issue, the work is already complete and forget to set 
pm_relax in some condition that make the system cannot be suspended.

> 
> If you want to take on this investigation, keep in mind that any conclusion will
> need to be backed by a proof.  That can be debug messages on a console output or
> a code reference in the workqueue core.
> 
> [1]. https://elixir.bootlin.com/linux/v6.1-rc4/source/Documentation/core-api/workqueue.rst#L184
> 
> 
>>> On Sun, 6 Nov 2022 at 18:14, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
>>>>
>>>> Hi,
>>>> On 11/4/2022 11:59 PM, Mathieu Poirier wrote:
>>>>> On Thu, Nov 03, 2022 at 10:03:49AM +0800, Aiqun(Maria) Yu wrote:
>>>>>> On 11/3/2022 2:03 AM, Mathieu Poirier wrote:
>>>>>>> On Wed, Nov 02, 2022 at 06:53:49PM +0800, Aiqun(Maria) Yu wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Let me think about this carefully.
>>>>>>>>
>>>>>>>> When in RPROC_RECOVERY_FAIL case we want to re-do the recovery process again
>>>>>>>> or just leave the pm_relax?
>>>>>>>
>>>>>>> Neither.
>>>>>>>
>>>>>>> When a recovery fail we don't want to call pm_relax().  The code in
>>>>>>> rproc_crash_handler_work() becomes:
>>>>>>>
>>>>>>>       if (rproc->state == RPROC_OFFLINE) {
>>>>>>>                     /* We have raced with rproc_shutdown() */
>>>>>>>                     pm_relax()
>>>>>>>               mutex_unlock(&rproc->lock);
>>>>>>>               return;
>>>>>>>             }
>>>>>>>
>>>>>>>       if (rproc->state == RPROC_CRASHED ||
>>>>>>>                 rproc->state == RPROC_RECOVERY_FAILED) {
>>>>>>>               /* handle only the first crash detected */
>>>>>>>               mutex_unlock(&rproc->lock);
>>>>>>>               return;
>>>>>>>       }
>>>>>>>
>>>>>>>
>>>>>>> RPROC_RECOVERY_FAILED gets set in rproc_boot_recovery() if request_firmware() or
>>>>>>> rproc_start() fail.  Function rproc_trigger_recovery() needs to allow for the
>>>>>>> recovery the the remote processor is in RPROC_RECOVERY_FAILED state.  As such
>>>>>>> the condition becomes:
>>>>>>>
>>>>>>>             /* State could have changed before we got the mutex */
>>>>>>>       if (rproc->state != RPROC_CRASHED &&
>>>>>>>                 rproc->state != RPROC_RECOVERY_FAILED)
>>>>>>>               goto unlock_mutex;
>>>>>>>
>>>>>>> Start with that and we can look at corner cases (if some exists) with a fresh
>>>>>>> patchset.  Note that I have not addressed the attach/detach() scenario in the
>>>>>>> above.
>>>>>>
>>>>>> If we didn't deal with the recovery failed case with correct pm_relax call,
>>>>>> it may left the device in a state that cannot enter to suspend state.
>>>>>
>>>>> That is what I am looking for.  We don't want to give the impression that
>>>>> everything is fine by allowing the device to suspend.  If the remote processor
>>>>> can't be recovered than it needs to be dealth with.
>>>> For the normal recovery failed case, it still need to do pm_relax to not
>>>> prevent the device goes to suspend. It is what in normal recovery failed
>>>> case we do in rproc_crash_handler_work as well.
>>>> rproc_crash_handler_work will not check the result of the
>>>> rproc_trigger_recovery return value, and will always do pm_relax.
>>>>
>>>> For current conconrency cornor case as well, it is better to consistant
>>>> with the current design of recovery fail senarios in normal cases.
>>>>
>>>> I personally agree that we shouldn't do nothing when it is a
>>>> RPROC_RECOVERY_FAILED senario when it is in rproc_crash_handler_work
>>>> check, because it maybe crash happened when it is trying to do the recovery.
>>>> So I suggested to do a continue try of trigger recovery again instead of
>>>> doing nothing and bail out if it is a RPROC_RECOVERY_FAILED state.
>>>>
>>>>>
>>>>>> Because first PROC_RECOVERY_FAIL case cannot ensure it have pm_relax called
>>>>>> before the second crash handler call pm_stay_awake or not.
>>>>>>
>>>>>
>>>>> I've been thinking about that part.  I don't think adding a wake_count to
>>>>> control calls to pm_stay_awake()/pm_relax() is the best way to go.  There is a
>>>>> similar count happening in the PM runtime subsystem and that is what we should
>>>>> be using.  I have asked a power management expert at Linaro for guidance with
>>>>> this matter.  I should be able to get back to you with a way forward by the end
>>>>> of next week.
>>>>>
>>>> Thx for the specific date provided as well. I will wait until your reply
>>>> for next patchset then.
>>>>
>>>>>> So, What about the atomic count along with pm_relax and pm_stay_awake ?
>>>>>>
>>>>>> struct rproc{
>>>>>> ...
>>>>>> atomic_t wake_count;
>>>>>> ...
>>>>>> }
>>>>>>
>>>>>> rproc_pm_stay_awake()
>>>>>> {
>>>>>>        atomic_inc(&wake_count);
>>>>>>        pm_stay_awake();
>>>>>> }
>>>>>>
>>>>>> rproc_pm_relax()
>>>>>> {
>>>>>>        if (atomic_dec_return(&wake_count) == 0)
>>>>>>                pm_stay_awake();
>>>>>> }
>>>>>>
>>>>>> can refer code like:
>>>>>>
>>>>>> rproc_report_crash()
>>>>>> {
>>>>>>        ...
>>>>>>        rproc_pm_stay_awake();
>>>>>>        queue_work();
>>>>>>        ...
>>>>>> }
>>>>>>
>>>>>> rproc_crash_handler_work()
>>>>>> {
>>>>>>        ...
>>>>>>        if (rproc->state == RPROC_OFFLINE || rproc->state == RPROC_CRASHED) {
>>>>>>                      /* We have raced with rproc_shutdown() */
>>>>>>                      rproc_pm_relax();
>>>>>>                mutex_unlock(&rproc->lock);
>>>>>>                return;
>>>>>>              }
>>>>>>        ...
>>>>>> }
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Mathieu
>>>>>>>
>>>>>>>>
>>>>>>>> recovery fail case 1:
>>>>>>>> |                                      |firstcrash interrupt issued
>>>>>>>> | second crashed interrupt issued      | rproc_report_crash()
>>>>>>>> | rproc_report_crash()                 |          pm_stay_awake()
>>>>>>>> |          pm_stay_awake()             |          queue_work()
>>>>>>>> |          queue_work()                |rproc_crash_handler_work()
>>>>>>>> |                                      |mutex_lock(&rproc->lock);
>>>>>>>> |                                      |rproc_stop()
>>>>>>>> |rproc_crash_handler_work()            |rproc->state = RPROC_OFFLINE;
>>>>>>>> |                                      |RPROC_RECOVERY_FAIL //new
>>>>>>>> |                                      |mutex_unlock(&rproc->lock);
>>>>>>>> |mutex_lock(&rproc->lock);             |pm_relax()
>>>>>>>> |if (rproc->state == RPROC_OFFLINE)    |
>>>>>>>> |return // shouldn't do pm_relax if RPROC_RECOVERY_FAIL?  |
>>>>>>>> |mutex_unlock(&rproc->lock);           |
>>>>>>>> |                                      |
>>>>>>>> |                                      |
>>>>>>>> |                                      |
>>>>>>>>
>>>>>>>> recovery fail case 2:
>>>>>>>> |                                      |firstcrash interrupt issued
>>>>>>>> |                                      | rproc_report_crash()
>>>>>>>> |                                      |          pm_stay_awake()
>>>>>>>> |                                      |          queue_work()
>>>>>>>> |                                      |rproc_crash_handler_work()
>>>>>>>> |                                      |mutex_lock(&rproc->lock);
>>>>>>>> |                                      |rproc_stop()
>>>>>>>> |                                      |rproc->state = RPROC_OFFLINE;
>>>>>>>> |                                      |RPROC_RECOVERY_FAIL //new
>>>>>>>> |                                      |mutex_unlock(&rproc->lock);
>>>>>>>> |                                      |pm_relax()
>>>>>>>> |
>>>>>>>> | second crashed interrupt issued      |
>>>>>>>> | rproc_report_crash()                 |
>>>>>>>> |          pm_stay_awake()             |
>>>>>>>> |          queue_work()                |
>>>>>>>> |pm_stay_awake()
>>>>>>>> |mutex_lock(&rproc->lock);
>>>>>>>> |if (rproc->state == RPROC_OFFLINE)    |
>>>>>>>> |return // still need do pm_relax if RPROC_RECOVERY_FAIL?  |
>>>>>>>> |mutex_unlock(&rproc->lock);           |
>>>>>>>> |                                      |
>>>>>>>> |                                      |
>>>>>>>> |                                      |
>>>>>>>>
>>>>>>>> Maybe I can have:
>>>>>>>> 1. the pm_stay_awake and pm_relax with count based and call with paired for
>>>>>>>> fix current concurency issue.
>>>>>>>> 2. RPROC_RECOVERY_FAIL can be another patch for continue try to do recovery
>>>>>>>> work.
>>>>>>>> 3. handle RPROC_DETACHED case.
>>>>>>>>
>>>>>>>> On 11/2/2022 4:11 AM, Mathieu Poirier wrote:
>>>>>>>>> On Fri, 28 Oct 2022 at 09:31, Arnaud POULIQUEN
>>>>>>>>> <arnaud.pouliquen@foss.st.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> On 10/24/22 05:17, Aiqun(Maria) Yu wrote:
>>>>>>>>>>> On 10/22/2022 3:34 AM, Mathieu Poirier wrote:
>>>>>>>>>>>> On Wed, 19 Oct 2022 at 23:52, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 10/14/2022 2:03 AM, Mathieu Poirier wrote:
>>>>>>>>>>>>>> On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote:
>>>>>>>>>>>>>>> On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:
>>>>>>>>>>>>>>>> Hi Mathieu,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
>>>>>>>>>>>>>>>>> Please add what has changed from one version to another, either in a cover
>>>>>>>>>>>>>>>>> letter or after the "Signed-off-by".  There are many examples on how to
>>>>>>>>>>>>>>>>> do that
>>>>>>>>>>>>>>>>> on the mailing list.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thx for the information, will take a note and benefit for next time.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
>>>>>>>>>>>>>>>>>> RPROC_OFFLINE state indicate there is no recovery process
>>>>>>>>>>>>>>>>>> is in progress and no chance to do the pm_relax.
>>>>>>>>>>>>>>>>>> Because when recovering from crash, rproc->lock is held and
>>>>>>>>>>>>>>>>>> state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>>>>>>>>>>>>>>> and then unlock rproc->lock.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> You are correct - because the lock is held rproc->state should be set to
>>>>>>>>>>>>>>>>> RPROC_RUNNING
>>>>>>>>>>>>>>>>> when rproc_trigger_recovery() returns.  If that is not the case then
>>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>> went wrong.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before
>>>>>>>>>>>>>>>>> returning,
>>>>>>>>>>>>>>>>> so we know the remote processor was stopped.  Therefore if rproc->state
>>>>>>>>>>>>>>>>> is set
>>>>>>>>>>>>>>>>> to RPROC_OFFLINE something went wrong in either request_firmware() or
>>>>>>>>>>>>>>>>> rproc_start().  Either way the remote processor is offline and the system
>>>>>>>>>>>>>>>>> probably
>>>>>>>>>>>>>>>>> in an unknown/unstable.  As such I don't see how calling pm_relax() can help
>>>>>>>>>>>>>>>>> things along.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
>>>>>>>>>>>>>>>> finished.
>>>>>>>>>>>>>>>> Even if it is multi crash rproc_crash_handler_work contention issue, and
>>>>>>>>>>>>>>>> last rproc_trigger_recovery bailed out with only
>>>>>>>>>>>>>>>> rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
>>>>>>>>>>>>>>>> Since the subsystem may still can be recovered with customer's next trigger
>>>>>>>>>>>>>>>> of rproc_start, and we can make each error out path clean with pm resources.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I suggest spending time understanding what leads to the failure when
>>>>>>>>>>>>>>>>> recovering
>>>>>>>>>>>>>>>>> from a crash and address that problem(s).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In current case, the customer's information is that the issue happened when
>>>>>>>>>>>>>>>> rproc_shutdown is triggered at similar time. So not an issue from error out
>>>>>>>>>>>>>>>> of rproc_trigger_recovery.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> That is a very important element to consider and should have been mentioned
>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>> the beginning.  What I see happening is the following:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> rproc_report_crash()
>>>>>>>>>>>>>>>                pm_stay_awake()
>>>>>>>>>>>>>>>                queue_work() // current thread is suspended
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> rproc_shutdown()
>>>>>>>>>>>>>>>                rproc_stop()
>>>>>>>>>>>>>>>                        rproc->state = RPROC_OFFLINE;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> rproc_crash_handler_work()
>>>>>>>>>>>>>>>                if (rproc->state == RPROC_OFFLINE)
>>>>>>>>>>>>>>>                        return // pm_relax() is not called
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The right way to fix this is to add a pm_relax() in rproc_shutdown() and
>>>>>>>>>>>>>>> rproc_detach(), along with a very descriptive comment as to why it is needed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thinking about this further there are more ramifications to consider.  Please
>>>>>>>>>>>>>> confirm the above scenario is what you are facing.  I will advise on how to
>>>>>>>>>>>>>> move
>>>>>>>>>>>>>> forward if that is the case.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Not sure if the situation is clear or not. So resend the email again.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The above senario is what customer is facing. crash hanppened while at
>>>>>>>>>>>>> the same time shutdown is triggered.
>>>>>>>>>>>>
>>>>>>>>>>>> Unfortunately this is not enough details to address a problem as
>>>>>>>>>>>> complex as this one.
>>>>>>>>>>>>
>>>>>>>>>>>>> And the device cannto goes to suspend state after that.
>>>>>>>>>>>>> the subsystem can still be start normally after this.
>>>>>>>>>>>>
>>>>>>>>>>>> If the code flow I pasted above reflects the problem at hand, the
>>>>>>>>>>>> current patch will not be sufficient to address the issue.  If Arnaud
>>>>>>>>>>>> confirms my suspicions we will have to think about a better solution.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi Mathiew,
>>>>>>>>>>>
>>>>>>>>>>> Could you pls have more details of any side effects other then power issue of
>>>>>>>>>>> the current senario?
>>>>>>>>>>> Why the current patch is not sufficient pls?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Have the current senario in details with rproc->lock information in details:
>>>>>>>>>>>
>>>>>>>>>>> | subsystem crashed interrupt issued      | user trigger shutdown
>>>>>>>>>>> | rproc_report_crash()                    |
>>>>>>>>>>> |          pm_stay_awake()                |
>>>>>>>>>>> |          queue_work()                   |
>>>>>>>>>>> |                                         |rproc_shutdown
>>>>>>>>>>> |                                         |mutex_lock(&rproc->lock);
>>>>>>>>>>> |                                         |rproc_stop()
>>>>>>>>>>> |rproc_crash_handler_work()               |rproc->state = RPROC_OFFLINE;
>>>>>>>>>>> |                                         |mutex_unlock(&rproc->lock);
>>>>>>>>>>> |mutex_lock(&rproc->lock);                |
>>>>>>>>>>> |if (rproc->state == RPROC_OFFLINE)       |
>>>>>>>>>>> |return // pm_relax() is not called       |rproc_boot
>>>>>>>>>>> |mutex_unlock(&rproc->lock);              |
>>>>>>>>>>> |                                         |mutex_lock(&rproc->lock);
>>>>>>>>>>> |                                         |rproc_start()
>>>>>>>>>>> |                                         |mutex_unlock(&rproc->lock);
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Agree with Mathieu, this is not so simple.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks for looking into this.
>>>>>>>>>
>>>>>>>>>> Here is my view  hoping I haven't missed a point in your discussion or
>>>>>>>>>> an other corner cases.
>>>>>>>>>>
>>>>>>>>>> I tried to analyze the issues (in what follows, the term "condition" means
>>>>>>>>>> the "if" condition in which Aiqun proposes to add the fix) :
>>>>>>>>>>
>>>>>>>>>> I can see 4 use cases with race condition
>>>>>>>>>>
>>>>>>>>>> 1) crash report while already one is treated (rproc_boot_recovery called)
>>>>>>>>>>           => not a real use case as if the remote processor is crashed we
>>>>>>>>>>                    should not have a second crash report
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> That part is of great concern to me.  *Theoretically* we should not
>>>>>>>>> get a new crash report while one has already been dispatched but the
>>>>>>>>> current code accounts for this scenario and as such the possibility
>>>>>>>>> can't be dismissed.  Therefore we need to expect rproc_report_crash()
>>>>>>>>> to be called multiple times before a single instance of
>>>>>>>>> rproc_boot_recovery() is scheduled.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> 2) rproc_stop executed between the queuing of the crash work and the call of
>>>>>>>>>>        rproc_crash_handler_work
>>>>>>>>>>         => rproc->state = RPROC_OFFLINE
>>>>>>>>>>         => we enter in the "condition" and the pm_relax has to be called
>>>>>>>>>>         => This commit fix should solve this use case
>>>>>>>>>>
>>>>>>>>>> 3) rproc_detach executed between the queue of the crash work and the call of
>>>>>>>>>>        rproc_crash_handler_work
>>>>>>>>>>         => rproc->state = RPROC_DETACHED;
>>>>>>>>>>         => we don't go in "the condition" and issue because the recovery reattach
>>>>>>>>>>            to the remote processor
>>>>>>>>>>         => but pm_relax is called
>>>>>>>>>>         => probably need an extra fix to avoid to re-attach
>>>>>>>>>>
>>>>>>>>>> 4) crash report while already one is treated (rproc_attach_recovery called)
>>>>>>>>>>         this one corresponds to an auto reboot of the remote processor, with a
>>>>>>>>>>         new crash
>>>>>>>>>>         => rproc->state = RPROC_CRASHED or rproc->state = RPROC_DETACHED;
>>>>>>>>>>         4)a) rproc->state = RPROC_CRASHED if rproc->recovery_disabled = true
>>>>>>>>>>              => should call pm_relax if rproc->recovery_disabled = true
>>>>>>>>>>              => commit does not work for this use case
>>>>>>>>>>
>>>>>>>>>>         4)b) rproc->state = RPROC_DETACHED if recovery fails
>>>>>>>>>>             => error case with an unstable state
>>>>>>>>>>             => how to differentiate it from the use case 3) ?
>>>>>>>>>>             => introduce a RPROC_RECOVERY_FAIL state?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The case where a recovery fails needs to be considered and is the
>>>>>>>>> reason the original patch doesn't work.  Right now in
>>>>>>>>> rproc_crash_handler_work(), it is not possible to differentiate
>>>>>>>>> between a legitimate shutdown request (scenario #2 above) and a
>>>>>>>>> recovery that went wrong.  I think introducing RPROC_RECOVERY_FAIL
>>>>>>>>> would greatly simplify things.
>>>>>>>>>
>>>>>>>>> My initial evaluation had not considered the attach/detach scenarios -
>>>>>>>>> thanks for adding that in the mix.
>>>>>>>>>
>>>>>>>>> Aiqun, please send a new patchset that adds a new remote processor
>>>>>>>>> state, i.e RPROC_RECOVERY_FAIL.  There should also be another patch in
>>>>>>>>> that set that takes attach/detach scenarios into account.  The code
>>>>>>>>> between the v6.0 and v6.1 cycle has changed a lot in that area so make
>>>>>>>>> sure to properly rebase.
>>>>>>>>>
>>>>>>>> I will try.
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Then pm_stay_awake is called when the crash work is queued.
>>>>>>>>>> It seems to me coherent to call the pm_relax in the work handler.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Here is a quick and dirty patch (not tested) that should take into account the
>>>>>>>>>> main use cases ( except 1) and 4)b) )
>>>>>>>>>>
>>>>>>>>>> @@ -2009,8 +2009,18 @@ static void rproc_crash_handler_work(struct work_struct *work)
>>>>>>>>>>
>>>>>>>>>>              mutex_lock(&rproc->lock);
>>>>>>>>>>
>>>>>>>>>> -       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
>>>>>>>>>> +       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE ||
>>>>>>>>>> +           rproc->state == RPROC_DETACHED) {
>>>>>>>>>>                      /* handle only the first crash detected */
>>>>>>>>>> +
>>>>>>>>>> +               /*
>>>>>>>>>> +                * call pm-relax in following use cases:
>>>>>>>>>> +                * - the remote processor has been stopped by the user
>>>>>>>>>> +                * - the remote processor is detached
>>>>>>>>>> +                + - the remote proc has an autonomous reset but recovery_disabled is true.
>>>>>>>>>> +                */
>>>>>>>>>> +               if(rproc->state != RPROC_CRASHED || rproc->recovery_disabled)
>>>>>>>>>> +                       pm_relax(rproc->dev.parent);
>>>>>>>>>>                      mutex_unlock(&rproc->lock);
>>>>>>>>>>                      return;
>>>>>>>>>>              }
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Arnaud
>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Mathieu
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> When the state is in RPROC_OFFLINE it means separate request
>>>>>>>>>>>>>>>>>> of rproc_stop was done and no need to hold the wakeup source
>>>>>>>>>>>>>>>>>> in crash handler to recover any more.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>          drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
>>>>>>>>>>>>>>>>>>          1 file changed, 11 insertions(+)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> diff --git a/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>>>>>>>> b/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>>>>>>>> index e5279ed9a8d7..6bc7b8b7d01e 100644
>>>>>>>>>>>>>>>>>> --- a/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>>>>>>>> +++ b/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>>>>>>>> @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct
>>>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>>                 if (rproc->state == RPROC_CRASHED || rproc->state ==
>>>>>>>>>>>>>>>>>> RPROC_OFFLINE) {
>>>>>>>>>>>>>>>>>>                         /* handle only the first crash detected */
>>>>>>>>>>>>>>>>>>                         mutex_unlock(&rproc->lock);
>>>>>>>>>>>>>>>>>> +         /*
>>>>>>>>>>>>>>>>>> +          * RPROC_OFFLINE state indicate there is no recovery process
>>>>>>>>>>>>>>>>>> +          * is in progress and no chance to have pm_relax in place.
>>>>>>>>>>>>>>>>>> +          * Because when recovering from crash, rproc->lock is held and
>>>>>>>>>>>>>>>>>> +          * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>>>>>>>>>>>>>>> +          * and then unlock rproc->lock.
>>>>>>>>>>>>>>>>>> +          * RPROC_OFFLINE is only an intermediate state in recovery
>>>>>>>>>>>>>>>>>> +          * process.
>>>>>>>>>>>>>>>>>> +          */
>>>>>>>>>>>>>>>>>> +         if (rproc->state == RPROC_OFFLINE)
>>>>>>>>>>>>>>>>>> +                 pm_relax(rproc->dev.parent);
>>>>>>>>>>>>>>>>>>                         return;
>>>>>>>>>>>>>>>>>>                 }
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> 2.7.4
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Thx and BRs,
>>>>>>>>>>>>>>>> Aiqun(Maria) Yu
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Thx and BRs,
>>>>>>>>>>>>> Aiqun(Maria) Yu
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thx and BRs,
>>>>>>>> Aiqun(Maria) Yu
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thx and BRs,
>>>>>> Aiqun(Maria) Yu
>>>>
>>>>
>>>> --
>>>> Thx and BRs,
>>>> Aiqun(Maria) Yu
>>
>>
>> -- 
>> Thx and BRs,
>> Aiqun(Maria) Yu
Mathieu Poirier Nov. 18, 2022, 6:52 p.m. UTC | #21
On Tue, Nov 15, 2022 at 09:30:58AM +0800, Aiqun(Maria) Yu wrote:
> Hi,
> 
> On 11/15/2022 5:18 AM, Mathieu Poirier wrote:
> > On Fri, Nov 11, 2022 at 08:52:11AM +0800, Aiqun(Maria) Yu wrote:
> > > On 11/11/2022 4:50 AM, Mathieu Poirier wrote:
> > > > I had a couple of good discussions with our power management expert
> > > > and even then, the way forward isn't as clear as I would have liked.
> > > > I am currently travelling and as such don't have the required time to
> > > > go into greater details, something I will be doing next week.
> > > > 
> > > Thx Mathieu for the info updated.
> > > I'll wait for your update next week then.
> > > let me know any initial questions that you have, perhaps I can also discuss
> > > that with our power team at the same time.
> > 
> > The problem is to determine exactly what the WQ_FREEZABLE flag does to the
> > rproc_recovery_wq workueue.  The documentation [1] indicate that work items on the WQ are
> > drained before the system is suspended.  What I understand from this is that if
> > two work items are queued and one is executing at the time a system suspend is
> > initiated, all three items will be executed before the system is allowed to be
> > suspended.  _If_ that is the case, there would not be a need to call
> > pm_stay_awake() and pm_relax() at all.
> > 
> > On the other hand, the PM resource I spoke to thought that in reality things
> > don't happen that way.  Taking the same above scenario where 2 work items are
> > queued and one is executing at the time of the suspend, only the work item that
> > is executing will be allowed to execute to completion before the system is
> > suspended.  The remaining two items that were queued will not execute.
> > 
> > If that is the case then we do need to call pm_stay_awake() and pm_relax(), and
> > find another strategy to fix this situation.
> > 
> > Until we have a clear view of how the WQ_FREEZABLE flag works, we won't be able
> > to move forward with this patchset.  Unfortunately, I currently do not have the
> > time to look into this.
> I had a check on the WQ_FREEZABLE flag, here is my understanding:
> 
> when the interrupt happened, it still need pm_stay_awake to make sure
> queue_work action can active the work instead of susepend the device.
> 
> 1. If  WQ_FREEZABLE, pwq->max_active = 0; // maximum number of in-flight
> work items is set to 0.
> 
> [1].
> https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/workqueue.c#L3748
> 
> 2. If  WQ_FREEZABLE, will only check pwq->nr_active to see if there is still
> freeze_workqueues_busy.
> [2].
> https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/workqueue.c#L5270
> 
> 3. When in queue_work, if max_active is 0, when do queue_work it will not
> actually active the work.
> [3].
> https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/workqueue.c#L1418
> 
> for the current issue, the work is already complete and forget to set
> pm_relax in some condition that make the system cannot be suspended.

What you have above does not describe what happens to work already queue to a
WQ_FREEZABLE workqueue when the system gets suspended, and that is the most
important thing.  

I ended up doing that investigation myself and found the documentation on
WQ_FREEZABLE seems to be accurate.  I am still discussing this with the Linaro
power management team and will get back to you when we reach a conclusion.

> 
> > 
> > If you want to take on this investigation, keep in mind that any conclusion will
> > need to be backed by a proof.  That can be debug messages on a console output or
> > a code reference in the workqueue core.
> > 
> > [1]. https://elixir.bootlin.com/linux/v6.1-rc4/source/Documentation/core-api/workqueue.rst#L184
> > 
> > 
> > > > On Sun, 6 Nov 2022 at 18:14, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
> > > > > 
> > > > > Hi,
> > > > > On 11/4/2022 11:59 PM, Mathieu Poirier wrote:
> > > > > > On Thu, Nov 03, 2022 at 10:03:49AM +0800, Aiqun(Maria) Yu wrote:
> > > > > > > On 11/3/2022 2:03 AM, Mathieu Poirier wrote:
> > > > > > > > On Wed, Nov 02, 2022 at 06:53:49PM +0800, Aiqun(Maria) Yu wrote:
> > > > > > > > > Hi,
> > > > > > > > > 
> > > > > > > > > Let me think about this carefully.
> > > > > > > > > 
> > > > > > > > > When in RPROC_RECOVERY_FAIL case we want to re-do the recovery process again
> > > > > > > > > or just leave the pm_relax?
> > > > > > > > 
> > > > > > > > Neither.
> > > > > > > > 
> > > > > > > > When a recovery fail we don't want to call pm_relax().  The code in
> > > > > > > > rproc_crash_handler_work() becomes:
> > > > > > > > 
> > > > > > > >       if (rproc->state == RPROC_OFFLINE) {
> > > > > > > >                     /* We have raced with rproc_shutdown() */
> > > > > > > >                     pm_relax()
> > > > > > > >               mutex_unlock(&rproc->lock);
> > > > > > > >               return;
> > > > > > > >             }
> > > > > > > > 
> > > > > > > >       if (rproc->state == RPROC_CRASHED ||
> > > > > > > >                 rproc->state == RPROC_RECOVERY_FAILED) {
> > > > > > > >               /* handle only the first crash detected */
> > > > > > > >               mutex_unlock(&rproc->lock);
> > > > > > > >               return;
> > > > > > > >       }
> > > > > > > > 
> > > > > > > > 
> > > > > > > > RPROC_RECOVERY_FAILED gets set in rproc_boot_recovery() if request_firmware() or
> > > > > > > > rproc_start() fail.  Function rproc_trigger_recovery() needs to allow for the
> > > > > > > > recovery the the remote processor is in RPROC_RECOVERY_FAILED state.  As such
> > > > > > > > the condition becomes:
> > > > > > > > 
> > > > > > > >             /* State could have changed before we got the mutex */
> > > > > > > >       if (rproc->state != RPROC_CRASHED &&
> > > > > > > >                 rproc->state != RPROC_RECOVERY_FAILED)
> > > > > > > >               goto unlock_mutex;
> > > > > > > > 
> > > > > > > > Start with that and we can look at corner cases (if some exists) with a fresh
> > > > > > > > patchset.  Note that I have not addressed the attach/detach() scenario in the
> > > > > > > > above.
> > > > > > > 
> > > > > > > If we didn't deal with the recovery failed case with correct pm_relax call,
> > > > > > > it may left the device in a state that cannot enter to suspend state.
> > > > > > 
> > > > > > That is what I am looking for.  We don't want to give the impression that
> > > > > > everything is fine by allowing the device to suspend.  If the remote processor
> > > > > > can't be recovered than it needs to be dealth with.
> > > > > For the normal recovery failed case, it still need to do pm_relax to not
> > > > > prevent the device goes to suspend. It is what in normal recovery failed
> > > > > case we do in rproc_crash_handler_work as well.
> > > > > rproc_crash_handler_work will not check the result of the
> > > > > rproc_trigger_recovery return value, and will always do pm_relax.
> > > > > 
> > > > > For current conconrency cornor case as well, it is better to consistant
> > > > > with the current design of recovery fail senarios in normal cases.
> > > > > 
> > > > > I personally agree that we shouldn't do nothing when it is a
> > > > > RPROC_RECOVERY_FAILED senario when it is in rproc_crash_handler_work
> > > > > check, because it maybe crash happened when it is trying to do the recovery.
> > > > > So I suggested to do a continue try of trigger recovery again instead of
> > > > > doing nothing and bail out if it is a RPROC_RECOVERY_FAILED state.
> > > > > 
> > > > > > 
> > > > > > > Because first PROC_RECOVERY_FAIL case cannot ensure it have pm_relax called
> > > > > > > before the second crash handler call pm_stay_awake or not.
> > > > > > > 
> > > > > > 
> > > > > > I've been thinking about that part.  I don't think adding a wake_count to
> > > > > > control calls to pm_stay_awake()/pm_relax() is the best way to go.  There is a
> > > > > > similar count happening in the PM runtime subsystem and that is what we should
> > > > > > be using.  I have asked a power management expert at Linaro for guidance with
> > > > > > this matter.  I should be able to get back to you with a way forward by the end
> > > > > > of next week.
> > > > > > 
> > > > > Thx for the specific date provided as well. I will wait until your reply
> > > > > for next patchset then.
> > > > > 
> > > > > > > So, What about the atomic count along with pm_relax and pm_stay_awake ?
> > > > > > > 
> > > > > > > struct rproc{
> > > > > > > ...
> > > > > > > atomic_t wake_count;
> > > > > > > ...
> > > > > > > }
> > > > > > > 
> > > > > > > rproc_pm_stay_awake()
> > > > > > > {
> > > > > > >        atomic_inc(&wake_count);
> > > > > > >        pm_stay_awake();
> > > > > > > }
> > > > > > > 
> > > > > > > rproc_pm_relax()
> > > > > > > {
> > > > > > >        if (atomic_dec_return(&wake_count) == 0)
> > > > > > >                pm_stay_awake();
> > > > > > > }
> > > > > > > 
> > > > > > > can refer code like:
> > > > > > > 
> > > > > > > rproc_report_crash()
> > > > > > > {
> > > > > > >        ...
> > > > > > >        rproc_pm_stay_awake();
> > > > > > >        queue_work();
> > > > > > >        ...
> > > > > > > }
> > > > > > > 
> > > > > > > rproc_crash_handler_work()
> > > > > > > {
> > > > > > >        ...
> > > > > > >        if (rproc->state == RPROC_OFFLINE || rproc->state == RPROC_CRASHED) {
> > > > > > >                      /* We have raced with rproc_shutdown() */
> > > > > > >                      rproc_pm_relax();
> > > > > > >                mutex_unlock(&rproc->lock);
> > > > > > >                return;
> > > > > > >              }
> > > > > > >        ...
> > > > > > > }
> > > > > > > 
> > > > > > > > 
> > > > > > > > Thanks,
> > > > > > > > Mathieu
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > recovery fail case 1:
> > > > > > > > > |                                      |firstcrash interrupt issued
> > > > > > > > > | second crashed interrupt issued      | rproc_report_crash()
> > > > > > > > > | rproc_report_crash()                 |          pm_stay_awake()
> > > > > > > > > |          pm_stay_awake()             |          queue_work()
> > > > > > > > > |          queue_work()                |rproc_crash_handler_work()
> > > > > > > > > |                                      |mutex_lock(&rproc->lock);
> > > > > > > > > |                                      |rproc_stop()
> > > > > > > > > |rproc_crash_handler_work()            |rproc->state = RPROC_OFFLINE;
> > > > > > > > > |                                      |RPROC_RECOVERY_FAIL //new
> > > > > > > > > |                                      |mutex_unlock(&rproc->lock);
> > > > > > > > > |mutex_lock(&rproc->lock);             |pm_relax()
> > > > > > > > > |if (rproc->state == RPROC_OFFLINE)    |
> > > > > > > > > |return // shouldn't do pm_relax if RPROC_RECOVERY_FAIL?  |
> > > > > > > > > |mutex_unlock(&rproc->lock);           |
> > > > > > > > > |                                      |
> > > > > > > > > |                                      |
> > > > > > > > > |                                      |
> > > > > > > > > 
> > > > > > > > > recovery fail case 2:
> > > > > > > > > |                                      |firstcrash interrupt issued
> > > > > > > > > |                                      | rproc_report_crash()
> > > > > > > > > |                                      |          pm_stay_awake()
> > > > > > > > > |                                      |          queue_work()
> > > > > > > > > |                                      |rproc_crash_handler_work()
> > > > > > > > > |                                      |mutex_lock(&rproc->lock);
> > > > > > > > > |                                      |rproc_stop()
> > > > > > > > > |                                      |rproc->state = RPROC_OFFLINE;
> > > > > > > > > |                                      |RPROC_RECOVERY_FAIL //new
> > > > > > > > > |                                      |mutex_unlock(&rproc->lock);
> > > > > > > > > |                                      |pm_relax()
> > > > > > > > > |
> > > > > > > > > | second crashed interrupt issued      |
> > > > > > > > > | rproc_report_crash()                 |
> > > > > > > > > |          pm_stay_awake()             |
> > > > > > > > > |          queue_work()                |
> > > > > > > > > |pm_stay_awake()
> > > > > > > > > |mutex_lock(&rproc->lock);
> > > > > > > > > |if (rproc->state == RPROC_OFFLINE)    |
> > > > > > > > > |return // still need do pm_relax if RPROC_RECOVERY_FAIL?  |
> > > > > > > > > |mutex_unlock(&rproc->lock);           |
> > > > > > > > > |                                      |
> > > > > > > > > |                                      |
> > > > > > > > > |                                      |
> > > > > > > > > 
> > > > > > > > > Maybe I can have:
> > > > > > > > > 1. the pm_stay_awake and pm_relax with count based and call with paired for
> > > > > > > > > fix current concurency issue.
> > > > > > > > > 2. RPROC_RECOVERY_FAIL can be another patch for continue try to do recovery
> > > > > > > > > work.
> > > > > > > > > 3. handle RPROC_DETACHED case.
> > > > > > > > > 
> > > > > > > > > On 11/2/2022 4:11 AM, Mathieu Poirier wrote:
> > > > > > > > > > On Fri, 28 Oct 2022 at 09:31, Arnaud POULIQUEN
> > > > > > > > > > <arnaud.pouliquen@foss.st.com> wrote:
> > > > > > > > > > > 
> > > > > > > > > > > Hi,
> > > > > > > > > > > 
> > > > > > > > > > > On 10/24/22 05:17, Aiqun(Maria) Yu wrote:
> > > > > > > > > > > > On 10/22/2022 3:34 AM, Mathieu Poirier wrote:
> > > > > > > > > > > > > On Wed, 19 Oct 2022 at 23:52, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > On 10/14/2022 2:03 AM, Mathieu Poirier wrote:
> > > > > > > > > > > > > > > On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote:
> > > > > > > > > > > > > > > > On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:
> > > > > > > > > > > > > > > > > Hi Mathieu,
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
> > > > > > > > > > > > > > > > > > Please add what has changed from one version to another, either in a cover
> > > > > > > > > > > > > > > > > > letter or after the "Signed-off-by".  There are many examples on how to
> > > > > > > > > > > > > > > > > > do that
> > > > > > > > > > > > > > > > > > on the mailing list.
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Thx for the information, will take a note and benefit for next time.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
> > > > > > > > > > > > > > > > > > > RPROC_OFFLINE state indicate there is no recovery process
> > > > > > > > > > > > > > > > > > > is in progress and no chance to do the pm_relax.
> > > > > > > > > > > > > > > > > > > Because when recovering from crash, rproc->lock is held and
> > > > > > > > > > > > > > > > > > > state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> > > > > > > > > > > > > > > > > > > and then unlock rproc->lock.
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > You are correct - because the lock is held rproc->state should be set to
> > > > > > > > > > > > > > > > > > RPROC_RUNNING
> > > > > > > > > > > > > > > > > > when rproc_trigger_recovery() returns.  If that is not the case then
> > > > > > > > > > > > > > > > > > something
> > > > > > > > > > > > > > > > > > went wrong.
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before
> > > > > > > > > > > > > > > > > > returning,
> > > > > > > > > > > > > > > > > > so we know the remote processor was stopped.  Therefore if rproc->state
> > > > > > > > > > > > > > > > > > is set
> > > > > > > > > > > > > > > > > > to RPROC_OFFLINE something went wrong in either request_firmware() or
> > > > > > > > > > > > > > > > > > rproc_start().  Either way the remote processor is offline and the system
> > > > > > > > > > > > > > > > > > probably
> > > > > > > > > > > > > > > > > > in an unknown/unstable.  As such I don't see how calling pm_relax() can help
> > > > > > > > > > > > > > > > > > things along.
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
> > > > > > > > > > > > > > > > > finished.
> > > > > > > > > > > > > > > > > Even if it is multi crash rproc_crash_handler_work contention issue, and
> > > > > > > > > > > > > > > > > last rproc_trigger_recovery bailed out with only
> > > > > > > > > > > > > > > > > rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
> > > > > > > > > > > > > > > > > Since the subsystem may still can be recovered with customer's next trigger
> > > > > > > > > > > > > > > > > of rproc_start, and we can make each error out path clean with pm resources.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > I suggest spending time understanding what leads to the failure when
> > > > > > > > > > > > > > > > > > recovering
> > > > > > > > > > > > > > > > > > from a crash and address that problem(s).
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > In current case, the customer's information is that the issue happened when
> > > > > > > > > > > > > > > > > rproc_shutdown is triggered at similar time. So not an issue from error out
> > > > > > > > > > > > > > > > > of rproc_trigger_recovery.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > That is a very important element to consider and should have been mentioned
> > > > > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > the beginning.  What I see happening is the following:
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > rproc_report_crash()
> > > > > > > > > > > > > > > >                pm_stay_awake()
> > > > > > > > > > > > > > > >                queue_work() // current thread is suspended
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > rproc_shutdown()
> > > > > > > > > > > > > > > >                rproc_stop()
> > > > > > > > > > > > > > > >                        rproc->state = RPROC_OFFLINE;
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > rproc_crash_handler_work()
> > > > > > > > > > > > > > > >                if (rproc->state == RPROC_OFFLINE)
> > > > > > > > > > > > > > > >                        return // pm_relax() is not called
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > The right way to fix this is to add a pm_relax() in rproc_shutdown() and
> > > > > > > > > > > > > > > > rproc_detach(), along with a very descriptive comment as to why it is needed.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Thinking about this further there are more ramifications to consider.  Please
> > > > > > > > > > > > > > > confirm the above scenario is what you are facing.  I will advise on how to
> > > > > > > > > > > > > > > move
> > > > > > > > > > > > > > > forward if that is the case.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Not sure if the situation is clear or not. So resend the email again.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > The above senario is what customer is facing. crash hanppened while at
> > > > > > > > > > > > > > the same time shutdown is triggered.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Unfortunately this is not enough details to address a problem as
> > > > > > > > > > > > > complex as this one.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > And the device cannto goes to suspend state after that.
> > > > > > > > > > > > > > the subsystem can still be start normally after this.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > If the code flow I pasted above reflects the problem at hand, the
> > > > > > > > > > > > > current patch will not be sufficient to address the issue.  If Arnaud
> > > > > > > > > > > > > confirms my suspicions we will have to think about a better solution.
> > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > Hi Mathiew,
> > > > > > > > > > > > 
> > > > > > > > > > > > Could you pls have more details of any side effects other then power issue of
> > > > > > > > > > > > the current senario?
> > > > > > > > > > > > Why the current patch is not sufficient pls?
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > Have the current senario in details with rproc->lock information in details:
> > > > > > > > > > > > 
> > > > > > > > > > > > | subsystem crashed interrupt issued      | user trigger shutdown
> > > > > > > > > > > > | rproc_report_crash()                    |
> > > > > > > > > > > > |          pm_stay_awake()                |
> > > > > > > > > > > > |          queue_work()                   |
> > > > > > > > > > > > |                                         |rproc_shutdown
> > > > > > > > > > > > |                                         |mutex_lock(&rproc->lock);
> > > > > > > > > > > > |                                         |rproc_stop()
> > > > > > > > > > > > |rproc_crash_handler_work()               |rproc->state = RPROC_OFFLINE;
> > > > > > > > > > > > |                                         |mutex_unlock(&rproc->lock);
> > > > > > > > > > > > |mutex_lock(&rproc->lock);                |
> > > > > > > > > > > > |if (rproc->state == RPROC_OFFLINE)       |
> > > > > > > > > > > > |return // pm_relax() is not called       |rproc_boot
> > > > > > > > > > > > |mutex_unlock(&rproc->lock);              |
> > > > > > > > > > > > |                                         |mutex_lock(&rproc->lock);
> > > > > > > > > > > > |                                         |rproc_start()
> > > > > > > > > > > > |                                         |mutex_unlock(&rproc->lock);
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Agree with Mathieu, this is not so simple.
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Thanks for looking into this.
> > > > > > > > > > 
> > > > > > > > > > > Here is my view  hoping I haven't missed a point in your discussion or
> > > > > > > > > > > an other corner cases.
> > > > > > > > > > > 
> > > > > > > > > > > I tried to analyze the issues (in what follows, the term "condition" means
> > > > > > > > > > > the "if" condition in which Aiqun proposes to add the fix) :
> > > > > > > > > > > 
> > > > > > > > > > > I can see 4 use cases with race condition
> > > > > > > > > > > 
> > > > > > > > > > > 1) crash report while already one is treated (rproc_boot_recovery called)
> > > > > > > > > > >           => not a real use case as if the remote processor is crashed we
> > > > > > > > > > >                    should not have a second crash report
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > That part is of great concern to me.  *Theoretically* we should not
> > > > > > > > > > get a new crash report while one has already been dispatched but the
> > > > > > > > > > current code accounts for this scenario and as such the possibility
> > > > > > > > > > can't be dismissed.  Therefore we need to expect rproc_report_crash()
> > > > > > > > > > to be called multiple times before a single instance of
> > > > > > > > > > rproc_boot_recovery() is scheduled.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > 2) rproc_stop executed between the queuing of the crash work and the call of
> > > > > > > > > > >        rproc_crash_handler_work
> > > > > > > > > > >         => rproc->state = RPROC_OFFLINE
> > > > > > > > > > >         => we enter in the "condition" and the pm_relax has to be called
> > > > > > > > > > >         => This commit fix should solve this use case
> > > > > > > > > > > 
> > > > > > > > > > > 3) rproc_detach executed between the queue of the crash work and the call of
> > > > > > > > > > >        rproc_crash_handler_work
> > > > > > > > > > >         => rproc->state = RPROC_DETACHED;
> > > > > > > > > > >         => we don't go in "the condition" and issue because the recovery reattach
> > > > > > > > > > >            to the remote processor
> > > > > > > > > > >         => but pm_relax is called
> > > > > > > > > > >         => probably need an extra fix to avoid to re-attach
> > > > > > > > > > > 
> > > > > > > > > > > 4) crash report while already one is treated (rproc_attach_recovery called)
> > > > > > > > > > >         this one corresponds to an auto reboot of the remote processor, with a
> > > > > > > > > > >         new crash
> > > > > > > > > > >         => rproc->state = RPROC_CRASHED or rproc->state = RPROC_DETACHED;
> > > > > > > > > > >         4)a) rproc->state = RPROC_CRASHED if rproc->recovery_disabled = true
> > > > > > > > > > >              => should call pm_relax if rproc->recovery_disabled = true
> > > > > > > > > > >              => commit does not work for this use case
> > > > > > > > > > > 
> > > > > > > > > > >         4)b) rproc->state = RPROC_DETACHED if recovery fails
> > > > > > > > > > >             => error case with an unstable state
> > > > > > > > > > >             => how to differentiate it from the use case 3) ?
> > > > > > > > > > >             => introduce a RPROC_RECOVERY_FAIL state?
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > The case where a recovery fails needs to be considered and is the
> > > > > > > > > > reason the original patch doesn't work.  Right now in
> > > > > > > > > > rproc_crash_handler_work(), it is not possible to differentiate
> > > > > > > > > > between a legitimate shutdown request (scenario #2 above) and a
> > > > > > > > > > recovery that went wrong.  I think introducing RPROC_RECOVERY_FAIL
> > > > > > > > > > would greatly simplify things.
> > > > > > > > > > 
> > > > > > > > > > My initial evaluation had not considered the attach/detach scenarios -
> > > > > > > > > > thanks for adding that in the mix.
> > > > > > > > > > 
> > > > > > > > > > Aiqun, please send a new patchset that adds a new remote processor
> > > > > > > > > > state, i.e RPROC_RECOVERY_FAIL.  There should also be another patch in
> > > > > > > > > > that set that takes attach/detach scenarios into account.  The code
> > > > > > > > > > between the v6.0 and v6.1 cycle has changed a lot in that area so make
> > > > > > > > > > sure to properly rebase.
> > > > > > > > > > 
> > > > > > > > > I will try.
> > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Then pm_stay_awake is called when the crash work is queued.
> > > > > > > > > > > It seems to me coherent to call the pm_relax in the work handler.
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Here is a quick and dirty patch (not tested) that should take into account the
> > > > > > > > > > > main use cases ( except 1) and 4)b) )
> > > > > > > > > > > 
> > > > > > > > > > > @@ -2009,8 +2009,18 @@ static void rproc_crash_handler_work(struct work_struct *work)
> > > > > > > > > > > 
> > > > > > > > > > >              mutex_lock(&rproc->lock);
> > > > > > > > > > > 
> > > > > > > > > > > -       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
> > > > > > > > > > > +       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE ||
> > > > > > > > > > > +           rproc->state == RPROC_DETACHED) {
> > > > > > > > > > >                      /* handle only the first crash detected */
> > > > > > > > > > > +
> > > > > > > > > > > +               /*
> > > > > > > > > > > +                * call pm-relax in following use cases:
> > > > > > > > > > > +                * - the remote processor has been stopped by the user
> > > > > > > > > > > +                * - the remote processor is detached
> > > > > > > > > > > +                + - the remote proc has an autonomous reset but recovery_disabled is true.
> > > > > > > > > > > +                */
> > > > > > > > > > > +               if(rproc->state != RPROC_CRASHED || rproc->recovery_disabled)
> > > > > > > > > > > +                       pm_relax(rproc->dev.parent);
> > > > > > > > > > >                      mutex_unlock(&rproc->lock);
> > > > > > > > > > >                      return;
> > > > > > > > > > >              }
> > > > > > > > > > > 
> > > > > > > > > > > Regards,
> > > > > > > > > > > Arnaud
> > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > > Mathieu
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > When the state is in RPROC_OFFLINE it means separate request
> > > > > > > > > > > > > > > > > > > of rproc_stop was done and no need to hold the wakeup source
> > > > > > > > > > > > > > > > > > > in crash handler to recover any more.
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
> > > > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > > >          drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
> > > > > > > > > > > > > > > > > > >          1 file changed, 11 insertions(+)
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > diff --git a/drivers/remoteproc/remoteproc_core.c
> > > > > > > > > > > > > > > > > > > b/drivers/remoteproc/remoteproc_core.c
> > > > > > > > > > > > > > > > > > > index e5279ed9a8d7..6bc7b8b7d01e 100644
> > > > > > > > > > > > > > > > > > > --- a/drivers/remoteproc/remoteproc_core.c
> > > > > > > > > > > > > > > > > > > +++ b/drivers/remoteproc/remoteproc_core.c
> > > > > > > > > > > > > > > > > > > @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct
> > > > > > > > > > > > > > > > > > > work_struct *work)
> > > > > > > > > > > > > > > > > > >                 if (rproc->state == RPROC_CRASHED || rproc->state ==
> > > > > > > > > > > > > > > > > > > RPROC_OFFLINE) {
> > > > > > > > > > > > > > > > > > >                         /* handle only the first crash detected */
> > > > > > > > > > > > > > > > > > >                         mutex_unlock(&rproc->lock);
> > > > > > > > > > > > > > > > > > > +         /*
> > > > > > > > > > > > > > > > > > > +          * RPROC_OFFLINE state indicate there is no recovery process
> > > > > > > > > > > > > > > > > > > +          * is in progress and no chance to have pm_relax in place.
> > > > > > > > > > > > > > > > > > > +          * Because when recovering from crash, rproc->lock is held and
> > > > > > > > > > > > > > > > > > > +          * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> > > > > > > > > > > > > > > > > > > +          * and then unlock rproc->lock.
> > > > > > > > > > > > > > > > > > > +          * RPROC_OFFLINE is only an intermediate state in recovery
> > > > > > > > > > > > > > > > > > > +          * process.
> > > > > > > > > > > > > > > > > > > +          */
> > > > > > > > > > > > > > > > > > > +         if (rproc->state == RPROC_OFFLINE)
> > > > > > > > > > > > > > > > > > > +                 pm_relax(rproc->dev.parent);
> > > > > > > > > > > > > > > > > > >                         return;
> > > > > > > > > > > > > > > > > > >                 }
> > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > 2.7.4
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > Thx and BRs,
> > > > > > > > > > > > > > > > > Aiqun(Maria) Yu
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Thx and BRs,
> > > > > > > > > > > > > > Aiqun(Maria) Yu
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > --
> > > > > > > > > Thx and BRs,
> > > > > > > > > Aiqun(Maria) Yu
> > > > > > > 
> > > > > > > 
> > > > > > > --
> > > > > > > Thx and BRs,
> > > > > > > Aiqun(Maria) Yu
> > > > > 
> > > > > 
> > > > > --
> > > > > Thx and BRs,
> > > > > Aiqun(Maria) Yu
> > > 
> > > 
> > > -- 
> > > Thx and BRs,
> > > Aiqun(Maria) Yu
> 
> 
> -- 
> Thx and BRs,
> Aiqun(Maria) Yu
Aiqun Yu (Maria) Nov. 21, 2022, 1:43 a.m. UTC | #22
On 11/19/2022 2:52 AM, Mathieu Poirier wrote:
> On Tue, Nov 15, 2022 at 09:30:58AM +0800, Aiqun(Maria) Yu wrote:
>> Hi,
>>
>> On 11/15/2022 5:18 AM, Mathieu Poirier wrote:
>>> On Fri, Nov 11, 2022 at 08:52:11AM +0800, Aiqun(Maria) Yu wrote:
>>>> On 11/11/2022 4:50 AM, Mathieu Poirier wrote:
>>>>> I had a couple of good discussions with our power management expert
>>>>> and even then, the way forward isn't as clear as I would have liked.
>>>>> I am currently travelling and as such don't have the required time to
>>>>> go into greater details, something I will be doing next week.
>>>>>
>>>> Thx Mathieu for the info updated.
>>>> I'll wait for your update next week then.
>>>> let me know any initial questions that you have, perhaps I can also discuss
>>>> that with our power team at the same time.
>>>
>>> The problem is to determine exactly what the WQ_FREEZABLE flag does to the
>>> rproc_recovery_wq workueue.  The documentation [1] indicate that work items on the WQ are
>>> drained before the system is suspended.  What I understand from this is that if
>>> two work items are queued and one is executing at the time a system suspend is
>>> initiated, all three items will be executed before the system is allowed to be
>>> suspended.  _If_ that is the case, there would not be a need to call
>>> pm_stay_awake() and pm_relax() at all.
>>>
>>> On the other hand, the PM resource I spoke to thought that in reality things
>>> don't happen that way.  Taking the same above scenario where 2 work items are
>>> queued and one is executing at the time of the suspend, only the work item that
>>> is executing will be allowed to execute to completion before the system is
>>> suspended.  The remaining two items that were queued will not execute.
>>>
>>> If that is the case then we do need to call pm_stay_awake() and pm_relax(), and
>>> find another strategy to fix this situation.
>>>
>>> Until we have a clear view of how the WQ_FREEZABLE flag works, we won't be able
>>> to move forward with this patchset.  Unfortunately, I currently do not have the
>>> time to look into this.
>> I had a check on the WQ_FREEZABLE flag, here is my understanding:
>>
>> when the interrupt happened, it still need pm_stay_awake to make sure
>> queue_work action can active the work instead of susepend the device.
>>
>> 1. If  WQ_FREEZABLE, pwq->max_active = 0; // maximum number of in-flight
>> work items is set to 0.
>>
>> [1].
>> https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/workqueue.c#L3748
>>
>> 2. If  WQ_FREEZABLE, will only check pwq->nr_active to see if there is still
>> freeze_workqueues_busy.
>> [2].
>> https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/workqueue.c#L5270
>>
>> 3. When in queue_work, if max_active is 0, when do queue_work it will not
>> actually active the work.
>> [3].
>> https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/workqueue.c#L1418
>>
>> for the current issue, the work is already complete and forget to set
>> pm_relax in some condition that make the system cannot be suspended.
> 
> What you have above does not describe what happens to work already queue to a
> WQ_FREEZABLE workqueue when the system gets suspended, and that is the most
> important thing.
> 
> I ended up doing that investigation myself and found the documentation on
> WQ_FREEZABLE seems to be accurate.  I am still discussing this with the Linaro
> power management team and will get back to you when we reach a conclusion.
> 
Glad that you have it clear.
And wait for your update on this.

here is my understandings maybe you can reference when to discuss with 
power teams:

1. pm_stay_awake in interrupt handler is needed.
when there is a subsystem crash interrupt, the remote proc driver want 
to keep the system awake until the crash is handled with subysystem 
state transfered to shutdown/restarted.

2. as long as the subsystem state is changed to RPROC_CRASHED or 
RPROC_OFFLINE state, it will be safe to do pm_relax and release the 
driver's wakelock at that point.

because 2 conditions can be met:
a. inactive work exist:
if there is an inactive work triggered by other crash interrupt, even if 
it become active again, it will bail out because of the subsystem state 
and not try to do any actual recovery work.
So no wakelock needed to be hold at this condition.

And if the subsystem resume again, and the inactive work become active 
and do the pm_relax, it will be fine if the device resumed with the 
state, the extra pm_relax won't make any difference. Or even if there is 
another crash handler happened at that time, the inactive work will try 
to do the recovery work directly.

b. active work exist:
WQ_FREEZABLE flag will ensure the active work being finished before 
going to suspend.


3. pm_relax can be called multi times. even if there is no pm_stay_awake 
called before this.


Feel free to comment if there is any understanding not correct.

>>
>>>
>>> If you want to take on this investigation, keep in mind that any conclusion will
>>> need to be backed by a proof.  That can be debug messages on a console output or
>>> a code reference in the workqueue core.
>>>
>>> [1]. https://elixir.bootlin.com/linux/v6.1-rc4/source/Documentation/core-api/workqueue.rst#L184
>>>
>>>
>>>>> On Sun, 6 Nov 2022 at 18:14, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>> On 11/4/2022 11:59 PM, Mathieu Poirier wrote:
>>>>>>> On Thu, Nov 03, 2022 at 10:03:49AM +0800, Aiqun(Maria) Yu wrote:
>>>>>>>> On 11/3/2022 2:03 AM, Mathieu Poirier wrote:
>>>>>>>>> On Wed, Nov 02, 2022 at 06:53:49PM +0800, Aiqun(Maria) Yu wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Let me think about this carefully.
>>>>>>>>>>
>>>>>>>>>> When in RPROC_RECOVERY_FAIL case we want to re-do the recovery process again
>>>>>>>>>> or just leave the pm_relax?
>>>>>>>>>
>>>>>>>>> Neither.
>>>>>>>>>
>>>>>>>>> When a recovery fail we don't want to call pm_relax().  The code in
>>>>>>>>> rproc_crash_handler_work() becomes:
>>>>>>>>>
>>>>>>>>>        if (rproc->state == RPROC_OFFLINE) {
>>>>>>>>>                      /* We have raced with rproc_shutdown() */
>>>>>>>>>                      pm_relax()
>>>>>>>>>                mutex_unlock(&rproc->lock);
>>>>>>>>>                return;
>>>>>>>>>              }
>>>>>>>>>
>>>>>>>>>        if (rproc->state == RPROC_CRASHED ||
>>>>>>>>>                  rproc->state == RPROC_RECOVERY_FAILED) {
>>>>>>>>>                /* handle only the first crash detected */
>>>>>>>>>                mutex_unlock(&rproc->lock);
>>>>>>>>>                return;
>>>>>>>>>        }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> RPROC_RECOVERY_FAILED gets set in rproc_boot_recovery() if request_firmware() or
>>>>>>>>> rproc_start() fail.  Function rproc_trigger_recovery() needs to allow for the
>>>>>>>>> recovery the the remote processor is in RPROC_RECOVERY_FAILED state.  As such
>>>>>>>>> the condition becomes:
>>>>>>>>>
>>>>>>>>>              /* State could have changed before we got the mutex */
>>>>>>>>>        if (rproc->state != RPROC_CRASHED &&
>>>>>>>>>                  rproc->state != RPROC_RECOVERY_FAILED)
>>>>>>>>>                goto unlock_mutex;
>>>>>>>>>
>>>>>>>>> Start with that and we can look at corner cases (if some exists) with a fresh
>>>>>>>>> patchset.  Note that I have not addressed the attach/detach() scenario in the
>>>>>>>>> above.
>>>>>>>>
>>>>>>>> If we didn't deal with the recovery failed case with correct pm_relax call,
>>>>>>>> it may left the device in a state that cannot enter to suspend state.
>>>>>>>
>>>>>>> That is what I am looking for.  We don't want to give the impression that
>>>>>>> everything is fine by allowing the device to suspend.  If the remote processor
>>>>>>> can't be recovered than it needs to be dealth with.
>>>>>> For the normal recovery failed case, it still need to do pm_relax to not
>>>>>> prevent the device goes to suspend. It is what in normal recovery failed
>>>>>> case we do in rproc_crash_handler_work as well.
>>>>>> rproc_crash_handler_work will not check the result of the
>>>>>> rproc_trigger_recovery return value, and will always do pm_relax.
>>>>>>
>>>>>> For current conconrency cornor case as well, it is better to consistant
>>>>>> with the current design of recovery fail senarios in normal cases.
>>>>>>
>>>>>> I personally agree that we shouldn't do nothing when it is a
>>>>>> RPROC_RECOVERY_FAILED senario when it is in rproc_crash_handler_work
>>>>>> check, because it maybe crash happened when it is trying to do the recovery.
>>>>>> So I suggested to do a continue try of trigger recovery again instead of
>>>>>> doing nothing and bail out if it is a RPROC_RECOVERY_FAILED state.
>>>>>>
>>>>>>>
>>>>>>>> Because first PROC_RECOVERY_FAIL case cannot ensure it have pm_relax called
>>>>>>>> before the second crash handler call pm_stay_awake or not.
>>>>>>>>
>>>>>>>
>>>>>>> I've been thinking about that part.  I don't think adding a wake_count to
>>>>>>> control calls to pm_stay_awake()/pm_relax() is the best way to go.  There is a
>>>>>>> similar count happening in the PM runtime subsystem and that is what we should
>>>>>>> be using.  I have asked a power management expert at Linaro for guidance with
>>>>>>> this matter.  I should be able to get back to you with a way forward by the end
>>>>>>> of next week.
>>>>>>>
>>>>>> Thx for the specific date provided as well. I will wait until your reply
>>>>>> for next patchset then.
>>>>>>
>>>>>>>> So, What about the atomic count along with pm_relax and pm_stay_awake ?
>>>>>>>>
>>>>>>>> struct rproc{
>>>>>>>> ...
>>>>>>>> atomic_t wake_count;
>>>>>>>> ...
>>>>>>>> }
>>>>>>>>
>>>>>>>> rproc_pm_stay_awake()
>>>>>>>> {
>>>>>>>>         atomic_inc(&wake_count);
>>>>>>>>         pm_stay_awake();
>>>>>>>> }
>>>>>>>>
>>>>>>>> rproc_pm_relax()
>>>>>>>> {
>>>>>>>>         if (atomic_dec_return(&wake_count) == 0)
>>>>>>>>                 pm_stay_awake();
>>>>>>>> }
>>>>>>>>
>>>>>>>> can refer code like:
>>>>>>>>
>>>>>>>> rproc_report_crash()
>>>>>>>> {
>>>>>>>>         ...
>>>>>>>>         rproc_pm_stay_awake();
>>>>>>>>         queue_work();
>>>>>>>>         ...
>>>>>>>> }
>>>>>>>>
>>>>>>>> rproc_crash_handler_work()
>>>>>>>> {
>>>>>>>>         ...
>>>>>>>>         if (rproc->state == RPROC_OFFLINE || rproc->state == RPROC_CRASHED) {
>>>>>>>>                       /* We have raced with rproc_shutdown() */
>>>>>>>>                       rproc_pm_relax();
>>>>>>>>                 mutex_unlock(&rproc->lock);
>>>>>>>>                 return;
>>>>>>>>               }
>>>>>>>>         ...
>>>>>>>> }
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Mathieu
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> recovery fail case 1:
>>>>>>>>>> |                                      |firstcrash interrupt issued
>>>>>>>>>> | second crashed interrupt issued      | rproc_report_crash()
>>>>>>>>>> | rproc_report_crash()                 |          pm_stay_awake()
>>>>>>>>>> |          pm_stay_awake()             |          queue_work()
>>>>>>>>>> |          queue_work()                |rproc_crash_handler_work()
>>>>>>>>>> |                                      |mutex_lock(&rproc->lock);
>>>>>>>>>> |                                      |rproc_stop()
>>>>>>>>>> |rproc_crash_handler_work()            |rproc->state = RPROC_OFFLINE;
>>>>>>>>>> |                                      |RPROC_RECOVERY_FAIL //new
>>>>>>>>>> |                                      |mutex_unlock(&rproc->lock);
>>>>>>>>>> |mutex_lock(&rproc->lock);             |pm_relax()
>>>>>>>>>> |if (rproc->state == RPROC_OFFLINE)    |
>>>>>>>>>> |return // shouldn't do pm_relax if RPROC_RECOVERY_FAIL?  |
>>>>>>>>>> |mutex_unlock(&rproc->lock);           |
>>>>>>>>>> |                                      |
>>>>>>>>>> |                                      |
>>>>>>>>>> |                                      |
>>>>>>>>>>
>>>>>>>>>> recovery fail case 2:
>>>>>>>>>> |                                      |firstcrash interrupt issued
>>>>>>>>>> |                                      | rproc_report_crash()
>>>>>>>>>> |                                      |          pm_stay_awake()
>>>>>>>>>> |                                      |          queue_work()
>>>>>>>>>> |                                      |rproc_crash_handler_work()
>>>>>>>>>> |                                      |mutex_lock(&rproc->lock);
>>>>>>>>>> |                                      |rproc_stop()
>>>>>>>>>> |                                      |rproc->state = RPROC_OFFLINE;
>>>>>>>>>> |                                      |RPROC_RECOVERY_FAIL //new
>>>>>>>>>> |                                      |mutex_unlock(&rproc->lock);
>>>>>>>>>> |                                      |pm_relax()
>>>>>>>>>> |
>>>>>>>>>> | second crashed interrupt issued      |
>>>>>>>>>> | rproc_report_crash()                 |
>>>>>>>>>> |          pm_stay_awake()             |
>>>>>>>>>> |          queue_work()                |
>>>>>>>>>> |pm_stay_awake()
>>>>>>>>>> |mutex_lock(&rproc->lock);
>>>>>>>>>> |if (rproc->state == RPROC_OFFLINE)    |
>>>>>>>>>> |return // still need do pm_relax if RPROC_RECOVERY_FAIL?  |
>>>>>>>>>> |mutex_unlock(&rproc->lock);           |
>>>>>>>>>> |                                      |
>>>>>>>>>> |                                      |
>>>>>>>>>> |                                      |
>>>>>>>>>>
>>>>>>>>>> Maybe I can have:
>>>>>>>>>> 1. the pm_stay_awake and pm_relax with count based and call with paired for
>>>>>>>>>> fix current concurency issue.
>>>>>>>>>> 2. RPROC_RECOVERY_FAIL can be another patch for continue try to do recovery
>>>>>>>>>> work.
>>>>>>>>>> 3. handle RPROC_DETACHED case.
>>>>>>>>>>
>>>>>>>>>> On 11/2/2022 4:11 AM, Mathieu Poirier wrote:
>>>>>>>>>>> On Fri, 28 Oct 2022 at 09:31, Arnaud POULIQUEN
>>>>>>>>>>> <arnaud.pouliquen@foss.st.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> On 10/24/22 05:17, Aiqun(Maria) Yu wrote:
>>>>>>>>>>>>> On 10/22/2022 3:34 AM, Mathieu Poirier wrote:
>>>>>>>>>>>>>> On Wed, 19 Oct 2022 at 23:52, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 10/14/2022 2:03 AM, Mathieu Poirier wrote:
>>>>>>>>>>>>>>>> On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote:
>>>>>>>>>>>>>>>>> On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:
>>>>>>>>>>>>>>>>>> Hi Mathieu,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
>>>>>>>>>>>>>>>>>>> Please add what has changed from one version to another, either in a cover
>>>>>>>>>>>>>>>>>>> letter or after the "Signed-off-by".  There are many examples on how to
>>>>>>>>>>>>>>>>>>> do that
>>>>>>>>>>>>>>>>>>> on the mailing list.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thx for the information, will take a note and benefit for next time.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
>>>>>>>>>>>>>>>>>>>> RPROC_OFFLINE state indicate there is no recovery process
>>>>>>>>>>>>>>>>>>>> is in progress and no chance to do the pm_relax.
>>>>>>>>>>>>>>>>>>>> Because when recovering from crash, rproc->lock is held and
>>>>>>>>>>>>>>>>>>>> state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>>>>>>>>>>>>>>>>> and then unlock rproc->lock.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> You are correct - because the lock is held rproc->state should be set to
>>>>>>>>>>>>>>>>>>> RPROC_RUNNING
>>>>>>>>>>>>>>>>>>> when rproc_trigger_recovery() returns.  If that is not the case then
>>>>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>>>> went wrong.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before
>>>>>>>>>>>>>>>>>>> returning,
>>>>>>>>>>>>>>>>>>> so we know the remote processor was stopped.  Therefore if rproc->state
>>>>>>>>>>>>>>>>>>> is set
>>>>>>>>>>>>>>>>>>> to RPROC_OFFLINE something went wrong in either request_firmware() or
>>>>>>>>>>>>>>>>>>> rproc_start().  Either way the remote processor is offline and the system
>>>>>>>>>>>>>>>>>>> probably
>>>>>>>>>>>>>>>>>>> in an unknown/unstable.  As such I don't see how calling pm_relax() can help
>>>>>>>>>>>>>>>>>>> things along.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
>>>>>>>>>>>>>>>>>> finished.
>>>>>>>>>>>>>>>>>> Even if it is multi crash rproc_crash_handler_work contention issue, and
>>>>>>>>>>>>>>>>>> last rproc_trigger_recovery bailed out with only
>>>>>>>>>>>>>>>>>> rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
>>>>>>>>>>>>>>>>>> Since the subsystem may still can be recovered with customer's next trigger
>>>>>>>>>>>>>>>>>> of rproc_start, and we can make each error out path clean with pm resources.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I suggest spending time understanding what leads to the failure when
>>>>>>>>>>>>>>>>>>> recovering
>>>>>>>>>>>>>>>>>>> from a crash and address that problem(s).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In current case, the customer's information is that the issue happened when
>>>>>>>>>>>>>>>>>> rproc_shutdown is triggered at similar time. So not an issue from error out
>>>>>>>>>>>>>>>>>> of rproc_trigger_recovery.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> That is a very important element to consider and should have been mentioned
>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>> the beginning.  What I see happening is the following:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> rproc_report_crash()
>>>>>>>>>>>>>>>>>                 pm_stay_awake()
>>>>>>>>>>>>>>>>>                 queue_work() // current thread is suspended
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> rproc_shutdown()
>>>>>>>>>>>>>>>>>                 rproc_stop()
>>>>>>>>>>>>>>>>>                         rproc->state = RPROC_OFFLINE;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> rproc_crash_handler_work()
>>>>>>>>>>>>>>>>>                 if (rproc->state == RPROC_OFFLINE)
>>>>>>>>>>>>>>>>>                         return // pm_relax() is not called
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The right way to fix this is to add a pm_relax() in rproc_shutdown() and
>>>>>>>>>>>>>>>>> rproc_detach(), along with a very descriptive comment as to why it is needed.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thinking about this further there are more ramifications to consider.  Please
>>>>>>>>>>>>>>>> confirm the above scenario is what you are facing.  I will advise on how to
>>>>>>>>>>>>>>>> move
>>>>>>>>>>>>>>>> forward if that is the case.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Not sure if the situation is clear or not. So resend the email again.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The above senario is what customer is facing. crash hanppened while at
>>>>>>>>>>>>>>> the same time shutdown is triggered.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Unfortunately this is not enough details to address a problem as
>>>>>>>>>>>>>> complex as this one.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> And the device cannto goes to suspend state after that.
>>>>>>>>>>>>>>> the subsystem can still be start normally after this.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If the code flow I pasted above reflects the problem at hand, the
>>>>>>>>>>>>>> current patch will not be sufficient to address the issue.  If Arnaud
>>>>>>>>>>>>>> confirms my suspicions we will have to think about a better solution.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Mathiew,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Could you pls have more details of any side effects other then power issue of
>>>>>>>>>>>>> the current senario?
>>>>>>>>>>>>> Why the current patch is not sufficient pls?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Have the current senario in details with rproc->lock information in details:
>>>>>>>>>>>>>
>>>>>>>>>>>>> | subsystem crashed interrupt issued      | user trigger shutdown
>>>>>>>>>>>>> | rproc_report_crash()                    |
>>>>>>>>>>>>> |          pm_stay_awake()                |
>>>>>>>>>>>>> |          queue_work()                   |
>>>>>>>>>>>>> |                                         |rproc_shutdown
>>>>>>>>>>>>> |                                         |mutex_lock(&rproc->lock);
>>>>>>>>>>>>> |                                         |rproc_stop()
>>>>>>>>>>>>> |rproc_crash_handler_work()               |rproc->state = RPROC_OFFLINE;
>>>>>>>>>>>>> |                                         |mutex_unlock(&rproc->lock);
>>>>>>>>>>>>> |mutex_lock(&rproc->lock);                |
>>>>>>>>>>>>> |if (rproc->state == RPROC_OFFLINE)       |
>>>>>>>>>>>>> |return // pm_relax() is not called       |rproc_boot
>>>>>>>>>>>>> |mutex_unlock(&rproc->lock);              |
>>>>>>>>>>>>> |                                         |mutex_lock(&rproc->lock);
>>>>>>>>>>>>> |                                         |rproc_start()
>>>>>>>>>>>>> |                                         |mutex_unlock(&rproc->lock);
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Agree with Mathieu, this is not so simple.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks for looking into this.
>>>>>>>>>>>
>>>>>>>>>>>> Here is my view  hoping I haven't missed a point in your discussion or
>>>>>>>>>>>> an other corner cases.
>>>>>>>>>>>>
>>>>>>>>>>>> I tried to analyze the issues (in what follows, the term "condition" means
>>>>>>>>>>>> the "if" condition in which Aiqun proposes to add the fix) :
>>>>>>>>>>>>
>>>>>>>>>>>> I can see 4 use cases with race condition
>>>>>>>>>>>>
>>>>>>>>>>>> 1) crash report while already one is treated (rproc_boot_recovery called)
>>>>>>>>>>>>            => not a real use case as if the remote processor is crashed we
>>>>>>>>>>>>                     should not have a second crash report
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> That part is of great concern to me.  *Theoretically* we should not
>>>>>>>>>>> get a new crash report while one has already been dispatched but the
>>>>>>>>>>> current code accounts for this scenario and as such the possibility
>>>>>>>>>>> can't be dismissed.  Therefore we need to expect rproc_report_crash()
>>>>>>>>>>> to be called multiple times before a single instance of
>>>>>>>>>>> rproc_boot_recovery() is scheduled.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> 2) rproc_stop executed between the queuing of the crash work and the call of
>>>>>>>>>>>>         rproc_crash_handler_work
>>>>>>>>>>>>          => rproc->state = RPROC_OFFLINE
>>>>>>>>>>>>          => we enter in the "condition" and the pm_relax has to be called
>>>>>>>>>>>>          => This commit fix should solve this use case
>>>>>>>>>>>>
>>>>>>>>>>>> 3) rproc_detach executed between the queue of the crash work and the call of
>>>>>>>>>>>>         rproc_crash_handler_work
>>>>>>>>>>>>          => rproc->state = RPROC_DETACHED;
>>>>>>>>>>>>          => we don't go in "the condition" and issue because the recovery reattach
>>>>>>>>>>>>             to the remote processor
>>>>>>>>>>>>          => but pm_relax is called
>>>>>>>>>>>>          => probably need an extra fix to avoid to re-attach
>>>>>>>>>>>>
>>>>>>>>>>>> 4) crash report while already one is treated (rproc_attach_recovery called)
>>>>>>>>>>>>          this one corresponds to an auto reboot of the remote processor, with a
>>>>>>>>>>>>          new crash
>>>>>>>>>>>>          => rproc->state = RPROC_CRASHED or rproc->state = RPROC_DETACHED;
>>>>>>>>>>>>          4)a) rproc->state = RPROC_CRASHED if rproc->recovery_disabled = true
>>>>>>>>>>>>               => should call pm_relax if rproc->recovery_disabled = true
>>>>>>>>>>>>               => commit does not work for this use case
>>>>>>>>>>>>
>>>>>>>>>>>>          4)b) rproc->state = RPROC_DETACHED if recovery fails
>>>>>>>>>>>>              => error case with an unstable state
>>>>>>>>>>>>              => how to differentiate it from the use case 3) ?
>>>>>>>>>>>>              => introduce a RPROC_RECOVERY_FAIL state?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The case where a recovery fails needs to be considered and is the
>>>>>>>>>>> reason the original patch doesn't work.  Right now in
>>>>>>>>>>> rproc_crash_handler_work(), it is not possible to differentiate
>>>>>>>>>>> between a legitimate shutdown request (scenario #2 above) and a
>>>>>>>>>>> recovery that went wrong.  I think introducing RPROC_RECOVERY_FAIL
>>>>>>>>>>> would greatly simplify things.
>>>>>>>>>>>
>>>>>>>>>>> My initial evaluation had not considered the attach/detach scenarios -
>>>>>>>>>>> thanks for adding that in the mix.
>>>>>>>>>>>
>>>>>>>>>>> Aiqun, please send a new patchset that adds a new remote processor
>>>>>>>>>>> state, i.e RPROC_RECOVERY_FAIL.  There should also be another patch in
>>>>>>>>>>> that set that takes attach/detach scenarios into account.  The code
>>>>>>>>>>> between the v6.0 and v6.1 cycle has changed a lot in that area so make
>>>>>>>>>>> sure to properly rebase.
>>>>>>>>>>>
>>>>>>>>>> I will try.
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Then pm_stay_awake is called when the crash work is queued.
>>>>>>>>>>>> It seems to me coherent to call the pm_relax in the work handler.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Here is a quick and dirty patch (not tested) that should take into account the
>>>>>>>>>>>> main use cases ( except 1) and 4)b) )
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -2009,8 +2009,18 @@ static void rproc_crash_handler_work(struct work_struct *work)
>>>>>>>>>>>>
>>>>>>>>>>>>               mutex_lock(&rproc->lock);
>>>>>>>>>>>>
>>>>>>>>>>>> -       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
>>>>>>>>>>>> +       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE ||
>>>>>>>>>>>> +           rproc->state == RPROC_DETACHED) {
>>>>>>>>>>>>                       /* handle only the first crash detected */
>>>>>>>>>>>> +
>>>>>>>>>>>> +               /*
>>>>>>>>>>>> +                * call pm-relax in following use cases:
>>>>>>>>>>>> +                * - the remote processor has been stopped by the user
>>>>>>>>>>>> +                * - the remote processor is detached
>>>>>>>>>>>> +                + - the remote proc has an autonomous reset but recovery_disabled is true.
>>>>>>>>>>>> +                */
>>>>>>>>>>>> +               if(rproc->state != RPROC_CRASHED || rproc->recovery_disabled)
>>>>>>>>>>>> +                       pm_relax(rproc->dev.parent);
>>>>>>>>>>>>                       mutex_unlock(&rproc->lock);
>>>>>>>>>>>>                       return;
>>>>>>>>>>>>               }
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Arnaud
>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Mathieu
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> When the state is in RPROC_OFFLINE it means separate request
>>>>>>>>>>>>>>>>>>>> of rproc_stop was done and no need to hold the wakeup source
>>>>>>>>>>>>>>>>>>>> in crash handler to recover any more.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>           drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
>>>>>>>>>>>>>>>>>>>>           1 file changed, 11 insertions(+)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> diff --git a/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>>>>>>>>>> b/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>>>>>>>>>> index e5279ed9a8d7..6bc7b8b7d01e 100644
>>>>>>>>>>>>>>>>>>>> --- a/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>>>>>>>>>> +++ b/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>>>>>>>>>> @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct
>>>>>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>>>>                  if (rproc->state == RPROC_CRASHED || rproc->state ==
>>>>>>>>>>>>>>>>>>>> RPROC_OFFLINE) {
>>>>>>>>>>>>>>>>>>>>                          /* handle only the first crash detected */
>>>>>>>>>>>>>>>>>>>>                          mutex_unlock(&rproc->lock);
>>>>>>>>>>>>>>>>>>>> +         /*
>>>>>>>>>>>>>>>>>>>> +          * RPROC_OFFLINE state indicate there is no recovery process
>>>>>>>>>>>>>>>>>>>> +          * is in progress and no chance to have pm_relax in place.
>>>>>>>>>>>>>>>>>>>> +          * Because when recovering from crash, rproc->lock is held and
>>>>>>>>>>>>>>>>>>>> +          * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>>>>>>>>>>>>>>>>> +          * and then unlock rproc->lock.
>>>>>>>>>>>>>>>>>>>> +          * RPROC_OFFLINE is only an intermediate state in recovery
>>>>>>>>>>>>>>>>>>>> +          * process.
>>>>>>>>>>>>>>>>>>>> +          */
>>>>>>>>>>>>>>>>>>>> +         if (rproc->state == RPROC_OFFLINE)
>>>>>>>>>>>>>>>>>>>> +                 pm_relax(rproc->dev.parent);
>>>>>>>>>>>>>>>>>>>>                          return;
>>>>>>>>>>>>>>>>>>>>                  }
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> 2.7.4
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Thx and BRs,
>>>>>>>>>>>>>>>>>> Aiqun(Maria) Yu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Thx and BRs,
>>>>>>>>>>>>>>> Aiqun(Maria) Yu
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Thx and BRs,
>>>>>>>>>> Aiqun(Maria) Yu
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thx and BRs,
>>>>>>>> Aiqun(Maria) Yu
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thx and BRs,
>>>>>> Aiqun(Maria) Yu
>>>>
>>>>
>>>> -- 
>>>> Thx and BRs,
>>>> Aiqun(Maria) Yu
>>
>>
>> -- 
>> Thx and BRs,
>> Aiqun(Maria) Yu
Mathieu Poirier Nov. 25, 2022, 6:37 p.m. UTC | #23
On Fri, Nov 18, 2022 at 11:52:46AM -0700, Mathieu Poirier wrote:
> On Tue, Nov 15, 2022 at 09:30:58AM +0800, Aiqun(Maria) Yu wrote:
> > Hi,
> > 
> > On 11/15/2022 5:18 AM, Mathieu Poirier wrote:
> > > On Fri, Nov 11, 2022 at 08:52:11AM +0800, Aiqun(Maria) Yu wrote:
> > > > On 11/11/2022 4:50 AM, Mathieu Poirier wrote:
> > > > > I had a couple of good discussions with our power management expert
> > > > > and even then, the way forward isn't as clear as I would have liked.
> > > > > I am currently travelling and as such don't have the required time to
> > > > > go into greater details, something I will be doing next week.
> > > > > 
> > > > Thx Mathieu for the info updated.
> > > > I'll wait for your update next week then.
> > > > let me know any initial questions that you have, perhaps I can also discuss
> > > > that with our power team at the same time.
> > > 
> > > The problem is to determine exactly what the WQ_FREEZABLE flag does to the
> > > rproc_recovery_wq workueue.  The documentation [1] indicate that work items on the WQ are
> > > drained before the system is suspended.  What I understand from this is that if
> > > two work items are queued and one is executing at the time a system suspend is
> > > initiated, all three items will be executed before the system is allowed to be
> > > suspended.  _If_ that is the case, there would not be a need to call
> > > pm_stay_awake() and pm_relax() at all.
> > > 
> > > On the other hand, the PM resource I spoke to thought that in reality things
> > > don't happen that way.  Taking the same above scenario where 2 work items are
> > > queued and one is executing at the time of the suspend, only the work item that
> > > is executing will be allowed to execute to completion before the system is
> > > suspended.  The remaining two items that were queued will not execute.
> > > 
> > > If that is the case then we do need to call pm_stay_awake() and pm_relax(), and
> > > find another strategy to fix this situation.
> > > 
> > > Until we have a clear view of how the WQ_FREEZABLE flag works, we won't be able
> > > to move forward with this patchset.  Unfortunately, I currently do not have the
> > > time to look into this.
> > I had a check on the WQ_FREEZABLE flag, here is my understanding:
> > 
> > when the interrupt happened, it still need pm_stay_awake to make sure
> > queue_work action can active the work instead of susepend the device.
> > 
> > 1. If  WQ_FREEZABLE, pwq->max_active = 0; // maximum number of in-flight
> > work items is set to 0.
> > 
> > [1].
> > https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/workqueue.c#L3748
> > 
> > 2. If  WQ_FREEZABLE, will only check pwq->nr_active to see if there is still
> > freeze_workqueues_busy.
> > [2].
> > https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/workqueue.c#L5270
> > 
> > 3. When in queue_work, if max_active is 0, when do queue_work it will not
> > actually active the work.
> > [3].
> > https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/workqueue.c#L1418
> > 
> > for the current issue, the work is already complete and forget to set
> > pm_relax in some condition that make the system cannot be suspended.
> 
> What you have above does not describe what happens to work already queue to a
> WQ_FREEZABLE workqueue when the system gets suspended, and that is the most
> important thing.  
> 
> I ended up doing that investigation myself and found the documentation on
> WQ_FREEZABLE seems to be accurate.  I am still discussing this with the Linaro
> power management team and will get back to you when we reach a conclusion.
>

I finally have time to get back to this problem...

As I reported above I went down to the bottom of how WQ_FREEZABLE workqueues
work.  The conclusion of that research is that _any_ work, queued or currently
executing, will finish before the system is suspended.  As such, as soon as
queue_work() returns in rproc_report_crash(), you can be guaranteed what was
just queued will be executed.  Therefore, calling pm_stay_awake() in
rproc_report_crash() and pm_relax() in rproc_crash_handler_work() should not be
needed because the system will not suspend for as long as there is work to be
done by the freezable worqueues.

I don't have details on what Rishabh - the author of the patch that added the
calls - was facing but based on the my understanding of freezable workqueues, it
seem to have been the wrong solution.

Until we have a clear vision on what Rishabh was trying to fix, nothing in that
area will be changing.  I suggest you touch base with him and sort it out.  If
that is not possible then calls to pm_stay_awake() and pm_relax() should be
removed.

> > 
> > > 
> > > If you want to take on this investigation, keep in mind that any conclusion will
> > > need to be backed by a proof.  That can be debug messages on a console output or
> > > a code reference in the workqueue core.
> > > 
> > > [1]. https://elixir.bootlin.com/linux/v6.1-rc4/source/Documentation/core-api/workqueue.rst#L184
> > > 
> > > 
> > > > > On Sun, 6 Nov 2022 at 18:14, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
> > > > > > 
> > > > > > Hi,
> > > > > > On 11/4/2022 11:59 PM, Mathieu Poirier wrote:
> > > > > > > On Thu, Nov 03, 2022 at 10:03:49AM +0800, Aiqun(Maria) Yu wrote:
> > > > > > > > On 11/3/2022 2:03 AM, Mathieu Poirier wrote:
> > > > > > > > > On Wed, Nov 02, 2022 at 06:53:49PM +0800, Aiqun(Maria) Yu wrote:
> > > > > > > > > > Hi,
> > > > > > > > > > 
> > > > > > > > > > Let me think about this carefully.
> > > > > > > > > > 
> > > > > > > > > > When in RPROC_RECOVERY_FAIL case we want to re-do the recovery process again
> > > > > > > > > > or just leave the pm_relax?
> > > > > > > > > 
> > > > > > > > > Neither.
> > > > > > > > > 
> > > > > > > > > When a recovery fail we don't want to call pm_relax().  The code in
> > > > > > > > > rproc_crash_handler_work() becomes:
> > > > > > > > > 
> > > > > > > > >       if (rproc->state == RPROC_OFFLINE) {
> > > > > > > > >                     /* We have raced with rproc_shutdown() */
> > > > > > > > >                     pm_relax()
> > > > > > > > >               mutex_unlock(&rproc->lock);
> > > > > > > > >               return;
> > > > > > > > >             }
> > > > > > > > > 
> > > > > > > > >       if (rproc->state == RPROC_CRASHED ||
> > > > > > > > >                 rproc->state == RPROC_RECOVERY_FAILED) {
> > > > > > > > >               /* handle only the first crash detected */
> > > > > > > > >               mutex_unlock(&rproc->lock);
> > > > > > > > >               return;
> > > > > > > > >       }
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > RPROC_RECOVERY_FAILED gets set in rproc_boot_recovery() if request_firmware() or
> > > > > > > > > rproc_start() fail.  Function rproc_trigger_recovery() needs to allow for the
> > > > > > > > > recovery the the remote processor is in RPROC_RECOVERY_FAILED state.  As such
> > > > > > > > > the condition becomes:
> > > > > > > > > 
> > > > > > > > >             /* State could have changed before we got the mutex */
> > > > > > > > >       if (rproc->state != RPROC_CRASHED &&
> > > > > > > > >                 rproc->state != RPROC_RECOVERY_FAILED)
> > > > > > > > >               goto unlock_mutex;
> > > > > > > > > 
> > > > > > > > > Start with that and we can look at corner cases (if some exists) with a fresh
> > > > > > > > > patchset.  Note that I have not addressed the attach/detach() scenario in the
> > > > > > > > > above.
> > > > > > > > 
> > > > > > > > If we didn't deal with the recovery failed case with correct pm_relax call,
> > > > > > > > it may left the device in a state that cannot enter to suspend state.
> > > > > > > 
> > > > > > > That is what I am looking for.  We don't want to give the impression that
> > > > > > > everything is fine by allowing the device to suspend.  If the remote processor
> > > > > > > can't be recovered than it needs to be dealth with.
> > > > > > For the normal recovery failed case, it still need to do pm_relax to not
> > > > > > prevent the device goes to suspend. It is what in normal recovery failed
> > > > > > case we do in rproc_crash_handler_work as well.
> > > > > > rproc_crash_handler_work will not check the result of the
> > > > > > rproc_trigger_recovery return value, and will always do pm_relax.
> > > > > > 
> > > > > > For current conconrency cornor case as well, it is better to consistant
> > > > > > with the current design of recovery fail senarios in normal cases.
> > > > > > 
> > > > > > I personally agree that we shouldn't do nothing when it is a
> > > > > > RPROC_RECOVERY_FAILED senario when it is in rproc_crash_handler_work
> > > > > > check, because it maybe crash happened when it is trying to do the recovery.
> > > > > > So I suggested to do a continue try of trigger recovery again instead of
> > > > > > doing nothing and bail out if it is a RPROC_RECOVERY_FAILED state.
> > > > > > 
> > > > > > > 
> > > > > > > > Because first PROC_RECOVERY_FAIL case cannot ensure it have pm_relax called
> > > > > > > > before the second crash handler call pm_stay_awake or not.
> > > > > > > > 
> > > > > > > 
> > > > > > > I've been thinking about that part.  I don't think adding a wake_count to
> > > > > > > control calls to pm_stay_awake()/pm_relax() is the best way to go.  There is a
> > > > > > > similar count happening in the PM runtime subsystem and that is what we should
> > > > > > > be using.  I have asked a power management expert at Linaro for guidance with
> > > > > > > this matter.  I should be able to get back to you with a way forward by the end
> > > > > > > of next week.
> > > > > > > 
> > > > > > Thx for the specific date provided as well. I will wait until your reply
> > > > > > for next patchset then.
> > > > > > 
> > > > > > > > So, What about the atomic count along with pm_relax and pm_stay_awake ?
> > > > > > > > 
> > > > > > > > struct rproc{
> > > > > > > > ...
> > > > > > > > atomic_t wake_count;
> > > > > > > > ...
> > > > > > > > }
> > > > > > > > 
> > > > > > > > rproc_pm_stay_awake()
> > > > > > > > {
> > > > > > > >        atomic_inc(&wake_count);
> > > > > > > >        pm_stay_awake();
> > > > > > > > }
> > > > > > > > 
> > > > > > > > rproc_pm_relax()
> > > > > > > > {
> > > > > > > >        if (atomic_dec_return(&wake_count) == 0)
> > > > > > > >                pm_stay_awake();
> > > > > > > > }
> > > > > > > > 
> > > > > > > > can refer code like:
> > > > > > > > 
> > > > > > > > rproc_report_crash()
> > > > > > > > {
> > > > > > > >        ...
> > > > > > > >        rproc_pm_stay_awake();
> > > > > > > >        queue_work();
> > > > > > > >        ...
> > > > > > > > }
> > > > > > > > 
> > > > > > > > rproc_crash_handler_work()
> > > > > > > > {
> > > > > > > >        ...
> > > > > > > >        if (rproc->state == RPROC_OFFLINE || rproc->state == RPROC_CRASHED) {
> > > > > > > >                      /* We have raced with rproc_shutdown() */
> > > > > > > >                      rproc_pm_relax();
> > > > > > > >                mutex_unlock(&rproc->lock);
> > > > > > > >                return;
> > > > > > > >              }
> > > > > > > >        ...
> > > > > > > > }
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Thanks,
> > > > > > > > > Mathieu
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > recovery fail case 1:
> > > > > > > > > > |                                      |firstcrash interrupt issued
> > > > > > > > > > | second crashed interrupt issued      | rproc_report_crash()
> > > > > > > > > > | rproc_report_crash()                 |          pm_stay_awake()
> > > > > > > > > > |          pm_stay_awake()             |          queue_work()
> > > > > > > > > > |          queue_work()                |rproc_crash_handler_work()
> > > > > > > > > > |                                      |mutex_lock(&rproc->lock);
> > > > > > > > > > |                                      |rproc_stop()
> > > > > > > > > > |rproc_crash_handler_work()            |rproc->state = RPROC_OFFLINE;
> > > > > > > > > > |                                      |RPROC_RECOVERY_FAIL //new
> > > > > > > > > > |                                      |mutex_unlock(&rproc->lock);
> > > > > > > > > > |mutex_lock(&rproc->lock);             |pm_relax()
> > > > > > > > > > |if (rproc->state == RPROC_OFFLINE)    |
> > > > > > > > > > |return // shouldn't do pm_relax if RPROC_RECOVERY_FAIL?  |
> > > > > > > > > > |mutex_unlock(&rproc->lock);           |
> > > > > > > > > > |                                      |
> > > > > > > > > > |                                      |
> > > > > > > > > > |                                      |
> > > > > > > > > > 
> > > > > > > > > > recovery fail case 2:
> > > > > > > > > > |                                      |firstcrash interrupt issued
> > > > > > > > > > |                                      | rproc_report_crash()
> > > > > > > > > > |                                      |          pm_stay_awake()
> > > > > > > > > > |                                      |          queue_work()
> > > > > > > > > > |                                      |rproc_crash_handler_work()
> > > > > > > > > > |                                      |mutex_lock(&rproc->lock);
> > > > > > > > > > |                                      |rproc_stop()
> > > > > > > > > > |                                      |rproc->state = RPROC_OFFLINE;
> > > > > > > > > > |                                      |RPROC_RECOVERY_FAIL //new
> > > > > > > > > > |                                      |mutex_unlock(&rproc->lock);
> > > > > > > > > > |                                      |pm_relax()
> > > > > > > > > > |
> > > > > > > > > > | second crashed interrupt issued      |
> > > > > > > > > > | rproc_report_crash()                 |
> > > > > > > > > > |          pm_stay_awake()             |
> > > > > > > > > > |          queue_work()                |
> > > > > > > > > > |pm_stay_awake()
> > > > > > > > > > |mutex_lock(&rproc->lock);
> > > > > > > > > > |if (rproc->state == RPROC_OFFLINE)    |
> > > > > > > > > > |return // still need do pm_relax if RPROC_RECOVERY_FAIL?  |
> > > > > > > > > > |mutex_unlock(&rproc->lock);           |
> > > > > > > > > > |                                      |
> > > > > > > > > > |                                      |
> > > > > > > > > > |                                      |
> > > > > > > > > > 
> > > > > > > > > > Maybe I can have:
> > > > > > > > > > 1. the pm_stay_awake and pm_relax with count based and call with paired for
> > > > > > > > > > fix current concurency issue.
> > > > > > > > > > 2. RPROC_RECOVERY_FAIL can be another patch for continue try to do recovery
> > > > > > > > > > work.
> > > > > > > > > > 3. handle RPROC_DETACHED case.
> > > > > > > > > > 
> > > > > > > > > > On 11/2/2022 4:11 AM, Mathieu Poirier wrote:
> > > > > > > > > > > On Fri, 28 Oct 2022 at 09:31, Arnaud POULIQUEN
> > > > > > > > > > > <arnaud.pouliquen@foss.st.com> wrote:
> > > > > > > > > > > > 
> > > > > > > > > > > > Hi,
> > > > > > > > > > > > 
> > > > > > > > > > > > On 10/24/22 05:17, Aiqun(Maria) Yu wrote:
> > > > > > > > > > > > > On 10/22/2022 3:34 AM, Mathieu Poirier wrote:
> > > > > > > > > > > > > > On Wed, 19 Oct 2022 at 23:52, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > On 10/14/2022 2:03 AM, Mathieu Poirier wrote:
> > > > > > > > > > > > > > > > On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote:
> > > > > > > > > > > > > > > > > On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:
> > > > > > > > > > > > > > > > > > Hi Mathieu,
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
> > > > > > > > > > > > > > > > > > > Please add what has changed from one version to another, either in a cover
> > > > > > > > > > > > > > > > > > > letter or after the "Signed-off-by".  There are many examples on how to
> > > > > > > > > > > > > > > > > > > do that
> > > > > > > > > > > > > > > > > > > on the mailing list.
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > Thx for the information, will take a note and benefit for next time.
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
> > > > > > > > > > > > > > > > > > > > RPROC_OFFLINE state indicate there is no recovery process
> > > > > > > > > > > > > > > > > > > > is in progress and no chance to do the pm_relax.
> > > > > > > > > > > > > > > > > > > > Because when recovering from crash, rproc->lock is held and
> > > > > > > > > > > > > > > > > > > > state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> > > > > > > > > > > > > > > > > > > > and then unlock rproc->lock.
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > You are correct - because the lock is held rproc->state should be set to
> > > > > > > > > > > > > > > > > > > RPROC_RUNNING
> > > > > > > > > > > > > > > > > > > when rproc_trigger_recovery() returns.  If that is not the case then
> > > > > > > > > > > > > > > > > > > something
> > > > > > > > > > > > > > > > > > > went wrong.
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before
> > > > > > > > > > > > > > > > > > > returning,
> > > > > > > > > > > > > > > > > > > so we know the remote processor was stopped.  Therefore if rproc->state
> > > > > > > > > > > > > > > > > > > is set
> > > > > > > > > > > > > > > > > > > to RPROC_OFFLINE something went wrong in either request_firmware() or
> > > > > > > > > > > > > > > > > > > rproc_start().  Either way the remote processor is offline and the system
> > > > > > > > > > > > > > > > > > > probably
> > > > > > > > > > > > > > > > > > > in an unknown/unstable.  As such I don't see how calling pm_relax() can help
> > > > > > > > > > > > > > > > > > > things along.
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
> > > > > > > > > > > > > > > > > > finished.
> > > > > > > > > > > > > > > > > > Even if it is multi crash rproc_crash_handler_work contention issue, and
> > > > > > > > > > > > > > > > > > last rproc_trigger_recovery bailed out with only
> > > > > > > > > > > > > > > > > > rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
> > > > > > > > > > > > > > > > > > Since the subsystem may still can be recovered with customer's next trigger
> > > > > > > > > > > > > > > > > > of rproc_start, and we can make each error out path clean with pm resources.
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > I suggest spending time understanding what leads to the failure when
> > > > > > > > > > > > > > > > > > > recovering
> > > > > > > > > > > > > > > > > > > from a crash and address that problem(s).
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > In current case, the customer's information is that the issue happened when
> > > > > > > > > > > > > > > > > > rproc_shutdown is triggered at similar time. So not an issue from error out
> > > > > > > > > > > > > > > > > > of rproc_trigger_recovery.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > That is a very important element to consider and should have been mentioned
> > > > > > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > > the beginning.  What I see happening is the following:
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > rproc_report_crash()
> > > > > > > > > > > > > > > > >                pm_stay_awake()
> > > > > > > > > > > > > > > > >                queue_work() // current thread is suspended
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > rproc_shutdown()
> > > > > > > > > > > > > > > > >                rproc_stop()
> > > > > > > > > > > > > > > > >                        rproc->state = RPROC_OFFLINE;
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > rproc_crash_handler_work()
> > > > > > > > > > > > > > > > >                if (rproc->state == RPROC_OFFLINE)
> > > > > > > > > > > > > > > > >                        return // pm_relax() is not called
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > The right way to fix this is to add a pm_relax() in rproc_shutdown() and
> > > > > > > > > > > > > > > > > rproc_detach(), along with a very descriptive comment as to why it is needed.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Thinking about this further there are more ramifications to consider.  Please
> > > > > > > > > > > > > > > > confirm the above scenario is what you are facing.  I will advise on how to
> > > > > > > > > > > > > > > > move
> > > > > > > > > > > > > > > > forward if that is the case.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Not sure if the situation is clear or not. So resend the email again.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > The above senario is what customer is facing. crash hanppened while at
> > > > > > > > > > > > > > > the same time shutdown is triggered.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Unfortunately this is not enough details to address a problem as
> > > > > > > > > > > > > > complex as this one.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > And the device cannto goes to suspend state after that.
> > > > > > > > > > > > > > > the subsystem can still be start normally after this.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > If the code flow I pasted above reflects the problem at hand, the
> > > > > > > > > > > > > > current patch will not be sufficient to address the issue.  If Arnaud
> > > > > > > > > > > > > > confirms my suspicions we will have to think about a better solution.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Hi Mathiew,
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Could you pls have more details of any side effects other then power issue of
> > > > > > > > > > > > > the current senario?
> > > > > > > > > > > > > Why the current patch is not sufficient pls?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Have the current senario in details with rproc->lock information in details:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > | subsystem crashed interrupt issued      | user trigger shutdown
> > > > > > > > > > > > > | rproc_report_crash()                    |
> > > > > > > > > > > > > |          pm_stay_awake()                |
> > > > > > > > > > > > > |          queue_work()                   |
> > > > > > > > > > > > > |                                         |rproc_shutdown
> > > > > > > > > > > > > |                                         |mutex_lock(&rproc->lock);
> > > > > > > > > > > > > |                                         |rproc_stop()
> > > > > > > > > > > > > |rproc_crash_handler_work()               |rproc->state = RPROC_OFFLINE;
> > > > > > > > > > > > > |                                         |mutex_unlock(&rproc->lock);
> > > > > > > > > > > > > |mutex_lock(&rproc->lock);                |
> > > > > > > > > > > > > |if (rproc->state == RPROC_OFFLINE)       |
> > > > > > > > > > > > > |return // pm_relax() is not called       |rproc_boot
> > > > > > > > > > > > > |mutex_unlock(&rproc->lock);              |
> > > > > > > > > > > > > |                                         |mutex_lock(&rproc->lock);
> > > > > > > > > > > > > |                                         |rproc_start()
> > > > > > > > > > > > > |                                         |mutex_unlock(&rproc->lock);
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > Agree with Mathieu, this is not so simple.
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Thanks for looking into this.
> > > > > > > > > > > 
> > > > > > > > > > > > Here is my view  hoping I haven't missed a point in your discussion or
> > > > > > > > > > > > an other corner cases.
> > > > > > > > > > > > 
> > > > > > > > > > > > I tried to analyze the issues (in what follows, the term "condition" means
> > > > > > > > > > > > the "if" condition in which Aiqun proposes to add the fix) :
> > > > > > > > > > > > 
> > > > > > > > > > > > I can see 4 use cases with race condition
> > > > > > > > > > > > 
> > > > > > > > > > > > 1) crash report while already one is treated (rproc_boot_recovery called)
> > > > > > > > > > > >           => not a real use case as if the remote processor is crashed we
> > > > > > > > > > > >                    should not have a second crash report
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > That part is of great concern to me.  *Theoretically* we should not
> > > > > > > > > > > get a new crash report while one has already been dispatched but the
> > > > > > > > > > > current code accounts for this scenario and as such the possibility
> > > > > > > > > > > can't be dismissed.  Therefore we need to expect rproc_report_crash()
> > > > > > > > > > > to be called multiple times before a single instance of
> > > > > > > > > > > rproc_boot_recovery() is scheduled.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > 2) rproc_stop executed between the queuing of the crash work and the call of
> > > > > > > > > > > >        rproc_crash_handler_work
> > > > > > > > > > > >         => rproc->state = RPROC_OFFLINE
> > > > > > > > > > > >         => we enter in the "condition" and the pm_relax has to be called
> > > > > > > > > > > >         => This commit fix should solve this use case
> > > > > > > > > > > > 
> > > > > > > > > > > > 3) rproc_detach executed between the queue of the crash work and the call of
> > > > > > > > > > > >        rproc_crash_handler_work
> > > > > > > > > > > >         => rproc->state = RPROC_DETACHED;
> > > > > > > > > > > >         => we don't go in "the condition" and issue because the recovery reattach
> > > > > > > > > > > >            to the remote processor
> > > > > > > > > > > >         => but pm_relax is called
> > > > > > > > > > > >         => probably need an extra fix to avoid to re-attach
> > > > > > > > > > > > 
> > > > > > > > > > > > 4) crash report while already one is treated (rproc_attach_recovery called)
> > > > > > > > > > > >         this one corresponds to an auto reboot of the remote processor, with a
> > > > > > > > > > > >         new crash
> > > > > > > > > > > >         => rproc->state = RPROC_CRASHED or rproc->state = RPROC_DETACHED;
> > > > > > > > > > > >         4)a) rproc->state = RPROC_CRASHED if rproc->recovery_disabled = true
> > > > > > > > > > > >              => should call pm_relax if rproc->recovery_disabled = true
> > > > > > > > > > > >              => commit does not work for this use case
> > > > > > > > > > > > 
> > > > > > > > > > > >         4)b) rproc->state = RPROC_DETACHED if recovery fails
> > > > > > > > > > > >             => error case with an unstable state
> > > > > > > > > > > >             => how to differentiate it from the use case 3) ?
> > > > > > > > > > > >             => introduce a RPROC_RECOVERY_FAIL state?
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > The case where a recovery fails needs to be considered and is the
> > > > > > > > > > > reason the original patch doesn't work.  Right now in
> > > > > > > > > > > rproc_crash_handler_work(), it is not possible to differentiate
> > > > > > > > > > > between a legitimate shutdown request (scenario #2 above) and a
> > > > > > > > > > > recovery that went wrong.  I think introducing RPROC_RECOVERY_FAIL
> > > > > > > > > > > would greatly simplify things.
> > > > > > > > > > > 
> > > > > > > > > > > My initial evaluation had not considered the attach/detach scenarios -
> > > > > > > > > > > thanks for adding that in the mix.
> > > > > > > > > > > 
> > > > > > > > > > > Aiqun, please send a new patchset that adds a new remote processor
> > > > > > > > > > > state, i.e RPROC_RECOVERY_FAIL.  There should also be another patch in
> > > > > > > > > > > that set that takes attach/detach scenarios into account.  The code
> > > > > > > > > > > between the v6.0 and v6.1 cycle has changed a lot in that area so make
> > > > > > > > > > > sure to properly rebase.
> > > > > > > > > > > 
> > > > > > > > > > I will try.
> > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > Then pm_stay_awake is called when the crash work is queued.
> > > > > > > > > > > > It seems to me coherent to call the pm_relax in the work handler.
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > Here is a quick and dirty patch (not tested) that should take into account the
> > > > > > > > > > > > main use cases ( except 1) and 4)b) )
> > > > > > > > > > > > 
> > > > > > > > > > > > @@ -2009,8 +2009,18 @@ static void rproc_crash_handler_work(struct work_struct *work)
> > > > > > > > > > > > 
> > > > > > > > > > > >              mutex_lock(&rproc->lock);
> > > > > > > > > > > > 
> > > > > > > > > > > > -       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
> > > > > > > > > > > > +       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE ||
> > > > > > > > > > > > +           rproc->state == RPROC_DETACHED) {
> > > > > > > > > > > >                      /* handle only the first crash detected */
> > > > > > > > > > > > +
> > > > > > > > > > > > +               /*
> > > > > > > > > > > > +                * call pm-relax in following use cases:
> > > > > > > > > > > > +                * - the remote processor has been stopped by the user
> > > > > > > > > > > > +                * - the remote processor is detached
> > > > > > > > > > > > +                + - the remote proc has an autonomous reset but recovery_disabled is true.
> > > > > > > > > > > > +                */
> > > > > > > > > > > > +               if(rproc->state != RPROC_CRASHED || rproc->recovery_disabled)
> > > > > > > > > > > > +                       pm_relax(rproc->dev.parent);
> > > > > > > > > > > >                      mutex_unlock(&rproc->lock);
> > > > > > > > > > > >                      return;
> > > > > > > > > > > >              }
> > > > > > > > > > > > 
> > > > > > > > > > > > Regards,
> > > > > > > > > > > > Arnaud
> > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > > > Mathieu
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > When the state is in RPROC_OFFLINE it means separate request
> > > > > > > > > > > > > > > > > > > > of rproc_stop was done and no need to hold the wakeup source
> > > > > > > > > > > > > > > > > > > > in crash handler to recover any more.
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
> > > > > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > > > >          drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
> > > > > > > > > > > > > > > > > > > >          1 file changed, 11 insertions(+)
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > diff --git a/drivers/remoteproc/remoteproc_core.c
> > > > > > > > > > > > > > > > > > > > b/drivers/remoteproc/remoteproc_core.c
> > > > > > > > > > > > > > > > > > > > index e5279ed9a8d7..6bc7b8b7d01e 100644
> > > > > > > > > > > > > > > > > > > > --- a/drivers/remoteproc/remoteproc_core.c
> > > > > > > > > > > > > > > > > > > > +++ b/drivers/remoteproc/remoteproc_core.c
> > > > > > > > > > > > > > > > > > > > @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct
> > > > > > > > > > > > > > > > > > > > work_struct *work)
> > > > > > > > > > > > > > > > > > > >                 if (rproc->state == RPROC_CRASHED || rproc->state ==
> > > > > > > > > > > > > > > > > > > > RPROC_OFFLINE) {
> > > > > > > > > > > > > > > > > > > >                         /* handle only the first crash detected */
> > > > > > > > > > > > > > > > > > > >                         mutex_unlock(&rproc->lock);
> > > > > > > > > > > > > > > > > > > > +         /*
> > > > > > > > > > > > > > > > > > > > +          * RPROC_OFFLINE state indicate there is no recovery process
> > > > > > > > > > > > > > > > > > > > +          * is in progress and no chance to have pm_relax in place.
> > > > > > > > > > > > > > > > > > > > +          * Because when recovering from crash, rproc->lock is held and
> > > > > > > > > > > > > > > > > > > > +          * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> > > > > > > > > > > > > > > > > > > > +          * and then unlock rproc->lock.
> > > > > > > > > > > > > > > > > > > > +          * RPROC_OFFLINE is only an intermediate state in recovery
> > > > > > > > > > > > > > > > > > > > +          * process.
> > > > > > > > > > > > > > > > > > > > +          */
> > > > > > > > > > > > > > > > > > > > +         if (rproc->state == RPROC_OFFLINE)
> > > > > > > > > > > > > > > > > > > > +                 pm_relax(rproc->dev.parent);
> > > > > > > > > > > > > > > > > > > >                         return;
> > > > > > > > > > > > > > > > > > > >                 }
> > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > 2.7.4
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > Thx and BRs,
> > > > > > > > > > > > > > > > > > Aiqun(Maria) Yu
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > Thx and BRs,
> > > > > > > > > > > > > > > Aiqun(Maria) Yu
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > --
> > > > > > > > > > Thx and BRs,
> > > > > > > > > > Aiqun(Maria) Yu
> > > > > > > > 
> > > > > > > > 
> > > > > > > > --
> > > > > > > > Thx and BRs,
> > > > > > > > Aiqun(Maria) Yu
> > > > > > 
> > > > > > 
> > > > > > --
> > > > > > Thx and BRs,
> > > > > > Aiqun(Maria) Yu
> > > > 
> > > > 
> > > > -- 
> > > > Thx and BRs,
> > > > Aiqun(Maria) Yu
> > 
> > 
> > -- 
> > Thx and BRs,
> > Aiqun(Maria) Yu
Aiqun Yu (Maria) Nov. 28, 2022, 2:18 a.m. UTC | #24
On 11/26/2022 2:37 AM, Mathieu Poirier wrote:
> On Fri, Nov 18, 2022 at 11:52:46AM -0700, Mathieu Poirier wrote:
>> On Tue, Nov 15, 2022 at 09:30:58AM +0800, Aiqun(Maria) Yu wrote:
>>> Hi,
>>>
>>> On 11/15/2022 5:18 AM, Mathieu Poirier wrote:
>>>> On Fri, Nov 11, 2022 at 08:52:11AM +0800, Aiqun(Maria) Yu wrote:
>>>>> On 11/11/2022 4:50 AM, Mathieu Poirier wrote:
>>>>>> I had a couple of good discussions with our power management expert
>>>>>> and even then, the way forward isn't as clear as I would have liked.
>>>>>> I am currently travelling and as such don't have the required time to
>>>>>> go into greater details, something I will be doing next week.
>>>>>>
>>>>> Thx Mathieu for the info updated.
>>>>> I'll wait for your update next week then.
>>>>> let me know any initial questions that you have, perhaps I can also discuss
>>>>> that with our power team at the same time.
>>>>
>>>> The problem is to determine exactly what the WQ_FREEZABLE flag does to the
>>>> rproc_recovery_wq workueue.  The documentation [1] indicate that work items on the WQ are
>>>> drained before the system is suspended.  What I understand from this is that if
>>>> two work items are queued and one is executing at the time a system suspend is
>>>> initiated, all three items will be executed before the system is allowed to be
>>>> suspended.  _If_ that is the case, there would not be a need to call
>>>> pm_stay_awake() and pm_relax() at all.
>>>>
>>>> On the other hand, the PM resource I spoke to thought that in reality things
>>>> don't happen that way.  Taking the same above scenario where 2 work items are
>>>> queued and one is executing at the time of the suspend, only the work item that
>>>> is executing will be allowed to execute to completion before the system is
>>>> suspended.  The remaining two items that were queued will not execute.
>>>>
>>>> If that is the case then we do need to call pm_stay_awake() and pm_relax(), and
>>>> find another strategy to fix this situation.
>>>>
>>>> Until we have a clear view of how the WQ_FREEZABLE flag works, we won't be able
>>>> to move forward with this patchset.  Unfortunately, I currently do not have the
>>>> time to look into this.
>>> I had a check on the WQ_FREEZABLE flag, here is my understanding:
>>>
>>> when the interrupt happened, it still need pm_stay_awake to make sure
>>> queue_work action can active the work instead of susepend the device.
>>>
>>> 1. If  WQ_FREEZABLE, pwq->max_active = 0; // maximum number of in-flight
>>> work items is set to 0.
>>>
>>> [1].
>>> https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/workqueue.c#L3748
>>>
>>> 2. If  WQ_FREEZABLE, will only check pwq->nr_active to see if there is still
>>> freeze_workqueues_busy.
>>> [2].
>>> https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/workqueue.c#L5270
>>>
>>> 3. When in queue_work, if max_active is 0, when do queue_work it will not
>>> actually active the work.
>>> [3].
>>> https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/workqueue.c#L1418
>>>
>>> for the current issue, the work is already complete and forget to set
>>> pm_relax in some condition that make the system cannot be suspended.
>>
>> What you have above does not describe what happens to work already queue to a
>> WQ_FREEZABLE workqueue when the system gets suspended, and that is the most
>> important thing.
>>
>> I ended up doing that investigation myself and found the documentation on
>> WQ_FREEZABLE seems to be accurate.  I am still discussing this with the Linaro
>> power management team and will get back to you when we reach a conclusion.
>>
> 
> I finally have time to get back to this problem...
> 
> As I reported above I went down to the bottom of how WQ_FREEZABLE workqueues
> work.  The conclusion of that research is that _any_ work, queued or currently
> executing, will finish before the system is suspended.  As such, as soon as
> queue_work() returns in rproc_report_crash(), you can be guaranteed what was
> just queued will be executed.  Therefore, calling pm_stay_awake() in
> rproc_report_crash() and pm_relax() in rproc_crash_handler_work() should not be
> needed because the system will not suspend for as long as there is work to be
> done by the freezable worqueues.
> 
The problem is when crash interrupt happened in a later point(after 
freeze_kernel_thread), the queue_work will result with the state of 
WORK_STRUCT_INACTIVE and insert to inactive_works.
And this is the senario that pm_stay_awake wanted to address inside the 
interrupt handler call.

refer to:
[1] 
https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/workqueue.c#L1512

the current suspend simple flow can be:

         +-+-------------------+
         |   freeze_processes  |
         +----------+----------+
                    |
                    |
        +-----------|-----------+
        | freeze_kernel_thread  |
        +-----------+-----------+
                    |
         +-+--------|----------+
         |   device_suspend    |
         +----------|----------+
                    |
                    |
        +-----------+-----------+
        | pm_wakeup_pending     |
        +-----------|-----------+
                    |
                    |
  +-----------------+-------------------+
  | suspend_ops->enter (machine suspend)|
  +-------------------------------------+


refer to:
suspend_prepare -> suspend_freeze_processes -->freeze_processes and 
freeze_kernel_threads



refer to:

suspend_devices_and_enter -->
dpm_suspend_start -> dpm_suspend --> device_suspend --> device suspend 
callbacks
suspend_enter -->dpm_suspend_late-->device_suspend_late

[2] 
https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/power/suspend.c#L582

> I don't have details on what Rishabh - the author of the patch that added the
> calls - was facing but based on the my understanding of freezable workqueues, it
> seem to have been the wrong solution.
In my opinion, the freezable workqueues are in the right solution. It 
only protect for the system freeze state.
While if we want the crash interrupt to be handled to prevent current 
suspend process, it still need to deal with other suspend state to 
machin suspend senarios.
Typical example can be before disable the device interrupt in device 
suspend callbacks, it will flush the necessary workqueues.

while in our case, since the crash handler will invoke a subsystem 
restart process which reply on user thread uevent firmware loaders, so 
it need to abort the suspend process with pm_stay_awake called.

So in a word, pm_stay_awake is necessary per my understanding.

> 
> Until we have a clear vision on what Rishabh was trying to fix, nothing in that
> area will be changing.  I suggest you touch base with him and sort it out.  If
> that is not possible then calls to pm_stay_awake() and pm_relax() should be
> removed.
> 
>>>
>>>>
>>>> If you want to take on this investigation, keep in mind that any conclusion will
>>>> need to be backed by a proof.  That can be debug messages on a console output or
>>>> a code reference in the workqueue core.
>>>>
>>>> [1]. https://elixir.bootlin.com/linux/v6.1-rc4/source/Documentation/core-api/workqueue.rst#L184
>>>>
>>>>
>>>>>> On Sun, 6 Nov 2022 at 18:14, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>> On 11/4/2022 11:59 PM, Mathieu Poirier wrote:
>>>>>>>> On Thu, Nov 03, 2022 at 10:03:49AM +0800, Aiqun(Maria) Yu wrote:
>>>>>>>>> On 11/3/2022 2:03 AM, Mathieu Poirier wrote:
>>>>>>>>>> On Wed, Nov 02, 2022 at 06:53:49PM +0800, Aiqun(Maria) Yu wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Let me think about this carefully.
>>>>>>>>>>>
>>>>>>>>>>> When in RPROC_RECOVERY_FAIL case we want to re-do the recovery process again
>>>>>>>>>>> or just leave the pm_relax?
>>>>>>>>>>
>>>>>>>>>> Neither.
>>>>>>>>>>
>>>>>>>>>> When a recovery fail we don't want to call pm_relax().  The code in
>>>>>>>>>> rproc_crash_handler_work() becomes:
>>>>>>>>>>
>>>>>>>>>>        if (rproc->state == RPROC_OFFLINE) {
>>>>>>>>>>                      /* We have raced with rproc_shutdown() */
>>>>>>>>>>                      pm_relax()
>>>>>>>>>>                mutex_unlock(&rproc->lock);
>>>>>>>>>>                return;
>>>>>>>>>>              }
>>>>>>>>>>
>>>>>>>>>>        if (rproc->state == RPROC_CRASHED ||
>>>>>>>>>>                  rproc->state == RPROC_RECOVERY_FAILED) {
>>>>>>>>>>                /* handle only the first crash detected */
>>>>>>>>>>                mutex_unlock(&rproc->lock);
>>>>>>>>>>                return;
>>>>>>>>>>        }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> RPROC_RECOVERY_FAILED gets set in rproc_boot_recovery() if request_firmware() or
>>>>>>>>>> rproc_start() fail.  Function rproc_trigger_recovery() needs to allow for the
>>>>>>>>>> recovery the the remote processor is in RPROC_RECOVERY_FAILED state.  As such
>>>>>>>>>> the condition becomes:
>>>>>>>>>>
>>>>>>>>>>              /* State could have changed before we got the mutex */
>>>>>>>>>>        if (rproc->state != RPROC_CRASHED &&
>>>>>>>>>>                  rproc->state != RPROC_RECOVERY_FAILED)
>>>>>>>>>>                goto unlock_mutex;
>>>>>>>>>>
>>>>>>>>>> Start with that and we can look at corner cases (if some exists) with a fresh
>>>>>>>>>> patchset.  Note that I have not addressed the attach/detach() scenario in the
>>>>>>>>>> above.
>>>>>>>>>
>>>>>>>>> If we didn't deal with the recovery failed case with correct pm_relax call,
>>>>>>>>> it may left the device in a state that cannot enter to suspend state.
>>>>>>>>
>>>>>>>> That is what I am looking for.  We don't want to give the impression that
>>>>>>>> everything is fine by allowing the device to suspend.  If the remote processor
>>>>>>>> can't be recovered than it needs to be dealth with.
>>>>>>> For the normal recovery failed case, it still need to do pm_relax to not
>>>>>>> prevent the device goes to suspend. It is what in normal recovery failed
>>>>>>> case we do in rproc_crash_handler_work as well.
>>>>>>> rproc_crash_handler_work will not check the result of the
>>>>>>> rproc_trigger_recovery return value, and will always do pm_relax.
>>>>>>>
>>>>>>> For current conconrency cornor case as well, it is better to consistant
>>>>>>> with the current design of recovery fail senarios in normal cases.
>>>>>>>
>>>>>>> I personally agree that we shouldn't do nothing when it is a
>>>>>>> RPROC_RECOVERY_FAILED senario when it is in rproc_crash_handler_work
>>>>>>> check, because it maybe crash happened when it is trying to do the recovery.
>>>>>>> So I suggested to do a continue try of trigger recovery again instead of
>>>>>>> doing nothing and bail out if it is a RPROC_RECOVERY_FAILED state.
>>>>>>>
>>>>>>>>
>>>>>>>>> Because first PROC_RECOVERY_FAIL case cannot ensure it have pm_relax called
>>>>>>>>> before the second crash handler call pm_stay_awake or not.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I've been thinking about that part.  I don't think adding a wake_count to
>>>>>>>> control calls to pm_stay_awake()/pm_relax() is the best way to go.  There is a
>>>>>>>> similar count happening in the PM runtime subsystem and that is what we should
>>>>>>>> be using.  I have asked a power management expert at Linaro for guidance with
>>>>>>>> this matter.  I should be able to get back to you with a way forward by the end
>>>>>>>> of next week.
>>>>>>>>
>>>>>>> Thx for the specific date provided as well. I will wait until your reply
>>>>>>> for next patchset then.
>>>>>>>
>>>>>>>>> So, What about the atomic count along with pm_relax and pm_stay_awake ?
>>>>>>>>>
>>>>>>>>> struct rproc{
>>>>>>>>> ...
>>>>>>>>> atomic_t wake_count;
>>>>>>>>> ...
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> rproc_pm_stay_awake()
>>>>>>>>> {
>>>>>>>>>         atomic_inc(&wake_count);
>>>>>>>>>         pm_stay_awake();
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> rproc_pm_relax()
>>>>>>>>> {
>>>>>>>>>         if (atomic_dec_return(&wake_count) == 0)
>>>>>>>>>                 pm_stay_awake();
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> can refer code like:
>>>>>>>>>
>>>>>>>>> rproc_report_crash()
>>>>>>>>> {
>>>>>>>>>         ...
>>>>>>>>>         rproc_pm_stay_awake();
>>>>>>>>>         queue_work();
>>>>>>>>>         ...
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> rproc_crash_handler_work()
>>>>>>>>> {
>>>>>>>>>         ...
>>>>>>>>>         if (rproc->state == RPROC_OFFLINE || rproc->state == RPROC_CRASHED) {
>>>>>>>>>                       /* We have raced with rproc_shutdown() */
>>>>>>>>>                       rproc_pm_relax();
>>>>>>>>>                 mutex_unlock(&rproc->lock);
>>>>>>>>>                 return;
>>>>>>>>>               }
>>>>>>>>>         ...
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Mathieu
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> recovery fail case 1:
>>>>>>>>>>> |                                      |firstcrash interrupt issued
>>>>>>>>>>> | second crashed interrupt issued      | rproc_report_crash()
>>>>>>>>>>> | rproc_report_crash()                 |          pm_stay_awake()
>>>>>>>>>>> |          pm_stay_awake()             |          queue_work()
>>>>>>>>>>> |          queue_work()                |rproc_crash_handler_work()
>>>>>>>>>>> |                                      |mutex_lock(&rproc->lock);
>>>>>>>>>>> |                                      |rproc_stop()
>>>>>>>>>>> |rproc_crash_handler_work()            |rproc->state = RPROC_OFFLINE;
>>>>>>>>>>> |                                      |RPROC_RECOVERY_FAIL //new
>>>>>>>>>>> |                                      |mutex_unlock(&rproc->lock);
>>>>>>>>>>> |mutex_lock(&rproc->lock);             |pm_relax()
>>>>>>>>>>> |if (rproc->state == RPROC_OFFLINE)    |
>>>>>>>>>>> |return // shouldn't do pm_relax if RPROC_RECOVERY_FAIL?  |
>>>>>>>>>>> |mutex_unlock(&rproc->lock);           |
>>>>>>>>>>> |                                      |
>>>>>>>>>>> |                                      |
>>>>>>>>>>> |                                      |
>>>>>>>>>>>
>>>>>>>>>>> recovery fail case 2:
>>>>>>>>>>> |                                      |firstcrash interrupt issued
>>>>>>>>>>> |                                      | rproc_report_crash()
>>>>>>>>>>> |                                      |          pm_stay_awake()
>>>>>>>>>>> |                                      |          queue_work()
>>>>>>>>>>> |                                      |rproc_crash_handler_work()
>>>>>>>>>>> |                                      |mutex_lock(&rproc->lock);
>>>>>>>>>>> |                                      |rproc_stop()
>>>>>>>>>>> |                                      |rproc->state = RPROC_OFFLINE;
>>>>>>>>>>> |                                      |RPROC_RECOVERY_FAIL //new
>>>>>>>>>>> |                                      |mutex_unlock(&rproc->lock);
>>>>>>>>>>> |                                      |pm_relax()
>>>>>>>>>>> |
>>>>>>>>>>> | second crashed interrupt issued      |
>>>>>>>>>>> | rproc_report_crash()                 |
>>>>>>>>>>> |          pm_stay_awake()             |
>>>>>>>>>>> |          queue_work()                |
>>>>>>>>>>> |pm_stay_awake()
>>>>>>>>>>> |mutex_lock(&rproc->lock);
>>>>>>>>>>> |if (rproc->state == RPROC_OFFLINE)    |
>>>>>>>>>>> |return // still need do pm_relax if RPROC_RECOVERY_FAIL?  |
>>>>>>>>>>> |mutex_unlock(&rproc->lock);           |
>>>>>>>>>>> |                                      |
>>>>>>>>>>> |                                      |
>>>>>>>>>>> |                                      |
>>>>>>>>>>>
>>>>>>>>>>> Maybe I can have:
>>>>>>>>>>> 1. the pm_stay_awake and pm_relax with count based and call with paired for
>>>>>>>>>>> fix current concurency issue.
>>>>>>>>>>> 2. RPROC_RECOVERY_FAIL can be another patch for continue try to do recovery
>>>>>>>>>>> work.
>>>>>>>>>>> 3. handle RPROC_DETACHED case.
>>>>>>>>>>>
>>>>>>>>>>> On 11/2/2022 4:11 AM, Mathieu Poirier wrote:
>>>>>>>>>>>> On Fri, 28 Oct 2022 at 09:31, Arnaud POULIQUEN
>>>>>>>>>>>> <arnaud.pouliquen@foss.st.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 10/24/22 05:17, Aiqun(Maria) Yu wrote:
>>>>>>>>>>>>>> On 10/22/2022 3:34 AM, Mathieu Poirier wrote:
>>>>>>>>>>>>>>> On Wed, 19 Oct 2022 at 23:52, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 10/14/2022 2:03 AM, Mathieu Poirier wrote:
>>>>>>>>>>>>>>>>> On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote:
>>>>>>>>>>>>>>>>>> On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:
>>>>>>>>>>>>>>>>>>> Hi Mathieu,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
>>>>>>>>>>>>>>>>>>>> Please add what has changed from one version to another, either in a cover
>>>>>>>>>>>>>>>>>>>> letter or after the "Signed-off-by".  There are many examples on how to
>>>>>>>>>>>>>>>>>>>> do that
>>>>>>>>>>>>>>>>>>>> on the mailing list.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thx for the information, will take a note and benefit for next time.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
>>>>>>>>>>>>>>>>>>>>> RPROC_OFFLINE state indicate there is no recovery process
>>>>>>>>>>>>>>>>>>>>> is in progress and no chance to do the pm_relax.
>>>>>>>>>>>>>>>>>>>>> Because when recovering from crash, rproc->lock is held and
>>>>>>>>>>>>>>>>>>>>> state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>>>>>>>>>>>>>>>>>> and then unlock rproc->lock.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> You are correct - because the lock is held rproc->state should be set to
>>>>>>>>>>>>>>>>>>>> RPROC_RUNNING
>>>>>>>>>>>>>>>>>>>> when rproc_trigger_recovery() returns.  If that is not the case then
>>>>>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>>>>> went wrong.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before
>>>>>>>>>>>>>>>>>>>> returning,
>>>>>>>>>>>>>>>>>>>> so we know the remote processor was stopped.  Therefore if rproc->state
>>>>>>>>>>>>>>>>>>>> is set
>>>>>>>>>>>>>>>>>>>> to RPROC_OFFLINE something went wrong in either request_firmware() or
>>>>>>>>>>>>>>>>>>>> rproc_start().  Either way the remote processor is offline and the system
>>>>>>>>>>>>>>>>>>>> probably
>>>>>>>>>>>>>>>>>>>> in an unknown/unstable.  As such I don't see how calling pm_relax() can help
>>>>>>>>>>>>>>>>>>>> things along.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
>>>>>>>>>>>>>>>>>>> finished.
>>>>>>>>>>>>>>>>>>> Even if it is multi crash rproc_crash_handler_work contention issue, and
>>>>>>>>>>>>>>>>>>> last rproc_trigger_recovery bailed out with only
>>>>>>>>>>>>>>>>>>> rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
>>>>>>>>>>>>>>>>>>> Since the subsystem may still can be recovered with customer's next trigger
>>>>>>>>>>>>>>>>>>> of rproc_start, and we can make each error out path clean with pm resources.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I suggest spending time understanding what leads to the failure when
>>>>>>>>>>>>>>>>>>>> recovering
>>>>>>>>>>>>>>>>>>>> from a crash and address that problem(s).
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> In current case, the customer's information is that the issue happened when
>>>>>>>>>>>>>>>>>>> rproc_shutdown is triggered at similar time. So not an issue from error out
>>>>>>>>>>>>>>>>>>> of rproc_trigger_recovery.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> That is a very important element to consider and should have been mentioned
>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>> the beginning.  What I see happening is the following:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> rproc_report_crash()
>>>>>>>>>>>>>>>>>>                 pm_stay_awake()
>>>>>>>>>>>>>>>>>>                 queue_work() // current thread is suspended
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> rproc_shutdown()
>>>>>>>>>>>>>>>>>>                 rproc_stop()
>>>>>>>>>>>>>>>>>>                         rproc->state = RPROC_OFFLINE;
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> rproc_crash_handler_work()
>>>>>>>>>>>>>>>>>>                 if (rproc->state == RPROC_OFFLINE)
>>>>>>>>>>>>>>>>>>                         return // pm_relax() is not called
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The right way to fix this is to add a pm_relax() in rproc_shutdown() and
>>>>>>>>>>>>>>>>>> rproc_detach(), along with a very descriptive comment as to why it is needed.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thinking about this further there are more ramifications to consider.  Please
>>>>>>>>>>>>>>>>> confirm the above scenario is what you are facing.  I will advise on how to
>>>>>>>>>>>>>>>>> move
>>>>>>>>>>>>>>>>> forward if that is the case.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Not sure if the situation is clear or not. So resend the email again.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The above senario is what customer is facing. crash hanppened while at
>>>>>>>>>>>>>>>> the same time shutdown is triggered.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Unfortunately this is not enough details to address a problem as
>>>>>>>>>>>>>>> complex as this one.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> And the device cannto goes to suspend state after that.
>>>>>>>>>>>>>>>> the subsystem can still be start normally after this.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If the code flow I pasted above reflects the problem at hand, the
>>>>>>>>>>>>>>> current patch will not be sufficient to address the issue.  If Arnaud
>>>>>>>>>>>>>>> confirms my suspicions we will have to think about a better solution.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Mathiew,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Could you pls have more details of any side effects other then power issue of
>>>>>>>>>>>>>> the current senario?
>>>>>>>>>>>>>> Why the current patch is not sufficient pls?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Have the current senario in details with rproc->lock information in details:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> | subsystem crashed interrupt issued      | user trigger shutdown
>>>>>>>>>>>>>> | rproc_report_crash()                    |
>>>>>>>>>>>>>> |          pm_stay_awake()                |
>>>>>>>>>>>>>> |          queue_work()                   |
>>>>>>>>>>>>>> |                                         |rproc_shutdown
>>>>>>>>>>>>>> |                                         |mutex_lock(&rproc->lock);
>>>>>>>>>>>>>> |                                         |rproc_stop()
>>>>>>>>>>>>>> |rproc_crash_handler_work()               |rproc->state = RPROC_OFFLINE;
>>>>>>>>>>>>>> |                                         |mutex_unlock(&rproc->lock);
>>>>>>>>>>>>>> |mutex_lock(&rproc->lock);                |
>>>>>>>>>>>>>> |if (rproc->state == RPROC_OFFLINE)       |
>>>>>>>>>>>>>> |return // pm_relax() is not called       |rproc_boot
>>>>>>>>>>>>>> |mutex_unlock(&rproc->lock);              |
>>>>>>>>>>>>>> |                                         |mutex_lock(&rproc->lock);
>>>>>>>>>>>>>> |                                         |rproc_start()
>>>>>>>>>>>>>> |                                         |mutex_unlock(&rproc->lock);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Agree with Mathieu, this is not so simple.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for looking into this.
>>>>>>>>>>>>
>>>>>>>>>>>>> Here is my view  hoping I haven't missed a point in your discussion or
>>>>>>>>>>>>> an other corner cases.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I tried to analyze the issues (in what follows, the term "condition" means
>>>>>>>>>>>>> the "if" condition in which Aiqun proposes to add the fix) :
>>>>>>>>>>>>>
>>>>>>>>>>>>> I can see 4 use cases with race condition
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1) crash report while already one is treated (rproc_boot_recovery called)
>>>>>>>>>>>>>            => not a real use case as if the remote processor is crashed we
>>>>>>>>>>>>>                     should not have a second crash report
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> That part is of great concern to me.  *Theoretically* we should not
>>>>>>>>>>>> get a new crash report while one has already been dispatched but the
>>>>>>>>>>>> current code accounts for this scenario and as such the possibility
>>>>>>>>>>>> can't be dismissed.  Therefore we need to expect rproc_report_crash()
>>>>>>>>>>>> to be called multiple times before a single instance of
>>>>>>>>>>>> rproc_boot_recovery() is scheduled.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> 2) rproc_stop executed between the queuing of the crash work and the call of
>>>>>>>>>>>>>         rproc_crash_handler_work
>>>>>>>>>>>>>          => rproc->state = RPROC_OFFLINE
>>>>>>>>>>>>>          => we enter in the "condition" and the pm_relax has to be called
>>>>>>>>>>>>>          => This commit fix should solve this use case
>>>>>>>>>>>>>
>>>>>>>>>>>>> 3) rproc_detach executed between the queue of the crash work and the call of
>>>>>>>>>>>>>         rproc_crash_handler_work
>>>>>>>>>>>>>          => rproc->state = RPROC_DETACHED;
>>>>>>>>>>>>>          => we don't go in "the condition" and issue because the recovery reattach
>>>>>>>>>>>>>             to the remote processor
>>>>>>>>>>>>>          => but pm_relax is called
>>>>>>>>>>>>>          => probably need an extra fix to avoid to re-attach
>>>>>>>>>>>>>
>>>>>>>>>>>>> 4) crash report while already one is treated (rproc_attach_recovery called)
>>>>>>>>>>>>>          this one corresponds to an auto reboot of the remote processor, with a
>>>>>>>>>>>>>          new crash
>>>>>>>>>>>>>          => rproc->state = RPROC_CRASHED or rproc->state = RPROC_DETACHED;
>>>>>>>>>>>>>          4)a) rproc->state = RPROC_CRASHED if rproc->recovery_disabled = true
>>>>>>>>>>>>>               => should call pm_relax if rproc->recovery_disabled = true
>>>>>>>>>>>>>               => commit does not work for this use case
>>>>>>>>>>>>>
>>>>>>>>>>>>>          4)b) rproc->state = RPROC_DETACHED if recovery fails
>>>>>>>>>>>>>              => error case with an unstable state
>>>>>>>>>>>>>              => how to differentiate it from the use case 3) ?
>>>>>>>>>>>>>              => introduce a RPROC_RECOVERY_FAIL state?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The case where a recovery fails needs to be considered and is the
>>>>>>>>>>>> reason the original patch doesn't work.  Right now in
>>>>>>>>>>>> rproc_crash_handler_work(), it is not possible to differentiate
>>>>>>>>>>>> between a legitimate shutdown request (scenario #2 above) and a
>>>>>>>>>>>> recovery that went wrong.  I think introducing RPROC_RECOVERY_FAIL
>>>>>>>>>>>> would greatly simplify things.
>>>>>>>>>>>>
>>>>>>>>>>>> My initial evaluation had not considered the attach/detach scenarios -
>>>>>>>>>>>> thanks for adding that in the mix.
>>>>>>>>>>>>
>>>>>>>>>>>> Aiqun, please send a new patchset that adds a new remote processor
>>>>>>>>>>>> state, i.e RPROC_RECOVERY_FAIL.  There should also be another patch in
>>>>>>>>>>>> that set that takes attach/detach scenarios into account.  The code
>>>>>>>>>>>> between the v6.0 and v6.1 cycle has changed a lot in that area so make
>>>>>>>>>>>> sure to properly rebase.
>>>>>>>>>>>>
>>>>>>>>>>> I will try.
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Then pm_stay_awake is called when the crash work is queued.
>>>>>>>>>>>>> It seems to me coherent to call the pm_relax in the work handler.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here is a quick and dirty patch (not tested) that should take into account the
>>>>>>>>>>>>> main use cases ( except 1) and 4)b) )
>>>>>>>>>>>>>
>>>>>>>>>>>>> @@ -2009,8 +2009,18 @@ static void rproc_crash_handler_work(struct work_struct *work)
>>>>>>>>>>>>>
>>>>>>>>>>>>>               mutex_lock(&rproc->lock);
>>>>>>>>>>>>>
>>>>>>>>>>>>> -       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
>>>>>>>>>>>>> +       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE ||
>>>>>>>>>>>>> +           rproc->state == RPROC_DETACHED) {
>>>>>>>>>>>>>                       /* handle only the first crash detected */
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +               /*
>>>>>>>>>>>>> +                * call pm-relax in following use cases:
>>>>>>>>>>>>> +                * - the remote processor has been stopped by the user
>>>>>>>>>>>>> +                * - the remote processor is detached
>>>>>>>>>>>>> +                + - the remote proc has an autonomous reset but recovery_disabled is true.
>>>>>>>>>>>>> +                */
>>>>>>>>>>>>> +               if(rproc->state != RPROC_CRASHED || rproc->recovery_disabled)
>>>>>>>>>>>>> +                       pm_relax(rproc->dev.parent);
>>>>>>>>>>>>>                       mutex_unlock(&rproc->lock);
>>>>>>>>>>>>>                       return;
>>>>>>>>>>>>>               }
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Arnaud
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Mathieu
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> When the state is in RPROC_OFFLINE it means separate request
>>>>>>>>>>>>>>>>>>>>> of rproc_stop was done and no need to hold the wakeup source
>>>>>>>>>>>>>>>>>>>>> in crash handler to recover any more.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>>           drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
>>>>>>>>>>>>>>>>>>>>>           1 file changed, 11 insertions(+)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> diff --git a/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>>>>>>>>>>> b/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>>>>>>>>>>> index e5279ed9a8d7..6bc7b8b7d01e 100644
>>>>>>>>>>>>>>>>>>>>> --- a/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>>>>>>>>>>> +++ b/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>>>>>>>>>>> @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct
>>>>>>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>>>>>                  if (rproc->state == RPROC_CRASHED || rproc->state ==
>>>>>>>>>>>>>>>>>>>>> RPROC_OFFLINE) {
>>>>>>>>>>>>>>>>>>>>>                          /* handle only the first crash detected */
>>>>>>>>>>>>>>>>>>>>>                          mutex_unlock(&rproc->lock);
>>>>>>>>>>>>>>>>>>>>> +         /*
>>>>>>>>>>>>>>>>>>>>> +          * RPROC_OFFLINE state indicate there is no recovery process
>>>>>>>>>>>>>>>>>>>>> +          * is in progress and no chance to have pm_relax in place.
>>>>>>>>>>>>>>>>>>>>> +          * Because when recovering from crash, rproc->lock is held and
>>>>>>>>>>>>>>>>>>>>> +          * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>>>>>>>>>>>>>>>>>> +          * and then unlock rproc->lock.
>>>>>>>>>>>>>>>>>>>>> +          * RPROC_OFFLINE is only an intermediate state in recovery
>>>>>>>>>>>>>>>>>>>>> +          * process.
>>>>>>>>>>>>>>>>>>>>> +          */
>>>>>>>>>>>>>>>>>>>>> +         if (rproc->state == RPROC_OFFLINE)
>>>>>>>>>>>>>>>>>>>>> +                 pm_relax(rproc->dev.parent);
>>>>>>>>>>>>>>>>>>>>>                          return;
>>>>>>>>>>>>>>>>>>>>>                  }
>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> 2.7.4
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Thx and BRs,
>>>>>>>>>>>>>>>>>>> Aiqun(Maria) Yu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Thx and BRs,
>>>>>>>>>>>>>>>> Aiqun(Maria) Yu
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Thx and BRs,
>>>>>>>>>>> Aiqun(Maria) Yu
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thx and BRs,
>>>>>>>>> Aiqun(Maria) Yu
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thx and BRs,
>>>>>>> Aiqun(Maria) Yu
>>>>>
>>>>>
>>>>> -- 
>>>>> Thx and BRs,
>>>>> Aiqun(Maria) Yu
>>>
>>>
>>> -- 
>>> Thx and BRs,
>>> Aiqun(Maria) Yu
Mathieu Poirier Dec. 1, 2022, 11 p.m. UTC | #25
On Mon, Nov 28, 2022 at 10:18:28AM +0800, Aiqun(Maria) Yu wrote:
> On 11/26/2022 2:37 AM, Mathieu Poirier wrote:
> > On Fri, Nov 18, 2022 at 11:52:46AM -0700, Mathieu Poirier wrote:
> > > On Tue, Nov 15, 2022 at 09:30:58AM +0800, Aiqun(Maria) Yu wrote:
> > > > Hi,
> > > > 
> > > > On 11/15/2022 5:18 AM, Mathieu Poirier wrote:
> > > > > On Fri, Nov 11, 2022 at 08:52:11AM +0800, Aiqun(Maria) Yu wrote:
> > > > > > On 11/11/2022 4:50 AM, Mathieu Poirier wrote:
> > > > > > > I had a couple of good discussions with our power management expert
> > > > > > > and even then, the way forward isn't as clear as I would have liked.
> > > > > > > I am currently travelling and as such don't have the required time to
> > > > > > > go into greater details, something I will be doing next week.
> > > > > > > 
> > > > > > Thx Mathieu for the info updated.
> > > > > > I'll wait for your update next week then.
> > > > > > let me know any initial questions that you have, perhaps I can also discuss
> > > > > > that with our power team at the same time.
> > > > > 
> > > > > The problem is to determine exactly what the WQ_FREEZABLE flag does to the
> > > > > rproc_recovery_wq workueue.  The documentation [1] indicate that work items on the WQ are
> > > > > drained before the system is suspended.  What I understand from this is that if
> > > > > two work items are queued and one is executing at the time a system suspend is
> > > > > initiated, all three items will be executed before the system is allowed to be
> > > > > suspended.  _If_ that is the case, there would not be a need to call
> > > > > pm_stay_awake() and pm_relax() at all.
> > > > > 
> > > > > On the other hand, the PM resource I spoke to thought that in reality things
> > > > > don't happen that way.  Taking the same above scenario where 2 work items are
> > > > > queued and one is executing at the time of the suspend, only the work item that
> > > > > is executing will be allowed to execute to completion before the system is
> > > > > suspended.  The remaining two items that were queued will not execute.
> > > > > 
> > > > > If that is the case then we do need to call pm_stay_awake() and pm_relax(), and
> > > > > find another strategy to fix this situation.
> > > > > 
> > > > > Until we have a clear view of how the WQ_FREEZABLE flag works, we won't be able
> > > > > to move forward with this patchset.  Unfortunately, I currently do not have the
> > > > > time to look into this.
> > > > I had a check on the WQ_FREEZABLE flag, here is my understanding:
> > > > 
> > > > when the interrupt happened, it still need pm_stay_awake to make sure
> > > > queue_work action can active the work instead of susepend the device.
> > > > 
> > > > 1. If  WQ_FREEZABLE, pwq->max_active = 0; // maximum number of in-flight
> > > > work items is set to 0.
> > > > 
> > > > [1].
> > > > https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/workqueue.c#L3748
> > > > 
> > > > 2. If  WQ_FREEZABLE, will only check pwq->nr_active to see if there is still
> > > > freeze_workqueues_busy.
> > > > [2].
> > > > https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/workqueue.c#L5270
> > > > 
> > > > 3. When in queue_work, if max_active is 0, when do queue_work it will not
> > > > actually active the work.
> > > > [3].
> > > > https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/workqueue.c#L1418
> > > > 
> > > > for the current issue, the work is already complete and forget to set
> > > > pm_relax in some condition that make the system cannot be suspended.
> > > 
> > > What you have above does not describe what happens to work already queue to a
> > > WQ_FREEZABLE workqueue when the system gets suspended, and that is the most
> > > important thing.
> > > 
> > > I ended up doing that investigation myself and found the documentation on
> > > WQ_FREEZABLE seems to be accurate.  I am still discussing this with the Linaro
> > > power management team and will get back to you when we reach a conclusion.
> > > 
> > 
> > I finally have time to get back to this problem...
> > 
> > As I reported above I went down to the bottom of how WQ_FREEZABLE workqueues
> > work.  The conclusion of that research is that _any_ work, queued or currently
> > executing, will finish before the system is suspended.  As such, as soon as
> > queue_work() returns in rproc_report_crash(), you can be guaranteed what was
> > just queued will be executed.  Therefore, calling pm_stay_awake() in
> > rproc_report_crash() and pm_relax() in rproc_crash_handler_work() should not be
> > needed because the system will not suspend for as long as there is work to be
> > done by the freezable worqueues.
> > 
> The problem is when crash interrupt happened in a later point(after
> freeze_kernel_thread), the queue_work will result with the state of
> WORK_STRUCT_INACTIVE and insert to inactive_works.
> And this is the senario that pm_stay_awake wanted to address inside the
> interrupt handler call.
> 
> refer to:
> [1]
> https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/workqueue.c#L1512
> 
> the current suspend simple flow can be:
> 
>         +-+-------------------+
>         |   freeze_processes  |
>         +----------+----------+
>                    |
>                    |
>        +-----------|-----------+
>        | freeze_kernel_thread  |
>        +-----------+-----------+
>                    |
>         +-+--------|----------+
>         |   device_suspend    |
>         +----------|----------+
>                    |
>                    |
>        +-----------+-----------+
>        | pm_wakeup_pending     |
>        +-----------|-----------+
>                    |
>                    |
>  +-----------------+-------------------+
>  | suspend_ops->enter (machine suspend)|
>  +-------------------------------------+
> 
> 
> refer to:
> suspend_prepare -> suspend_freeze_processes -->freeze_processes and
> freeze_kernel_threads
> 
> 
> 
> refer to:
> 
> suspend_devices_and_enter -->
> dpm_suspend_start -> dpm_suspend --> device_suspend --> device suspend
> callbacks
> suspend_enter -->dpm_suspend_late-->device_suspend_late
> 
> [2]
> https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/power/suspend.c#L582
> 
> > I don't have details on what Rishabh - the author of the patch that added the
> > calls - was facing but based on the my understanding of freezable workqueues, it
> > seem to have been the wrong solution.
> In my opinion, the freezable workqueues are in the right solution. It only
> protect for the system freeze state.
> While if we want the crash interrupt to be handled to prevent current
> suspend process, it still need to deal with other suspend state to machin
> suspend senarios.
> Typical example can be before disable the device interrupt in device suspend
> callbacks, it will flush the necessary workqueues.
> 
> while in our case, since the crash handler will invoke a subsystem restart
> process which reply on user thread uevent firmware loaders, so it need to
> abort the suspend process with pm_stay_awake called.
> 
> So in a word, pm_stay_awake is necessary per my understanding.
> 

I see your point - thanks for being perseverant.  What is important to
understand is that pm_stay_awake() can stop and reverse the suspend process that
is currently underway.

With the code you had for V4 in mind, I suggest to modify
rproc_crash_handler_work() to be:

...
...
...

if (rproc->state == RPROC_CRASHED) {
        /* handle only the first crash detected */
        mutex_unlock(&rproc->lock);
        return;
}

if (rproc->state == RPROC_OFFLINE) {
        /* The remote processor is offline, no need to recover anything */
        mutex_unlock(&rproc->lock);
        goto out;
}

...
...
...

out: 
        pm_relax();

Do you think that will work for you?


> > 
> > Until we have a clear vision on what Rishabh was trying to fix, nothing in that
> > area will be changing.  I suggest you touch base with him and sort it out.  If
> > that is not possible then calls to pm_stay_awake() and pm_relax() should be
> > removed.
> > 
> > > > 
> > > > > 
> > > > > If you want to take on this investigation, keep in mind that any conclusion will
> > > > > need to be backed by a proof.  That can be debug messages on a console output or
> > > > > a code reference in the workqueue core.
> > > > > 
> > > > > [1]. https://elixir.bootlin.com/linux/v6.1-rc4/source/Documentation/core-api/workqueue.rst#L184
> > > > > 
> > > > > 
> > > > > > > On Sun, 6 Nov 2022 at 18:14, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
> > > > > > > > 
> > > > > > > > Hi,
> > > > > > > > On 11/4/2022 11:59 PM, Mathieu Poirier wrote:
> > > > > > > > > On Thu, Nov 03, 2022 at 10:03:49AM +0800, Aiqun(Maria) Yu wrote:
> > > > > > > > > > On 11/3/2022 2:03 AM, Mathieu Poirier wrote:
> > > > > > > > > > > On Wed, Nov 02, 2022 at 06:53:49PM +0800, Aiqun(Maria) Yu wrote:
> > > > > > > > > > > > Hi,
> > > > > > > > > > > > 
> > > > > > > > > > > > Let me think about this carefully.
> > > > > > > > > > > > 
> > > > > > > > > > > > When in RPROC_RECOVERY_FAIL case we want to re-do the recovery process again
> > > > > > > > > > > > or just leave the pm_relax?
> > > > > > > > > > > 
> > > > > > > > > > > Neither.
> > > > > > > > > > > 
> > > > > > > > > > > When a recovery fail we don't want to call pm_relax().  The code in
> > > > > > > > > > > rproc_crash_handler_work() becomes:
> > > > > > > > > > > 
> > > > > > > > > > >        if (rproc->state == RPROC_OFFLINE) {
> > > > > > > > > > >                      /* We have raced with rproc_shutdown() */
> > > > > > > > > > >                      pm_relax()
> > > > > > > > > > >                mutex_unlock(&rproc->lock);
> > > > > > > > > > >                return;
> > > > > > > > > > >              }
> > > > > > > > > > > 
> > > > > > > > > > >        if (rproc->state == RPROC_CRASHED ||
> > > > > > > > > > >                  rproc->state == RPROC_RECOVERY_FAILED) {
> > > > > > > > > > >                /* handle only the first crash detected */
> > > > > > > > > > >                mutex_unlock(&rproc->lock);
> > > > > > > > > > >                return;
> > > > > > > > > > >        }
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > RPROC_RECOVERY_FAILED gets set in rproc_boot_recovery() if request_firmware() or
> > > > > > > > > > > rproc_start() fail.  Function rproc_trigger_recovery() needs to allow for the
> > > > > > > > > > > recovery the the remote processor is in RPROC_RECOVERY_FAILED state.  As such
> > > > > > > > > > > the condition becomes:
> > > > > > > > > > > 
> > > > > > > > > > >              /* State could have changed before we got the mutex */
> > > > > > > > > > >        if (rproc->state != RPROC_CRASHED &&
> > > > > > > > > > >                  rproc->state != RPROC_RECOVERY_FAILED)
> > > > > > > > > > >                goto unlock_mutex;
> > > > > > > > > > > 
> > > > > > > > > > > Start with that and we can look at corner cases (if some exists) with a fresh
> > > > > > > > > > > patchset.  Note that I have not addressed the attach/detach() scenario in the
> > > > > > > > > > > above.
> > > > > > > > > > 
> > > > > > > > > > If we didn't deal with the recovery failed case with correct pm_relax call,
> > > > > > > > > > it may left the device in a state that cannot enter to suspend state.
> > > > > > > > > 
> > > > > > > > > That is what I am looking for.  We don't want to give the impression that
> > > > > > > > > everything is fine by allowing the device to suspend.  If the remote processor
> > > > > > > > > can't be recovered than it needs to be dealth with.
> > > > > > > > For the normal recovery failed case, it still need to do pm_relax to not
> > > > > > > > prevent the device goes to suspend. It is what in normal recovery failed
> > > > > > > > case we do in rproc_crash_handler_work as well.
> > > > > > > > rproc_crash_handler_work will not check the result of the
> > > > > > > > rproc_trigger_recovery return value, and will always do pm_relax.
> > > > > > > > 
> > > > > > > > For current conconrency cornor case as well, it is better to consistant
> > > > > > > > with the current design of recovery fail senarios in normal cases.
> > > > > > > > 
> > > > > > > > I personally agree that we shouldn't do nothing when it is a
> > > > > > > > RPROC_RECOVERY_FAILED senario when it is in rproc_crash_handler_work
> > > > > > > > check, because it maybe crash happened when it is trying to do the recovery.
> > > > > > > > So I suggested to do a continue try of trigger recovery again instead of
> > > > > > > > doing nothing and bail out if it is a RPROC_RECOVERY_FAILED state.
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > Because first PROC_RECOVERY_FAIL case cannot ensure it have pm_relax called
> > > > > > > > > > before the second crash handler call pm_stay_awake or not.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > I've been thinking about that part.  I don't think adding a wake_count to
> > > > > > > > > control calls to pm_stay_awake()/pm_relax() is the best way to go.  There is a
> > > > > > > > > similar count happening in the PM runtime subsystem and that is what we should
> > > > > > > > > be using.  I have asked a power management expert at Linaro for guidance with
> > > > > > > > > this matter.  I should be able to get back to you with a way forward by the end
> > > > > > > > > of next week.
> > > > > > > > > 
> > > > > > > > Thx for the specific date provided as well. I will wait until your reply
> > > > > > > > for next patchset then.
> > > > > > > > 
> > > > > > > > > > So, What about the atomic count along with pm_relax and pm_stay_awake ?
> > > > > > > > > > 
> > > > > > > > > > struct rproc{
> > > > > > > > > > ...
> > > > > > > > > > atomic_t wake_count;
> > > > > > > > > > ...
> > > > > > > > > > }
> > > > > > > > > > 
> > > > > > > > > > rproc_pm_stay_awake()
> > > > > > > > > > {
> > > > > > > > > >         atomic_inc(&wake_count);
> > > > > > > > > >         pm_stay_awake();
> > > > > > > > > > }
> > > > > > > > > > 
> > > > > > > > > > rproc_pm_relax()
> > > > > > > > > > {
> > > > > > > > > >         if (atomic_dec_return(&wake_count) == 0)
> > > > > > > > > >                 pm_stay_awake();
> > > > > > > > > > }
> > > > > > > > > > 
> > > > > > > > > > can refer code like:
> > > > > > > > > > 
> > > > > > > > > > rproc_report_crash()
> > > > > > > > > > {
> > > > > > > > > >         ...
> > > > > > > > > >         rproc_pm_stay_awake();
> > > > > > > > > >         queue_work();
> > > > > > > > > >         ...
> > > > > > > > > > }
> > > > > > > > > > 
> > > > > > > > > > rproc_crash_handler_work()
> > > > > > > > > > {
> > > > > > > > > >         ...
> > > > > > > > > >         if (rproc->state == RPROC_OFFLINE || rproc->state == RPROC_CRASHED) {
> > > > > > > > > >                       /* We have raced with rproc_shutdown() */
> > > > > > > > > >                       rproc_pm_relax();
> > > > > > > > > >                 mutex_unlock(&rproc->lock);
> > > > > > > > > >                 return;
> > > > > > > > > >               }
> > > > > > > > > >         ...
> > > > > > > > > > }
> > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Mathieu
> > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > recovery fail case 1:
> > > > > > > > > > > > |                                      |firstcrash interrupt issued
> > > > > > > > > > > > | second crashed interrupt issued      | rproc_report_crash()
> > > > > > > > > > > > | rproc_report_crash()                 |          pm_stay_awake()
> > > > > > > > > > > > |          pm_stay_awake()             |          queue_work()
> > > > > > > > > > > > |          queue_work()                |rproc_crash_handler_work()
> > > > > > > > > > > > |                                      |mutex_lock(&rproc->lock);
> > > > > > > > > > > > |                                      |rproc_stop()
> > > > > > > > > > > > |rproc_crash_handler_work()            |rproc->state = RPROC_OFFLINE;
> > > > > > > > > > > > |                                      |RPROC_RECOVERY_FAIL //new
> > > > > > > > > > > > |                                      |mutex_unlock(&rproc->lock);
> > > > > > > > > > > > |mutex_lock(&rproc->lock);             |pm_relax()
> > > > > > > > > > > > |if (rproc->state == RPROC_OFFLINE)    |
> > > > > > > > > > > > |return // shouldn't do pm_relax if RPROC_RECOVERY_FAIL?  |
> > > > > > > > > > > > |mutex_unlock(&rproc->lock);           |
> > > > > > > > > > > > |                                      |
> > > > > > > > > > > > |                                      |
> > > > > > > > > > > > |                                      |
> > > > > > > > > > > > 
> > > > > > > > > > > > recovery fail case 2:
> > > > > > > > > > > > |                                      |firstcrash interrupt issued
> > > > > > > > > > > > |                                      | rproc_report_crash()
> > > > > > > > > > > > |                                      |          pm_stay_awake()
> > > > > > > > > > > > |                                      |          queue_work()
> > > > > > > > > > > > |                                      |rproc_crash_handler_work()
> > > > > > > > > > > > |                                      |mutex_lock(&rproc->lock);
> > > > > > > > > > > > |                                      |rproc_stop()
> > > > > > > > > > > > |                                      |rproc->state = RPROC_OFFLINE;
> > > > > > > > > > > > |                                      |RPROC_RECOVERY_FAIL //new
> > > > > > > > > > > > |                                      |mutex_unlock(&rproc->lock);
> > > > > > > > > > > > |                                      |pm_relax()
> > > > > > > > > > > > |
> > > > > > > > > > > > | second crashed interrupt issued      |
> > > > > > > > > > > > | rproc_report_crash()                 |
> > > > > > > > > > > > |          pm_stay_awake()             |
> > > > > > > > > > > > |          queue_work()                |
> > > > > > > > > > > > |pm_stay_awake()
> > > > > > > > > > > > |mutex_lock(&rproc->lock);
> > > > > > > > > > > > |if (rproc->state == RPROC_OFFLINE)    |
> > > > > > > > > > > > |return // still need do pm_relax if RPROC_RECOVERY_FAIL?  |
> > > > > > > > > > > > |mutex_unlock(&rproc->lock);           |
> > > > > > > > > > > > |                                      |
> > > > > > > > > > > > |                                      |
> > > > > > > > > > > > |                                      |
> > > > > > > > > > > > 
> > > > > > > > > > > > Maybe I can have:
> > > > > > > > > > > > 1. the pm_stay_awake and pm_relax with count based and call with paired for
> > > > > > > > > > > > fix current concurency issue.
> > > > > > > > > > > > 2. RPROC_RECOVERY_FAIL can be another patch for continue try to do recovery
> > > > > > > > > > > > work.
> > > > > > > > > > > > 3. handle RPROC_DETACHED case.
> > > > > > > > > > > > 
> > > > > > > > > > > > On 11/2/2022 4:11 AM, Mathieu Poirier wrote:
> > > > > > > > > > > > > On Fri, 28 Oct 2022 at 09:31, Arnaud POULIQUEN
> > > > > > > > > > > > > <arnaud.pouliquen@foss.st.com> wrote:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > On 10/24/22 05:17, Aiqun(Maria) Yu wrote:
> > > > > > > > > > > > > > > On 10/22/2022 3:34 AM, Mathieu Poirier wrote:
> > > > > > > > > > > > > > > > On Wed, 19 Oct 2022 at 23:52, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > On 10/14/2022 2:03 AM, Mathieu Poirier wrote:
> > > > > > > > > > > > > > > > > > On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote:
> > > > > > > > > > > > > > > > > > > On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:
> > > > > > > > > > > > > > > > > > > > Hi Mathieu,
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
> > > > > > > > > > > > > > > > > > > > > Please add what has changed from one version to another, either in a cover
> > > > > > > > > > > > > > > > > > > > > letter or after the "Signed-off-by".  There are many examples on how to
> > > > > > > > > > > > > > > > > > > > > do that
> > > > > > > > > > > > > > > > > > > > > on the mailing list.
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > Thx for the information, will take a note and benefit for next time.
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
> > > > > > > > > > > > > > > > > > > > > > RPROC_OFFLINE state indicate there is no recovery process
> > > > > > > > > > > > > > > > > > > > > > is in progress and no chance to do the pm_relax.
> > > > > > > > > > > > > > > > > > > > > > Because when recovering from crash, rproc->lock is held and
> > > > > > > > > > > > > > > > > > > > > > state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> > > > > > > > > > > > > > > > > > > > > > and then unlock rproc->lock.
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > You are correct - because the lock is held rproc->state should be set to
> > > > > > > > > > > > > > > > > > > > > RPROC_RUNNING
> > > > > > > > > > > > > > > > > > > > > when rproc_trigger_recovery() returns.  If that is not the case then
> > > > > > > > > > > > > > > > > > > > > something
> > > > > > > > > > > > > > > > > > > > > went wrong.
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before
> > > > > > > > > > > > > > > > > > > > > returning,
> > > > > > > > > > > > > > > > > > > > > so we know the remote processor was stopped.  Therefore if rproc->state
> > > > > > > > > > > > > > > > > > > > > is set
> > > > > > > > > > > > > > > > > > > > > to RPROC_OFFLINE something went wrong in either request_firmware() or
> > > > > > > > > > > > > > > > > > > > > rproc_start().  Either way the remote processor is offline and the system
> > > > > > > > > > > > > > > > > > > > > probably
> > > > > > > > > > > > > > > > > > > > > in an unknown/unstable.  As such I don't see how calling pm_relax() can help
> > > > > > > > > > > > > > > > > > > > > things along.
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
> > > > > > > > > > > > > > > > > > > > finished.
> > > > > > > > > > > > > > > > > > > > Even if it is multi crash rproc_crash_handler_work contention issue, and
> > > > > > > > > > > > > > > > > > > > last rproc_trigger_recovery bailed out with only
> > > > > > > > > > > > > > > > > > > > rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
> > > > > > > > > > > > > > > > > > > > Since the subsystem may still can be recovered with customer's next trigger
> > > > > > > > > > > > > > > > > > > > of rproc_start, and we can make each error out path clean with pm resources.
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > I suggest spending time understanding what leads to the failure when
> > > > > > > > > > > > > > > > > > > > > recovering
> > > > > > > > > > > > > > > > > > > > > from a crash and address that problem(s).
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > In current case, the customer's information is that the issue happened when
> > > > > > > > > > > > > > > > > > > > rproc_shutdown is triggered at similar time. So not an issue from error out
> > > > > > > > > > > > > > > > > > > > of rproc_trigger_recovery.
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > That is a very important element to consider and should have been mentioned
> > > > > > > > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > > > > the beginning.  What I see happening is the following:
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > rproc_report_crash()
> > > > > > > > > > > > > > > > > > >                 pm_stay_awake()
> > > > > > > > > > > > > > > > > > >                 queue_work() // current thread is suspended
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > rproc_shutdown()
> > > > > > > > > > > > > > > > > > >                 rproc_stop()
> > > > > > > > > > > > > > > > > > >                         rproc->state = RPROC_OFFLINE;
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > rproc_crash_handler_work()
> > > > > > > > > > > > > > > > > > >                 if (rproc->state == RPROC_OFFLINE)
> > > > > > > > > > > > > > > > > > >                         return // pm_relax() is not called
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > The right way to fix this is to add a pm_relax() in rproc_shutdown() and
> > > > > > > > > > > > > > > > > > > rproc_detach(), along with a very descriptive comment as to why it is needed.
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > Thinking about this further there are more ramifications to consider.  Please
> > > > > > > > > > > > > > > > > > confirm the above scenario is what you are facing.  I will advise on how to
> > > > > > > > > > > > > > > > > > move
> > > > > > > > > > > > > > > > > > forward if that is the case.
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Not sure if the situation is clear or not. So resend the email again.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > The above senario is what customer is facing. crash hanppened while at
> > > > > > > > > > > > > > > > > the same time shutdown is triggered.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Unfortunately this is not enough details to address a problem as
> > > > > > > > > > > > > > > > complex as this one.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > And the device cannto goes to suspend state after that.
> > > > > > > > > > > > > > > > > the subsystem can still be start normally after this.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > If the code flow I pasted above reflects the problem at hand, the
> > > > > > > > > > > > > > > > current patch will not be sufficient to address the issue.  If Arnaud
> > > > > > > > > > > > > > > > confirms my suspicions we will have to think about a better solution.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Hi Mathiew,
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Could you pls have more details of any side effects other then power issue of
> > > > > > > > > > > > > > > the current senario?
> > > > > > > > > > > > > > > Why the current patch is not sufficient pls?
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Have the current senario in details with rproc->lock information in details:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > | subsystem crashed interrupt issued      | user trigger shutdown
> > > > > > > > > > > > > > > | rproc_report_crash()                    |
> > > > > > > > > > > > > > > |          pm_stay_awake()                |
> > > > > > > > > > > > > > > |          queue_work()                   |
> > > > > > > > > > > > > > > |                                         |rproc_shutdown
> > > > > > > > > > > > > > > |                                         |mutex_lock(&rproc->lock);
> > > > > > > > > > > > > > > |                                         |rproc_stop()
> > > > > > > > > > > > > > > |rproc_crash_handler_work()               |rproc->state = RPROC_OFFLINE;
> > > > > > > > > > > > > > > |                                         |mutex_unlock(&rproc->lock);
> > > > > > > > > > > > > > > |mutex_lock(&rproc->lock);                |
> > > > > > > > > > > > > > > |if (rproc->state == RPROC_OFFLINE)       |
> > > > > > > > > > > > > > > |return // pm_relax() is not called       |rproc_boot
> > > > > > > > > > > > > > > |mutex_unlock(&rproc->lock);              |
> > > > > > > > > > > > > > > |                                         |mutex_lock(&rproc->lock);
> > > > > > > > > > > > > > > |                                         |rproc_start()
> > > > > > > > > > > > > > > |                                         |mutex_unlock(&rproc->lock);
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Agree with Mathieu, this is not so simple.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Thanks for looking into this.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > Here is my view  hoping I haven't missed a point in your discussion or
> > > > > > > > > > > > > > an other corner cases.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I tried to analyze the issues (in what follows, the term "condition" means
> > > > > > > > > > > > > > the "if" condition in which Aiqun proposes to add the fix) :
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I can see 4 use cases with race condition
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 1) crash report while already one is treated (rproc_boot_recovery called)
> > > > > > > > > > > > > >            => not a real use case as if the remote processor is crashed we
> > > > > > > > > > > > > >                     should not have a second crash report
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > That part is of great concern to me.  *Theoretically* we should not
> > > > > > > > > > > > > get a new crash report while one has already been dispatched but the
> > > > > > > > > > > > > current code accounts for this scenario and as such the possibility
> > > > > > > > > > > > > can't be dismissed.  Therefore we need to expect rproc_report_crash()
> > > > > > > > > > > > > to be called multiple times before a single instance of
> > > > > > > > > > > > > rproc_boot_recovery() is scheduled.
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > 2) rproc_stop executed between the queuing of the crash work and the call of
> > > > > > > > > > > > > >         rproc_crash_handler_work
> > > > > > > > > > > > > >          => rproc->state = RPROC_OFFLINE
> > > > > > > > > > > > > >          => we enter in the "condition" and the pm_relax has to be called
> > > > > > > > > > > > > >          => This commit fix should solve this use case
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 3) rproc_detach executed between the queue of the crash work and the call of
> > > > > > > > > > > > > >         rproc_crash_handler_work
> > > > > > > > > > > > > >          => rproc->state = RPROC_DETACHED;
> > > > > > > > > > > > > >          => we don't go in "the condition" and issue because the recovery reattach
> > > > > > > > > > > > > >             to the remote processor
> > > > > > > > > > > > > >          => but pm_relax is called
> > > > > > > > > > > > > >          => probably need an extra fix to avoid to re-attach
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 4) crash report while already one is treated (rproc_attach_recovery called)
> > > > > > > > > > > > > >          this one corresponds to an auto reboot of the remote processor, with a
> > > > > > > > > > > > > >          new crash
> > > > > > > > > > > > > >          => rproc->state = RPROC_CRASHED or rproc->state = RPROC_DETACHED;
> > > > > > > > > > > > > >          4)a) rproc->state = RPROC_CRASHED if rproc->recovery_disabled = true
> > > > > > > > > > > > > >               => should call pm_relax if rproc->recovery_disabled = true
> > > > > > > > > > > > > >               => commit does not work for this use case
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >          4)b) rproc->state = RPROC_DETACHED if recovery fails
> > > > > > > > > > > > > >              => error case with an unstable state
> > > > > > > > > > > > > >              => how to differentiate it from the use case 3) ?
> > > > > > > > > > > > > >              => introduce a RPROC_RECOVERY_FAIL state?
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The case where a recovery fails needs to be considered and is the
> > > > > > > > > > > > > reason the original patch doesn't work.  Right now in
> > > > > > > > > > > > > rproc_crash_handler_work(), it is not possible to differentiate
> > > > > > > > > > > > > between a legitimate shutdown request (scenario #2 above) and a
> > > > > > > > > > > > > recovery that went wrong.  I think introducing RPROC_RECOVERY_FAIL
> > > > > > > > > > > > > would greatly simplify things.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > My initial evaluation had not considered the attach/detach scenarios -
> > > > > > > > > > > > > thanks for adding that in the mix.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Aiqun, please send a new patchset that adds a new remote processor
> > > > > > > > > > > > > state, i.e RPROC_RECOVERY_FAIL.  There should also be another patch in
> > > > > > > > > > > > > that set that takes attach/detach scenarios into account.  The code
> > > > > > > > > > > > > between the v6.0 and v6.1 cycle has changed a lot in that area so make
> > > > > > > > > > > > > sure to properly rebase.
> > > > > > > > > > > > > 
> > > > > > > > > > > > I will try.
> > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Then pm_stay_awake is called when the crash work is queued.
> > > > > > > > > > > > > > It seems to me coherent to call the pm_relax in the work handler.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Here is a quick and dirty patch (not tested) that should take into account the
> > > > > > > > > > > > > > main use cases ( except 1) and 4)b) )
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > @@ -2009,8 +2009,18 @@ static void rproc_crash_handler_work(struct work_struct *work)
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >               mutex_lock(&rproc->lock);
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > -       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
> > > > > > > > > > > > > > +       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE ||
> > > > > > > > > > > > > > +           rproc->state == RPROC_DETACHED) {
> > > > > > > > > > > > > >                       /* handle only the first crash detected */
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +               /*
> > > > > > > > > > > > > > +                * call pm-relax in following use cases:
> > > > > > > > > > > > > > +                * - the remote processor has been stopped by the user
> > > > > > > > > > > > > > +                * - the remote processor is detached
> > > > > > > > > > > > > > +                + - the remote proc has an autonomous reset but recovery_disabled is true.
> > > > > > > > > > > > > > +                */
> > > > > > > > > > > > > > +               if(rproc->state != RPROC_CRASHED || rproc->recovery_disabled)
> > > > > > > > > > > > > > +                       pm_relax(rproc->dev.parent);
> > > > > > > > > > > > > >                       mutex_unlock(&rproc->lock);
> > > > > > > > > > > > > >                       return;
> > > > > > > > > > > > > >               }
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > Arnaud
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > > > > > Mathieu
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > When the state is in RPROC_OFFLINE it means separate request
> > > > > > > > > > > > > > > > > > > > > > of rproc_stop was done and no need to hold the wakeup source
> > > > > > > > > > > > > > > > > > > > > > in crash handler to recover any more.
> > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
> > > > > > > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > > > > > >           drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
> > > > > > > > > > > > > > > > > > > > > >           1 file changed, 11 insertions(+)
> > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > diff --git a/drivers/remoteproc/remoteproc_core.c
> > > > > > > > > > > > > > > > > > > > > > b/drivers/remoteproc/remoteproc_core.c
> > > > > > > > > > > > > > > > > > > > > > index e5279ed9a8d7..6bc7b8b7d01e 100644
> > > > > > > > > > > > > > > > > > > > > > --- a/drivers/remoteproc/remoteproc_core.c
> > > > > > > > > > > > > > > > > > > > > > +++ b/drivers/remoteproc/remoteproc_core.c
> > > > > > > > > > > > > > > > > > > > > > @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct
> > > > > > > > > > > > > > > > > > > > > > work_struct *work)
> > > > > > > > > > > > > > > > > > > > > >                  if (rproc->state == RPROC_CRASHED || rproc->state ==
> > > > > > > > > > > > > > > > > > > > > > RPROC_OFFLINE) {
> > > > > > > > > > > > > > > > > > > > > >                          /* handle only the first crash detected */
> > > > > > > > > > > > > > > > > > > > > >                          mutex_unlock(&rproc->lock);
> > > > > > > > > > > > > > > > > > > > > > +         /*
> > > > > > > > > > > > > > > > > > > > > > +          * RPROC_OFFLINE state indicate there is no recovery process
> > > > > > > > > > > > > > > > > > > > > > +          * is in progress and no chance to have pm_relax in place.
> > > > > > > > > > > > > > > > > > > > > > +          * Because when recovering from crash, rproc->lock is held and
> > > > > > > > > > > > > > > > > > > > > > +          * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> > > > > > > > > > > > > > > > > > > > > > +          * and then unlock rproc->lock.
> > > > > > > > > > > > > > > > > > > > > > +          * RPROC_OFFLINE is only an intermediate state in recovery
> > > > > > > > > > > > > > > > > > > > > > +          * process.
> > > > > > > > > > > > > > > > > > > > > > +          */
> > > > > > > > > > > > > > > > > > > > > > +         if (rproc->state == RPROC_OFFLINE)
> > > > > > > > > > > > > > > > > > > > > > +                 pm_relax(rproc->dev.parent);
> > > > > > > > > > > > > > > > > > > > > >                          return;
> > > > > > > > > > > > > > > > > > > > > >                  }
> > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > 2.7.4
> > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > Thx and BRs,
> > > > > > > > > > > > > > > > > > > > Aiqun(Maria) Yu
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > Thx and BRs,
> > > > > > > > > > > > > > > > > Aiqun(Maria) Yu
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > --
> > > > > > > > > > > > Thx and BRs,
> > > > > > > > > > > > Aiqun(Maria) Yu
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > --
> > > > > > > > > > Thx and BRs,
> > > > > > > > > > Aiqun(Maria) Yu
> > > > > > > > 
> > > > > > > > 
> > > > > > > > --
> > > > > > > > Thx and BRs,
> > > > > > > > Aiqun(Maria) Yu
> > > > > > 
> > > > > > 
> > > > > > -- 
> > > > > > Thx and BRs,
> > > > > > Aiqun(Maria) Yu
> > > > 
> > > > 
> > > > -- 
> > > > Thx and BRs,
> > > > Aiqun(Maria) Yu
> 
> 
> -- 
> Thx and BRs,
> Aiqun(Maria) Yu
Aiqun Yu (Maria) Dec. 2, 2022, 1:17 a.m. UTC | #26
On 12/2/2022 7:00 AM, Mathieu Poirier wrote:
> On Mon, Nov 28, 2022 at 10:18:28AM +0800, Aiqun(Maria) Yu wrote:
>> On 11/26/2022 2:37 AM, Mathieu Poirier wrote:
>>> On Fri, Nov 18, 2022 at 11:52:46AM -0700, Mathieu Poirier wrote:
>>>> On Tue, Nov 15, 2022 at 09:30:58AM +0800, Aiqun(Maria) Yu wrote:
>>>>> Hi,
>>>>>
>>>>> On 11/15/2022 5:18 AM, Mathieu Poirier wrote:
>>>>>> On Fri, Nov 11, 2022 at 08:52:11AM +0800, Aiqun(Maria) Yu wrote:
>>>>>>> On 11/11/2022 4:50 AM, Mathieu Poirier wrote:
>>>>>>>> I had a couple of good discussions with our power management expert
>>>>>>>> and even then, the way forward isn't as clear as I would have liked.
>>>>>>>> I am currently travelling and as such don't have the required time to
>>>>>>>> go into greater details, something I will be doing next week.
>>>>>>>>
>>>>>>> Thx Mathieu for the info updated.
>>>>>>> I'll wait for your update next week then.
>>>>>>> let me know any initial questions that you have, perhaps I can also discuss
>>>>>>> that with our power team at the same time.
>>>>>>
>>>>>> The problem is to determine exactly what the WQ_FREEZABLE flag does to the
>>>>>> rproc_recovery_wq workueue.  The documentation [1] indicate that work items on the WQ are
>>>>>> drained before the system is suspended.  What I understand from this is that if
>>>>>> two work items are queued and one is executing at the time a system suspend is
>>>>>> initiated, all three items will be executed before the system is allowed to be
>>>>>> suspended.  _If_ that is the case, there would not be a need to call
>>>>>> pm_stay_awake() and pm_relax() at all.
>>>>>>
>>>>>> On the other hand, the PM resource I spoke to thought that in reality things
>>>>>> don't happen that way.  Taking the same above scenario where 2 work items are
>>>>>> queued and one is executing at the time of the suspend, only the work item that
>>>>>> is executing will be allowed to execute to completion before the system is
>>>>>> suspended.  The remaining two items that were queued will not execute.
>>>>>>
>>>>>> If that is the case then we do need to call pm_stay_awake() and pm_relax(), and
>>>>>> find another strategy to fix this situation.
>>>>>>
>>>>>> Until we have a clear view of how the WQ_FREEZABLE flag works, we won't be able
>>>>>> to move forward with this patchset.  Unfortunately, I currently do not have the
>>>>>> time to look into this.
>>>>> I had a check on the WQ_FREEZABLE flag, here is my understanding:
>>>>>
>>>>> when the interrupt happened, it still need pm_stay_awake to make sure
>>>>> queue_work action can active the work instead of susepend the device.
>>>>>
>>>>> 1. If  WQ_FREEZABLE, pwq->max_active = 0; // maximum number of in-flight
>>>>> work items is set to 0.
>>>>>
>>>>> [1].
>>>>> https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/workqueue.c#L3748
>>>>>
>>>>> 2. If  WQ_FREEZABLE, will only check pwq->nr_active to see if there is still
>>>>> freeze_workqueues_busy.
>>>>> [2].
>>>>> https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/workqueue.c#L5270
>>>>>
>>>>> 3. When in queue_work, if max_active is 0, when do queue_work it will not
>>>>> actually active the work.
>>>>> [3].
>>>>> https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/workqueue.c#L1418
>>>>>
>>>>> for the current issue, the work is already complete and forget to set
>>>>> pm_relax in some condition that make the system cannot be suspended.
>>>>
>>>> What you have above does not describe what happens to work already queue to a
>>>> WQ_FREEZABLE workqueue when the system gets suspended, and that is the most
>>>> important thing.
>>>>
>>>> I ended up doing that investigation myself and found the documentation on
>>>> WQ_FREEZABLE seems to be accurate.  I am still discussing this with the Linaro
>>>> power management team and will get back to you when we reach a conclusion.
>>>>
>>>
>>> I finally have time to get back to this problem...
>>>
>>> As I reported above I went down to the bottom of how WQ_FREEZABLE workqueues
>>> work.  The conclusion of that research is that _any_ work, queued or currently
>>> executing, will finish before the system is suspended.  As such, as soon as
>>> queue_work() returns in rproc_report_crash(), you can be guaranteed what was
>>> just queued will be executed.  Therefore, calling pm_stay_awake() in
>>> rproc_report_crash() and pm_relax() in rproc_crash_handler_work() should not be
>>> needed because the system will not suspend for as long as there is work to be
>>> done by the freezable worqueues.
>>>
>> The problem is when crash interrupt happened in a later point(after
>> freeze_kernel_thread), the queue_work will result with the state of
>> WORK_STRUCT_INACTIVE and insert to inactive_works.
>> And this is the senario that pm_stay_awake wanted to address inside the
>> interrupt handler call.
>>
>> refer to:
>> [1]
>> https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/workqueue.c#L1512
>>
>> the current suspend simple flow can be:
>>
>>          +-+-------------------+
>>          |   freeze_processes  |
>>          +----------+----------+
>>                     |
>>                     |
>>         +-----------|-----------+
>>         | freeze_kernel_thread  |
>>         +-----------+-----------+
>>                     |
>>          +-+--------|----------+
>>          |   device_suspend    |
>>          +----------|----------+
>>                     |
>>                     |
>>         +-----------+-----------+
>>         | pm_wakeup_pending     |
>>         +-----------|-----------+
>>                     |
>>                     |
>>   +-----------------+-------------------+
>>   | suspend_ops->enter (machine suspend)|
>>   +-------------------------------------+
>>
>>
>> refer to:
>> suspend_prepare -> suspend_freeze_processes -->freeze_processes and
>> freeze_kernel_threads
>>
>>
>>
>> refer to:
>>
>> suspend_devices_and_enter -->
>> dpm_suspend_start -> dpm_suspend --> device_suspend --> device suspend
>> callbacks
>> suspend_enter -->dpm_suspend_late-->device_suspend_late
>>
>> [2]
>> https://elixir.bootlin.com/linux/v6.1-rc4/source/kernel/power/suspend.c#L582
>>
>>> I don't have details on what Rishabh - the author of the patch that added the
>>> calls - was facing but based on the my understanding of freezable workqueues, it
>>> seem to have been the wrong solution.
>> In my opinion, the freezable workqueues are in the right solution. It only
>> protect for the system freeze state.
>> While if we want the crash interrupt to be handled to prevent current
>> suspend process, it still need to deal with other suspend state to machin
>> suspend senarios.
>> Typical example can be before disable the device interrupt in device suspend
>> callbacks, it will flush the necessary workqueues.
>>
>> while in our case, since the crash handler will invoke a subsystem restart
>> process which reply on user thread uevent firmware loaders, so it need to
>> abort the suspend process with pm_stay_awake called.
>>
>> So in a word, pm_stay_awake is necessary per my understanding.
>>
> 
> I see your point - thanks for being perseverant.  What is important to
> understand is that pm_stay_awake() can stop and reverse the suspend process that
> is currently underway.
Thx for the understanding as well.
Perseverant in most of time can be a win win result.
> 
> With the code you had for V4 in mind, I suggest to modify
> rproc_crash_handler_work() to be:
> 
> ...
> ...
> ...
> 
> if (rproc->state == RPROC_CRASHED) {
>          /* handle only the first crash detected */
>          mutex_unlock(&rproc->lock);
>          return;
> }
> 
> if (rproc->state == RPROC_OFFLINE) {
>          /* The remote processor is offline, no need to recover anything */
>          mutex_unlock(&rproc->lock);
>          goto out;
> }
> 
> ...
> ...
> ...
> 
> out:
>          pm_relax();
> > Do you think that will work for you?
> 
This is good.

Also, alloc_ordered_workqueue instead of alloc_workqueue is suggested in 
the same patch.
Because with alloc_ordered_workqueue, only 1 work will be actively 
executing. It is to avoid unnecessary contentions which have multi crash 
work executing at the same time.

I will upload new patchset for discussion.
> 
>>>
>>> Until we have a clear vision on what Rishabh was trying to fix, nothing in that
>>> area will be changing.  I suggest you touch base with him and sort it out.  If
>>> that is not possible then calls to pm_stay_awake() and pm_relax() should be
>>> removed.
>>>
>>>>>
>>>>>>
>>>>>> If you want to take on this investigation, keep in mind that any conclusion will
>>>>>> need to be backed by a proof.  That can be debug messages on a console output or
>>>>>> a code reference in the workqueue core.
>>>>>>
>>>>>> [1]. https://elixir.bootlin.com/linux/v6.1-rc4/source/Documentation/core-api/workqueue.rst#L184
>>>>>>
>>>>>>
>>>>>>>> On Sun, 6 Nov 2022 at 18:14, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> On 11/4/2022 11:59 PM, Mathieu Poirier wrote:
>>>>>>>>>> On Thu, Nov 03, 2022 at 10:03:49AM +0800, Aiqun(Maria) Yu wrote:
>>>>>>>>>>> On 11/3/2022 2:03 AM, Mathieu Poirier wrote:
>>>>>>>>>>>> On Wed, Nov 02, 2022 at 06:53:49PM +0800, Aiqun(Maria) Yu wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let me think about this carefully.
>>>>>>>>>>>>>
>>>>>>>>>>>>> When in RPROC_RECOVERY_FAIL case we want to re-do the recovery process again
>>>>>>>>>>>>> or just leave the pm_relax?
>>>>>>>>>>>>
>>>>>>>>>>>> Neither.
>>>>>>>>>>>>
>>>>>>>>>>>> When a recovery fail we don't want to call pm_relax().  The code in
>>>>>>>>>>>> rproc_crash_handler_work() becomes:
>>>>>>>>>>>>
>>>>>>>>>>>>         if (rproc->state == RPROC_OFFLINE) {
>>>>>>>>>>>>                       /* We have raced with rproc_shutdown() */
>>>>>>>>>>>>                       pm_relax()
>>>>>>>>>>>>                 mutex_unlock(&rproc->lock);
>>>>>>>>>>>>                 return;
>>>>>>>>>>>>               }
>>>>>>>>>>>>
>>>>>>>>>>>>         if (rproc->state == RPROC_CRASHED ||
>>>>>>>>>>>>                   rproc->state == RPROC_RECOVERY_FAILED) {
>>>>>>>>>>>>                 /* handle only the first crash detected */
>>>>>>>>>>>>                 mutex_unlock(&rproc->lock);
>>>>>>>>>>>>                 return;
>>>>>>>>>>>>         }
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> RPROC_RECOVERY_FAILED gets set in rproc_boot_recovery() if request_firmware() or
>>>>>>>>>>>> rproc_start() fail.  Function rproc_trigger_recovery() needs to allow for the
>>>>>>>>>>>> recovery the the remote processor is in RPROC_RECOVERY_FAILED state.  As such
>>>>>>>>>>>> the condition becomes:
>>>>>>>>>>>>
>>>>>>>>>>>>               /* State could have changed before we got the mutex */
>>>>>>>>>>>>         if (rproc->state != RPROC_CRASHED &&
>>>>>>>>>>>>                   rproc->state != RPROC_RECOVERY_FAILED)
>>>>>>>>>>>>                 goto unlock_mutex;
>>>>>>>>>>>>
>>>>>>>>>>>> Start with that and we can look at corner cases (if some exists) with a fresh
>>>>>>>>>>>> patchset.  Note that I have not addressed the attach/detach() scenario in the
>>>>>>>>>>>> above.
>>>>>>>>>>>
>>>>>>>>>>> If we didn't deal with the recovery failed case with correct pm_relax call,
>>>>>>>>>>> it may left the device in a state that cannot enter to suspend state.
>>>>>>>>>>
>>>>>>>>>> That is what I am looking for.  We don't want to give the impression that
>>>>>>>>>> everything is fine by allowing the device to suspend.  If the remote processor
>>>>>>>>>> can't be recovered than it needs to be dealth with.
>>>>>>>>> For the normal recovery failed case, it still need to do pm_relax to not
>>>>>>>>> prevent the device goes to suspend. It is what in normal recovery failed
>>>>>>>>> case we do in rproc_crash_handler_work as well.
>>>>>>>>> rproc_crash_handler_work will not check the result of the
>>>>>>>>> rproc_trigger_recovery return value, and will always do pm_relax.
>>>>>>>>>
>>>>>>>>> For current conconrency cornor case as well, it is better to consistant
>>>>>>>>> with the current design of recovery fail senarios in normal cases.
>>>>>>>>>
>>>>>>>>> I personally agree that we shouldn't do nothing when it is a
>>>>>>>>> RPROC_RECOVERY_FAILED senario when it is in rproc_crash_handler_work
>>>>>>>>> check, because it maybe crash happened when it is trying to do the recovery.
>>>>>>>>> So I suggested to do a continue try of trigger recovery again instead of
>>>>>>>>> doing nothing and bail out if it is a RPROC_RECOVERY_FAILED state.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Because first PROC_RECOVERY_FAIL case cannot ensure it have pm_relax called
>>>>>>>>>>> before the second crash handler call pm_stay_awake or not.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I've been thinking about that part.  I don't think adding a wake_count to
>>>>>>>>>> control calls to pm_stay_awake()/pm_relax() is the best way to go.  There is a
>>>>>>>>>> similar count happening in the PM runtime subsystem and that is what we should
>>>>>>>>>> be using.  I have asked a power management expert at Linaro for guidance with
>>>>>>>>>> this matter.  I should be able to get back to you with a way forward by the end
>>>>>>>>>> of next week.
>>>>>>>>>>
>>>>>>>>> Thx for the specific date provided as well. I will wait until your reply
>>>>>>>>> for next patchset then.
>>>>>>>>>
>>>>>>>>>>> So, What about the atomic count along with pm_relax and pm_stay_awake ?
>>>>>>>>>>>
>>>>>>>>>>> struct rproc{
>>>>>>>>>>> ...
>>>>>>>>>>> atomic_t wake_count;
>>>>>>>>>>> ...
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> rproc_pm_stay_awake()
>>>>>>>>>>> {
>>>>>>>>>>>          atomic_inc(&wake_count);
>>>>>>>>>>>          pm_stay_awake();
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> rproc_pm_relax()
>>>>>>>>>>> {
>>>>>>>>>>>          if (atomic_dec_return(&wake_count) == 0)
>>>>>>>>>>>                  pm_stay_awake();
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> can refer code like:
>>>>>>>>>>>
>>>>>>>>>>> rproc_report_crash()
>>>>>>>>>>> {
>>>>>>>>>>>          ...
>>>>>>>>>>>          rproc_pm_stay_awake();
>>>>>>>>>>>          queue_work();
>>>>>>>>>>>          ...
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> rproc_crash_handler_work()
>>>>>>>>>>> {
>>>>>>>>>>>          ...
>>>>>>>>>>>          if (rproc->state == RPROC_OFFLINE || rproc->state == RPROC_CRASHED) {
>>>>>>>>>>>                        /* We have raced with rproc_shutdown() */
>>>>>>>>>>>                        rproc_pm_relax();
>>>>>>>>>>>                  mutex_unlock(&rproc->lock);
>>>>>>>>>>>                  return;
>>>>>>>>>>>                }
>>>>>>>>>>>          ...
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Mathieu
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> recovery fail case 1:
>>>>>>>>>>>>> |                                      |firstcrash interrupt issued
>>>>>>>>>>>>> | second crashed interrupt issued      | rproc_report_crash()
>>>>>>>>>>>>> | rproc_report_crash()                 |          pm_stay_awake()
>>>>>>>>>>>>> |          pm_stay_awake()             |          queue_work()
>>>>>>>>>>>>> |          queue_work()                |rproc_crash_handler_work()
>>>>>>>>>>>>> |                                      |mutex_lock(&rproc->lock);
>>>>>>>>>>>>> |                                      |rproc_stop()
>>>>>>>>>>>>> |rproc_crash_handler_work()            |rproc->state = RPROC_OFFLINE;
>>>>>>>>>>>>> |                                      |RPROC_RECOVERY_FAIL //new
>>>>>>>>>>>>> |                                      |mutex_unlock(&rproc->lock);
>>>>>>>>>>>>> |mutex_lock(&rproc->lock);             |pm_relax()
>>>>>>>>>>>>> |if (rproc->state == RPROC_OFFLINE)    |
>>>>>>>>>>>>> |return // shouldn't do pm_relax if RPROC_RECOVERY_FAIL?  |
>>>>>>>>>>>>> |mutex_unlock(&rproc->lock);           |
>>>>>>>>>>>>> |                                      |
>>>>>>>>>>>>> |                                      |
>>>>>>>>>>>>> |                                      |
>>>>>>>>>>>>>
>>>>>>>>>>>>> recovery fail case 2:
>>>>>>>>>>>>> |                                      |firstcrash interrupt issued
>>>>>>>>>>>>> |                                      | rproc_report_crash()
>>>>>>>>>>>>> |                                      |          pm_stay_awake()
>>>>>>>>>>>>> |                                      |          queue_work()
>>>>>>>>>>>>> |                                      |rproc_crash_handler_work()
>>>>>>>>>>>>> |                                      |mutex_lock(&rproc->lock);
>>>>>>>>>>>>> |                                      |rproc_stop()
>>>>>>>>>>>>> |                                      |rproc->state = RPROC_OFFLINE;
>>>>>>>>>>>>> |                                      |RPROC_RECOVERY_FAIL //new
>>>>>>>>>>>>> |                                      |mutex_unlock(&rproc->lock);
>>>>>>>>>>>>> |                                      |pm_relax()
>>>>>>>>>>>>> |
>>>>>>>>>>>>> | second crashed interrupt issued      |
>>>>>>>>>>>>> | rproc_report_crash()                 |
>>>>>>>>>>>>> |          pm_stay_awake()             |
>>>>>>>>>>>>> |          queue_work()                |
>>>>>>>>>>>>> |pm_stay_awake()
>>>>>>>>>>>>> |mutex_lock(&rproc->lock);
>>>>>>>>>>>>> |if (rproc->state == RPROC_OFFLINE)    |
>>>>>>>>>>>>> |return // still need do pm_relax if RPROC_RECOVERY_FAIL?  |
>>>>>>>>>>>>> |mutex_unlock(&rproc->lock);           |
>>>>>>>>>>>>> |                                      |
>>>>>>>>>>>>> |                                      |
>>>>>>>>>>>>> |                                      |
>>>>>>>>>>>>>
>>>>>>>>>>>>> Maybe I can have:
>>>>>>>>>>>>> 1. the pm_stay_awake and pm_relax with count based and call with paired for
>>>>>>>>>>>>> fix current concurency issue.
>>>>>>>>>>>>> 2. RPROC_RECOVERY_FAIL can be another patch for continue try to do recovery
>>>>>>>>>>>>> work.
>>>>>>>>>>>>> 3. handle RPROC_DETACHED case.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 11/2/2022 4:11 AM, Mathieu Poirier wrote:
>>>>>>>>>>>>>> On Fri, 28 Oct 2022 at 09:31, Arnaud POULIQUEN
>>>>>>>>>>>>>> <arnaud.pouliquen@foss.st.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 10/24/22 05:17, Aiqun(Maria) Yu wrote:
>>>>>>>>>>>>>>>> On 10/22/2022 3:34 AM, Mathieu Poirier wrote:
>>>>>>>>>>>>>>>>> On Wed, 19 Oct 2022 at 23:52, Aiqun(Maria) Yu <quic_aiquny@quicinc.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 10/14/2022 2:03 AM, Mathieu Poirier wrote:
>>>>>>>>>>>>>>>>>>> On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote:
>>>>>>>>>>>>>>>>>>>> On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:
>>>>>>>>>>>>>>>>>>>>> Hi Mathieu,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
>>>>>>>>>>>>>>>>>>>>>> Please add what has changed from one version to another, either in a cover
>>>>>>>>>>>>>>>>>>>>>> letter or after the "Signed-off-by".  There are many examples on how to
>>>>>>>>>>>>>>>>>>>>>> do that
>>>>>>>>>>>>>>>>>>>>>> on the mailing list.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thx for the information, will take a note and benefit for next time.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
>>>>>>>>>>>>>>>>>>>>>>> RPROC_OFFLINE state indicate there is no recovery process
>>>>>>>>>>>>>>>>>>>>>>> is in progress and no chance to do the pm_relax.
>>>>>>>>>>>>>>>>>>>>>>> Because when recovering from crash, rproc->lock is held and
>>>>>>>>>>>>>>>>>>>>>>> state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>>>>>>>>>>>>>>>>>>>> and then unlock rproc->lock.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> You are correct - because the lock is held rproc->state should be set to
>>>>>>>>>>>>>>>>>>>>>> RPROC_RUNNING
>>>>>>>>>>>>>>>>>>>>>> when rproc_trigger_recovery() returns.  If that is not the case then
>>>>>>>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>>>>>>> went wrong.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before
>>>>>>>>>>>>>>>>>>>>>> returning,
>>>>>>>>>>>>>>>>>>>>>> so we know the remote processor was stopped.  Therefore if rproc->state
>>>>>>>>>>>>>>>>>>>>>> is set
>>>>>>>>>>>>>>>>>>>>>> to RPROC_OFFLINE something went wrong in either request_firmware() or
>>>>>>>>>>>>>>>>>>>>>> rproc_start().  Either way the remote processor is offline and the system
>>>>>>>>>>>>>>>>>>>>>> probably
>>>>>>>>>>>>>>>>>>>>>> in an unknown/unstable.  As such I don't see how calling pm_relax() can help
>>>>>>>>>>>>>>>>>>>>>> things along.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
>>>>>>>>>>>>>>>>>>>>> finished.
>>>>>>>>>>>>>>>>>>>>> Even if it is multi crash rproc_crash_handler_work contention issue, and
>>>>>>>>>>>>>>>>>>>>> last rproc_trigger_recovery bailed out with only
>>>>>>>>>>>>>>>>>>>>> rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
>>>>>>>>>>>>>>>>>>>>> Since the subsystem may still can be recovered with customer's next trigger
>>>>>>>>>>>>>>>>>>>>> of rproc_start, and we can make each error out path clean with pm resources.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I suggest spending time understanding what leads to the failure when
>>>>>>>>>>>>>>>>>>>>>> recovering
>>>>>>>>>>>>>>>>>>>>>> from a crash and address that problem(s).
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> In current case, the customer's information is that the issue happened when
>>>>>>>>>>>>>>>>>>>>> rproc_shutdown is triggered at similar time. So not an issue from error out
>>>>>>>>>>>>>>>>>>>>> of rproc_trigger_recovery.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> That is a very important element to consider and should have been mentioned
>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>> the beginning.  What I see happening is the following:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> rproc_report_crash()
>>>>>>>>>>>>>>>>>>>>                  pm_stay_awake()
>>>>>>>>>>>>>>>>>>>>                  queue_work() // current thread is suspended
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> rproc_shutdown()
>>>>>>>>>>>>>>>>>>>>                  rproc_stop()
>>>>>>>>>>>>>>>>>>>>                          rproc->state = RPROC_OFFLINE;
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> rproc_crash_handler_work()
>>>>>>>>>>>>>>>>>>>>                  if (rproc->state == RPROC_OFFLINE)
>>>>>>>>>>>>>>>>>>>>                          return // pm_relax() is not called
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The right way to fix this is to add a pm_relax() in rproc_shutdown() and
>>>>>>>>>>>>>>>>>>>> rproc_detach(), along with a very descriptive comment as to why it is needed.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thinking about this further there are more ramifications to consider.  Please
>>>>>>>>>>>>>>>>>>> confirm the above scenario is what you are facing.  I will advise on how to
>>>>>>>>>>>>>>>>>>> move
>>>>>>>>>>>>>>>>>>> forward if that is the case.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Not sure if the situation is clear or not. So resend the email again.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The above senario is what customer is facing. crash hanppened while at
>>>>>>>>>>>>>>>>>> the same time shutdown is triggered.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Unfortunately this is not enough details to address a problem as
>>>>>>>>>>>>>>>>> complex as this one.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> And the device cannto goes to suspend state after that.
>>>>>>>>>>>>>>>>>> the subsystem can still be start normally after this.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If the code flow I pasted above reflects the problem at hand, the
>>>>>>>>>>>>>>>>> current patch will not be sufficient to address the issue.  If Arnaud
>>>>>>>>>>>>>>>>> confirms my suspicions we will have to think about a better solution.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Mathiew,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Could you pls have more details of any side effects other then power issue of
>>>>>>>>>>>>>>>> the current senario?
>>>>>>>>>>>>>>>> Why the current patch is not sufficient pls?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Have the current senario in details with rproc->lock information in details:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> | subsystem crashed interrupt issued      | user trigger shutdown
>>>>>>>>>>>>>>>> | rproc_report_crash()                    |
>>>>>>>>>>>>>>>> |          pm_stay_awake()                |
>>>>>>>>>>>>>>>> |          queue_work()                   |
>>>>>>>>>>>>>>>> |                                         |rproc_shutdown
>>>>>>>>>>>>>>>> |                                         |mutex_lock(&rproc->lock);
>>>>>>>>>>>>>>>> |                                         |rproc_stop()
>>>>>>>>>>>>>>>> |rproc_crash_handler_work()               |rproc->state = RPROC_OFFLINE;
>>>>>>>>>>>>>>>> |                                         |mutex_unlock(&rproc->lock);
>>>>>>>>>>>>>>>> |mutex_lock(&rproc->lock);                |
>>>>>>>>>>>>>>>> |if (rproc->state == RPROC_OFFLINE)       |
>>>>>>>>>>>>>>>> |return // pm_relax() is not called       |rproc_boot
>>>>>>>>>>>>>>>> |mutex_unlock(&rproc->lock);              |
>>>>>>>>>>>>>>>> |                                         |mutex_lock(&rproc->lock);
>>>>>>>>>>>>>>>> |                                         |rproc_start()
>>>>>>>>>>>>>>>> |                                         |mutex_unlock(&rproc->lock);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Agree with Mathieu, this is not so simple.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for looking into this.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Here is my view  hoping I haven't missed a point in your discussion or
>>>>>>>>>>>>>>> an other corner cases.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I tried to analyze the issues (in what follows, the term "condition" means
>>>>>>>>>>>>>>> the "if" condition in which Aiqun proposes to add the fix) :
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I can see 4 use cases with race condition
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1) crash report while already one is treated (rproc_boot_recovery called)
>>>>>>>>>>>>>>>             => not a real use case as if the remote processor is crashed we
>>>>>>>>>>>>>>>                      should not have a second crash report
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That part is of great concern to me.  *Theoretically* we should not
>>>>>>>>>>>>>> get a new crash report while one has already been dispatched but the
>>>>>>>>>>>>>> current code accounts for this scenario and as such the possibility
>>>>>>>>>>>>>> can't be dismissed.  Therefore we need to expect rproc_report_crash()
>>>>>>>>>>>>>> to be called multiple times before a single instance of
>>>>>>>>>>>>>> rproc_boot_recovery() is scheduled.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2) rproc_stop executed between the queuing of the crash work and the call of
>>>>>>>>>>>>>>>          rproc_crash_handler_work
>>>>>>>>>>>>>>>           => rproc->state = RPROC_OFFLINE
>>>>>>>>>>>>>>>           => we enter in the "condition" and the pm_relax has to be called
>>>>>>>>>>>>>>>           => This commit fix should solve this use case
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 3) rproc_detach executed between the queue of the crash work and the call of
>>>>>>>>>>>>>>>          rproc_crash_handler_work
>>>>>>>>>>>>>>>           => rproc->state = RPROC_DETACHED;
>>>>>>>>>>>>>>>           => we don't go in "the condition" and issue because the recovery reattach
>>>>>>>>>>>>>>>              to the remote processor
>>>>>>>>>>>>>>>           => but pm_relax is called
>>>>>>>>>>>>>>>           => probably need an extra fix to avoid to re-attach
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 4) crash report while already one is treated (rproc_attach_recovery called)
>>>>>>>>>>>>>>>           this one corresponds to an auto reboot of the remote processor, with a
>>>>>>>>>>>>>>>           new crash
>>>>>>>>>>>>>>>           => rproc->state = RPROC_CRASHED or rproc->state = RPROC_DETACHED;
>>>>>>>>>>>>>>>           4)a) rproc->state = RPROC_CRASHED if rproc->recovery_disabled = true
>>>>>>>>>>>>>>>                => should call pm_relax if rproc->recovery_disabled = true
>>>>>>>>>>>>>>>                => commit does not work for this use case
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>           4)b) rproc->state = RPROC_DETACHED if recovery fails
>>>>>>>>>>>>>>>               => error case with an unstable state
>>>>>>>>>>>>>>>               => how to differentiate it from the use case 3) ?
>>>>>>>>>>>>>>>               => introduce a RPROC_RECOVERY_FAIL state?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The case where a recovery fails needs to be considered and is the
>>>>>>>>>>>>>> reason the original patch doesn't work.  Right now in
>>>>>>>>>>>>>> rproc_crash_handler_work(), it is not possible to differentiate
>>>>>>>>>>>>>> between a legitimate shutdown request (scenario #2 above) and a
>>>>>>>>>>>>>> recovery that went wrong.  I think introducing RPROC_RECOVERY_FAIL
>>>>>>>>>>>>>> would greatly simplify things.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My initial evaluation had not considered the attach/detach scenarios -
>>>>>>>>>>>>>> thanks for adding that in the mix.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Aiqun, please send a new patchset that adds a new remote processor
>>>>>>>>>>>>>> state, i.e RPROC_RECOVERY_FAIL.  There should also be another patch in
>>>>>>>>>>>>>> that set that takes attach/detach scenarios into account.  The code
>>>>>>>>>>>>>> between the v6.0 and v6.1 cycle has changed a lot in that area so make
>>>>>>>>>>>>>> sure to properly rebase.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> I will try.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Then pm_stay_awake is called when the crash work is queued.
>>>>>>>>>>>>>>> It seems to me coherent to call the pm_relax in the work handler.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Here is a quick and dirty patch (not tested) that should take into account the
>>>>>>>>>>>>>>> main use cases ( except 1) and 4)b) )
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @@ -2009,8 +2009,18 @@ static void rproc_crash_handler_work(struct work_struct *work)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                mutex_lock(&rproc->lock);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
>>>>>>>>>>>>>>> +       if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE ||
>>>>>>>>>>>>>>> +           rproc->state == RPROC_DETACHED) {
>>>>>>>>>>>>>>>                        /* handle only the first crash detected */
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> +               /*
>>>>>>>>>>>>>>> +                * call pm-relax in following use cases:
>>>>>>>>>>>>>>> +                * - the remote processor has been stopped by the user
>>>>>>>>>>>>>>> +                * - the remote processor is detached
>>>>>>>>>>>>>>> +                + - the remote proc has an autonomous reset but recovery_disabled is true.
>>>>>>>>>>>>>>> +                */
>>>>>>>>>>>>>>> +               if(rproc->state != RPROC_CRASHED || rproc->recovery_disabled)
>>>>>>>>>>>>>>> +                       pm_relax(rproc->dev.parent);
>>>>>>>>>>>>>>>                        mutex_unlock(&rproc->lock);
>>>>>>>>>>>>>>>                        return;
>>>>>>>>>>>>>>>                }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Arnaud
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>> Mathieu
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> When the state is in RPROC_OFFLINE it means separate request
>>>>>>>>>>>>>>>>>>>>>>> of rproc_stop was done and no need to hold the wakeup source
>>>>>>>>>>>>>>>>>>>>>>> in crash handler to recover any more.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
>>>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>>>>            drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
>>>>>>>>>>>>>>>>>>>>>>>            1 file changed, 11 insertions(+)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> diff --git a/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>>>>>>>>>>>>> b/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>>>>>>>>>>>>> index e5279ed9a8d7..6bc7b8b7d01e 100644
>>>>>>>>>>>>>>>>>>>>>>> --- a/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>>>>>>>>>>>>> +++ b/drivers/remoteproc/remoteproc_core.c
>>>>>>>>>>>>>>>>>>>>>>> @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct
>>>>>>>>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>>>>>>>                   if (rproc->state == RPROC_CRASHED || rproc->state ==
>>>>>>>>>>>>>>>>>>>>>>> RPROC_OFFLINE) {
>>>>>>>>>>>>>>>>>>>>>>>                           /* handle only the first crash detected */
>>>>>>>>>>>>>>>>>>>>>>>                           mutex_unlock(&rproc->lock);
>>>>>>>>>>>>>>>>>>>>>>> +         /*
>>>>>>>>>>>>>>>>>>>>>>> +          * RPROC_OFFLINE state indicate there is no recovery process
>>>>>>>>>>>>>>>>>>>>>>> +          * is in progress and no chance to have pm_relax in place.
>>>>>>>>>>>>>>>>>>>>>>> +          * Because when recovering from crash, rproc->lock is held and
>>>>>>>>>>>>>>>>>>>>>>> +          * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>>>>>>>>>>>>>>>>>>>> +          * and then unlock rproc->lock.
>>>>>>>>>>>>>>>>>>>>>>> +          * RPROC_OFFLINE is only an intermediate state in recovery
>>>>>>>>>>>>>>>>>>>>>>> +          * process.
>>>>>>>>>>>>>>>>>>>>>>> +          */
>>>>>>>>>>>>>>>>>>>>>>> +         if (rproc->state == RPROC_OFFLINE)
>>>>>>>>>>>>>>>>>>>>>>> +                 pm_relax(rproc->dev.parent);
>>>>>>>>>>>>>>>>>>>>>>>                           return;
>>>>>>>>>>>>>>>>>>>>>>>                   }
>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>> 2.7.4
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> Thx and BRs,
>>>>>>>>>>>>>>>>>>>>> Aiqun(Maria) Yu
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Thx and BRs,
>>>>>>>>>>>>>>>>>> Aiqun(Maria) Yu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Thx and BRs,
>>>>>>>>>>>>> Aiqun(Maria) Yu
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Thx and BRs,
>>>>>>>>>>> Aiqun(Maria) Yu
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thx and BRs,
>>>>>>>>> Aiqun(Maria) Yu
>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>> Thx and BRs,
>>>>>>> Aiqun(Maria) Yu
>>>>>
>>>>>
>>>>> -- 
>>>>> Thx and BRs,
>>>>> Aiqun(Maria) Yu
>>
>>
>> -- 
>> Thx and BRs,
>> Aiqun(Maria) Yu
diff mbox series

Patch

diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
index e5279ed9a8d7..6bc7b8b7d01e 100644
--- a/drivers/remoteproc/remoteproc_core.c
+++ b/drivers/remoteproc/remoteproc_core.c
@@ -1956,6 +1956,17 @@  static void rproc_crash_handler_work(struct work_struct *work)
 	if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
 		/* handle only the first crash detected */
 		mutex_unlock(&rproc->lock);
+		/*
+		 * RPROC_OFFLINE state indicate there is no recovery process
+		 * is in progress and no chance to have pm_relax in place.
+		 * Because when recovering from crash, rproc->lock is held and
+		 * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
+		 * and then unlock rproc->lock.
+		 * RPROC_OFFLINE is only an intermediate state in recovery
+		 * process.
+		 */
+		if (rproc->state == RPROC_OFFLINE)
+			pm_relax(rproc->dev.parent);
 		return;
 	}