rpmsg: mtk_rpmsg: Fix circular locking dependency

Message ID	20220114144737.375621-1-angelogioacchino.delregno@collabora.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org> sender: kholk11) with ESMTPSA id A6C881F46A72 From: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> To: bjorn.andersson@linaro.org Cc: mathieu.poirier@linaro.org, matthias.bgg@gmail.com, pihsun@chromium.org, linux-remoteproc@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-mediatek@lists.infradead.org, linux-kernel@vger.kernel.org, kernel@collabora.com, AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Subject: [PATCH] rpmsg: mtk_rpmsg: Fix circular locking dependency Date: Fri, 14 Jan 2022 15:47:37 +0100 Message-Id: <20220114144737.375621-1-angelogioacchino.delregno@collabora.com> MIME-Version: 1.0 Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	rpmsg: mtk_rpmsg: Fix circular locking dependency \| expand rpmsg: mtk_rpmsg: Fix circular locking dependency

AngeloGioacchino Del Regno Jan. 14, 2022, 2:47 p.m. UTC

During execution of the worker that's used to register rpmsg devices
we are safely locking the channels mutex but, when creating a new
endpoint for such devices, we are registering a IPI on the SCP, which
then makes the SCP to trigger an interrupt, lock its own mutex and in
turn register more subdevices.
This creates a circular locking dependency situation, as the mtk_rpmsg
channels_lock will then depend on the SCP IPI lock.

[   18.014514]  Possible unsafe locking scenario:
[   18.014515]        CPU0                    CPU1
[   18.014517]        ----                    ----
[   18.045467]   lock(&mtk_subdev->channels_lock);
[   18.045474]                                lock(&scp->ipi_desc[i].lock);
[   18.228399]                                lock(&mtk_subdev->channels_lock);
[   18.228405]   lock(&scp->ipi_desc[i].lock);
[   18.264405]

To solve this, simply unlock the channels_lock mutex before calling
mtk_rpmsg_register_device() and relock it right after, as safety is
still ensured by the locking mechanism that happens right after
through SCP.
Notably, mtk_rpmsg_register_device() does not even require locking.

Fixes: 7017996951fd ("rpmsg: add rpmsg support for mt8183 SCP.")
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
---
 drivers/rpmsg/mtk_rpmsg.c | 2 ++
 1 file changed, 2 insertions(+)

AngeloGioacchino Del Regno Feb. 16, 2022, 4:06 p.m. UTC | #1

Il 14/01/22 15:47, AngeloGioacchino Del Regno ha scritto:
> During execution of the worker that's used to register rpmsg devices
> we are safely locking the channels mutex but, when creating a new
> endpoint for such devices, we are registering a IPI on the SCP, which
> then makes the SCP to trigger an interrupt, lock its own mutex and in
> turn register more subdevices.
> This creates a circular locking dependency situation, as the mtk_rpmsg
> channels_lock will then depend on the SCP IPI lock.
> 
> [   18.014514]  Possible unsafe locking scenario:
> [   18.014515]        CPU0                    CPU1
> [   18.014517]        ----                    ----
> [   18.045467]   lock(&mtk_subdev->channels_lock);
> [   18.045474]                                lock(&scp->ipi_desc[i].lock);
> [   18.228399]                                lock(&mtk_subdev->channels_lock);
> [   18.228405]   lock(&scp->ipi_desc[i].lock);
> [   18.264405]
> 
> To solve this, simply unlock the channels_lock mutex before calling
> mtk_rpmsg_register_device() and relock it right after, as safety is
> still ensured by the locking mechanism that happens right after
> through SCP.
> Notably, mtk_rpmsg_register_device() does not even require locking.
> 
> Fixes: 7017996951fd ("rpmsg: add rpmsg support for mt8183 SCP.")
> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>

Friendly ping for an important fix... :)

> ---
>   drivers/rpmsg/mtk_rpmsg.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/rpmsg/mtk_rpmsg.c b/drivers/rpmsg/mtk_rpmsg.c
> index 5b4404b8be4c..d1213c33da20 100644
> --- a/drivers/rpmsg/mtk_rpmsg.c
> +++ b/drivers/rpmsg/mtk_rpmsg.c
> @@ -234,7 +234,9 @@ static void mtk_register_device_work_function(struct work_struct *register_work)
>   		if (info->registered)
>   			continue;
>   
> +		mutex_unlock(&subdev->channels_lock);
>   		ret = mtk_rpmsg_register_device(subdev, &info->info);
> +		mutex_lock(&subdev->channels_lock);
>   		if (ret) {
>   			dev_err(&pdev->dev, "Can't create rpmsg_device\n");
>   			continue;

Mathieu Poirier Feb. 17, 2022, 4:08 p.m. UTC | #2

On Wed, Feb 16, 2022 at 05:06:22PM +0100, AngeloGioacchino Del Regno wrote:
> Il 14/01/22 15:47, AngeloGioacchino Del Regno ha scritto:
> > During execution of the worker that's used to register rpmsg devices
> > we are safely locking the channels mutex but, when creating a new
> > endpoint for such devices, we are registering a IPI on the SCP, which
> > then makes the SCP to trigger an interrupt, lock its own mutex and in
> > turn register more subdevices.
> > This creates a circular locking dependency situation, as the mtk_rpmsg
> > channels_lock will then depend on the SCP IPI lock.
> > 
> > [   18.014514]  Possible unsafe locking scenario:
> > [   18.014515]        CPU0                    CPU1
> > [   18.014517]        ----                    ----
> > [   18.045467]   lock(&mtk_subdev->channels_lock);
> > [   18.045474]                                lock(&scp->ipi_desc[i].lock);
> > [   18.228399]                                lock(&mtk_subdev->channels_lock);
> > [   18.228405]   lock(&scp->ipi_desc[i].lock);
> > [   18.264405]
> > 
> > To solve this, simply unlock the channels_lock mutex before calling
> > mtk_rpmsg_register_device() and relock it right after, as safety is
> > still ensured by the locking mechanism that happens right after
> > through SCP.
> > Notably, mtk_rpmsg_register_device() does not even require locking.
> > 
> > Fixes: 7017996951fd ("rpmsg: add rpmsg support for mt8183 SCP.")
> > Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
> 
> Friendly ping for an important fix... :)
>

This seems to have slipped through the cracks...  I will take a look later
today.

Thanks,
Mathieu

> > ---
> >   drivers/rpmsg/mtk_rpmsg.c | 2 ++
> >   1 file changed, 2 insertions(+)
> > 
> > diff --git a/drivers/rpmsg/mtk_rpmsg.c b/drivers/rpmsg/mtk_rpmsg.c
> > index 5b4404b8be4c..d1213c33da20 100644
> > --- a/drivers/rpmsg/mtk_rpmsg.c
> > +++ b/drivers/rpmsg/mtk_rpmsg.c
> > @@ -234,7 +234,9 @@ static void mtk_register_device_work_function(struct work_struct *register_work)
> >   		if (info->registered)
> >   			continue;
> > +		mutex_unlock(&subdev->channels_lock);
> >   		ret = mtk_rpmsg_register_device(subdev, &info->info);
> > +		mutex_lock(&subdev->channels_lock);
> >   		if (ret) {
> >   			dev_err(&pdev->dev, "Can't create rpmsg_device\n");
> >   			continue;
> 
>

Mathieu Poirier Feb. 17, 2022, 7:03 p.m. UTC | #3

Hi Angelo,

On Fri, Jan 14, 2022 at 03:47:37PM +0100, AngeloGioacchino Del Regno wrote:
> During execution of the worker that's used to register rpmsg devices
> we are safely locking the channels mutex but, when creating a new
> endpoint for such devices, we are registering a IPI on the SCP, which
> then makes the SCP to trigger an interrupt, lock its own mutex and in
> turn register more subdevices.
> This creates a circular locking dependency situation, as the mtk_rpmsg
> channels_lock will then depend on the SCP IPI lock.
> 
> [   18.014514]  Possible unsafe locking scenario:
> [   18.014515]        CPU0                    CPU1
> [   18.014517]        ----                    ----
> [   18.045467]   lock(&mtk_subdev->channels_lock);
> [   18.045474]                                lock(&scp->ipi_desc[i].lock);

I spent well over an hour tracing through the meanders of the code to end up in
scp_ipi_register() which, I think, leads to the above.  But from there I don't
see how an IPI can come in and that tells me my assumption is wrong.

Can you give more details on the events that lead to the above?  I'm not saying
there is no problem, I just need to understand it.

Thanks,
Mathieu

> [   18.228399]                                lock(&mtk_subdev->channels_lock);
> [   18.228405]   lock(&scp->ipi_desc[i].lock);
> [   18.264405]
> 
> To solve this, simply unlock the channels_lock mutex before calling
> mtk_rpmsg_register_device() and relock it right after, as safety is
> still ensured by the locking mechanism that happens right after
> through SCP.
> Notably, mtk_rpmsg_register_device() does not even require locking.
> 
> Fixes: 7017996951fd ("rpmsg: add rpmsg support for mt8183 SCP.")
> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
> ---
>  drivers/rpmsg/mtk_rpmsg.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/rpmsg/mtk_rpmsg.c b/drivers/rpmsg/mtk_rpmsg.c
> index 5b4404b8be4c..d1213c33da20 100644
> --- a/drivers/rpmsg/mtk_rpmsg.c
> +++ b/drivers/rpmsg/mtk_rpmsg.c
> @@ -234,7 +234,9 @@ static void mtk_register_device_work_function(struct work_struct *register_work)
>  		if (info->registered)
>  			continue;
>  
> +		mutex_unlock(&subdev->channels_lock);
>  		ret = mtk_rpmsg_register_device(subdev, &info->info);
> +		mutex_lock(&subdev->channels_lock);
>  		if (ret) {
>  			dev_err(&pdev->dev, "Can't create rpmsg_device\n");
>  			continue;
> -- 
> 2.33.1
>

AngeloGioacchino Del Regno Feb. 18, 2022, 9:16 a.m. UTC | #4

Il 17/02/22 20:03, Mathieu Poirier ha scritto:
> Hi Angelo,
> 
> On Fri, Jan 14, 2022 at 03:47:37PM +0100, AngeloGioacchino Del Regno wrote:
>> During execution of the worker that's used to register rpmsg devices
>> we are safely locking the channels mutex but, when creating a new
>> endpoint for such devices, we are registering a IPI on the SCP, which
>> then makes the SCP to trigger an interrupt, lock its own mutex and in
>> turn register more subdevices.
>> This creates a circular locking dependency situation, as the mtk_rpmsg
>> channels_lock will then depend on the SCP IPI lock.
>>
>> [   18.014514]  Possible unsafe locking scenario:
>> [   18.014515]        CPU0                    CPU1
>> [   18.014517]        ----                    ----
>> [   18.045467]   lock(&mtk_subdev->channels_lock);
>> [   18.045474]                                lock(&scp->ipi_desc[i].lock);
> 
> I spent well over an hour tracing through the meanders of the code to end up in
> scp_ipi_register() which, I think, leads to the above.  But from there I don't
> see how an IPI can come in and that tells me my assumption is wrong.
> 
> Can you give more details on the events that lead to the above?  I'm not saying
> there is no problem, I just need to understand it.
> 

Hi Mathieu,

I understand that following this flow without the assistance of the actual
hardware may be a little confusing, so, no worries.

drivers/remoteproc/mtk_scp.c - this driver manages the SCP (obviously, a
remote processor)
drivers/remoteproc/mtk_scp_ipi.c - public functions for kernel SCP IPC

Flow:
- MediaTek SCP gets probed
- RPMSG starts, we start probing "something", like google,cros-ec-rpmsg
- mtk_rpmsg: creates endpoint; IPI handler is registered here.

          ( more flow )

- mtk_rpmsg: mtk_rpmsg_ns_cb() -> mtk_rpmsg_create_device(), channel is
              added to the channels list, worker gets scheduled


Now for the part that produces the real issue:

label_a:

*** RPMSG MUTEX LOCK ***
- mtk_rpmsg: ## Go through multiple channels ##, call mtk_rpmsg_register_device()

- Registered device tries to communicate through RPMSG
- .send() or .trysend() (depending on the device) is called: send_ipi()
     *** SCP MUTEX LOCK ***
    - mtk_scp_ipi: Data written, ACK? ok -> return 0
     *** SCP MUTEX UNLOCK ***

- mtk_scp_ipi: **** INTERRUPT!!! **** New RPMSG NS available? -> create channel
           goto label_a;

*** RPMSG MUTEX UNLOCK ***


Pardon me for keeping some things in this flow implicit, but that was done to
simplify it as much as possible as to try to make you understand the situation.

Cheers,
Angelo

> Thanks,
> Mathieu
> 
>> [   18.228399]                                lock(&mtk_subdev->channels_lock);
>> [   18.228405]   lock(&scp->ipi_desc[i].lock);
>> [   18.264405]
>>
>> To solve this, simply unlock the channels_lock mutex before calling
>> mtk_rpmsg_register_device() and relock it right after, as safety is
>> still ensured by the locking mechanism that happens right after
>> through SCP.
>> Notably, mtk_rpmsg_register_device() does not even require locking.
>>
>> Fixes: 7017996951fd ("rpmsg: add rpmsg support for mt8183 SCP.")
>> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
>> ---
>>   drivers/rpmsg/mtk_rpmsg.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/rpmsg/mtk_rpmsg.c b/drivers/rpmsg/mtk_rpmsg.c
>> index 5b4404b8be4c..d1213c33da20 100644
>> --- a/drivers/rpmsg/mtk_rpmsg.c
>> +++ b/drivers/rpmsg/mtk_rpmsg.c
>> @@ -234,7 +234,9 @@ static void mtk_register_device_work_function(struct work_struct *register_work)
>>   		if (info->registered)
>>   			continue;
>>   
>> +		mutex_unlock(&subdev->channels_lock);
>>   		ret = mtk_rpmsg_register_device(subdev, &info->info);
>> +		mutex_lock(&subdev->channels_lock);
>>   		if (ret) {
>>   			dev_err(&pdev->dev, "Can't create rpmsg_device\n");
>>   			continue;
>> -- 
>> 2.33.1
>>

Mathieu Poirier Feb. 18, 2022, 6:01 p.m. UTC | #5

On Fri, Feb 18, 2022 at 10:16:51AM +0100, AngeloGioacchino Del Regno wrote:
> Il 17/02/22 20:03, Mathieu Poirier ha scritto:
> > Hi Angelo,
> > 
> > On Fri, Jan 14, 2022 at 03:47:37PM +0100, AngeloGioacchino Del Regno wrote:
> > > During execution of the worker that's used to register rpmsg devices
> > > we are safely locking the channels mutex but, when creating a new
> > > endpoint for such devices, we are registering a IPI on the SCP, which
> > > then makes the SCP to trigger an interrupt, lock its own mutex and in
> > > turn register more subdevices.
> > > This creates a circular locking dependency situation, as the mtk_rpmsg
> > > channels_lock will then depend on the SCP IPI lock.
> > > 
> > > [   18.014514]  Possible unsafe locking scenario:
> > > [   18.014515]        CPU0                    CPU1
> > > [   18.014517]        ----                    ----
> > > [   18.045467]   lock(&mtk_subdev->channels_lock);
> > > [   18.045474]                                lock(&scp->ipi_desc[i].lock);
> > 
> > I spent well over an hour tracing through the meanders of the code to end up in
> > scp_ipi_register() which, I think, leads to the above.  But from there I don't
> > see how an IPI can come in and that tells me my assumption is wrong.
> > 
> > Can you give more details on the events that lead to the above?  I'm not saying
> > there is no problem, I just need to understand it.
> > 
> 
> Hi Mathieu,
> 
> I understand that following this flow without the assistance of the actual
> hardware may be a little confusing, so, no worries.
> 
> drivers/remoteproc/mtk_scp.c - this driver manages the SCP (obviously, a
> remote processor)
> drivers/remoteproc/mtk_scp_ipi.c - public functions for kernel SCP IPC
> 
> Flow:
> - MediaTek SCP gets probed
> - RPMSG starts, we start probing "something", like google,cros-ec-rpmsg
> - mtk_rpmsg: creates endpoint; IPI handler is registered here.
> 
>          ( more flow )
> 
> - mtk_rpmsg: mtk_rpmsg_ns_cb() -> mtk_rpmsg_create_device(), channel is
>              added to the channels list, worker gets scheduled

To me the above is out of order.  The name space endpoint is registered as part
of the remote processor start sequence.  From there an IPI with ns_ipi_id comes in
and then cros_ec_rpmsg_probe() is called.  The above seems to imply the
opposite.

>
> 
> Now for the part that produces the real issue:
> 
> label_a:
> 
> *** RPMSG MUTEX LOCK ***

By this I take you mean the subdev->channels_lock mutex.

> - mtk_rpmsg: ## Go through multiple channels ##, call mtk_rpmsg_register_device()
> 
> - Registered device tries to communicate through RPMSG
> - .send() or .trysend() (depending on the device) is called: send_ipi()
>     *** SCP MUTEX LOCK ***

And this one is either scp->send_lock or scp->ipi_desc[i].lock.

>    - mtk_scp_ipi: Data written, ACK? ok -> return 0
>     *** SCP MUTEX UNLOCK ***
> 
> - mtk_scp_ipi: **** INTERRUPT!!! **** New RPMSG NS available? -> create channel
>           goto label_a;
> 
> *** RPMSG MUTEX UNLOCK ***
> 
> 
> Pardon me for keeping some things in this flow implicit, but that was done to
> simplify it as much as possible as to try to make you understand the situation.

I certainly appreciate the effort but the above does not provide me with a clear
path that causes the lock to happen.  As I said in my last reply I don't doubt
there is a lock contention but the provided information doesn't allow to
understand how it happens. 

All I am looking for is one scenario with all mutexes and functions calls
involved.

Thanks,
Mathieu

> 
> Cheers,
> Angelo
> 
> > Thanks,
> > Mathieu
> > 
> > > [   18.228399]                                lock(&mtk_subdev->channels_lock);
> > > [   18.228405]   lock(&scp->ipi_desc[i].lock);
> > > [   18.264405]
> > > 
> > > To solve this, simply unlock the channels_lock mutex before calling
> > > mtk_rpmsg_register_device() and relock it right after, as safety is
> > > still ensured by the locking mechanism that happens right after
> > > through SCP.
> > > Notably, mtk_rpmsg_register_device() does not even require locking.
> > > 
> > > Fixes: 7017996951fd ("rpmsg: add rpmsg support for mt8183 SCP.")
> > > Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
> > > ---
> > >   drivers/rpmsg/mtk_rpmsg.c | 2 ++
> > >   1 file changed, 2 insertions(+)
> > > 
> > > diff --git a/drivers/rpmsg/mtk_rpmsg.c b/drivers/rpmsg/mtk_rpmsg.c
> > > index 5b4404b8be4c..d1213c33da20 100644
> > > --- a/drivers/rpmsg/mtk_rpmsg.c
> > > +++ b/drivers/rpmsg/mtk_rpmsg.c
> > > @@ -234,7 +234,9 @@ static void mtk_register_device_work_function(struct work_struct *register_work)
> > >   		if (info->registered)
> > >   			continue;
> > > +		mutex_unlock(&subdev->channels_lock);
> > >   		ret = mtk_rpmsg_register_device(subdev, &info->info);
> > > +		mutex_lock(&subdev->channels_lock);
> > >   		if (ret) {
> > >   			dev_err(&pdev->dev, "Can't create rpmsg_device\n");
> > >   			continue;
> > > -- 
> > > 2.33.1
> > > 
> 
> 
>

AngeloGioacchino Del Regno March 29, 2022, 10:55 a.m. UTC | #6

Il 18/02/22 19:01, Mathieu Poirier ha scritto:
> On Fri, Feb 18, 2022 at 10:16:51AM +0100, AngeloGioacchino Del Regno wrote:
>> Il 17/02/22 20:03, Mathieu Poirier ha scritto:
>>> Hi Angelo,
>>>
>>> On Fri, Jan 14, 2022 at 03:47:37PM +0100, AngeloGioacchino Del Regno wrote:
>>>> During execution of the worker that's used to register rpmsg devices
>>>> we are safely locking the channels mutex but, when creating a new
>>>> endpoint for such devices, we are registering a IPI on the SCP, which
>>>> then makes the SCP to trigger an interrupt, lock its own mutex and in
>>>> turn register more subdevices.
>>>> This creates a circular locking dependency situation, as the mtk_rpmsg
>>>> channels_lock will then depend on the SCP IPI lock.
>>>>
>>>> [   18.014514]  Possible unsafe locking scenario:
>>>> [   18.014515]        CPU0                    CPU1
>>>> [   18.014517]        ----                    ----
>>>> [   18.045467]   lock(&mtk_subdev->channels_lock);
>>>> [   18.045474]                                lock(&scp->ipi_desc[i].lock);
>>>
>>> I spent well over an hour tracing through the meanders of the code to end up in
>>> scp_ipi_register() which, I think, leads to the above.  But from there I don't
>>> see how an IPI can come in and that tells me my assumption is wrong.
>>>
>>> Can you give more details on the events that lead to the above?  I'm not saying
>>> there is no problem, I just need to understand it.
>>>
>>
>> Hi Mathieu,
>>
>> I understand that following this flow without the assistance of the actual
>> hardware may be a little confusing, so, no worries.
>>
>> drivers/remoteproc/mtk_scp.c - this driver manages the SCP (obviously, a
>> remote processor)
>> drivers/remoteproc/mtk_scp_ipi.c - public functions for kernel SCP IPC
>>
>> Flow:
>> - MediaTek SCP gets probed
>> - RPMSG starts, we start probing "something", like google,cros-ec-rpmsg
>> - mtk_rpmsg: creates endpoint; IPI handler is registered here.
>>
>>           ( more flow )
>>
>> - mtk_rpmsg: mtk_rpmsg_ns_cb() -> mtk_rpmsg_create_device(), channel is
>>               added to the channels list, worker gets scheduled
> 
> To me the above is out of order.  The name space endpoint is registered as part
> of the remote processor start sequence.  From there an IPI with ns_ipi_id comes in
> and then cros_ec_rpmsg_probe() is called.  The above seems to imply the
> opposite.
> 
>>
>>
>> Now for the part that produces the real issue:
>>
>> label_a:
>>
>> *** RPMSG MUTEX LOCK ***
> 
> By this I take you mean the subdev->channels_lock mutex.
> 
>> - mtk_rpmsg: ## Go through multiple channels ##, call mtk_rpmsg_register_device()
>>
>> - Registered device tries to communicate through RPMSG
>> - .send() or .trysend() (depending on the device) is called: send_ipi()
>>      *** SCP MUTEX LOCK ***
> 
> And this one is either scp->send_lock or scp->ipi_desc[i].lock.
> 
>>     - mtk_scp_ipi: Data written, ACK? ok -> return 0
>>      *** SCP MUTEX UNLOCK ***
>>
>> - mtk_scp_ipi: **** INTERRUPT!!! **** New RPMSG NS available? -> create channel
>>            goto label_a;
>>
>> *** RPMSG MUTEX UNLOCK ***
>>
>>
>> Pardon me for keeping some things in this flow implicit, but that was done to
>> simplify it as much as possible as to try to make you understand the situation.
> 
> I certainly appreciate the effort but the above does not provide me with a clear
> path that causes the lock to happen.  As I said in my last reply I don't doubt
> there is a lock contention but the provided information doesn't allow to
> understand how it happens.
> 
> All I am looking for is one scenario with all mutexes and functions calls
> involved.
> 

Hello Mathieu,
I'm sorry for leaving this unresolved for a long time, had to work on other
things in the meanwhile.

I'm not sure what you need, can you please help me to give you the
issue background that you require?

In the meanwhile, here's full debugging info coming from the kmsg:

https://paste.debian.net/1235967/

Thanks,
Angelo

> Thanks,
> Mathieu
> 
>>
>> Cheers,
>> Angelo
>>
>>> Thanks,
>>> Mathieu
>>>
>>>> [   18.228399]                                lock(&mtk_subdev->channels_lock);
>>>> [   18.228405]   lock(&scp->ipi_desc[i].lock);
>>>> [   18.264405]
>>>>
>>>> To solve this, simply unlock the channels_lock mutex before calling
>>>> mtk_rpmsg_register_device() and relock it right after, as safety is
>>>> still ensured by the locking mechanism that happens right after
>>>> through SCP.
>>>> Notably, mtk_rpmsg_register_device() does not even require locking.
>>>>
>>>> Fixes: 7017996951fd ("rpmsg: add rpmsg support for mt8183 SCP.")
>>>> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
>>>> ---
>>>>    drivers/rpmsg/mtk_rpmsg.c | 2 ++
>>>>    1 file changed, 2 insertions(+)
>>>>
>>>> diff --git a/drivers/rpmsg/mtk_rpmsg.c b/drivers/rpmsg/mtk_rpmsg.c
>>>> index 5b4404b8be4c..d1213c33da20 100644
>>>> --- a/drivers/rpmsg/mtk_rpmsg.c
>>>> +++ b/drivers/rpmsg/mtk_rpmsg.c
>>>> @@ -234,7 +234,9 @@ static void mtk_register_device_work_function(struct work_struct *register_work)
>>>>    		if (info->registered)
>>>>    			continue;
>>>> +		mutex_unlock(&subdev->channels_lock);
>>>>    		ret = mtk_rpmsg_register_device(subdev, &info->info);
>>>> +		mutex_lock(&subdev->channels_lock);
>>>>    		if (ret) {
>>>>    			dev_err(&pdev->dev, "Can't create rpmsg_device\n");
>>>>    			continue;
>>>> -- 
>>>> 2.33.1
>>>>
>>
>>
>>

Mathieu Poirier March 29, 2022, 3:08 p.m. UTC | #7

On Tue, Mar 29, 2022 at 12:55:33PM +0200, AngeloGioacchino Del Regno wrote:
> Il 18/02/22 19:01, Mathieu Poirier ha scritto:
> > On Fri, Feb 18, 2022 at 10:16:51AM +0100, AngeloGioacchino Del Regno wrote:
> > > Il 17/02/22 20:03, Mathieu Poirier ha scritto:
> > > > Hi Angelo,
> > > > 
> > > > On Fri, Jan 14, 2022 at 03:47:37PM +0100, AngeloGioacchino Del Regno wrote:
> > > > > During execution of the worker that's used to register rpmsg devices
> > > > > we are safely locking the channels mutex but, when creating a new
> > > > > endpoint for such devices, we are registering a IPI on the SCP, which
> > > > > then makes the SCP to trigger an interrupt, lock its own mutex and in
> > > > > turn register more subdevices.
> > > > > This creates a circular locking dependency situation, as the mtk_rpmsg
> > > > > channels_lock will then depend on the SCP IPI lock.
> > > > > 
> > > > > [   18.014514]  Possible unsafe locking scenario:
> > > > > [   18.014515]        CPU0                    CPU1
> > > > > [   18.014517]        ----                    ----
> > > > > [   18.045467]   lock(&mtk_subdev->channels_lock);
> > > > > [   18.045474]                                lock(&scp->ipi_desc[i].lock);
> > > > 
> > > > I spent well over an hour tracing through the meanders of the code to end up in
> > > > scp_ipi_register() which, I think, leads to the above.  But from there I don't
> > > > see how an IPI can come in and that tells me my assumption is wrong.
> > > > 
> > > > Can you give more details on the events that lead to the above?  I'm not saying
> > > > there is no problem, I just need to understand it.
> > > > 
> > > 
> > > Hi Mathieu,
> > > 
> > > I understand that following this flow without the assistance of the actual
> > > hardware may be a little confusing, so, no worries.
> > > 
> > > drivers/remoteproc/mtk_scp.c - this driver manages the SCP (obviously, a
> > > remote processor)
> > > drivers/remoteproc/mtk_scp_ipi.c - public functions for kernel SCP IPC
> > > 
> > > Flow:
> > > - MediaTek SCP gets probed
> > > - RPMSG starts, we start probing "something", like google,cros-ec-rpmsg
> > > - mtk_rpmsg: creates endpoint; IPI handler is registered here.
> > > 
> > >           ( more flow )
> > > 
> > > - mtk_rpmsg: mtk_rpmsg_ns_cb() -> mtk_rpmsg_create_device(), channel is
> > >               added to the channels list, worker gets scheduled
> > 
> > To me the above is out of order.  The name space endpoint is registered as part
> > of the remote processor start sequence.  From there an IPI with ns_ipi_id comes in
> > and then cros_ec_rpmsg_probe() is called.  The above seems to imply the
> > opposite.
> > 
> > > 
> > > 
> > > Now for the part that produces the real issue:
> > > 
> > > label_a:
> > > 
> > > *** RPMSG MUTEX LOCK ***
> > 
> > By this I take you mean the subdev->channels_lock mutex.
> > 
> > > - mtk_rpmsg: ## Go through multiple channels ##, call mtk_rpmsg_register_device()
> > > 
> > > - Registered device tries to communicate through RPMSG
> > > - .send() or .trysend() (depending on the device) is called: send_ipi()
> > >      *** SCP MUTEX LOCK ***
> > 
> > And this one is either scp->send_lock or scp->ipi_desc[i].lock.
> > 
> > >     - mtk_scp_ipi: Data written, ACK? ok -> return 0
> > >      *** SCP MUTEX UNLOCK ***
> > > 
> > > - mtk_scp_ipi: **** INTERRUPT!!! **** New RPMSG NS available? -> create channel
> > >            goto label_a;
> > > 
> > > *** RPMSG MUTEX UNLOCK ***
> > > 
> > > 
> > > Pardon me for keeping some things in this flow implicit, but that was done to
> > > simplify it as much as possible as to try to make you understand the situation.
> > 
> > I certainly appreciate the effort but the above does not provide me with a clear
> > path that causes the lock to happen.  As I said in my last reply I don't doubt
> > there is a lock contention but the provided information doesn't allow to
> > understand how it happens.
> > 
> > All I am looking for is one scenario with all mutexes and functions calls
> > involved.
> > 
> 
> Hello Mathieu,
> I'm sorry for leaving this unresolved for a long time, had to work on other
> things in the meanwhile.
>

I also had to move on and as such details related to this patch have mostly
vanished from my memory.

> I'm not sure what you need, can you please help me to give you the
> issue background that you require?
> 
> In the meanwhile, here's full debugging info coming from the kmsg:
> 
> https://paste.debian.net/1235967/

I will review this pastebin along with the explanation you had provided earlier
in this thread - maybe I'll get it this time... But that will take some time as
I have 5 other patchsets to review before looking at this again.

> 
> Thanks,
> Angelo
> 
> > Thanks,
> > Mathieu
> > 
> > > 
> > > Cheers,
> > > Angelo
> > > 
> > > > Thanks,
> > > > Mathieu
> > > > 
> > > > > [   18.228399]                                lock(&mtk_subdev->channels_lock);
> > > > > [   18.228405]   lock(&scp->ipi_desc[i].lock);
> > > > > [   18.264405]
> > > > > 
> > > > > To solve this, simply unlock the channels_lock mutex before calling
> > > > > mtk_rpmsg_register_device() and relock it right after, as safety is
> > > > > still ensured by the locking mechanism that happens right after
> > > > > through SCP.
> > > > > Notably, mtk_rpmsg_register_device() does not even require locking.
> > > > > 
> > > > > Fixes: 7017996951fd ("rpmsg: add rpmsg support for mt8183 SCP.")
> > > > > Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
> > > > > ---
> > > > >    drivers/rpmsg/mtk_rpmsg.c | 2 ++
> > > > >    1 file changed, 2 insertions(+)
> > > > > 
> > > > > diff --git a/drivers/rpmsg/mtk_rpmsg.c b/drivers/rpmsg/mtk_rpmsg.c
> > > > > index 5b4404b8be4c..d1213c33da20 100644
> > > > > --- a/drivers/rpmsg/mtk_rpmsg.c
> > > > > +++ b/drivers/rpmsg/mtk_rpmsg.c
> > > > > @@ -234,7 +234,9 @@ static void mtk_register_device_work_function(struct work_struct *register_work)
> > > > >    		if (info->registered)
> > > > >    			continue;
> > > > > +		mutex_unlock(&subdev->channels_lock);
> > > > >    		ret = mtk_rpmsg_register_device(subdev, &info->info);
> > > > > +		mutex_lock(&subdev->channels_lock);
> > > > >    		if (ret) {
> > > > >    			dev_err(&pdev->dev, "Can't create rpmsg_device\n");
> > > > >    			continue;
> > > > > -- 
> > > > > 2.33.1
> > > > > 
> > > 
> > > 
> > >

Mathieu Poirier April 28, 2022, 5:31 p.m. UTC | #8

On Fri, Jan 14, 2022 at 03:47:37PM +0100, AngeloGioacchino Del Regno wrote:
> During execution of the worker that's used to register rpmsg devices
> we are safely locking the channels mutex but, when creating a new
> endpoint for such devices, we are registering a IPI on the SCP, which
> then makes the SCP to trigger an interrupt, lock its own mutex and in
> turn register more subdevices.
> This creates a circular locking dependency situation, as the mtk_rpmsg
> channels_lock will then depend on the SCP IPI lock.
> 
> [   18.014514]  Possible unsafe locking scenario:
> [   18.014515]        CPU0                    CPU1
> [   18.014517]        ----                    ----
> [   18.045467]   lock(&mtk_subdev->channels_lock);
> [   18.045474]                                lock(&scp->ipi_desc[i].lock);
> [   18.228399]                                lock(&mtk_subdev->channels_lock);
> [   18.228405]   lock(&scp->ipi_desc[i].lock);
> [   18.264405]

I finally understand the problem, something that would have been impossible
without the pastebin you provided in your latest email.  Please add the content
of that pastebin to the changelog and send another revision (checkpatch
warnings can be ignored).

> 
> To solve this, simply unlock the channels_lock mutex before calling
> mtk_rpmsg_register_device() and relock it right after, as safety is
> still ensured by the locking mechanism that happens right after
> through SCP.

The integrity of the subdev->channels list is guaranteed by relocking the
mutex, I'm not sure what "through SCP" adds to the sentence. 

> Notably, mtk_rpmsg_register_device() does not even require locking.
>

I don't agree with the above sentence - if locking doesn't happen in
mtk_rpmsg_create_device(), there can be two CPUs accessing the list at the same
time.

Thanks,
Mathieu

> Fixes: 7017996951fd ("rpmsg: add rpmsg support for mt8183 SCP.")
> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
> ---
>  drivers/rpmsg/mtk_rpmsg.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/rpmsg/mtk_rpmsg.c b/drivers/rpmsg/mtk_rpmsg.c
> index 5b4404b8be4c..d1213c33da20 100644
> --- a/drivers/rpmsg/mtk_rpmsg.c
> +++ b/drivers/rpmsg/mtk_rpmsg.c
> @@ -234,7 +234,9 @@ static void mtk_register_device_work_function(struct work_struct *register_work)
>  		if (info->registered)
>  			continue;
>  
> +		mutex_unlock(&subdev->channels_lock);
>  		ret = mtk_rpmsg_register_device(subdev, &info->info);
> +		mutex_lock(&subdev->channels_lock);
>  		if (ret) {
>  			dev_err(&pdev->dev, "Can't create rpmsg_device\n");
>  			continue;
> -- 
> 2.33.1
>

AngeloGioacchino Del Regno April 29, 2022, 1:37 p.m. UTC | #9

Il 28/04/22 19:31, Mathieu Poirier ha scritto:
> On Fri, Jan 14, 2022 at 03:47:37PM +0100, AngeloGioacchino Del Regno wrote:
>> During execution of the worker that's used to register rpmsg devices
>> we are safely locking the channels mutex but, when creating a new
>> endpoint for such devices, we are registering a IPI on the SCP, which
>> then makes the SCP to trigger an interrupt, lock its own mutex and in
>> turn register more subdevices.
>> This creates a circular locking dependency situation, as the mtk_rpmsg
>> channels_lock will then depend on the SCP IPI lock.
>>
>> [   18.014514]  Possible unsafe locking scenario:
>> [   18.014515]        CPU0                    CPU1
>> [   18.014517]        ----                    ----
>> [   18.045467]   lock(&mtk_subdev->channels_lock);
>> [   18.045474]                                lock(&scp->ipi_desc[i].lock);
>> [   18.228399]                                lock(&mtk_subdev->channels_lock);
>> [   18.228405]   lock(&scp->ipi_desc[i].lock);
>> [   18.264405]
> 
> I finally understand the problem, something that would have been impossible
> without the pastebin you provided in your latest email.  Please add the content
> of that pastebin to the changelog and send another revision (checkpatch
> warnings can be ignored).
> 

Thanks for giving it another look... I'll add a cover letter with the contents
of the pastebin to avoid possible confusion for anyone looking at the patch.

>>
>> To solve this, simply unlock the channels_lock mutex before calling
>> mtk_rpmsg_register_device() and relock it right after, as safety is
>> still ensured by the locking mechanism that happens right after
>> through SCP.
> 
> The integrity of the subdev->channels list is guaranteed by relocking the
> mutex, I'm not sure what "through SCP" adds to the sentence.

I'll clarify the commit message.

> 
>> Notably, mtk_rpmsg_register_device() does not even require locking.
>>
> 
> I don't agree with the above sentence - if locking doesn't happen in
> mtk_rpmsg_create_device(), there can be two CPUs accessing the list at the same
> time.
> 

That's right, I have largely underestimated that for some reason, sorry about that.

Regards,
Angelo

> Thanks,
> Mathieu
>

rpmsg: mtk_rpmsg: Fix circular locking dependency

Commit Message

Comments

Patch