diff mbox series

usb: typec: ucsi: Clear pending after acking connector change

Message ID 20210516040953.622409-1-bjorn.andersson@linaro.org (mailing list archive)
State Accepted
Commit 8c9b3caab3ac26db1da00b8117901640c55a69dd
Headers show
Series usb: typec: ucsi: Clear pending after acking connector change | expand

Commit Message

Bjorn Andersson May 16, 2021, 4:09 a.m. UTC
It's possible that the interrupt handler for the UCSI driver signals a
connector changes after the handler clears the PENDING bit, but before
it has sent the acknowledge request. The result is that the handler is
invoked yet again, to ack the same connector change.

At least some versions of the Qualcomm UCSI firmware will not handle the
second - "spurious" - acknowledgment gracefully. So make sure to not
clear the pending flag until the change is acknowledged.

Any connector changes coming in after the acknowledgment, that would
have the pending flag incorrectly cleared, would afaict be covered by
the subsequent connector status check.

Fixes: 217504a05532 ("usb: typec: ucsi: Work around PPM losing change information")
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
---
 drivers/usb/typec/ucsi/ucsi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Heikki Krogerus May 17, 2021, 10:03 a.m. UTC | #1
On Sat, May 15, 2021 at 09:09:53PM -0700, Bjorn Andersson wrote:
> It's possible that the interrupt handler for the UCSI driver signals a
> connector changes after the handler clears the PENDING bit, but before
> it has sent the acknowledge request. The result is that the handler is
> invoked yet again, to ack the same connector change.
> 
> At least some versions of the Qualcomm UCSI firmware will not handle the
> second - "spurious" - acknowledgment gracefully. So make sure to not
> clear the pending flag until the change is acknowledged.
> 
> Any connector changes coming in after the acknowledgment, that would
> have the pending flag incorrectly cleared, would afaict be covered by
> the subsequent connector status check.
> 
> Fixes: 217504a05532 ("usb: typec: ucsi: Work around PPM losing change information")
> Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>

I'm OK with this if Bejamin does not see any problems with it. I'll
wait for his comments before giving my reviewed-by tag.

That workaround (commit 217504a05532) is unfortunately too fragile.
I'm going to now separate the processing of the connector state from
the event handler (interrupt handler). That way we should be fairly
sure we don't loose any of the connector states even if an event is
generated while we are still in the middle of processing the previous
one(s), and at the same time be sure that we also don't confuse the
firmware.

So the event handler shall after that only read the connector status,
schedule the unique job where it's processed and ACK the event.
Nothing else.

> ---
>  drivers/usb/typec/ucsi/ucsi.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/usb/typec/ucsi/ucsi.c b/drivers/usb/typec/ucsi/ucsi.c
> index 282c3c825c13..f451ce0132a9 100644
> --- a/drivers/usb/typec/ucsi/ucsi.c
> +++ b/drivers/usb/typec/ucsi/ucsi.c
> @@ -694,8 +694,8 @@ static void ucsi_handle_connector_change(struct work_struct *work)
>  	ucsi_send_command(con->ucsi, command, NULL, 0);
>  
>  	/* 3. ACK connector change */
> -	clear_bit(EVENT_PENDING, &ucsi->flags);
>  	ret = ucsi_acknowledge_connector_change(ucsi);
> +	clear_bit(EVENT_PENDING, &ucsi->flags);
>  	if (ret) {
>  		dev_err(ucsi->dev, "%s: ACK failed (%d)", __func__, ret);
>  		goto out_unlock;
> -- 
> 2.29.2

thanks,
Heikki Krogerus May 17, 2021, 12:15 p.m. UTC | #2
Hi,

On Mon, May 17, 2021 at 01:03:12PM +0300, Heikki Krogerus wrote:
> On Sat, May 15, 2021 at 09:09:53PM -0700, Bjorn Andersson wrote:
> > It's possible that the interrupt handler for the UCSI driver signals a
> > connector changes after the handler clears the PENDING bit, but before
> > it has sent the acknowledge request. The result is that the handler is
> > invoked yet again, to ack the same connector change.
> > 
> > At least some versions of the Qualcomm UCSI firmware will not handle the
> > second - "spurious" - acknowledgment gracefully. So make sure to not
> > clear the pending flag until the change is acknowledged.
> > 
> > Any connector changes coming in after the acknowledgment, that would
> > have the pending flag incorrectly cleared, would afaict be covered by
> > the subsequent connector status check.
> > 
> > Fixes: 217504a05532 ("usb: typec: ucsi: Work around PPM losing change information")
> > Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
> 
> I'm OK with this if Bejamin does not see any problems with it. I'll
> wait for his comments before giving my reviewed-by tag.
> 
> That workaround (commit 217504a05532) is unfortunately too fragile.
> I'm going to now separate the processing of the connector state from
> the event handler (interrupt handler). That way we should be fairly
> sure we don't loose any of the connector states even if an event is
> generated while we are still in the middle of processing the previous
> one(s), and at the same time be sure that we also don't confuse the
> firmware.
> 
> So the event handler shall after that only read the connector status,
> schedule the unique job where it's processed and ACK the event.
> Nothing else.

Seems to be straightforward to implement. I'm attaching the patch I
made for that. I think it should actually also remove the problem you
are seeing. Can you test it?

thanks,
Benjamin Berg May 17, 2021, 12:36 p.m. UTC | #3
On Mon, 2021-05-17 at 13:03 +0300, Heikki Krogerus wrote:
> On Sat, May 15, 2021 at 09:09:53PM -0700, Bjorn Andersson wrote:
> > It's possible that the interrupt handler for the UCSI driver
> > signals a
> > connector changes after the handler clears the PENDING bit, but
> > before
> > it has sent the acknowledge request. The result is that the handler
> > is
> > invoked yet again, to ack the same connector change.
> > 
> > At least some versions of the Qualcomm UCSI firmware will not
> > handle the
> > second - "spurious" - acknowledgment gracefully. So make sure to
> > not
> > clear the pending flag untuntil the changeil the change is
> > acknowledged.
> > 
> > Any connector changes coming in after the acknowledgment, that
> > would
> > have the pending flag incorrectly cleared, would afaict be covered
> > by
> > the subsequent connector status check.
> > 
> > Fixes: 217504a05532 ("usb: typec: ucsi: Work around PPM losing
> > change information")
> > Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
> 
> I'm OK with this if Bejamin does not see any problems with it. I'll
> wait for his comments before giving my reviewed-by tag.

Interesting problem. I wonder if we can detect this situation and
restart the worker instead of sending the ACK.

As for the patch, I do believe it is safe. I agree with the assessment
in the commit message, that the subsequent connector check will detect
any race condition and recovers from it.

Acked-By: Benjamin Berg <bberg@redhat.com>

Benjamin

> That workaround (commit 217504a05532) is unfortunately too fragile.
> I'm going to now separate the processing of the connector state from
> the event handler (interrupt handler). That way we should be fairly
> sure we don't loose any of the connector states even if an event is
> generated while we are still in the middle of processing the previous
> one(s), and at the same time be sure that we also don't confuse the
> firmware.
> So the event handler shall after that only read the connector status,
> schedule the unique job where it's processed and ACK the event.
> Nothing else.
> 
> > ---
> >  drivers/usb/typec/ucsi/ucsi.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/usb/typec/ucsi/ucsi.c
> > b/drivers/usb/typec/ucsi/ucsi.c
> > index 282c3c825c13..f451ce0132a9 100644
> > --- a/drivers/usb/typec/ucsi/ucsi.c
> > +++ b/drivers/usb/typec/ucsi/ucsi.c
> > @@ -694,8 +694,8 @@ static void ucsi_handle_connector_change(struct
> > work_struct *work)
> >         ucsi_send_command(con->ucsi, command, NULL, 0);
> >  
> >         /* 3. ACK connector change */
> > -       clear_bit(EVENT_PENDING, &ucsi->flags);
> >         ret = ucsi_acknowledge_connector_change(ucsi);
> > +       clear_bit(EVENT_PENDING, &ucsi->flags);
> >         if (ret) {
> >                 dev_err(ucsi->dev, "%s: ACK failed (%d)", __func__,
> > ret);
> >                 goto out_unlock;
> > -- 
> > 2.29.2
> 
> thanks,
>
Benjamin Berg May 17, 2021, 12:57 p.m. UTC | #4
Hi Heikki,


On Mon, 2021-05-17 at 15:15 +0300, Heikki Krogerus wrote:
> On Mon, May 17, 2021 at 01:03:12PM +0300, Heikki Krogerus wrote:
> > On Sat, May 15, 2021 at 09:09:53PM -0700, Bjorn Andersson wrote:
> > > It's possible that the interrupt handler for the UCSI driver
> > > signals a
> > > connector changes after the handler clears the PENDING bit, but
> > > before
> > > it has sent the acknowledge request. The result is that the handler
> > > is
> > > invoked yet again, to ack the same connector change.
> > > 
> > > At least some versions of the Qualcomm UCSI firmware will not
> > > handle the
> > > second - "spurious" - acknowledgment gracefully. So make sure to
> > > not
> > > clear the pending flag until the change is acknowledged.
> > > 
> > > Any connector changes coming in after the acknowledgment, that
> > > would
> > > have the pending flag incorrectly cleared, would afaict be covered
> > > by
> > > the subsequent connector status check.
> > > 
> > > Fixes: 217504a05532 ("usb: typec: ucsi: Work around PPM losing
> > > change information")
> > > Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
> > 
> > I'm OK with this if Bejamin does not see any problems with it. I'll
> > wait for his comments before giving my reviewed-by tag.
> > 
> > That workaround (commit 217504a05532) is unfortunately too fragile.
> > I'm going to now separate the processing of the connector state from
> > the event handler (interrupt handler). That way we should be fairly
> > sure we don't loose any of the connector states even if an event is
> > generated while we are still in the middle of processing the previous
> > one(s), and at the same time be sure that we also don't confuse the
> > firmware.
> > 
> > So the event handler shall after that only read the connector status,
> > schedule the unique job where it's processed and ACK the event.
> > Nothing else.
> 
> Seems to be straightforward to implement. I'm attaching the patch I
> made for that. I think it should actually also remove the problem you
> are seeing. Can you test it?

Hmm, I am happy to try the patch tomorrow in the scenario where I
observed my race condition.

Unfortunately, I don't feel it'll work. The problem that I was seeing
looked like a race condition in the PPM itself, where the window is the
time between the UCSI_GET_CONNECTOR_STATUS command and the subsequent
ACK.
For such a firmware level bug in the PPM, we need a way to detect the
race condition when it happens (or get a fix for the firmware).

Note that there was also some code to always sent a
UCSI_GET_CAM_SUPPORTED command "to ensure the PPM does not get stuck in
case it assumes we do". I have no idea where this came from or what
PPMs might require this workaround. But I doubt it is a good idea to
drop it.

Benjamin
Heikki Krogerus May 18, 2021, 1:21 p.m. UTC | #5
On Sat, May 15, 2021 at 09:09:53PM -0700, Bjorn Andersson wrote:
> It's possible that the interrupt handler for the UCSI driver signals a
> connector changes after the handler clears the PENDING bit, but before
> it has sent the acknowledge request. The result is that the handler is
> invoked yet again, to ack the same connector change.
> 
> At least some versions of the Qualcomm UCSI firmware will not handle the
> second - "spurious" - acknowledgment gracefully. So make sure to not
> clear the pending flag until the change is acknowledged.
> 
> Any connector changes coming in after the acknowledgment, that would
> have the pending flag incorrectly cleared, would afaict be covered by
> the subsequent connector status check.
> 
> Fixes: 217504a05532 ("usb: typec: ucsi: Work around PPM losing change information")
> Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>

Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>

> ---
>  drivers/usb/typec/ucsi/ucsi.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/usb/typec/ucsi/ucsi.c b/drivers/usb/typec/ucsi/ucsi.c
> index 282c3c825c13..f451ce0132a9 100644
> --- a/drivers/usb/typec/ucsi/ucsi.c
> +++ b/drivers/usb/typec/ucsi/ucsi.c
> @@ -694,8 +694,8 @@ static void ucsi_handle_connector_change(struct work_struct *work)
>  	ucsi_send_command(con->ucsi, command, NULL, 0);
>  
>  	/* 3. ACK connector change */
> -	clear_bit(EVENT_PENDING, &ucsi->flags);
>  	ret = ucsi_acknowledge_connector_change(ucsi);
> +	clear_bit(EVENT_PENDING, &ucsi->flags);
>  	if (ret) {
>  		dev_err(ucsi->dev, "%s: ACK failed (%d)", __func__, ret);
>  		goto out_unlock;
> -- 
> 2.29.2
Heikki Krogerus May 18, 2021, 1:29 p.m. UTC | #6
On Mon, May 17, 2021 at 02:57:28PM +0200, Benjamin Berg wrote:
> Hi Heikki,
> 
> 
> On Mon, 2021-05-17 at 15:15 +0300, Heikki Krogerus wrote:
> > On Mon, May 17, 2021 at 01:03:12PM +0300, Heikki Krogerus wrote:
> > > On Sat, May 15, 2021 at 09:09:53PM -0700, Bjorn Andersson wrote:
> > > > It's possible that the interrupt handler for the UCSI driver
> > > > signals a
> > > > connector changes after the handler clears the PENDING bit, but
> > > > before
> > > > it has sent the acknowledge request. The result is that the handler
> > > > is
> > > > invoked yet again, to ack the same connector change.
> > > > 
> > > > At least some versions of the Qualcomm UCSI firmware will not
> > > > handle the
> > > > second - "spurious" - acknowledgment gracefully. So make sure to
> > > > not
> > > > clear the pending flag until the change is acknowledged.
> > > > 
> > > > Any connector changes coming in after the acknowledgment, that
> > > > would
> > > > have the pending flag incorrectly cleared, would afaict be covered
> > > > by
> > > > the subsequent connector status check.
> > > > 
> > > > Fixes: 217504a05532 ("usb: typec: ucsi: Work around PPM losing
> > > > change information")
> > > > Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
> > > 
> > > I'm OK with this if Bejamin does not see any problems with it. I'll
> > > wait for his comments before giving my reviewed-by tag.
> > > 
> > > That workaround (commit 217504a05532) is unfortunately too fragile.
> > > I'm going to now separate the processing of the connector state from
> > > the event handler (interrupt handler). That way we should be fairly
> > > sure we don't loose any of the connector states even if an event is
> > > generated while we are still in the middle of processing the previous
> > > one(s), and at the same time be sure that we also don't confuse the
> > > firmware.
> > > 
> > > So the event handler shall after that only read the connector status,
> > > schedule the unique job where it's processed and ACK the event.
> > > Nothing else.
> > 
> > Seems to be straightforward to implement. I'm attaching the patch I
> > made for that. I think it should actually also remove the problem you
> > are seeing. Can you test it?
> 
> Hmm, I am happy to try the patch tomorrow in the scenario where I
> observed my race condition.
> 
> Unfortunately, I don't feel it'll work. The problem that I was seeing
> looked like a race condition in the PPM itself, where the window is the
> time between the UCSI_GET_CONNECTOR_STATUS command and the subsequent
> ACK.
> For such a firmware level bug in the PPM, we need a way to detect the
> race condition when it happens (or get a fix for the firmware).

OK. Let me know does the patch bring the issue back for you.

> Note that there was also some code to always sent a
> UCSI_GET_CAM_SUPPORTED command "to ensure the PPM does not get stuck in
> case it assumes we do". I have no idea where this came from or what
> PPMs might require this workaround. But I doubt it is a good idea to
> drop it.

Sure.

thanks,
Benjamin Berg May 18, 2021, 6:04 p.m. UTC | #7
On Tue, 2021-05-18 at 16:29 +0300, Heikki Krogerus wrote:
> On Mon, May 17, 2021 at 02:57:28PM +0200, Benjamin Berg wrote:
> > 
> > [SNIP]
> > Unfortunately, I don't feel it'll work. The problem that I was
> > seeing
> > looked like a race condition in the PPM itself, where the window is
> > the
> > time between the UCSI_GET_CONNECTOR_STATUS command and the
> > subsequent
> > ACK.
> > For such a firmware level bug in the PPM, we need a way to detect
> > the
> > race condition when it happens (or get a fix for the firmware).
> 
> OK. Let me know does the patch bring the issue back for you.

So, I just tried the patch, and I can occasionally reproduce the issue
where "online" for the ucsi power adapter is stuck at "1" after
unplugging with the patch applied.

Benjamin
Heikki Krogerus May 19, 2021, 11:33 a.m. UTC | #8
On Tue, May 18, 2021 at 08:04:14PM +0200, Benjamin Berg wrote:
> On Tue, 2021-05-18 at 16:29 +0300, Heikki Krogerus wrote:
> > On Mon, May 17, 2021 at 02:57:28PM +0200, Benjamin Berg wrote:
> > > 
> > > [SNIP]
> > > Unfortunately, I don't feel it'll work. The problem that I was
> > > seeing
> > > looked like a race condition in the PPM itself, where the window is
> > > the
> > > time between the UCSI_GET_CONNECTOR_STATUS command and the
> > > subsequent
> > > ACK.
> > > For such a firmware level bug in the PPM, we need a way to detect
> > > the
> > > race condition when it happens (or get a fix for the firmware).
> > 
> > OK. Let me know does the patch bring the issue back for you.
> 
> So, I just tried the patch, and I can occasionally reproduce the issue
> where "online" for the ucsi power adapter is stuck at "1" after
> unplugging with the patch applied.

Thanks for testing it.

I'm still not sure that the PPM is the culprit here. I have a feeling
that the problem you are seeing is caused by the workaround (bad
workaround) that we have for the issue where the EC firmware does not
return with the BUSY bit set in the CCI when it should in many cases.
The UCSI ACPI driver has one minute timeout value for command
completion because of that, which is way too long. So if the EC
firmware decides to take its time before acknowledging command, the
driver is stuck, and we start loosing the events... Well, I guess
technically the PPM would be the culprit in the end in any case, but
I'm just not sure that there is any race like you suspected.

But this is off topic. I'll send you an RFC proposal what I think we
could do about that.


thanks,
Benjamin Berg May 19, 2021, 11:50 a.m. UTC | #9
On Wed, 2021-05-19 at 14:33 +0300, Heikki Krogerus wrote:
> On Tue, May 18, 2021 at 08:04:14PM +0200, Benjamin Berg wrote:
> > On Tue, 2021-05-18 at 16:29 +0300, Heikki Krogerus wrote:
> > > On Mon, May 17, 2021 at 02:57:28PM +0200, Benjamin Berg wrote:
> > > > 
> > > > [SNIP]
> > > > Unfortunately, I don't feel it'll work. The problem that I was
> > > > seeing
> > > > looked like a race condition in the PPM itself, where the
> > > > window is
> > > > the
> > > > time between the UCSI_GET_CONNECTOR_STATUS command and the
> > > > subsequent
> > > > ACK.
> > > > For such a firmware level bug in the PPM, we need a way to
> > > > detect
> > > > the
> > > > race condition when it happens (or get a fix for the firmware).
> > > 
> > > OK. Let me know does the patch bring the issue back for you.
> > 
> > So, I just tried the patch, and I can occasionally reproduce the
> > issue
> > where "online" for the ucsi power adapter is stuck at "1" after
> > unplugging with the patch applied.
> 
> Thanks for testing it.
> 
> I'm still not sure that the PPM is the culprit here. I have a feeling
> that the problem you are seeing is caused by the workaround (bad
> workaround) that we have for the issue where the EC firmware does not
> return with the BUSY bit set in the CCI when it should in many cases.
> The UCSI ACPI driver has one minute timeout value for command
> completion because of that, which is way too long. So if the EC
> firmware decides to take its time before acknowledging command, the
> driver is stuck, and we start loosing the events...

Hmm, interesting, yes. If the PPM is sending a second change event
before or after we ACK'ed the first one, and we loose this event, then
that would absolutely explain the issue I am seeing.

In my case, I was able to considerably increase the probability of the
bug by inserting a sleep just before acknowledging the connector
change. Not sure whether this is meaningful, but maybe that tells us
something about who is at fault here.

Benjamin

> Well, I guess
> technically the PPM would be the culprit in the end in any case, but
> I'm just not sure that there is any race like you suspected.
> 
> But this is off topic. I'll send you an RFC proposal what I think we
> could do about that.
> 
> 
> thanks,
>
diff mbox series

Patch

diff --git a/drivers/usb/typec/ucsi/ucsi.c b/drivers/usb/typec/ucsi/ucsi.c
index 282c3c825c13..f451ce0132a9 100644
--- a/drivers/usb/typec/ucsi/ucsi.c
+++ b/drivers/usb/typec/ucsi/ucsi.c
@@ -694,8 +694,8 @@  static void ucsi_handle_connector_change(struct work_struct *work)
 	ucsi_send_command(con->ucsi, command, NULL, 0);
 
 	/* 3. ACK connector change */
-	clear_bit(EVENT_PENDING, &ucsi->flags);
 	ret = ucsi_acknowledge_connector_change(ucsi);
+	clear_bit(EVENT_PENDING, &ucsi->flags);
 	if (ret) {
 		dev_err(ucsi->dev, "%s: ACK failed (%d)", __func__, ret);
 		goto out_unlock;