diff mbox series

dirvers: scmi: poll again when transfer reach timeout

Message ID 20250123083323.2363749-1-jackhuang021@gmail.com (mailing list archive)
State New
Headers show
Series dirvers: scmi: poll again when transfer reach timeout | expand

Commit Message

jack21 Jan. 23, 2025, 8:33 a.m. UTC
From: Huangjie <huangjie1663@phytium.com.cn>

spin_until_cond() not really hold a spin lock, possible timeout may occur
in preemption kernel when preempted after spin_until_cond().

We check status again when reach timeout is reached to prevent incorrect
jugement of timeout.

Signed-off-by: Huangjie <huangjie1663@phytium.com.cn>
---
 drivers/firmware/arm_scmi/driver.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Dan Carpenter Jan. 23, 2025, 10:38 a.m. UTC | #1
s/dirvers/drivers/

On Thu, Jan 23, 2025 at 04:33:24PM +0800, jack21 wrote:
> From: Huangjie <huangjie1663@phytium.com.cn>
> 
> spin_until_cond() not really hold a spin lock, possible timeout may occur
> in preemption kernel when preempted after spin_until_cond().
> 
> We check status again when reach timeout is reached to prevent incorrect
> jugement of timeout.
> 

I suspect the real issue is that we exit the spin loop when
try_wait_for_completion(&xfer->done) is true.  Probably we should add
that as a Fixes tag?:

Fixes: ed7c04c1fea3 ("firmware: arm_scmi: Handle concurrent and out-of-order messages")

Btw, the scmi_xfer_done_no_timeout() has a confusing name, because it
does timeout.  Was the "_no" an accident?

regards,
dan carpenter
Cristian Marussi Jan. 23, 2025, 7:48 p.m. UTC | #2
On Thu, Jan 23, 2025 at 01:38:30PM +0300, Dan Carpenter wrote:
> s/dirvers/drivers/
> 
> On Thu, Jan 23, 2025 at 04:33:24PM +0800, jack21 wrote:
> > From: Huangjie <huangjie1663@phytium.com.cn>
> > 
> > spin_until_cond() not really hold a spin lock, possible timeout may occur
> > in preemption kernel when preempted after spin_until_cond().
> > 
> > We check status again when reach timeout is reached to prevent incorrect
> > jugement of timeout.
> > 

Hi,

probably another not so short email of mine :P ... 

> 
> I suspect the real issue is that we exit the spin loop when
> try_wait_for_completion(&xfer->done) is true.  Probably we should add
> that as a Fixes tag?:
> 

The Kernel SCMI stack, acting as an SCMI agent have to cope with the
possible (even though rare) scenario of receiving Out of Order messages
when dealing with async commands...

...IOW, what to do if, after having issued an AsycnCmd, the Delayed response
is received BEFORE the corresponding immediate reply to the initial request:
such a wicked (and rare) situation could be the result of a misbehaving
platform server OR simply due to parallellization of activies on the A2P
and P2A on some transports, i.e. platform sent reply and delayed_response
in the correct order but the transport delivered those OoO, being
transmitted on 2 discinct physical  channels...hard but not impossible.

Kernel side we address this OoO scenario by assuming that, if a valid
Delayed Response to a in-flight Async-cmd is received on teh P2A chan,
before the immediate A2P reply, the transaction itself is good and we
can progress by just logging and ignoring/swallowing the missing
immediate-reply, that will probably arrive later, and just carry on
processing the Delayed Response.

In order to do that, we maintain, in fact, a per-message state-machine,
and inside scmi_msg_response_validate(), when we detect the OoO condition,
we cause the wait-for-immediate-reply to terminate by issuing forcibly a
complete(xfer->done)....

...this works straight away for the non-polled IRQ transactions since
causes the wait_for_completion() to terminate cleanly, BUT in order to
cut-short also the busy-wait in the polling case we need that additional
try_wait_for_completion()....so as not to spin forever for a message
that we dont care anymore...
[ note that any late arriving immediate-reply will be in teh future
 discarded at this point]

For this reason, I think that this patch, while correctly checking the
poll_done() condition when the spinloop terminates for the reason
explained in the commit-message, it should also check if the loop was not
instead forcibly terminated by the OoO scenario...if not, we will end up
considering the polling to have timeout, while instead it was forcibly
terminated by the OoO state machine complete() and just want to ignore
such missing message...

So what about instead of checking (untested):

+	if (!completion_done(&xfer->done) &&
+	    !info->desc->ops->poll_done(cinfo, xfer))

... so that we determine that the polling has timed out only when:

- the immediate-reply has NOT been received [!poll_done()]

 AND

- we were NOT forcibly make to ignore the missing reply in an OoO scenario
    [!completion_done]

NOTE THAT xfer->done on the polling path is ONLY completed in case of OoO.

> Fixes: ed7c04c1fea3 ("firmware: arm_scmi: Handle concurrent and out-of-order messages")
> 
> Btw, the scmi_xfer_done_no_timeout() has a confusing name, because it
> does timeout.  Was the "_no" an accident?
> 

Agreed. Not sure of the history of the bad naming...

Thans,
Cristian
diff mbox series

Patch

diff --git a/drivers/firmware/arm_scmi/driver.c b/drivers/firmware/arm_scmi/driver.c
index 1b5fb2c4ce86..10b049fe5fd0 100644
--- a/drivers/firmware/arm_scmi/driver.c
+++ b/drivers/firmware/arm_scmi/driver.c
@@ -1246,7 +1246,8 @@  static int scmi_wait_for_reply(struct device *dev, const struct scmi_desc *desc,
 
 			spin_until_cond(scmi_xfer_done_no_timeout(cinfo,
 								  xfer, stop));
-			if (ktime_after(ktime_get(), stop)) {
+			if (ktime_after(ktime_get(), stop) &&
+			    !info->desc->ops->poll_done(cinfo, xfer)) {
 				dev_err(dev,
 					"timed out in resp(caller: %pS) - polling\n",
 					(void *)_RET_IP_);