Message ID | 20250123083323.2363749-1-jackhuang021@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | dirvers: scmi: poll again when transfer reach timeout | expand |
s/dirvers/drivers/ On Thu, Jan 23, 2025 at 04:33:24PM +0800, jack21 wrote: > From: Huangjie <huangjie1663@phytium.com.cn> > > spin_until_cond() not really hold a spin lock, possible timeout may occur > in preemption kernel when preempted after spin_until_cond(). > > We check status again when reach timeout is reached to prevent incorrect > jugement of timeout. > I suspect the real issue is that we exit the spin loop when try_wait_for_completion(&xfer->done) is true. Probably we should add that as a Fixes tag?: Fixes: ed7c04c1fea3 ("firmware: arm_scmi: Handle concurrent and out-of-order messages") Btw, the scmi_xfer_done_no_timeout() has a confusing name, because it does timeout. Was the "_no" an accident? regards, dan carpenter
On Thu, Jan 23, 2025 at 01:38:30PM +0300, Dan Carpenter wrote: > s/dirvers/drivers/ > > On Thu, Jan 23, 2025 at 04:33:24PM +0800, jack21 wrote: > > From: Huangjie <huangjie1663@phytium.com.cn> > > > > spin_until_cond() not really hold a spin lock, possible timeout may occur > > in preemption kernel when preempted after spin_until_cond(). > > > > We check status again when reach timeout is reached to prevent incorrect > > jugement of timeout. > > Hi, probably another not so short email of mine :P ... > > I suspect the real issue is that we exit the spin loop when > try_wait_for_completion(&xfer->done) is true. Probably we should add > that as a Fixes tag?: > The Kernel SCMI stack, acting as an SCMI agent have to cope with the possible (even though rare) scenario of receiving Out of Order messages when dealing with async commands... ...IOW, what to do if, after having issued an AsycnCmd, the Delayed response is received BEFORE the corresponding immediate reply to the initial request: such a wicked (and rare) situation could be the result of a misbehaving platform server OR simply due to parallellization of activies on the A2P and P2A on some transports, i.e. platform sent reply and delayed_response in the correct order but the transport delivered those OoO, being transmitted on 2 discinct physical channels...hard but not impossible. Kernel side we address this OoO scenario by assuming that, if a valid Delayed Response to a in-flight Async-cmd is received on teh P2A chan, before the immediate A2P reply, the transaction itself is good and we can progress by just logging and ignoring/swallowing the missing immediate-reply, that will probably arrive later, and just carry on processing the Delayed Response. In order to do that, we maintain, in fact, a per-message state-machine, and inside scmi_msg_response_validate(), when we detect the OoO condition, we cause the wait-for-immediate-reply to terminate by issuing forcibly a complete(xfer->done).... ...this works straight away for the non-polled IRQ transactions since causes the wait_for_completion() to terminate cleanly, BUT in order to cut-short also the busy-wait in the polling case we need that additional try_wait_for_completion()....so as not to spin forever for a message that we dont care anymore... [ note that any late arriving immediate-reply will be in teh future discarded at this point] For this reason, I think that this patch, while correctly checking the poll_done() condition when the spinloop terminates for the reason explained in the commit-message, it should also check if the loop was not instead forcibly terminated by the OoO scenario...if not, we will end up considering the polling to have timeout, while instead it was forcibly terminated by the OoO state machine complete() and just want to ignore such missing message... So what about instead of checking (untested): + if (!completion_done(&xfer->done) && + !info->desc->ops->poll_done(cinfo, xfer)) ... so that we determine that the polling has timed out only when: - the immediate-reply has NOT been received [!poll_done()] AND - we were NOT forcibly make to ignore the missing reply in an OoO scenario [!completion_done] NOTE THAT xfer->done on the polling path is ONLY completed in case of OoO. > Fixes: ed7c04c1fea3 ("firmware: arm_scmi: Handle concurrent and out-of-order messages") > > Btw, the scmi_xfer_done_no_timeout() has a confusing name, because it > does timeout. Was the "_no" an accident? > Agreed. Not sure of the history of the bad naming... Thans, Cristian
diff --git a/drivers/firmware/arm_scmi/driver.c b/drivers/firmware/arm_scmi/driver.c index 1b5fb2c4ce86..10b049fe5fd0 100644 --- a/drivers/firmware/arm_scmi/driver.c +++ b/drivers/firmware/arm_scmi/driver.c @@ -1246,7 +1246,8 @@ static int scmi_wait_for_reply(struct device *dev, const struct scmi_desc *desc, spin_until_cond(scmi_xfer_done_no_timeout(cinfo, xfer, stop)); - if (ktime_after(ktime_get(), stop)) { + if (ktime_after(ktime_get(), stop) && + !info->desc->ops->poll_done(cinfo, xfer)) { dev_err(dev, "timed out in resp(caller: %pS) - polling\n", (void *)_RET_IP_);