Message ID | 20250321174005.4077683-1-jeff.hugo@oss.qualcomm.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [v2] bus: mhi: host: Address conflict between power_up and syserr | expand |
On Fri, Mar 21, 2025 at 11:40:05AM -0600, Jeff Hugo wrote: > From: Jeffrey Hugo <quic_jhugo@quicinc.com> > > mhi_async_power_up() enables IRQs, at which point we can receive a syserr > notification from the device. The syserr notification queues a work item > that cannot execute until the pm_mutex is released. > > If we receive a syserr notification at the right time during > mhi_async_power_up(), we will fail to initialize the device. > > The syserr work item will be pending. If mhi_async_power_up() detects the > syserr, it will handle it. If the device is in PBL, then the PBL state > transition event will be queued, resulting in a work item after the > pending syserr work item. Once mhi_async_power_up() releases the pm_mutex > the syserr work item can run. It will blindly attempt to reset the MHI > state machine, which is the recovery action for syserr. PBL/SBL are not > interrupt driven and will ignore the MHI Reset unless syserr is actively > advertised. This will cause the syserr work item to timeout waiting for > Reset to be cleared, and will leave the host state in syserr processing. > The PBL transition work item will then run, and immediately fail because > syserr processing is not a valid state for PBL transition. > > This leaves the device uninitialized. > > This issue has a fairly unique signature in the kernel log: > > [ 909.803598] mhi mhi3: Requested to power ON > [ 909.803775] Qualcomm Cloud AI 100 0000:36:00.0: Fatal error received from device. Attempting to recover > [ 909.803945] mhi mhi3: Power on setup success > [ 911.808444] mhi mhi3: Device failed to exit MHI Reset state > [ 911.808448] mhi mhi3: Device MHI is not in valid state > > We cannot remove the syserr handling from mhi_async_power_up() because the > device may be in the syserr state, but we missed the notification as the > irq was fired before irqs were enabled. We also can't queue the syserr > work item from mhi_async_power_up() if syserr is detected because that may > result in a duplicate work item, and cause the same issue since the > duplicate item will blindly issue MHI Reset even if syserr is no longer > active. > > Instead, add a check in the syserr work item to make sure that the device > is in the syserr state if the device is in the PBL or SBL EEs. > > It is unknown when this issue was introduced. It was first observed with > commit bce3f770684c ("bus: mhi: host: Add MHI_PM_SYS_ERR_FAIL state") but > that commit does not appear to introduce the issue per code inspection. > This issue is suspected to trace back to the introduction of MHI, but the > relevant code paths have drastically changed since then. Therefore, do > not identify a specific commit in a Fixes tag as confidence is low that > such a commit would be correctly identified. > > Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com> > Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> > Reviewed-by: Troy Hanson <quic_thanson@quicinc.com> > --- > > v2: > -Move comment to reset_device condition > -Rename cur_statemachine_state to cur_state > -Amend commit text to explain the lack of a Fixes: > > drivers/bus/mhi/host/pm.c | 18 +++++++++++++++++- > 1 file changed, 17 insertions(+), 1 deletion(-) > > diff --git a/drivers/bus/mhi/host/pm.c b/drivers/bus/mhi/host/pm.c > index 11c0e751f223..5e608436775f 100644 > --- a/drivers/bus/mhi/host/pm.c > +++ b/drivers/bus/mhi/host/pm.c > @@ -602,6 +602,7 @@ static void mhi_pm_sys_error_transition(struct mhi_controller *mhi_cntrl) > struct mhi_cmd *mhi_cmd; > struct mhi_event_ctxt *er_ctxt; > struct device *dev = &mhi_cntrl->mhi_dev->dev; > + bool reset_device = false; > int ret, i; > > dev_dbg(dev, "Transitioning from PM state: %s to: %s\n", > @@ -630,8 +631,23 @@ static void mhi_pm_sys_error_transition(struct mhi_controller *mhi_cntrl) > /* Wake up threads waiting for state transition */ > wake_up_all(&mhi_cntrl->state_event); > > - /* Trigger MHI RESET so that the device will not access host memory */ > if (MHI_REG_ACCESS_VALID(prev_state)) { > + enum mhi_state cur_state = mhi_get_mhi_state(mhi_cntrl); > + enum mhi_ee_type cur_ee = mhi_get_exec_env(mhi_cntrl); > + > + if (cur_state == MHI_STATE_SYS_ERR) > + reset_device = true; > + else if (cur_ee != MHI_EE_PBL && cur_ee != MHI_EE_SBL) > + reset_device = true; > + } > + > + /* > + * Trigger MHI RESET so that the device will not access host memory. > + * If the device is in PBL or SBL, it will only respond to RESET if > + * the device is in SYSERR state. SYSERR might already be cleared > + * at this point. Oops... I asked you to move only the existing comment below and keep the rest as is. Like, if (MHI_REG_ACCESS_VALID(prev_state)) { /* * If the device is in PBL or SBL, it will only respond to RESET * if the device is in SYSERR state. SYSERR might already be * cleared at this point. */ ... /* Trigger MHI RESET so that the device will not access host memory */ if (reset_device) { ,,, Sorry if I was not clear. - Mani
diff --git a/drivers/bus/mhi/host/pm.c b/drivers/bus/mhi/host/pm.c index 11c0e751f223..5e608436775f 100644 --- a/drivers/bus/mhi/host/pm.c +++ b/drivers/bus/mhi/host/pm.c @@ -602,6 +602,7 @@ static void mhi_pm_sys_error_transition(struct mhi_controller *mhi_cntrl) struct mhi_cmd *mhi_cmd; struct mhi_event_ctxt *er_ctxt; struct device *dev = &mhi_cntrl->mhi_dev->dev; + bool reset_device = false; int ret, i; dev_dbg(dev, "Transitioning from PM state: %s to: %s\n", @@ -630,8 +631,23 @@ static void mhi_pm_sys_error_transition(struct mhi_controller *mhi_cntrl) /* Wake up threads waiting for state transition */ wake_up_all(&mhi_cntrl->state_event); - /* Trigger MHI RESET so that the device will not access host memory */ if (MHI_REG_ACCESS_VALID(prev_state)) { + enum mhi_state cur_state = mhi_get_mhi_state(mhi_cntrl); + enum mhi_ee_type cur_ee = mhi_get_exec_env(mhi_cntrl); + + if (cur_state == MHI_STATE_SYS_ERR) + reset_device = true; + else if (cur_ee != MHI_EE_PBL && cur_ee != MHI_EE_SBL) + reset_device = true; + } + + /* + * Trigger MHI RESET so that the device will not access host memory. + * If the device is in PBL or SBL, it will only respond to RESET if + * the device is in SYSERR state. SYSERR might already be cleared + * at this point. + */ + if (reset_device) { u32 in_reset = -1; unsigned long timeout = msecs_to_jiffies(mhi_cntrl->timeout_ms);