[v1,4/4] scsi: ufs: Fix up and simplify error recovery mechanism

Error recovery can be invoked from multiple paths, including hibern8
enter/exit, some vendor vops, ufshcd_eh_host_reset_handler(), resume and
eh_work scheduled from IRQ context. Ultimately, these paths are trying to
invoke ufshcd_reset_and_restore(), in either sync or async manner.

Having both sync and async manners at the same time has some problems

- If link recovery happens during clock scaling work, acquring scaling_lock
  in ufshcd_exec_dev_cmd() would cause dead lock, because scaling_lock is
  already held by scaling work before link recovery happens.

- If link recovery happens during ungate work, ufshcd_hold() would be
  called recursively. Although commit 53c12d0ef6fcb
  ("scsi: ufs: fix error recovery after the hibern8 exit failure") fixed
  a deadlock due to recursive calls of ufshcd_hold() by adding a check
  of eh_in_progress into ufshcd_hold(), this check added to ufshcd_hold()
  allows eh_work to run in parallel when link recovery is running.

- Similar concurrency can also happen to error recovery invoked from
  eh_host_reset_handler(), although it tries to protect it from happening
  by flushing eh_work before start invoking reset and restore, after flush
  work returns, eh_work can still be scheduled and running in parallel.

- Concurrency can even happen between eh_works. eh_work, currently queued
  on system_wq, is allowed to have multiple instances running in parallel,
  but we don't have proper protection for that.

To fix up and simplify error recovery mechanism, this change mainly does
below things

o Queue eh_work on a single threaded workqueue to avoid concurrency between
  eh_works.

o According to the UFSHCI JEDEC spec, hibern8 enter/exit error occurs when
  the link is broken. This actaully applies to any power mode change
  operations. In this change, if a power mode change operation (including
  AH8 enter/exit) fails, mark the link state as UIC_LINK_BROKEN_STATE and
  schedule eh_work. eh_work needs to do full reset and restore to recover
  the link stack to active. Before the link state is recovered back to
  active by eh_work, any power mode change attempts just return -ENOLINK to
  avoid consecutive HW error.

o To avoid concurrency between eh_work and link recovery, remove link
  recovery from hibern8 enter/exit func. If hibern8 enter/exit func fails,
  simply return error code and let eh_work run in parallel.

o Recover UFS hba runtime PM error in eh_work. If ufschd_suspend/resume
  fails due to UFS error, e.g. hibern8 enter/exit error and SSU cmd error,
  the runtime PM framework saves the error to dev.power.runtime_error.
  After that, hba runtime suspend/resume would not be invoked anymore until
  dev.power.runtime_error is cleared. The runtime PM error can be recovered
  in eh_work by calling pm_runtime_set_active() after reset and restore
  succeeds. Meanwhile, if pm_runtime_set_active() returns no error, which
  means dev.power.runtime_error is cleared, we also need to explicitly
  resume those scsi devices under hba in case any of them has failed to be
  resumed due to hba runtime resume error.

o Fix a racing problem between eh_work and ufshcd_suspend/resume. In the
  old code, it blocks scsi requests before schedules eh_work, but when
  eh_work calls pm_runtime_get_sync(), if ufshcd_suspend/resume is sending
  a scsi cmd, most likely the SSU cmd, pm_runtime_get_sync() will never
  return because scsi requests were blocked. To fix this racing problem,
  o Don't block scsi requests before schedule eh_work, but let eh_work
    block scsi requests when eh_work is ready to start error recovery.
  o Meanwhile, if eh_work is schueduled due to fatal error, don't requeue
    the scsi cmds sent from ufshcd_suspend/resume path, but simply let the
    scsi cmds fail. If the scsi cmds fail, hba runtime suspend/resume fails
    too, but it does hurt since eh_work recovers hba runtime PM error.

o Move host/regs dump in ufshcd_check_errors() to eh_work because heavy
  dump in IRQ context can lead to stability issues. In addition, some clean
  up in ufshcd_print_host_regs() and ufshcd_print_host_state().

Signed-off-by: Can Guo <cang@codeaurora.org>
---
 drivers/scsi/ufs/ufs-sysfs.c |   1 +
 drivers/scsi/ufs/ufshcd.c    | 441 ++++++++++++++++++++++++++-----------------
 drivers/scsi/ufs/ufshcd.h    |  15 ++
 3 files changed, 284 insertions(+), 173 deletions(-)

Message ID	1594616232-25080-5-git-send-email-cang@codeaurora.org (mailing list archive)
State	Superseded
Headers	show Return-Path: <SRS0=+RX4=AY=vger.kernel.org=linux-scsi-owner@kernel.org> Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2167C722 for <patchwork-linux-scsi@patchwork.kernel.org>; Mon, 13 Jul 2020 05:03:46 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0393120791 for <patchwork-linux-scsi@patchwork.kernel.org>; Mon, 13 Jul 2020 05:03:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728779AbgGMFDm (ORCPT <rfc822;patchwork-linux-scsi@patchwork.kernel.org>); Mon, 13 Jul 2020 01:03:42 -0400 Received: from labrats.qualcomm.com ([199.106.110.90]:32892 "EHLO labrats.qualcomm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725826AbgGMFDl (ORCPT <rfc822;linux-scsi@vger.kernel.org>); Mon, 13 Jul 2020 01:03:41 -0400 IronPort-SDR: E/nunR0mpylhFxNg0ie6FP6e0Fm0gCtP9reLOUT6gcklqbQBf1Fme1FxUEP0xUOvSJvPLQxH4T /PULK3L3maObq92x4JccC+ikrd4bI3AEXceKWu12vVvAGkTPqiyv6slLjtRMjIx0R904ELlVl1 oWEdRunycxWvsHPd4kBSDaZ4i2385ap5IahsCmqTQfY3w0Q2cwJh5oO95P7z0x6HkEP1S/w26H 6P6wiFyCUqIVTM1QQlNPcrPuVHkmY7A1Y/MEo6WqcyZhpFOJifpNFZbQl5vSHz4NIcHMaEwPr6 6B4= X-IronPort-AV: E=Sophos;i="5.75,346,1589266800"; d="scan'208";a="47215538" Received: from unknown (HELO ironmsg03-sd.qualcomm.com) ([10.53.140.143]) by labrats.qualcomm.com with ESMTP; 12 Jul 2020 21:57:37 -0700 Received: from pacamara-linux.qualcomm.com ([192.168.140.135]) by ironmsg03-sd.qualcomm.com with ESMTP; 12 Jul 2020 21:57:36 -0700 Received: by pacamara-linux.qualcomm.com (Postfix, from userid 359480) id 5272E22DAF; Sun, 12 Jul 2020 21:57:36 -0700 (PDT) From: Can Guo <cang@codeaurora.org> To: asutoshd@codeaurora.org, nguyenb@codeaurora.org, hongwus@codeaurora.org, rnayak@codeaurora.org, linux-scsi@vger.kernel.org, kernel-team@android.com, saravanak@google.com, salyzyn@google.com, cang@codeaurora.org Cc: Alim Akhtar <alim.akhtar@samsung.com>, Avri Altman <avri.altman@wdc.com>, "James E.J. Bottomley" <jejb@linux.ibm.com>, "Martin K. Petersen" <martin.petersen@oracle.com>, Stanley Chu <stanley.chu@mediatek.com>, Bean Huo <beanhuo@micron.com>, Nitin Rawat <nitirawa@codeaurora.org>, Tomas Winkler <tomas.winkler@intel.com>, Bart Van Assche <bvanassche@acm.org>, Satya Tangirala <satyat@google.com>, linux-kernel@vger.kernel.org (open list) Subject: [PATCH v1 4/4] scsi: ufs: Fix up and simplify error recovery mechanism Date: Sun, 12 Jul 2020 21:57:12 -0700 Message-Id: <1594616232-25080-5-git-send-email-cang@codeaurora.org> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1594616232-25080-1-git-send-email-cang@codeaurora.org> References: <1594616232-25080-1-git-send-email-cang@codeaurora.org> Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk List-ID: <linux-scsi.vger.kernel.org> X-Mailing-List: linux-scsi@vger.kernel.org
Series	Fix up and simplify error recovery mechanism \| expand [v1,0/4] Fix up and simplify error recovery mechanism [v1,1/4] scsi: ufs: Add checks before setting clk-gating states [v1,2/4] scsi: ufs: Fix imbalanced scsi_block_reqs_cnt caused by ufshcd_hold() [v1,3/4] ufs: ufs-qcom: Fix a few BUGs in func ufs_qcom_dump_dbg_regs() [v1,4/4] scsi: ufs: Fix up and simplify error recovery mechanism

[v1,4/4] scsi: ufs: Fix up and simplify error recovery mechanism

Commit Message

Comments

Patch