diff mbox

[v4,25/32] cxlflash: Fix to prevent EEH recovery failure

Message ID 1443223134-9886-1-git-send-email-mrochs@linux.vnet.ibm.com (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Matthew R. Ochs Sept. 25, 2015, 11:18 p.m. UTC
The process_sense() routine can perform a read capacity which
can take some time to complete. If an EEH occurs while waiting
on the read capacity, the EEH handler is unable to obtain the
context's mutex in order to put the context in an error state.
The EEH handler will sit and wait until the context is free,
but this wait can last longer than the EEH handler tolerates,
leading to a failed recovery.

To address this issue, make the context unavailable to new,
non-system owned threads and release the context while calling
into process_sense(). After returning from process_sense() the
context mutex is reacquired and the context is made available
again. The context can be safely moved to the error state if
needed during the unavailable window as no other threads will
hold its reference.

Signed-off-by: Matthew R. Ochs <mrochs@linux.vnet.ibm.com>
Signed-off-by: Manoj N. Kumar <manoj@linux.vnet.ibm.com>
Reviewed-by: Brian King <brking@linux.vnet.ibm.com>
---
 drivers/scsi/cxlflash/superpipe.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Daniel Axtens Sept. 29, 2015, 1:25 a.m. UTC | #1
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

"Matthew R. Ochs" <mrochs@linux.vnet.ibm.com> writes:


> The process_sense() routine can perform a read capacity which
> can take some time to complete. If an EEH occurs while waiting
> on the read capacity, the EEH handler is unable to obtain the
> context's mutex in order to put the context in an error state.
> The EEH handler will sit and wait until the context is free,
> but this wait can last longer than the EEH handler tolerates,
> leading to a failed recovery.

I'm not quite clear on what you mean by the EEH handler timing
out. AFAIK there's nothing in eehd and the EEH core that times out if a
driver doesn't respond - indeed, it's pretty easy to hang eehd with a
misbehaving driver.

Are you referring to your own internal timeouts?
cxlflash_wait_for_pci_err_recovery and anything else that uses
CXLFLASH_PCI_ERROR_RECOVERY_TIMEOUT?

Regards,
Daniel

>
> To address this issue, make the context unavailable to new,
> non-system owned threads and release the context while calling
> into process_sense(). After returning from process_sense() the
> context mutex is reacquired and the context is made available
> again. The context can be safely moved to the error state if
> needed during the unavailable window as no other threads will
> hold its reference.
>
> Signed-off-by: Matthew R. Ochs <mrochs@linux.vnet.ibm.com>
> Signed-off-by: Manoj N. Kumar <manoj@linux.vnet.ibm.com>
> Reviewed-by: Brian King <brking@linux.vnet.ibm.com>
> ---
>  drivers/scsi/cxlflash/superpipe.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/drivers/scsi/cxlflash/superpipe.c b/drivers/scsi/cxlflash/superpipe.c
> index a6316f5..7283e83 100644
> --- a/drivers/scsi/cxlflash/superpipe.c
> +++ b/drivers/scsi/cxlflash/superpipe.c
> @@ -1787,12 +1787,21 @@ static int cxlflash_disk_verify(struct scsi_device *sdev,
>  	 * inquiry (i.e. the Unit attention is due to the WWN changing).
>  	 */
>  	if (verify->hint & DK_CXLFLASH_VERIFY_HINT_SENSE) {
> +		/* Can't hold mutex across process_sense/read_cap16,
> +		 * since we could have an intervening EEH event.
> +		 */
> +		ctxi->unavail = true;
> +		mutex_unlock(&ctxi->mutex);
>  		rc = process_sense(sdev, verify);
>  		if (unlikely(rc)) {
>  			dev_err(dev, "%s: Failed to validate sense data (%d)\n",
>  				__func__, rc);
> +			mutex_lock(&ctxi->mutex);
> +			ctxi->unavail = false;
>  			goto out;
>  		}
> +		mutex_lock(&ctxi->mutex);
> +		ctxi->unavail = false;
>  	}
>  
>  	switch (gli->mode) {
> -- 
> 2.1.0
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: GPGTools - https://gpgtools.org

iQIcBAEBCgAGBQJWCeieAAoJEPC3R3P2I92F+hMP/1OdLQCin+kKbOb9qxf952bH
DUAkmEhc0oD7xZFQI8HgDmHRxpes5HHxXtwXFsLgsr8QYG+aOIV568GXIZtTbrl0
aCFMqtKXZ6jVqv5L60r1tgzcWxmWdshMLd1op6t3BwA67nUc5Edcr94ePUyDDLj1
at335wCnxuGxn0kdB0Ud/lbPzTsgDPcuV6tCLy0o4J15KFOyFt9hCjO4nmL/wcIt
kmjyn5SHbdgje+73uaRQnXkli4wDA9x7x6/8wFgLspnOxgMEJgnHmm+HYbOXnHyX
nFFHw9+X2ETUcucVWuKNaFzW1vH+WJDteEZbjS7t7liJIkmIiZSFHyUTtVGdBkl1
FsWswA0pkzuGq94Wsb0nGtNHbsMw+WeWTcTlNN46DMG/wqz75iO3yMGK9MZuddSX
9jUokiM0kQvvfwAoujmvpMCVB4b2oseRRG4/yJ0lKSCcC8kETQTXgVHbT8oLmCdk
rUA0hxbbKzVQsDzw8s5HqYZjqHdLp3sDPeyukPeJl2CNhysrmnyHXpq8XgcLi3op
kbuuiR3z8UH3MW4BDpplnjhZ+5Wyw9cSI57vRF2Kr80NnU+5hBvftNh4rBneeny2
0gCDlPHDvB7Ks9HkcxkK9MW78FTgj50ePofS/dUUod4M9ohDd4MSwRKjpwQ+H3By
jmxnzfvWO/oTlL1D9+2W
=3BcU
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matthew R. Ochs Sept. 29, 2015, 8:11 p.m. UTC | #2
> On Sep 28, 2015, at 8:25 PM, Daniel Axtens <dja@axtens.net> wrote:
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
> 
> "Matthew R. Ochs" <mrochs@linux.vnet.ibm.com> writes:
> 
> 
>> The process_sense() routine can perform a read capacity which
>> can take some time to complete. If an EEH occurs while waiting
>> on the read capacity, the EEH handler is unable to obtain the
>> context's mutex in order to put the context in an error state.
>> The EEH handler will sit and wait until the context is free,
>> but this wait can last longer than the EEH handler tolerates,
>> leading to a failed recovery.
> 
> I'm not quite clear on what you mean by the EEH handler timing
> out. AFAIK there's nothing in eehd and the EEH core that times out if a
> driver doesn't respond - indeed, it's pretty easy to hang eehd with a
> misbehaving driver.
> 
> Are you referring to your own internal timeouts?
> cxlflash_wait_for_pci_err_recovery and anything else that uses
> CXLFLASH_PCI_ERROR_RECOVERY_TIMEOUT?

Reading through this again I can see how this is misleading. This is
actually similar and related to the deadlock scenario described in
"Fix to avoid potential deadlock on EEH". Without this fix, you'd end
up in a similar situation but deadlocked on the context mutex instead
of the ioctl semaphore.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Axtens Sept. 30, 2015, 11:53 p.m. UTC | #3
"Matthew R. Ochs" <mrochs@linux.vnet.ibm.com> writes:

>>> The process_sense() routine can perform a read capacity which
>>> can take some time to complete. If an EEH occurs while waiting
>>> on the read capacity, the EEH handler is unable to obtain the
>>> context's mutex in order to put the context in an error state.
>>> The EEH handler will sit and wait until the context is free,
>>> but this wait can last longer than the EEH handler tolerates,
>>> leading to a failed recovery.
>> 
>> I'm not quite clear on what you mean by the EEH handler timing
>> out. AFAIK there's nothing in eehd and the EEH core that times out if a
>> driver doesn't respond - indeed, it's pretty easy to hang eehd with a
>> misbehaving driver.
>> 
>> Are you referring to your own internal timeouts?
>> cxlflash_wait_for_pci_err_recovery and anything else that uses
>> CXLFLASH_PCI_ERROR_RECOVERY_TIMEOUT?
>
> Reading through this again I can see how this is misleading. This is
> actually similar and related to the deadlock scenario described in
> "Fix to avoid potential deadlock on EEH". Without this fix, you'd end
> up in a similar situation but deadlocked on the context mutex instead
> of the ioctl semaphore.

That makes _much_ more sense. If you could please revise the commit
message to explain that, you can include this in the next version:
Reviewed-by: Daniel Axtens <dja@axtens.net>

Regards,
Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/scsi/cxlflash/superpipe.c b/drivers/scsi/cxlflash/superpipe.c
index a6316f5..7283e83 100644
--- a/drivers/scsi/cxlflash/superpipe.c
+++ b/drivers/scsi/cxlflash/superpipe.c
@@ -1787,12 +1787,21 @@  static int cxlflash_disk_verify(struct scsi_device *sdev,
 	 * inquiry (i.e. the Unit attention is due to the WWN changing).
 	 */
 	if (verify->hint & DK_CXLFLASH_VERIFY_HINT_SENSE) {
+		/* Can't hold mutex across process_sense/read_cap16,
+		 * since we could have an intervening EEH event.
+		 */
+		ctxi->unavail = true;
+		mutex_unlock(&ctxi->mutex);
 		rc = process_sense(sdev, verify);
 		if (unlikely(rc)) {
 			dev_err(dev, "%s: Failed to validate sense data (%d)\n",
 				__func__, rc);
+			mutex_lock(&ctxi->mutex);
+			ctxi->unavail = false;
 			goto out;
 		}
+		mutex_lock(&ctxi->mutex);
+		ctxi->unavail = false;
 	}
 
 	switch (gli->mode) {