diff mbox series

zfcp: fix reaction on bit error theshold notification with adapter close

Message ID 20190924160616.15301-1-maier@linux.ibm.com (mailing list archive)
State Superseded
Headers show
Series zfcp: fix reaction on bit error theshold notification with adapter close | expand

Commit Message

Steffen Maier Sept. 24, 2019, 4:06 p.m. UTC
Kernel message explanation:

 * Description:
 * The FCP channel reported that its bit error threshold has been exceeded.
 * These errors might result from a problem with the physical components
 * of the local fibre link into the FCP channel.
 * The problem might be damage or malfunction of the cable or
 * cable connection between the FCP channel and
 * the adjacent fabric switch port or the point-to-point peer.
 * Find details about the errors in the HBA trace for the FCP device.
 * The zfcp device driver closed down the FCP device
 * to limit the performance impact from possible I/O command timeouts.
 * User action:
 * Check for problems on the local fibre link, ensure that fibre optics are
 * clean and functional, and all cables are properly plugged.
 * After the repair action, you can manually recover the FCP device by
 * writing "0" into its "failed" sysfs attribute.
 * If recovery through sysfs is not possible, set the CHPID of the device
 * offline and back online on the service element.

Signed-off-by: Steffen Maier <maier@linux.ibm.com>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: <stable@vger.kernel.org> #2.6.30+
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
---

Martin, James,

an important zfcp fix for v5.4-rc.
It applies to Martin's 5.4/scsi-fixes or to James' fixes branch.


 drivers/s390/scsi/zfcp_fsf.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

Comments

Steffen Maier Sept. 26, 2019, 11 a.m. UTC | #1
On 9/26/19 12:43 AM, Sasha Levin wrote:
> [This is an automated email]
> 
> This commit has been processed because it contains a "Fixes:" tag,
> fixing commit: 1da177e4c3f4 Linux-2.6.12-rc2.
> 
> The bot has tested the following trees: v5.3.1, v5.2.17, v4.19.75, v4.14.146, v4.9.194, v4.4.194.
> 
> v5.3.1: Build OK!
> v5.2.17: Build OK!
> v4.19.75: Build OK!
> v4.14.146: Failed to apply! Possible dependencies:
>      75492a51568b ("s390/scsi: Convert timers to use timer_setup()")
> 
> v4.9.194: Failed to apply! Possible dependencies:
>      75492a51568b ("s390/scsi: Convert timers to use timer_setup()")
>      bc46427e807e ("scsi: zfcp: use setup_timer instead of init_timer")
> 
> v4.4.194: Failed to apply! Possible dependencies:
>      75492a51568b ("s390/scsi: Convert timers to use timer_setup()")
>      bc46427e807e ("scsi: zfcp: use setup_timer instead of init_timer")
> 
> 
> NOTE: The patch will not be queued to stable trees until it is upstream.
> 
> How should we proceed with this patch?

It's sufficient to have the fix in those more recent stable trees where it 
applies (and builds). My fixes tag formally indicates since when it was at 
least broken but I don't expect all stable or longterm kernels to get the fix. 
If I happen to find out we need the fix in a kernel where it does not apply, 
I'll send a backport to stable when the time is right.


Showing the possible dependencies is awesome!
Martin K. Petersen Oct. 1, 2019, 3:49 a.m. UTC | #2
Steffen,

> Kernel message explanation:
>
>  * Description:
>  * The FCP channel reported that its bit error threshold has been exceeded.
>  * These errors might result from a problem with the physical components
>  * of the local fibre link into the FCP channel.
>  * The problem might be damage or malfunction of the cable or
>  * cable connection between the FCP channel and
>  * the adjacent fabric switch port or the point-to-point peer.
>  * Find details about the errors in the HBA trace for the FCP device.
>  * The zfcp device driver closed down the FCP device
>  * to limit the performance impact from possible I/O command timeouts.
>  * User action:
>  * Check for problems on the local fibre link, ensure that fibre optics are
>  * clean and functional, and all cables are properly plugged.
>  * After the repair action, you can manually recover the FCP device by
>  * writing "0" into its "failed" sysfs attribute.
>  * If recovery through sysfs is not possible, set the CHPID of the device
>  * offline and back online on the service element.

This commentary does not read like a patch description. It makes no
mention of the actual kernel changes and the introduced module
parameter.

> +static bool ber_stop = true;
> +module_param(ber_stop, bool, 0600);
> +MODULE_PARM_DESC(ber_stop,
> +		 "Shuts down FCP devices for FCP channels that report a bit-error count in excess of its threshold (default on)");
> +
diff mbox series

Patch

diff --git a/drivers/s390/scsi/zfcp_fsf.c b/drivers/s390/scsi/zfcp_fsf.c
index 296bbc3c4606..cf63916814cc 100644
--- a/drivers/s390/scsi/zfcp_fsf.c
+++ b/drivers/s390/scsi/zfcp_fsf.c
@@ -27,6 +27,11 @@ 
 
 struct kmem_cache *zfcp_fsf_qtcb_cache;
 
+static bool ber_stop = true;
+module_param(ber_stop, bool, 0600);
+MODULE_PARM_DESC(ber_stop,
+		 "Shuts down FCP devices for FCP channels that report a bit-error count in excess of its threshold (default on)");
+
 static void zfcp_fsf_request_timeout_handler(struct timer_list *t)
 {
 	struct zfcp_fsf_req *fsf_req = from_timer(fsf_req, t, timer);
@@ -236,10 +241,15 @@  static void zfcp_fsf_status_read_handler(struct zfcp_fsf_req *req)
 	case FSF_STATUS_READ_SENSE_DATA_AVAIL:
 		break;
 	case FSF_STATUS_READ_BIT_ERROR_THRESHOLD:
-		dev_warn(&adapter->ccw_device->dev,
-			 "The error threshold for checksum statistics "
-			 "has been exceeded\n");
 		zfcp_dbf_hba_bit_err("fssrh_3", req);
+		if (ber_stop) {
+			dev_warn(&adapter->ccw_device->dev,
+				 "All paths over this FCP device are disused because of excessive bit errors\n");
+			zfcp_erp_adapter_shutdown(adapter, 0, "fssrh_b");
+		} else {
+			dev_warn(&adapter->ccw_device->dev,
+				 "The error threshold for checksum statistics has been exceeded\n");
+		}
 		break;
 	case FSF_STATUS_READ_LINK_DOWN:
 		zfcp_fsf_status_read_link_down(req);