Message ID | 20191001104949.42810-1-maier@linux.ibm.com (mailing list archive) |
---|---|
State | Mainlined |
Commit | 2190168aaea42c31bff7b9a967e7b045f07df095 |
Headers | show |
Series | [v2] zfcp: fix reaction on bit error theshold notification with adapter close | expand |
On Tue, Oct 01, 2019 at 12:49:49PM +0200, Steffen Maier wrote: > On excessive bit errors for the FCP channel ingress fibre path, the channel > notifies us. Previously, we only emitted a kernel message and a trace record. > Since performance can become suboptimal with I/O timeouts due to > bit errors, we now stop using an FCP device by default on channel > notification so multipath on top can timely failover to other paths. > A new module parameter zfcp.ber_stop can be used to get zfcp old behavior. Ugh, module parameters? This isn't the 1990's anymore :( Why not just make this a dynamic sysfs variable, that way you properly can set this on whatever device you want, not just "all or nothing"? thanks, greg k-h
On 10/1/19 4:14 PM, Greg KH wrote: > On Tue, Oct 01, 2019 at 12:49:49PM +0200, Steffen Maier wrote: >> On excessive bit errors for the FCP channel ingress fibre path, the channel >> notifies us. Previously, we only emitted a kernel message and a trace record. >> Since performance can become suboptimal with I/O timeouts due to >> bit errors, we now stop using an FCP device by default on channel >> notification so multipath on top can timely failover to other paths. >> A new module parameter zfcp.ber_stop can be used to get zfcp old behavior. > > Ugh, module parameters? This isn't the 1990's anymore :( > > Why not just make this a dynamic sysfs variable, that way you properly > can set this on whatever device you want, not just "all or nothing"? Since we can see many more (virtual) FCP devices than we want to actually use, we defer probing. It means, we only start allocating structures and sysfs entries on setting an FCP "online" for the first time. Setting online works through another sysfs attribute owned by our ccw bus code component called "cio". IIRC, setting online does not emit a uevent. On setting online, the (add) uevent of hot-/coldplug of an FCP device had already happened, so we could not easily have end users craft udev rules to automatically/persistently configure a new sysfs attribute (which is FCP-device-specific and appears late) to disable the new code behavior. Not sure if that could ever become a problem for end users: Even if we were to write into a new sysfs attribute, the attribute only appears during setting online so this might race with starting to actually use the FCP device with the new default behavior and could potentially disable I/O paths before the sysfs attribute write could become effective to disable the new behavor.
On Tue, Oct 01, 2019 at 05:07:50PM +0200, Steffen Maier wrote: > On 10/1/19 4:14 PM, Greg KH wrote: > > On Tue, Oct 01, 2019 at 12:49:49PM +0200, Steffen Maier wrote: > > > On excessive bit errors for the FCP channel ingress fibre path, the channel > > > notifies us. Previously, we only emitted a kernel message and a trace record. > > > Since performance can become suboptimal with I/O timeouts due to > > > bit errors, we now stop using an FCP device by default on channel > > > notification so multipath on top can timely failover to other paths. > > > A new module parameter zfcp.ber_stop can be used to get zfcp old behavior. > > > > Ugh, module parameters? This isn't the 1990's anymore :( > > > > Why not just make this a dynamic sysfs variable, that way you properly > > can set this on whatever device you want, not just "all or nothing"? > > Since we can see many more (virtual) FCP devices than we want to actually > use, we defer probing. It means, we only start allocating structures and > sysfs entries on setting an FCP "online" for the first time. Setting online > works through another sysfs attribute owned by our ccw bus code component > called "cio". IIRC, setting online does not emit a uevent. On setting > online, the (add) uevent of hot-/coldplug of an FCP device had already > happened, so we could not easily have end users craft udev rules to > automatically/persistently configure a new sysfs attribute (which is > FCP-device-specific and appears late) to disable the new code behavior. > > Not sure if that could ever become a problem for end users: Even if we were > to write into a new sysfs attribute, the attribute only appears during > setting online so this might race with starting to actually use the FCP > device with the new default behavior and could potentially disable I/O paths > before the sysfs attribute write could become effective to disable the new > behavor. Ok, then why make this a module option that you will have to support for the next 20+ years anyway if you feel this fix is the correct way that it should be done instead? module options are tough to manage and support, only add them as a very last thing, when all other options have been ruled out. thanks, greg k-h
Greg, > Ok, then why make this a module option that you will have to support > for the next 20+ years anyway if you feel this fix is the correct way > that it should be done instead? I agree. Why not just shut FCP down unconditionally on excessive bit errors? What's the benefit of allowing things to continue? Are you hoping things will eventually recover in a single-path scenario?
On 10/1/19 8:26 PM, Martin K. Petersen wrote: >> Ok, then why make this a module option that you will have to support >> for the next 20+ years anyway if you feel this fix is the correct way >> that it should be done instead? > > I agree. > > Why not just shut FCP down unconditionally on excessive bit errors? > What's the benefit of allowing things to continue? Are you hoping things > will eventually recover in a single-path scenario? Experience told me that there will be an unforeseen end user scenario where I need a quick switch to let even shaky paths survive.
Steffen, >> Why not just shut FCP down unconditionally on excessive bit errors? >> What's the benefit of allowing things to continue? Are you hoping things >> will eventually recover in a single-path scenario? > > Experience told me that there will be an unforeseen end user scenario > where I need a quick switch to let even shaky paths survive. Can't say I like it. But it's your driver. Applied to 5.4/scsi-fixes. Thanks!
diff --git a/drivers/s390/scsi/zfcp_fsf.c b/drivers/s390/scsi/zfcp_fsf.c index e31c6b47af97..1e279220f073 100644 --- a/drivers/s390/scsi/zfcp_fsf.c +++ b/drivers/s390/scsi/zfcp_fsf.c @@ -29,6 +29,11 @@ struct kmem_cache *zfcp_fsf_qtcb_cache; +static bool ber_stop = true; +module_param(ber_stop, bool, 0600); +MODULE_PARM_DESC(ber_stop, + "Shuts down FCP devices for FCP channels that report a bit-error count in excess of its threshold (default on)"); + static void zfcp_fsf_request_timeout_handler(struct timer_list *t) { struct zfcp_fsf_req *fsf_req = from_timer(fsf_req, t, timer); @@ -238,10 +243,15 @@ static void zfcp_fsf_status_read_handler(struct zfcp_fsf_req *req) case FSF_STATUS_READ_SENSE_DATA_AVAIL: break; case FSF_STATUS_READ_BIT_ERROR_THRESHOLD: - dev_warn(&adapter->ccw_device->dev, - "The error threshold for checksum statistics " - "has been exceeded\n"); zfcp_dbf_hba_bit_err("fssrh_3", req); + if (ber_stop) { + dev_warn(&adapter->ccw_device->dev, + "All paths over this FCP device are disused because of excessive bit errors\n"); + zfcp_erp_adapter_shutdown(adapter, 0, "fssrh_b"); + } else { + dev_warn(&adapter->ccw_device->dev, + "The error threshold for checksum statistics has been exceeded\n"); + } break; case FSF_STATUS_READ_LINK_DOWN: zfcp_fsf_status_read_link_down(req);