Message ID | 20190923060122.GA9603@machine1 (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | scsi: core: Log SCSI command age with errors | expand |
On Mon, 2019-09-23 at 11:31 +0530, Milan P. Gandhi wrote: > Couple of users had requested to print the SCSI command age along > with command failure errors. This is a small change, but allows > users to get more important information about the command that was > failed, it would help the users in debugging the command failures: > > Signed-off-by: Milan P. Gandhi <mgandhi@redhat.com> > --- > diff --git a/drivers/scsi/scsi_logging.c > b/drivers/scsi/scsi_logging.c > index ecc5918e372a..ca2182bc53c6 100644 > --- a/drivers/scsi/scsi_logging.c > +++ b/drivers/scsi/scsi_logging.c > @@ -437,6 +437,7 @@ void scsi_print_result(const struct scsi_cmnd > *cmd, const char *msg, > const char *mlret_string = scsi_mlreturn_string(disposition); > const char *hb_string = scsi_hostbyte_string(cmd->result); > const char *db_string = scsi_driverbyte_string(cmd->result); > + unsigned long cmd_age = (jiffies - cmd->jiffies_at_alloc) / HZ; > > logbuf = scsi_log_reserve_buffer(&logbuf_len); > if (!logbuf) > @@ -478,10 +479,15 @@ void scsi_print_result(const struct scsi_cmnd > *cmd, const char *msg, > > if (db_string) > off += scnprintf(logbuf + off, logbuf_len - off, > - "driverbyte=%s", db_string); > + "driverbyte=%s ", db_string); > else > off += scnprintf(logbuf + off, logbuf_len - off, > - "driverbyte=0x%02x", driver_byte(cmd- > >result)); > + "driverbyte=0x%02x ", > + driver_byte(cmd->result)); > + > + off += scnprintf(logbuf + off, logbuf_len - off, > + "cmd-age=%lus", cmd_age); > + > out_printk: > dev_printk(KERN_INFO, &cmd->device->sdev_gendev, "%s", logbuf); > scsi_log_release_buffer(logbuf); > This looks to be a useful debug addition to me, and the code looks correct. I believe this has also been tested by Milan in our lab. Reviewed-by: Laurence Oberman <loberman@redhat.com>
On 9/23/19 6:32 PM, Laurence Oberman wrote: > On Mon, 2019-09-23 at 11:31 +0530, Milan P. Gandhi wrote: >> Couple of users had requested to print the SCSI command age along >> with command failure errors. This is a small change, but allows >> users to get more important information about the command that was >> failed, it would help the users in debugging the command failures: >> >> Signed-off-by: Milan P. Gandhi <mgandhi@redhat.com> >> --- >> diff --git a/drivers/scsi/scsi_logging.c >> b/drivers/scsi/scsi_logging.c >> index ecc5918e372a..ca2182bc53c6 100644 >> --- a/drivers/scsi/scsi_logging.c >> +++ b/drivers/scsi/scsi_logging.c >> @@ -437,6 +437,7 @@ void scsi_print_result(const struct scsi_cmnd >> *cmd, const char *msg, >> const char *mlret_string = scsi_mlreturn_string(disposition); >> const char *hb_string = scsi_hostbyte_string(cmd->result); >> const char *db_string = scsi_driverbyte_string(cmd->result); >> + unsigned long cmd_age = (jiffies - cmd->jiffies_at_alloc) / HZ; >> >> logbuf = scsi_log_reserve_buffer(&logbuf_len); >> if (!logbuf) >> @@ -478,10 +479,15 @@ void scsi_print_result(const struct scsi_cmnd >> *cmd, const char *msg, >> >> if (db_string) >> off += scnprintf(logbuf + off, logbuf_len - off, >> - "driverbyte=%s", db_string); >> + "driverbyte=%s ", db_string); >> else >> off += scnprintf(logbuf + off, logbuf_len - off, >> - "driverbyte=0x%02x", driver_byte(cmd- >>> result)); >> + "driverbyte=0x%02x ", >> + driver_byte(cmd->result)); >> + >> + off += scnprintf(logbuf + off, logbuf_len - off, >> + "cmd-age=%lus", cmd_age); >> + >> out_printk: >> dev_printk(KERN_INFO, &cmd->device->sdev_gendev, "%s", logbuf); >> scsi_log_release_buffer(logbuf); >> > > This looks to be a useful debug addition to me, and the code looks > correct. > I believe this has also been tested by Milan in our lab. > > Reviewed-by: Laurence Oberman <loberman@redhat.com> > Yes, the patch was tested locally using scsi_debug as well as in real storage issues caused by bad disks in customer environment. Thanks, Milan.
On 9/22/19 11:01 PM, Milan P. Gandhi wrote: > + off += scnprintf(logbuf + off, logbuf_len - off, > + "cmd-age=%lus", cmd_age); Have you considered to change cmd-age into cmd_age? I'm afraid otherwise someone might interpret the hyphen as a subtraction sign... Thanks, Bart.
On 9/25/19 9:15 PM, Bart Van Assche wrote: > On 9/22/19 11:01 PM, Milan P. Gandhi wrote: >> + off += scnprintf(logbuf + off, logbuf_len - off, >> + "cmd-age=%lus", cmd_age); > > Have you considered to change cmd-age into cmd_age? I'm afraid otherwise someone might interpret the hyphen as a subtraction sign... Thanks for your suggestion Bart. Yes, it would be better to use cmd_age to avoid any confusion. I will change it and send the updated patch. Thanks, Milan.
On Mon, 2019-09-23 at 11:31 +0530, Milan P. Gandhi wrote: > Couple of users had requested to print the SCSI command age along > with command failure errors. This is a small change, but allows > users to get more important information about the command that was > failed, it would help the users in debugging the command failures: > > Signed-off-by: Milan P. Gandhi <mgandhi@redhat.com> > --- > diff --git a/drivers/scsi/scsi_logging.c > b/drivers/scsi/scsi_logging.c > index ecc5918e372a..ca2182bc53c6 100644 > --- a/drivers/scsi/scsi_logging.c > +++ b/drivers/scsi/scsi_logging.c > @@ -437,6 +437,7 @@ void scsi_print_result(const struct scsi_cmnd > *cmd, const char *msg, > const char *mlret_string = scsi_mlreturn_string(disposition); > const char *hb_string = scsi_hostbyte_string(cmd->result); > const char *db_string = scsi_driverbyte_string(cmd->result); > + unsigned long cmd_age = (jiffies - cmd->jiffies_at_alloc) / HZ; This comes down to pretty coarse time resolution, does it not? More often than not, the time difference shown will be 0. I'd recommend at least millisecond resolution. > > logbuf = scsi_log_reserve_buffer(&logbuf_len); > if (!logbuf) > @@ -478,10 +479,15 @@ void scsi_print_result(const struct scsi_cmnd > *cmd, const char *msg, > > if (db_string) > off += scnprintf(logbuf + off, logbuf_len - off, > - "driverbyte=%s", db_string); > + "driverbyte=%s ", db_string); > else > off += scnprintf(logbuf + off, logbuf_len - off, > - "driverbyte=0x%02x", driver_byte(cmd- > >result)); > + "driverbyte=0x%02x ", > + driver_byte(cmd->result)); > + > + off += scnprintf(logbuf + off, logbuf_len - off, > + "cmd-age=%lus", cmd_age); This is certainly helpful in some situations. Yet I am unsure if it should *always* be printed. I wouldn't say it's as important as the other stuff scsi_print_result() provides. After all, by activating MLQUEUE+MLCOMPLETE, the time on-wire can be extracted with better accuracy from currently available logs. So perhaps make this depend on a module parameter? Also, we should carefully ponder if we want to put this on the same line as the driver byte, as users may have created scripts for parsing this output. Regards, Martin
On Fri, 2019-09-27 at 17:26 +0200, Martin Wilck wrote: > On Mon, 2019-09-23 at 11:31 +0530, Milan P. Gandhi wrote: > > Couple of users had requested to print the SCSI command age along > > with command failure errors. This is a small change, but allows > > users to get more important information about the command that was > > failed, it would help the users in debugging the command failures: > > > > Signed-off-by: Milan P. Gandhi <mgandhi@redhat.com> > > --- > > diff --git a/drivers/scsi/scsi_logging.c > > b/drivers/scsi/scsi_logging.c > > index ecc5918e372a..ca2182bc53c6 100644 > > --- a/drivers/scsi/scsi_logging.c > > +++ b/drivers/scsi/scsi_logging.c > > @@ -437,6 +437,7 @@ void scsi_print_result(const struct scsi_cmnd > > *cmd, const char *msg, > > const char *mlret_string = scsi_mlreturn_string(disposition); > > const char *hb_string = scsi_hostbyte_string(cmd->result); > > const char *db_string = scsi_driverbyte_string(cmd->result); > > + unsigned long cmd_age = (jiffies - cmd->jiffies_at_alloc) / HZ; > > This comes down to pretty coarse time resolution, does it not? More > often than not, the time difference shown will be 0. I'd recommend at > least millisecond resolution. > > > > > logbuf = scsi_log_reserve_buffer(&logbuf_len); > > if (!logbuf) > > @@ -478,10 +479,15 @@ void scsi_print_result(const struct scsi_cmnd > > *cmd, const char *msg, > > > > if (db_string) > > off += scnprintf(logbuf + off, logbuf_len - off, > > - "driverbyte=%s", db_string); > > + "driverbyte=%s ", db_string); > > else > > off += scnprintf(logbuf + off, logbuf_len - off, > > - "driverbyte=0x%02x", driver_byte(cmd- > > > result)); > > > > + "driverbyte=0x%02x ", > > + driver_byte(cmd->result)); > > + > > + off += scnprintf(logbuf + off, logbuf_len - off, > > + "cmd-age=%lus", cmd_age); > > This is certainly helpful in some situations. Yet I am unsure if it > should *always* be printed. I wouldn't say it's as important as the > other stuff scsi_print_result() provides. After all, by activating > MLQUEUE+MLCOMPLETE, the time on-wire can be extracted with better > accuracy from currently available logs. > > So perhaps make this depend on a module parameter? > > Also, we should carefully ponder if we want to put this on the same > line as the driver byte, as users may have created scripts for > parsing > this output. > > Regards, > Martin > > Hi Martin Agreed about log extraction, but turning that on with a busy workload in a production environment is not practical. We cant do it with systems with 1000's of luns and 1000's of IOPS/sec. Also second resolution is good enough for the debug we want to see. Regards Laurence
On Fri, 2019-09-27 at 13:45 -0400, Laurence Oberman wrote: > > Hi Martin > > Agreed about log extraction, but turning that on with a busy workload > in a production environment is not practical. We cant do it with > systems with 1000's of luns and 1000's of IOPS/sec. > Also second resolution is good enough for the debug we want to see. I gather that you look at a specific problem where second resolution is sufficient. For upstream, the generic usefulness should be considered, and I don't think we can say today that better-than-second resolution will never be useful, so I still vote for milliseconds. Wrt the enablement of the option on highly loaded systems, I'm not sure I understand. You need to enable SCSI logging anyway, don't you? Is it an issue to have to set 2 sysfs values rather than just one? Regards, Martin > > Regards > Laurence >
On 9/30/19 2:12 PM, Martin Wilck wrote: > On Fri, 2019-09-27 at 13:45 -0400, Laurence Oberman wrote: >> >> Hi Martin >> >> Agreed about log extraction, but turning that on with a busy workload >> in a production environment is not practical. We cant do it with >> systems with 1000's of luns and 1000's of IOPS/sec. >> Also second resolution is good enough for the debug we want to see. > > I gather that you look at a specific problem where second resolution is > sufficient. For upstream, the generic usefulness should be considered, > and I don't think we can say today that better-than-second resolution > will never be useful, so I still vote for milliseconds. Ok, I will change it to ms. > Wrt the enablement of the option on highly loaded systems, I'm not sure > I understand. You need to enable SCSI logging anyway, don't you? By default we keep the SCSI debug logging disabled or am I missing something? >Is it an issue to have to set 2 sysfs values rather than just one? The idea here is to capture the above debug data even without any user interventions to change any sysfs entries or to enable debug logging on busy, critical production systems. Also, we are not changing the existing text in SCSI command error log, but we are only adding one single word at the end of message. Ideally the user scripts are written to grep specific pattern from the logs. Since we are not replacing any existing text from the logs, the scripts should still work with this change as well. Thanks, Milan.
Hello Milan, On Mon, 2019-09-30 at 14:35 +0530, Milan P. Gandhi wrote: > On 9/30/19 2:12 PM, Martin Wilck wrote: > > Wrt the enablement of the option on highly loaded systems, I'm not > > sure > > I understand. You need to enable SCSI logging anyway, don't you? > > By default we keep the SCSI debug logging disabled or am I missing > something? > > > Is it an issue to have to set 2 sysfs values rather than just one? > > The idea here is to capture the above debug data even without > any user interventions to change any sysfs entries or to enable > debug logging on busy, critical production systems. So, you're looking at the scsi_io_completion() code path. In my experience that isn't reliable for bug hunting because of the the message rate limiting. Therefore I prefer using SCSI logging MLCOMPLETE=1, where no rate limiting applies. But that's just a side note, it depends on the case what's more useful. Back to the cmd age output, IMO we're are on a thin line between capturing useful information and keeping the logs neat. As I already said, I'm not convinced that this information, as important it may be for the case(s) you're currently investigating, has the same generic degree of importance or usefulness as what's currently printed (the CDB and the sense data). But OTOH, that's just a gut feeling, and I can't claim to have the experience to make general statement on it. If noone else has issues with this being printed by default, I'm not going oppose it. > Also, we are not changing the existing text in SCSI command error > log, > but we are only adding one single word at the end of message. Ideally > the user scripts are written to grep specific pattern from the logs. > Since we are not replacing any existing text from the logs, the > scripts should still work with this change as well. You are certainly aware that such scripts don't necessarily conform to what kernel developers would consider "ideal" :-) But again, I just wanted to raise the issue; if noone else thinks it matters, fine with me. Thanks Martin
diff --git a/drivers/scsi/scsi_logging.c b/drivers/scsi/scsi_logging.c index ecc5918e372a..ca2182bc53c6 100644 --- a/drivers/scsi/scsi_logging.c +++ b/drivers/scsi/scsi_logging.c @@ -437,6 +437,7 @@ void scsi_print_result(const struct scsi_cmnd *cmd, const char *msg, const char *mlret_string = scsi_mlreturn_string(disposition); const char *hb_string = scsi_hostbyte_string(cmd->result); const char *db_string = scsi_driverbyte_string(cmd->result); + unsigned long cmd_age = (jiffies - cmd->jiffies_at_alloc) / HZ; logbuf = scsi_log_reserve_buffer(&logbuf_len); if (!logbuf) @@ -478,10 +479,15 @@ void scsi_print_result(const struct scsi_cmnd *cmd, const char *msg, if (db_string) off += scnprintf(logbuf + off, logbuf_len - off, - "driverbyte=%s", db_string); + "driverbyte=%s ", db_string); else off += scnprintf(logbuf + off, logbuf_len - off, - "driverbyte=0x%02x", driver_byte(cmd->result)); + "driverbyte=0x%02x ", + driver_byte(cmd->result)); + + off += scnprintf(logbuf + off, logbuf_len - off, + "cmd-age=%lus", cmd_age); + out_printk: dev_printk(KERN_INFO, &cmd->device->sdev_gendev, "%s", logbuf); scsi_log_release_buffer(logbuf);
Couple of users had requested to print the SCSI command age along with command failure errors. This is a small change, but allows users to get more important information about the command that was failed, it would help the users in debugging the command failures: Signed-off-by: Milan P. Gandhi <mgandhi@redhat.com> ---