diff mbox series

[05/13] lightnvm: pblk: Count all read errors in stats

Message ID 20190227171442.11853-6-igor.j.konopko@intel.com (mailing list archive)
State New, archived
Headers show
Series lightnvm: bugfixes and improvements | expand

Commit Message

Igor Konopko Feb. 27, 2019, 5:14 p.m. UTC
Currently when unknown error occurs on read path
there is only dmesg information about it, but it
is not counted in sysfs statistics. Since this is
still an error we should also count it there.

Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
---
 drivers/lightnvm/pblk-core.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Javier González March 4, 2019, 7:42 a.m. UTC | #1
> On 27 Feb 2019, at 18.14, Igor Konopko <igor.j.konopko@intel.com> wrote:
> 
> Currently when unknown error occurs on read path
> there is only dmesg information about it, but it
> is not counted in sysfs statistics. Since this is
> still an error we should also count it there.
> 
> Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
> ---
> drivers/lightnvm/pblk-core.c | 1 +
> 1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
> index eabcbc119681..a98b2255f963 100644
> --- a/drivers/lightnvm/pblk-core.c
> +++ b/drivers/lightnvm/pblk-core.c
> @@ -493,6 +493,7 @@ void pblk_log_read_err(struct pblk *pblk, struct nvm_rq *rqd)
> 		atomic_long_inc(&pblk->read_failed);
> 		break;
> 	default:
> +		atomic_long_inc(&pblk->read_failed);
> 		pblk_err(pblk, "unknown read error:%d\n", rqd->error);
> 	}
> #ifdef CONFIG_NVM_PBLK_DEBUG
> --
> 2.17.1

I left this out intentionally  so that we could correlate the logs from
the controller and the errors in the read path. Since we do not have an
standard way to correlate this on SMART yet, let’s add this now (I
assume that you are using it for something?) and we can separate the
error stats in the future.

Reviewed-by: Javier González <javier@javigon.com>
Hans Holmberg March 4, 2019, 9:02 a.m. UTC | #2
Igor: Have you seen this happening in real life?

I think it would be better to count all expected errors and put them
in the right bucket (without spamming dmesg). If we need a new bucket
for i.e. vendor-specific-errors, let's do that instead.

Someone wiser than me told me that every error print in the log is a
potential customer call.

Javier: Yeah, I think S.M.A.R.T is the way to deliver this
information. Why can't we let the drives expose this info and remove
this from pblk? What's blocking that?

Thanks,
Hans

On Mon, Mar 4, 2019 at 8:42 AM Javier González <javier@javigon.com> wrote:
>
> > On 27 Feb 2019, at 18.14, Igor Konopko <igor.j.konopko@intel.com> wrote:
> >
> > Currently when unknown error occurs on read path
> > there is only dmesg information about it, but it
> > is not counted in sysfs statistics. Since this is
> > still an error we should also count it there.
> >
> > Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
> > ---
> > drivers/lightnvm/pblk-core.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
> > index eabcbc119681..a98b2255f963 100644
> > --- a/drivers/lightnvm/pblk-core.c
> > +++ b/drivers/lightnvm/pblk-core.c
> > @@ -493,6 +493,7 @@ void pblk_log_read_err(struct pblk *pblk, struct nvm_rq *rqd)
> >               atomic_long_inc(&pblk->read_failed);
> >               break;
> >       default:
> > +             atomic_long_inc(&pblk->read_failed);
> >               pblk_err(pblk, "unknown read error:%d\n", rqd->error);
> >       }
> > #ifdef CONFIG_NVM_PBLK_DEBUG
> > --
> > 2.17.1
>
> I left this out intentionally  so that we could correlate the logs from
> the controller and the errors in the read path. Since we do not have an
> standard way to correlate this on SMART yet, let’s add this now (I
> assume that you are using it for something?) and we can separate the
> error stats in the future.
>
> Reviewed-by: Javier González <javier@javigon.com>
Javier González March 4, 2019, 9:23 a.m. UTC | #3
> On 4 Mar 2019, at 10.02, Hans Holmberg <hans.ml.holmberg@owltronix.com> wrote:
> 
> Igor: Have you seen this happening in real life?
> 
> I think it would be better to count all expected errors and put them
> in the right bucket (without spamming dmesg). If we need a new bucket
> for i.e. vendor-specific-errors, let's do that instead.
> 
> Someone wiser than me told me that every error print in the log is a
> potential customer call.
> 
> Javier: Yeah, I think S.M.A.R.T is the way to deliver this
> information. Why can't we let the drives expose this info and remove
> this from pblk? What's blocking that?

Until now the spec. We added some new log information in Denali exactly
for this. But since pblk supports OCSSD 1.2 and 2.0 I think it is needed to
have it here, at least for debugging.

> 
> Thanks,
> Hans
> 
> On Mon, Mar 4, 2019 at 8:42 AM Javier González <javier@javigon.com> wrote:
>>> On 27 Feb 2019, at 18.14, Igor Konopko <igor.j.konopko@intel.com> wrote:
>>> 
>>> Currently when unknown error occurs on read path
>>> there is only dmesg information about it, but it
>>> is not counted in sysfs statistics. Since this is
>>> still an error we should also count it there.
>>> 
>>> Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
>>> ---
>>> drivers/lightnvm/pblk-core.c | 1 +
>>> 1 file changed, 1 insertion(+)
>>> 
>>> diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
>>> index eabcbc119681..a98b2255f963 100644
>>> --- a/drivers/lightnvm/pblk-core.c
>>> +++ b/drivers/lightnvm/pblk-core.c
>>> @@ -493,6 +493,7 @@ void pblk_log_read_err(struct pblk *pblk, struct nvm_rq *rqd)
>>>              atomic_long_inc(&pblk->read_failed);
>>>              break;
>>>      default:
>>> +             atomic_long_inc(&pblk->read_failed);
>>>              pblk_err(pblk, "unknown read error:%d\n", rqd->error);
>>>      }
>>> #ifdef CONFIG_NVM_PBLK_DEBUG
>>> --
>>> 2.17.1
>> 
>> I left this out intentionally  so that we could correlate the logs from
>> the controller and the errors in the read path. Since we do not have an
>> standard way to correlate this on SMART yet, let’s add this now (I
>> assume that you are using it for something?) and we can separate the
>> error stats in the future.
>> 
>> Reviewed-by: Javier González <javier@javigon.com>
Hans Holmberg March 4, 2019, 11:41 a.m. UTC | #4
On Mon, Mar 4, 2019 at 10:23 AM Javier González <javier@javigon.com> wrote:
>
> > On 4 Mar 2019, at 10.02, Hans Holmberg <hans.ml.holmberg@owltronix.com> wrote:
> >
> > Igor: Have you seen this happening in real life?
> >
> > I think it would be better to count all expected errors and put them
> > in the right bucket (without spamming dmesg). If we need a new bucket
> > for i.e. vendor-specific-errors, let's do that instead.
> >
> > Someone wiser than me told me that every error print in the log is a
> > potential customer call.
> >
> > Javier: Yeah, I think S.M.A.R.T is the way to deliver this
> > information. Why can't we let the drives expose this info and remove
> > this from pblk? What's blocking that?
>
> Until now the spec. We added some new log information in Denali exactly
> for this. But since pblk supports OCSSD 1.2 and 2.0 I think it is needed to
> have it here, at least for debugging.

Why add it to the spec? Why not use whatever everyone else is using?

https://en.wikipedia.org/wiki/S.M.A.R.T. :
"S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology; often
written as SMART) is a monitoring system included in computer hard
disk drives (HDDs), solid-state drives (SSDs),[1] and eMMC drives. Its
primary function is to detect and report various indicators of drive
reliability with the intent of anticipating imminent hardware
failures."
Sounds like what we want here.

For debugging, a trace point or something(i.e. BPF) would be a better
solution that would not impact hot-path performance.

>
> >
> > Thanks,
> > Hans
> >
> > On Mon, Mar 4, 2019 at 8:42 AM Javier González <javier@javigon.com> wrote:
> >>> On 27 Feb 2019, at 18.14, Igor Konopko <igor.j.konopko@intel.com> wrote:
> >>>
> >>> Currently when unknown error occurs on read path
> >>> there is only dmesg information about it, but it
> >>> is not counted in sysfs statistics. Since this is
> >>> still an error we should also count it there.
> >>>
> >>> Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
> >>> ---
> >>> drivers/lightnvm/pblk-core.c | 1 +
> >>> 1 file changed, 1 insertion(+)
> >>>
> >>> diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
> >>> index eabcbc119681..a98b2255f963 100644
> >>> --- a/drivers/lightnvm/pblk-core.c
> >>> +++ b/drivers/lightnvm/pblk-core.c
> >>> @@ -493,6 +493,7 @@ void pblk_log_read_err(struct pblk *pblk, struct nvm_rq *rqd)
> >>>              atomic_long_inc(&pblk->read_failed);
> >>>              break;
> >>>      default:
> >>> +             atomic_long_inc(&pblk->read_failed);
> >>>              pblk_err(pblk, "unknown read error:%d\n", rqd->error);
> >>>      }
> >>> #ifdef CONFIG_NVM_PBLK_DEBUG
> >>> --
> >>> 2.17.1
> >>
> >> I left this out intentionally  so that we could correlate the logs from
> >> the controller and the errors in the read path. Since we do not have an
> >> standard way to correlate this on SMART yet, let’s add this now (I
> >> assume that you are using it for something?) and we can separate the
> >> error stats in the future.
> >>
> >> Reviewed-by: Javier González <javier@javigon.com>
Javier González March 4, 2019, 11:45 a.m. UTC | #5
> On 4 Mar 2019, at 12.41, Hans Holmberg <hans@owltronix.com> wrote:
> 
> On Mon, Mar 4, 2019 at 10:23 AM Javier González <javier@javigon.com> wrote:
>>> On 4 Mar 2019, at 10.02, Hans Holmberg <hans.ml.holmberg@owltronix.com> wrote:
>>> 
>>> Igor: Have you seen this happening in real life?
>>> 
>>> I think it would be better to count all expected errors and put them
>>> in the right bucket (without spamming dmesg). If we need a new bucket
>>> for i.e. vendor-specific-errors, let's do that instead.
>>> 
>>> Someone wiser than me told me that every error print in the log is a
>>> potential customer call.
>>> 
>>> Javier: Yeah, I think S.M.A.R.T is the way to deliver this
>>> information. Why can't we let the drives expose this info and remove
>>> this from pblk? What's blocking that?
>> 
>> Until now the spec. We added some new log information in Denali exactly
>> for this. But since pblk supports OCSSD 1.2 and 2.0 I think it is needed to
>> have it here, at least for debugging.
> 
> Why add it to the spec? Why not use whatever everyone else is using?
> 
> https://en.wikipedia.org/wiki/S.M.A.R.T. :
> "S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology; often
> written as SMART) is a monitoring system included in computer hard
> disk drives (HDDs), solid-state drives (SSDs),[1] and eMMC drives. Its
> primary function is to detect and report various indicators of drive
> reliability with the intent of anticipating imminent hardware
> failures."
> Sounds like what we want here.

I know what smart is… You need to define the fields. Maybe you want to
read Denali again - the extensions are couple with smart.

> For debugging, a trace point or something(i.e. BPF) would be a better
> solution that would not impact hot-path performance.

Cool. Look forward to the patches ;)

Javier
Igor Konopko March 4, 2019, 12:42 p.m. UTC | #6
On 04.03.2019 12:45, Javier González wrote:
> 
>> On 4 Mar 2019, at 12.41, Hans Holmberg <hans@owltronix.com> wrote:
>>
>> On Mon, Mar 4, 2019 at 10:23 AM Javier González <javier@javigon.com> wrote:
>>>> On 4 Mar 2019, at 10.02, Hans Holmberg <hans.ml.holmberg@owltronix.com> wrote:
>>>>
>>>> Igor: Have you seen this happening in real life?
>>>>
>>>> I think it would be better to count all expected errors and put them
>>>> in the right bucket (without spamming dmesg). If we need a new bucket
>>>> for i.e. vendor-specific-errors, let's do that instead.

Generally I'm seeing different types of errors (which are typically as 
Javier mention controller errors) in cases such as hot drive removal, etc.

We can skip that patch, since this are kind of corner cases. I can also
create new type of pblk stats, sth. like "controller errors", which 
would collect all the other unexpected errors in one place instead of 
mixing them with real read/write errors as I did.

>>>>
>>>> Someone wiser than me told me that every error print in the log is a
>>>> potential customer call.
>>>>
>>>> Javier: Yeah, I think S.M.A.R.T is the way to deliver this
>>>> information. Why can't we let the drives expose this info and remove
>>>> this from pblk? What's blocking that?
>>>
>>> Until now the spec. We added some new log information in Denali exactly
>>> for this. But since pblk supports OCSSD 1.2 and 2.0 I think it is needed to
>>> have it here, at least for debugging.
>>
>> Why add it to the spec? Why not use whatever everyone else is using?
>>
>> https://en.wikipedia.org/wiki/S.M.A.R.T. :
>> "S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology; often
>> written as SMART) is a monitoring system included in computer hard
>> disk drives (HDDs), solid-state drives (SSDs),[1] and eMMC drives. Its
>> primary function is to detect and report various indicators of drive
>> reliability with the intent of anticipating imminent hardware
>> failures."
>> Sounds like what we want here.
> 
> I know what smart is… You need to define the fields. Maybe you want to
> read Denali again - the extensions are couple with smart.
> 
>> For debugging, a trace point or something(i.e. BPF) would be a better
>> solution that would not impact hot-path performance.
> 
> Cool. Look forward to the patches ;)
> 
> Javier
>
Hans Holmberg March 4, 2019, 12:48 p.m. UTC | #7
On Mon, Mar 4, 2019 at 1:42 PM Igor Konopko <igor.j.konopko@intel.com> wrote:
>
>
>
> On 04.03.2019 12:45, Javier González wrote:
> >
> >> On 4 Mar 2019, at 12.41, Hans Holmberg <hans@owltronix.com> wrote:
> >>
> >> On Mon, Mar 4, 2019 at 10:23 AM Javier González <javier@javigon.com> wrote:
> >>>> On 4 Mar 2019, at 10.02, Hans Holmberg <hans.ml.holmberg@owltronix.com> wrote:
> >>>>
> >>>> Igor: Have you seen this happening in real life?
> >>>>
> >>>> I think it would be better to count all expected errors and put them
> >>>> in the right bucket (without spamming dmesg). If we need a new bucket
> >>>> for i.e. vendor-specific-errors, let's do that instead.
>
> Generally I'm seeing different types of errors (which are typically as
> Javier mention controller errors) in cases such as hot drive removal, etc.
>
> We can skip that patch, since this are kind of corner cases. I can also
> create new type of pblk stats, sth. like "controller errors", which
> would collect all the other unexpected errors in one place instead of
> mixing them with real read/write errors as I did.

Yes, that makes sense.

Thanks,
Hans

>
> >>>>
> >>>> Someone wiser than me told me that every error print in the log is a
> >>>> potential customer call.
> >>>>
> >>>> Javier: Yeah, I think S.M.A.R.T is the way to deliver this
> >>>> information. Why can't we let the drives expose this info and remove
> >>>> this from pblk? What's blocking that?
> >>>
> >>> Until now the spec. We added some new log information in Denali exactly
> >>> for this. But since pblk supports OCSSD 1.2 and 2.0 I think it is needed to
> >>> have it here, at least for debugging.
> >>
> >> Why add it to the spec? Why not use whatever everyone else is using?
> >>
> >> https://en.wikipedia.org/wiki/S.M.A.R.T. :
> >> "S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology; often
> >> written as SMART) is a monitoring system included in computer hard
> >> disk drives (HDDs), solid-state drives (SSDs),[1] and eMMC drives. Its
> >> primary function is to detect and report various indicators of drive
> >> reliability with the intent of anticipating imminent hardware
> >> failures."
> >> Sounds like what we want here.
> >
> > I know what smart is… You need to define the fields. Maybe you want to
> > read Denali again - the extensions are couple with smart.
> >
> >> For debugging, a trace point or something(i.e. BPF) would be a better
> >> solution that would not impact hot-path performance.
> >
> > Cool. Look forward to the patches ;)
> >
> > Javier
> >
diff mbox series

Patch

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index eabcbc119681..a98b2255f963 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -493,6 +493,7 @@  void pblk_log_read_err(struct pblk *pblk, struct nvm_rq *rqd)
 		atomic_long_inc(&pblk->read_failed);
 		break;
 	default:
+		atomic_long_inc(&pblk->read_failed);
 		pblk_err(pblk, "unknown read error:%d\n", rqd->error);
 	}
 #ifdef CONFIG_NVM_PBLK_DEBUG