diff mbox series

btrfs: device stat, log when zeroed assist audit

Message ID 20200110042634.4843-1-anand.jain@oracle.com (mailing list archive)
State New, archived
Headers show
Series btrfs: device stat, log when zeroed assist audit | expand

Commit Message

Anand Jain Jan. 10, 2020, 4:26 a.m. UTC
We had a report indicating that some read errors aren't reported by
the device stats in the userland. It is important to have the errors
reported in the device stat as user land scripts might depend on it to
take the reasonable corrective actions. But to debug these issue we need
to be really sure that request to reset the device stat did not come
from the userland itself. So log an info message when device error reset
happens.

For example:
 BTRFS info (device sdc): device stats zeroed by btrfs (9223)

Reported-by: philip@philip-seeger.de
Link: https://www.spinics.net/lists/linux-btrfs/msg96528.html
Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 BTRFS info (device sdc): device stats zeroed by btrfs (9223)
The last words are name and pid of the process, unfortunately it came out
as 'by btrfs'. At some point if there is a python and lib to reset it
would change, otherwise its going to be 'by btrfs', I am ok with it,
if otherwise please suggest the alternative.

 fs/btrfs/volumes.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Josef Bacik Jan. 10, 2020, 3:07 p.m. UTC | #1
On 1/9/20 11:26 PM, Anand Jain wrote:
> We had a report indicating that some read errors aren't reported by
> the device stats in the userland. It is important to have the errors
> reported in the device stat as user land scripts might depend on it to
> take the reasonable corrective actions. But to debug these issue we need
> to be really sure that request to reset the device stat did not come
> from the userland itself. So log an info message when device error reset
> happens.
> 
> For example:
>   BTRFS info (device sdc): device stats zeroed by btrfs (9223)
> 
> Reported-by: philip@philip-seeger.de
> Link: https://www.spinics.net/lists/linux-btrfs/msg96528.html
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> ---
>   BTRFS info (device sdc): device stats zeroed by btrfs (9223)
> The last words are name and pid of the process, unfortunately it came out
> as 'by btrfs'. At some point if there is a python and lib to reset it
> would change, otherwise its going to be 'by btrfs', I am ok with it,
> if otherwise please suggest the alternative.

I think name(pid) makes sense, similar to what drop_caches does

pr_info("%s (%d): drop_caches: %d\n",
	current->comm, task_pid_nr(current),

Thanks,

Josef
Nikolay Borisov Jan. 10, 2020, 7:47 p.m. UTC | #2
On 10.01.20 г. 6:26 ч., Anand Jain wrote:
> We had a report indicating that some read errors aren't reported by
> the device stats in the userland. It is important to have the errors
> reported in the device stat as user land scripts might depend on it to
> take the reasonable corrective actions. But to debug these issue we need
> to be really sure that request to reset the device stat did not come
> from the userland itself. So log an info message when device error reset
> happens.
> 
> For example:
>  BTRFS info (device sdc): device stats zeroed by btrfs (9223)
> 
> Reported-by: philip@philip-seeger.de
> Link: https://www.spinics.net/lists/linux-btrfs/msg96528.html
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> ---
>  BTRFS info (device sdc): device stats zeroed by btrfs (9223)
> The last words are name and pid of the process, unfortunately it came out
> as 'by btrfs'. At some point if there is a python and lib to reset it
> would change, otherwise its going to be 'by btrfs', I am ok with it,
> if otherwise please suggest the alternative.

This patch itself is OK but is not related to what Philip has reported.
The issue there is the fact we only record errors for 2 specific retvals
from block layer.

> 
>  fs/btrfs/volumes.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index eb55df0d4038..6fd90270e2c7 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -7324,6 +7324,8 @@ int btrfs_get_dev_stats(struct btrfs_fs_info *fs_info,
>  			else
>  				btrfs_dev_stat_set(dev, i, 0);
>  		}
> +		btrfs_info(fs_info, "device stats zeroed by %s (%d)",
> +			   current->comm, task_pid_nr(current));
>  	} else {
>  		for (i = 0; i < BTRFS_DEV_STAT_VALUES_MAX; i++)
>  			if (stats->nr_items > i)
>
Anand Jain Jan. 11, 2020, 8:50 a.m. UTC | #3
On 1/10/20 11:07 PM, Josef Bacik wrote:
> On 1/9/20 11:26 PM, Anand Jain wrote:
>> We had a report indicating that some read errors aren't reported by
>> the device stats in the userland. It is important to have the errors
>> reported in the device stat as user land scripts might depend on it to
>> take the reasonable corrective actions. But to debug these issue we need
>> to be really sure that request to reset the device stat did not come
>> from the userland itself. So log an info message when device error reset
>> happens.
>>
>> For example:
>>   BTRFS info (device sdc): device stats zeroed by btrfs (9223)
>>
>> Reported-by: philip@philip-seeger.de
>> Link: https://www.spinics.net/lists/linux-btrfs/msg96528.html
>> Signed-off-by: Anand Jain <anand.jain@oracle.com>
>> ---
>>   BTRFS info (device sdc): device stats zeroed by btrfs (9223)
>> The last words are name and pid of the process, unfortunately it came out
>> as 'by btrfs'. At some point if there is a python and lib to reset it
>> would change, otherwise its going to be 'by btrfs', I am ok with it,
>> if otherwise please suggest the alternative.
> 
> I think name(pid) makes sense, similar to what drop_caches does
> 
> pr_info("%s (%d): drop_caches: %d\n",
>      current->comm, task_pid_nr(current),

There is a small deviation to what we already have in
device_list_add(), name (pid) is at the end the log message..

------
                         pr_info(
         "BTRFS: device label %s devid %llu transid %llu %s scanned by 
%s (%d)\n",
                                 disk_super->label, devid, 
found_transid, path,
                                 current->comm, task_pid_nr(current));
--------

I am not sure. Can David can tweak during merge ?

Thanks, Anand

> Thanks,
> 
> Josef
David Sterba Jan. 13, 2020, 4:59 p.m. UTC | #4
On Sat, Jan 11, 2020 at 04:50:18PM +0800, Anand Jain wrote:
> On 1/10/20 11:07 PM, Josef Bacik wrote:
> > On 1/9/20 11:26 PM, Anand Jain wrote:
> >> We had a report indicating that some read errors aren't reported by
> >> the device stats in the userland. It is important to have the errors
> >> reported in the device stat as user land scripts might depend on it to
> >> take the reasonable corrective actions. But to debug these issue we need
> >> to be really sure that request to reset the device stat did not come
> >> from the userland itself. So log an info message when device error reset
> >> happens.
> >>
> >> For example:
> >>   BTRFS info (device sdc): device stats zeroed by btrfs (9223)
> >>
> >> Reported-by: philip@philip-seeger.de
> >> Link: https://www.spinics.net/lists/linux-btrfs/msg96528.html
> >> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> >> ---
> >>   BTRFS info (device sdc): device stats zeroed by btrfs (9223)
> >> The last words are name and pid of the process, unfortunately it came out
> >> as 'by btrfs'. At some point if there is a python and lib to reset it
> >> would change, otherwise its going to be 'by btrfs', I am ok with it,
> >> if otherwise please suggest the alternative.
> > 
> > I think name(pid) makes sense, similar to what drop_caches does
> > 
> > pr_info("%s (%d): drop_caches: %d\n",
> >      current->comm, task_pid_nr(current),
> 
> There is a small deviation to what we already have in
> device_list_add(), name (pid) is at the end the log message..
> 
> ------
>                          pr_info(
>          "BTRFS: device label %s devid %llu transid %llu %s scanned by 
> %s (%d)\n",
>                                  disk_super->label, devid, 
> found_transid, path,
>                                  current->comm, task_pid_nr(current));
> --------
> 
> I am not sure. Can David can tweak during merge ?

Yes, no problem.
diff mbox series

Patch

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index eb55df0d4038..6fd90270e2c7 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7324,6 +7324,8 @@  int btrfs_get_dev_stats(struct btrfs_fs_info *fs_info,
 			else
 				btrfs_dev_stat_set(dev, i, 0);
 		}
+		btrfs_info(fs_info, "device stats zeroed by %s (%d)",
+			   current->comm, task_pid_nr(current));
 	} else {
 		for (i = 0; i < BTRFS_DEV_STAT_VALUES_MAX; i++)
 			if (stats->nr_items > i)