Message ID | 20200110042634.4843-1-anand.jain@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: device stat, log when zeroed assist audit | expand |
On 1/9/20 11:26 PM, Anand Jain wrote: > We had a report indicating that some read errors aren't reported by > the device stats in the userland. It is important to have the errors > reported in the device stat as user land scripts might depend on it to > take the reasonable corrective actions. But to debug these issue we need > to be really sure that request to reset the device stat did not come > from the userland itself. So log an info message when device error reset > happens. > > For example: > BTRFS info (device sdc): device stats zeroed by btrfs (9223) > > Reported-by: philip@philip-seeger.de > Link: https://www.spinics.net/lists/linux-btrfs/msg96528.html > Signed-off-by: Anand Jain <anand.jain@oracle.com> > --- > BTRFS info (device sdc): device stats zeroed by btrfs (9223) > The last words are name and pid of the process, unfortunately it came out > as 'by btrfs'. At some point if there is a python and lib to reset it > would change, otherwise its going to be 'by btrfs', I am ok with it, > if otherwise please suggest the alternative. I think name(pid) makes sense, similar to what drop_caches does pr_info("%s (%d): drop_caches: %d\n", current->comm, task_pid_nr(current), Thanks, Josef
On 10.01.20 г. 6:26 ч., Anand Jain wrote: > We had a report indicating that some read errors aren't reported by > the device stats in the userland. It is important to have the errors > reported in the device stat as user land scripts might depend on it to > take the reasonable corrective actions. But to debug these issue we need > to be really sure that request to reset the device stat did not come > from the userland itself. So log an info message when device error reset > happens. > > For example: > BTRFS info (device sdc): device stats zeroed by btrfs (9223) > > Reported-by: philip@philip-seeger.de > Link: https://www.spinics.net/lists/linux-btrfs/msg96528.html > Signed-off-by: Anand Jain <anand.jain@oracle.com> > --- > BTRFS info (device sdc): device stats zeroed by btrfs (9223) > The last words are name and pid of the process, unfortunately it came out > as 'by btrfs'. At some point if there is a python and lib to reset it > would change, otherwise its going to be 'by btrfs', I am ok with it, > if otherwise please suggest the alternative. This patch itself is OK but is not related to what Philip has reported. The issue there is the fact we only record errors for 2 specific retvals from block layer. > > fs/btrfs/volumes.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index eb55df0d4038..6fd90270e2c7 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -7324,6 +7324,8 @@ int btrfs_get_dev_stats(struct btrfs_fs_info *fs_info, > else > btrfs_dev_stat_set(dev, i, 0); > } > + btrfs_info(fs_info, "device stats zeroed by %s (%d)", > + current->comm, task_pid_nr(current)); > } else { > for (i = 0; i < BTRFS_DEV_STAT_VALUES_MAX; i++) > if (stats->nr_items > i) >
On 1/10/20 11:07 PM, Josef Bacik wrote: > On 1/9/20 11:26 PM, Anand Jain wrote: >> We had a report indicating that some read errors aren't reported by >> the device stats in the userland. It is important to have the errors >> reported in the device stat as user land scripts might depend on it to >> take the reasonable corrective actions. But to debug these issue we need >> to be really sure that request to reset the device stat did not come >> from the userland itself. So log an info message when device error reset >> happens. >> >> For example: >> BTRFS info (device sdc): device stats zeroed by btrfs (9223) >> >> Reported-by: philip@philip-seeger.de >> Link: https://www.spinics.net/lists/linux-btrfs/msg96528.html >> Signed-off-by: Anand Jain <anand.jain@oracle.com> >> --- >> BTRFS info (device sdc): device stats zeroed by btrfs (9223) >> The last words are name and pid of the process, unfortunately it came out >> as 'by btrfs'. At some point if there is a python and lib to reset it >> would change, otherwise its going to be 'by btrfs', I am ok with it, >> if otherwise please suggest the alternative. > > I think name(pid) makes sense, similar to what drop_caches does > > pr_info("%s (%d): drop_caches: %d\n", > current->comm, task_pid_nr(current), There is a small deviation to what we already have in device_list_add(), name (pid) is at the end the log message.. ------ pr_info( "BTRFS: device label %s devid %llu transid %llu %s scanned by %s (%d)\n", disk_super->label, devid, found_transid, path, current->comm, task_pid_nr(current)); -------- I am not sure. Can David can tweak during merge ? Thanks, Anand > Thanks, > > Josef
On Sat, Jan 11, 2020 at 04:50:18PM +0800, Anand Jain wrote: > On 1/10/20 11:07 PM, Josef Bacik wrote: > > On 1/9/20 11:26 PM, Anand Jain wrote: > >> We had a report indicating that some read errors aren't reported by > >> the device stats in the userland. It is important to have the errors > >> reported in the device stat as user land scripts might depend on it to > >> take the reasonable corrective actions. But to debug these issue we need > >> to be really sure that request to reset the device stat did not come > >> from the userland itself. So log an info message when device error reset > >> happens. > >> > >> For example: > >> BTRFS info (device sdc): device stats zeroed by btrfs (9223) > >> > >> Reported-by: philip@philip-seeger.de > >> Link: https://www.spinics.net/lists/linux-btrfs/msg96528.html > >> Signed-off-by: Anand Jain <anand.jain@oracle.com> > >> --- > >> BTRFS info (device sdc): device stats zeroed by btrfs (9223) > >> The last words are name and pid of the process, unfortunately it came out > >> as 'by btrfs'. At some point if there is a python and lib to reset it > >> would change, otherwise its going to be 'by btrfs', I am ok with it, > >> if otherwise please suggest the alternative. > > > > I think name(pid) makes sense, similar to what drop_caches does > > > > pr_info("%s (%d): drop_caches: %d\n", > > current->comm, task_pid_nr(current), > > There is a small deviation to what we already have in > device_list_add(), name (pid) is at the end the log message.. > > ------ > pr_info( > "BTRFS: device label %s devid %llu transid %llu %s scanned by > %s (%d)\n", > disk_super->label, devid, > found_transid, path, > current->comm, task_pid_nr(current)); > -------- > > I am not sure. Can David can tweak during merge ? Yes, no problem.
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index eb55df0d4038..6fd90270e2c7 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -7324,6 +7324,8 @@ int btrfs_get_dev_stats(struct btrfs_fs_info *fs_info, else btrfs_dev_stat_set(dev, i, 0); } + btrfs_info(fs_info, "device stats zeroed by %s (%d)", + current->comm, task_pid_nr(current)); } else { for (i = 0; i < BTRFS_DEV_STAT_VALUES_MAX; i++) if (stats->nr_items > i)
We had a report indicating that some read errors aren't reported by the device stats in the userland. It is important to have the errors reported in the device stat as user land scripts might depend on it to take the reasonable corrective actions. But to debug these issue we need to be really sure that request to reset the device stat did not come from the userland itself. So log an info message when device error reset happens. For example: BTRFS info (device sdc): device stats zeroed by btrfs (9223) Reported-by: philip@philip-seeger.de Link: https://www.spinics.net/lists/linux-btrfs/msg96528.html Signed-off-by: Anand Jain <anand.jain@oracle.com> --- BTRFS info (device sdc): device stats zeroed by btrfs (9223) The last words are name and pid of the process, unfortunately it came out as 'by btrfs'. At some point if there is a python and lib to reset it would change, otherwise its going to be 'by btrfs', I am ok with it, if otherwise please suggest the alternative. fs/btrfs/volumes.c | 2 ++ 1 file changed, 2 insertions(+)