[RFC,00/15] File system wide monitoring

Message ID	20210426184201.4177978-1-krisman@collabora.com (mailing list archive)
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> sender: krisman) with ESMTPSA id B0C611F41E07 From: Gabriel Krisman Bertazi <krisman@collabora.com> To: amir73il@gmail.com, tytso@mit.edu, djwong@kernel.org Cc: david@fromorbit.com, jack@suse.com, dhowells@redhat.com, khazhy@google.com, linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, Gabriel Krisman Bertazi <krisman@collabora.com>, kernel@collabora.com Subject: [PATCH RFC 00/15] File system wide monitoring Date: Mon, 26 Apr 2021 14:41:46 -0400 Message-Id: <20210426184201.4177978-1-krisman@collabora.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	File system wide monitoring \| expand [RFC,00/15] File system wide monitoring [RFC,01/15] fanotify: Fold event size calculation to its own function [RFC,02/15] fanotify: Split fsid check from other fid mode checks [RFC,03/15] fsnotify: Wire flags field on group allocation [RFC,04/15] fsnotify: Wire up group information on event initialization [RFC,05/15] fsnotify: Support event submission through ring buffer [RFC,06/15] fanotify: Support submission through ring buffer [RFC,07/15] fsnotify: Support FS_ERROR event type [RFC,08/15] fsnotify: Introduce helpers to send error_events [RFC,09/15] fanotify: Introduce generic error record [RFC,10/15] fanotify: Introduce code location record [RFC,11/15] fanotify: Introduce filesystem specific data record [RFC,12/15] fanotify: Introduce the FAN_ERROR mark [RFC,13/15] ext4: Send notifications on error [RFC,14/15] samples: Add fs error monitoring example [RFC,15/15] Documentation: Document the FAN_ERROR framework

Message ID

20210426184201.4177978-1-krisman@collabora.com (mailing list archive)

Headers

From: Gabriel Krisman Bertazi <krisman@collabora.com>
To: amir73il@gmail.com, tytso@mit.edu, djwong@kernel.org
Cc: david@fromorbit.com, jack@suse.com, dhowells@redhat.com,
        khazhy@google.com, linux-fsdevel@vger.kernel.org,
        linux-ext4@vger.kernel.org,
        Gabriel Krisman Bertazi <krisman@collabora.com>,
        kernel@collabora.com
Subject: [PATCH RFC 00/15] File system wide monitoring
Date: Mon, 26 Apr 2021 14:41:46 -0400
Message-Id: <20210426184201.4177978-1-krisman@collabora.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

File system wide monitoring | expand

Message

Gabriel Krisman Bertazi April 26, 2021, 6:41 p.m. UTC

Hi,

In an attempt to consolidate some of the feedback from the previous
proposals, I wrote a new attempt to solve the file system error reporting
problem.  Before I spend more time polishing it, I'd like to hear your
feedback if I'm going in the wrong direction, in particular with the
modifications to fsnotify.

This RFC follows up on my previous proposals which attempted to leverage
watch_queue[1] and fsnotify[2] to provide a mechanism for file systems
to push error notifications to user space.  This proposal starts by, as
suggested by Darrick, limiting the scope of what I'm trying to do to an
interface for administrators to monitor the health of a file system,
instead of a generic inteface for file errors.  Therefore, this doesn't
solve the problem of writeback errors or the need to watch a specific
subsystem.

* Format

The feature is implemented on top of fanotify, as a new type of fanotify
mark, FAN_ERROR, which a file system monitoring tool can register to
receive notifications.  A notification is split in three parts, and only
the first is guaranteed to exist for any given error event:

 - FS generic data: A file system agnostic structure that has a generic
 error code and identifies the filesystem.  Basically, it let's
 userspace know something happen on a monitored filesystem.

 - FS location data: Identifies where in the code the problem
 happened. (This is important for the use case of analysing frequent
 error points that we discussed earlier).

 - FS specific data: A detailed error report in a filesystem specific
 format that details what the error is.  Ideally, a capable monitoring
 tool can use the information here for error recovery.  For instance,
 xfs can put the xfs_scrub structures here, ext4 can send its error
 reports, etc.  An example of usage is done in the ext4 patch of this
 series.

More details on the information in each record can be found on the
documentation introduced in patch 15.

* Using fanotify

Using fanotify for this kind of thing is slightly tricky because we want
to guarantee delivery in some complicated conditions, for instance, the
file system might want to send an error while holding several locks.

Instead of working around file system constraints at the file system
level, this proposal tries to make the FAN_ERROR submission safe in
those contexts.  This is done with a new mode in fsnotify that
preallocates the memory at group creation to be used for the
notification submission.

This new mode in fsnotify introduces a ring buffer to queue
notifications, which eliminates the allocation path in fsnotify.  From
what I saw, the allocation is the only problem in fsnotify for
filesystems to submit errors in constrained situations.

* Visibility

Since the usecase is limited to a tool for whole file system monitoring,
errors are associated with the superblock and visible filesystem-wide.
It is assumed and required that userspace has CAP_SYS_ADMIN.

* Testing

This was tested with corrupted ext4 images in a few scenarios, which
caused errors to be triggered and monitored with the sample tool
provided in the next to final patch.

* patches

Patches 1-4 massage fanotify attempt to refactor fanotify a bit for
the patches to come.  Patch 5 introduce the ring buffer interface to
fsnotify, while patch 6 enable this support in fanotify.  Patch 7, 8 wire
the FS_ERROR event type, which will be used by filesystems.  In
sequennce, patches 9-12 implement the FAN_ERROR record types and create
the new event.  Patch 13 is an ext4 example implementation supporting
this feature.  Finally, patches 14 and 15 document and provide examples
of a userspace tool that uses this feature.

I also pushed the full series to:

  https://gitlab.collabora.com/krisman/linux -b fanotify-notifications

[1] https://lwn.net/Articles/839310/
[2] https://www.spinics.net/lists/linux-fsdevel/msg187075.html

Gabriel Krisman Bertazi (15):
  fanotify: Fold event size calculation to its own function
  fanotify: Split fsid check from other fid mode checks
  fsnotify: Wire flags field on group allocation
  fsnotify: Wire up group information on event initialization
  fsnotify: Support event submission through ring buffer
  fanotify: Support submission through ring buffer
  fsnotify: Support FS_ERROR event type
  fsnotify: Introduce helpers to send error_events
  fanotify: Introduce generic error record
  fanotify: Introduce code location record
  fanotify: Introduce filesystem specific data record
  fanotify: Introduce the FAN_ERROR mark
  ext4: Send notifications on error
  samples: Add fs error monitoring example
  Documentation: Document the FAN_ERROR framework

 .../admin-guide/filesystem-monitoring.rst     | 103 ++++++
 Documentation/admin-guide/index.rst           |   1 +
 fs/ext4/super.c                               |  60 +++-
 fs/notify/Makefile                            |   2 +-
 fs/notify/dnotify/dnotify.c                   |   2 +-
 fs/notify/fanotify/fanotify.c                 | 127 +++++--
 fs/notify/fanotify/fanotify.h                 |  35 +-
 fs/notify/fanotify/fanotify_user.c            | 319 ++++++++++++++----
 fs/notify/fsnotify.c                          |   2 +-
 fs/notify/group.c                             |  25 +-
 fs/notify/inotify/inotify_fsnotify.c          |   2 +-
 fs/notify/inotify/inotify_user.c              |   4 +-
 fs/notify/notification.c                      |  10 +
 fs/notify/ring.c                              | 199 +++++++++++
 include/linux/fanotify.h                      |  12 +-
 include/linux/fsnotify.h                      |  15 +
 include/linux/fsnotify_backend.h              |  63 +++-
 include/uapi/linux/ext4-notify.h              |  17 +
 include/uapi/linux/fanotify.h                 |  26 ++
 kernel/audit_fsnotify.c                       |   2 +-
 kernel/audit_tree.c                           |   2 +-
 kernel/audit_watch.c                          |   2 +-
 samples/Kconfig                               |   7 +
 samples/Makefile                              |   1 +
 samples/fanotify/Makefile                     |   3 +
 samples/fanotify/fs-monitor.c                 | 135 ++++++++
 26 files changed, 1034 insertions(+), 142 deletions(-)
 create mode 100644 Documentation/admin-guide/filesystem-monitoring.rst
 create mode 100644 fs/notify/ring.c
 create mode 100644 include/uapi/linux/ext4-notify.h
 create mode 100644 samples/fanotify/Makefile
 create mode 100644 samples/fanotify/fs-monitor.c

Comments

Amir Goldstein April 27, 2021, 4:11 a.m. UTC | #1

On Mon, Apr 26, 2021 at 9:42 PM Gabriel Krisman Bertazi
<krisman@collabora.com> wrote:
>
> Hi,
>
> In an attempt to consolidate some of the feedback from the previous
> proposals, I wrote a new attempt to solve the file system error reporting
> problem.  Before I spend more time polishing it, I'd like to hear your
> feedback if I'm going in the wrong direction, in particular with the
> modifications to fsnotify.
>

IMO you are going in the right direction, but you have gone a bit too far ;-)

My understanding of the requirements and my interpretation of the feedback
from filesystem maintainers is that the missing piece in the ecosystem is a
user notification that "something went wrong". The "what went wrong" part
is something that users and admins have long been able to gather from the
kernel log and from filesystem tools (e.g. last error recorded).

I do not see the need to duplicate existing functionality in fsmonitor.
Don't get me wrong, I understand why it would have been nice for fsmonitor
to be able to get all the errors nicely without looking anywhere else, but I
don't think it justifies the extra complexity.

> This RFC follows up on my previous proposals which attempted to leverage
> watch_queue[1] and fsnotify[2] to provide a mechanism for file systems
> to push error notifications to user space.  This proposal starts by, as
> suggested by Darrick, limiting the scope of what I'm trying to do to an
> interface for administrators to monitor the health of a file system,
> instead of a generic inteface for file errors.  Therefore, this doesn't
> solve the problem of writeback errors or the need to watch a specific
> subsystem.
>
> * Format
>
> The feature is implemented on top of fanotify, as a new type of fanotify
> mark, FAN_ERROR, which a file system monitoring tool can register to

You have a terminology mistake throughout your series.
FAN_ERROR is not a type of a mark, it is a type of an event.
A mark describes the watched object (i.e. a filesystem, mount, inode).

> receive notifications.  A notification is split in three parts, and only
> the first is guaranteed to exist for any given error event:
>
>  - FS generic data: A file system agnostic structure that has a generic
>  error code and identifies the filesystem.  Basically, it let's
>  userspace know something happen on a monitored filesystem.

I think an error seq counter per fs would be a nice addition to generic data.
It does not need to be persistent (it could be if filesystem supports it).

>
>  - FS location data: Identifies where in the code the problem
>  happened. (This is important for the use case of analysing frequent
>  error points that we discussed earlier).
>
>  - FS specific data: A detailed error report in a filesystem specific
>  format that details what the error is.  Ideally, a capable monitoring
>  tool can use the information here for error recovery.  For instance,
>  xfs can put the xfs_scrub structures here, ext4 can send its error
>  reports, etc.  An example of usage is done in the ext4 patch of this
>  series.
>
> More details on the information in each record can be found on the
> documentation introduced in patch 15.
>
> * Using fanotify
>
> Using fanotify for this kind of thing is slightly tricky because we want
> to guarantee delivery in some complicated conditions, for instance, the
> file system might want to send an error while holding several locks.
>
> Instead of working around file system constraints at the file system
> level, this proposal tries to make the FAN_ERROR submission safe in
> those contexts.  This is done with a new mode in fsnotify that
> preallocates the memory at group creation to be used for the
> notification submission.
>
> This new mode in fsnotify introduces a ring buffer to queue
> notifications, which eliminates the allocation path in fsnotify.  From
> what I saw, the allocation is the only problem in fsnotify for
> filesystems to submit errors in constrained situations.
>

The ring buffer functionality for fsnotify is interesting and it may be
useful on its own, but IMO, its too big of a hammer for the problem
at hand.

The question that you should be asking yourself is what is the
expected behavior in case of a flood of filesystem corruption errors.
I think it has already been expressed by filesystem maintainers on
one your previous postings, that a flood of filesystem corruption
errors is often noise and the only interesting information is the first error.

For this reason, I think that FS_ERROR could be implemented
by attaching an fsnotify_error_info object to an fsnotify_sb_mark:

struct fsnotify_sb_mark {
        struct fsnotify_mark fsn_mark;
        struct fsnotify_error_info info;
}

Similar to fd sampled errseq, there can be only one error report
per sb-group pair (i.e. fsnotify_sb_mark) and the memory needed to store
the error report can be allocated at the time of setting the filesystem mark.

With this, you will not need the added complexity of the ring buffer
and you will not need to limit FAN_ERROR reporting to a group that
is only listening for FAN_ERROR, which is an unneeded limitation IMO.

Anyway, in case, others do like the ring buffer approach, I do have
some technical comments on the implementation.
I will comment on individual patches.

Thanks,
Amir.

> * Visibility
>
> Since the usecase is limited to a tool for whole file system monitoring,
> errors are associated with the superblock and visible filesystem-wide.
> It is assumed and required that userspace has CAP_SYS_ADMIN.
>
> * Testing
>
> This was tested with corrupted ext4 images in a few scenarios, which
> caused errors to be triggered and monitored with the sample tool
> provided in the next to final patch.
>
> * patches
>
> Patches 1-4 massage fanotify attempt to refactor fanotify a bit for
> the patches to come.  Patch 5 introduce the ring buffer interface to
> fsnotify, while patch 6 enable this support in fanotify.  Patch 7, 8 wire
> the FS_ERROR event type, which will be used by filesystems.  In
> sequennce, patches 9-12 implement the FAN_ERROR record types and create
> the new event.  Patch 13 is an ext4 example implementation supporting
> this feature.  Finally, patches 14 and 15 document and provide examples
> of a userspace tool that uses this feature.
>
> I also pushed the full series to:
>
>   https://gitlab.collabora.com/krisman/linux -b fanotify-notifications
>
> [1] https://lwn.net/Articles/839310/
> [2] https://www.spinics.net/lists/linux-fsdevel/msg187075.html
>
> Gabriel Krisman Bertazi (15):
>   fanotify: Fold event size calculation to its own function
>   fanotify: Split fsid check from other fid mode checks
>   fsnotify: Wire flags field on group allocation
>   fsnotify: Wire up group information on event initialization
>   fsnotify: Support event submission through ring buffer
>   fanotify: Support submission through ring buffer
>   fsnotify: Support FS_ERROR event type
>   fsnotify: Introduce helpers to send error_events
>   fanotify: Introduce generic error record
>   fanotify: Introduce code location record
>   fanotify: Introduce filesystem specific data record
>   fanotify: Introduce the FAN_ERROR mark
>   ext4: Send notifications on error
>   samples: Add fs error monitoring example
>   Documentation: Document the FAN_ERROR framework
>
>  .../admin-guide/filesystem-monitoring.rst     | 103 ++++++
>  Documentation/admin-guide/index.rst           |   1 +
>  fs/ext4/super.c                               |  60 +++-
>  fs/notify/Makefile                            |   2 +-
>  fs/notify/dnotify/dnotify.c                   |   2 +-
>  fs/notify/fanotify/fanotify.c                 | 127 +++++--
>  fs/notify/fanotify/fanotify.h                 |  35 +-
>  fs/notify/fanotify/fanotify_user.c            | 319 ++++++++++++++----
>  fs/notify/fsnotify.c                          |   2 +-
>  fs/notify/group.c                             |  25 +-
>  fs/notify/inotify/inotify_fsnotify.c          |   2 +-
>  fs/notify/inotify/inotify_user.c              |   4 +-
>  fs/notify/notification.c                      |  10 +
>  fs/notify/ring.c                              | 199 +++++++++++
>  include/linux/fanotify.h                      |  12 +-
>  include/linux/fsnotify.h                      |  15 +
>  include/linux/fsnotify_backend.h              |  63 +++-
>  include/uapi/linux/ext4-notify.h              |  17 +
>  include/uapi/linux/fanotify.h                 |  26 ++
>  kernel/audit_fsnotify.c                       |   2 +-
>  kernel/audit_tree.c                           |   2 +-
>  kernel/audit_watch.c                          |   2 +-
>  samples/Kconfig                               |   7 +
>  samples/Makefile                              |   1 +
>  samples/fanotify/Makefile                     |   3 +
>  samples/fanotify/fs-monitor.c                 | 135 ++++++++
>  26 files changed, 1034 insertions(+), 142 deletions(-)
>  create mode 100644 Documentation/admin-guide/filesystem-monitoring.rst
>  create mode 100644 fs/notify/ring.c
>  create mode 100644 include/uapi/linux/ext4-notify.h
>  create mode 100644 samples/fanotify/Makefile
>  create mode 100644 samples/fanotify/fs-monitor.c
>
> --
> 2.31.0
>

Gabriel Krisman Bertazi April 27, 2021, 3:44 p.m. UTC | #2

Amir Goldstein <amir73il@gmail.com> writes:

> On Mon, Apr 26, 2021 at 9:42 PM Gabriel Krisman Bertazi
> <krisman@collabora.com> wrote:
>>
>> Hi,
>>
>> In an attempt to consolidate some of the feedback from the previous
>> proposals, I wrote a new attempt to solve the file system error reporting
>> problem.  Before I spend more time polishing it, I'd like to hear your
>> feedback if I'm going in the wrong direction, in particular with the
>> modifications to fsnotify.
>>
>
> IMO you are going in the right direction, but you have gone a bit too far ;-)
>
> My understanding of the requirements and my interpretation of the feedback
> from filesystem maintainers is that the missing piece in the ecosystem is a
> user notification that "something went wrong". The "what went wrong" part
> is something that users and admins have long been able to gather from the
> kernel log and from filesystem tools (e.g. last error recorded).
>
> I do not see the need to duplicate existing functionality in fsmonitor.
> Don't get me wrong, I understand why it would have been nice for fsmonitor
> to be able to get all the errors nicely without looking anywhere else, but I
> don't think it justifies the extra complexity.

Hi Amir,

Thanks for the detailed review.

The reasons for the location record and the ring buffer is the use case
from Google to do analysis on a series of errors.  I understand this is
important to them, which is why I expanded a bit on the 'what went
wrong' and multiple errors.  In addition, The file system specific blob
attempts to assist online recovery tools with more information, but it
might make sense to do it in the future, when it is needed.

>> This RFC follows up on my previous proposals which attempted to leverage
>> watch_queue[1] and fsnotify[2] to provide a mechanism for file systems
>> to push error notifications to user space.  This proposal starts by, as
>> suggested by Darrick, limiting the scope of what I'm trying to do to an
>> interface for administrators to monitor the health of a file system,
>> instead of a generic inteface for file errors.  Therefore, this doesn't
>> solve the problem of writeback errors or the need to watch a specific
>> subsystem.
>>
>> * Format
>>
>> The feature is implemented on top of fanotify, as a new type of fanotify
>> mark, FAN_ERROR, which a file system monitoring tool can register to
>
> You have a terminology mistake throughout your series.
> FAN_ERROR is not a type of a mark, it is a type of an event.
> A mark describes the watched object (i.e. a filesystem, mount, inode).

Right.  I understand the mistake and will fix it around the series.
>
>> receive notifications.  A notification is split in three parts, and only
>> the first is guaranteed to exist for any given error event:
>>
>>  - FS generic data: A file system agnostic structure that has a generic
>>  error code and identifies the filesystem.  Basically, it let's
>>  userspace know something happen on a monitored filesystem.
>
> I think an error seq counter per fs would be a nice addition to generic data.
> It does not need to be persistent (it could be if filesystem supports it).

Makes sense to me.

>>
>>  - FS location data: Identifies where in the code the problem
>>  happened. (This is important for the use case of analysing frequent
>>  error points that we discussed earlier).
>>
>>  - FS specific data: A detailed error report in a filesystem specific
>>  format that details what the error is.  Ideally, a capable monitoring
>>  tool can use the information here for error recovery.  For instance,
>>  xfs can put the xfs_scrub structures here, ext4 can send its error
>>  reports, etc.  An example of usage is done in the ext4 patch of this
>>  series.
>>
>> More details on the information in each record can be found on the
>> documentation introduced in patch 15.
>>
>> * Using fanotify
>>
>> Using fanotify for this kind of thing is slightly tricky because we want
>> to guarantee delivery in some complicated conditions, for instance, the
>> file system might want to send an error while holding several locks.
>>
>> Instead of working around file system constraints at the file system
>> level, this proposal tries to make the FAN_ERROR submission safe in
>> those contexts.  This is done with a new mode in fsnotify that
>> preallocates the memory at group creation to be used for the
>> notification submission.
>>
>> This new mode in fsnotify introduces a ring buffer to queue
>> notifications, which eliminates the allocation path in fsnotify.  From
>> what I saw, the allocation is the only problem in fsnotify for
>> filesystems to submit errors in constrained situations.
>>
>
> The ring buffer functionality for fsnotify is interesting and it may be
> useful on its own, but IMO, its too big of a hammer for the problem
> at hand.
>
> The question that you should be asking yourself is what is the
> expected behavior in case of a flood of filesystem corruption errors.
> I think it has already been expressed by filesystem maintainers on
> one your previous postings, that a flood of filesystem corruption
> errors is often noise and the only interesting information is the
> first error.

My idea was be to provide an ioctl for the user to resize the ring
buffer when needed, to make the flood manageable. But I understand your
main point about the ring buffer.  i'm not sure saving only the first
notification solves Google's use case of error monitoring and analysis,
though.  Khazhy, Ted, can you weight in?

> For this reason, I think that FS_ERROR could be implemented
> by attaching an fsnotify_error_info object to an fsnotify_sb_mark:
>
> struct fsnotify_sb_mark {
>         struct fsnotify_mark fsn_mark;
>         struct fsnotify_error_info info;
> }
>
> Similar to fd sampled errseq, there can be only one error report
> per sb-group pair (i.e. fsnotify_sb_mark) and the memory needed to store
> the error report can be allocated at the time of setting the filesystem mark.
>
> With this, you will not need the added complexity of the ring buffer
> and you will not need to limit FAN_ERROR reporting to a group that
> is only listening for FAN_ERROR, which is an unneeded limitation IMO.

The limitation exists because I was concerned about not breaking the
semantics of FAN_ACCESS and others, with regards to merged
notifications.  I believe there should be no other reason why
notifications of FAN_CLASS_NOTIF can't be sent to the ring buffer too.
That limitation could be lifted for everything but permission events, I
think.

Khazhy Kumykov May 11, 2021, 4:45 a.m. UTC | #3

On Tue, Apr 27, 2021 at 8:44 AM Gabriel Krisman Bertazi
<krisman@collabora.com> wrote:
>
> Amir Goldstein <amir73il@gmail.com> writes:
>
> > On Mon, Apr 26, 2021 at 9:42 PM Gabriel Krisman Bertazi
> > <krisman@collabora.com> wrote:
> >>
> >> Hi,
> >>
> >> In an attempt to consolidate some of the feedback from the previous
> >> proposals, I wrote a new attempt to solve the file system error reporting
> >> problem.  Before I spend more time polishing it, I'd like to hear your
> >> feedback if I'm going in the wrong direction, in particular with the
> >> modifications to fsnotify.
> >>
> >
> > IMO you are going in the right direction, but you have gone a bit too far ;-)
> >
> > My understanding of the requirements and my interpretation of the feedback
> > from filesystem maintainers is that the missing piece in the ecosystem is a
> > user notification that "something went wrong". The "what went wrong" part
> > is something that users and admins have long been able to gather from the
> > kernel log and from filesystem tools (e.g. last error recorded).
> >
> > I do not see the need to duplicate existing functionality in fsmonitor.
> > Don't get me wrong, I understand why it would have been nice for fsmonitor
> > to be able to get all the errors nicely without looking anywhere else, but I
> > don't think it justifies the extra complexity.
>
> Hi Amir,
>
> Thanks for the detailed review.
>
> The reasons for the location record and the ring buffer is the use case
> from Google to do analysis on a series of errors.  I understand this is
> important to them, which is why I expanded a bit on the 'what went
> wrong' and multiple errors.  In addition, The file system specific blob
> attempts to assist online recovery tools with more information, but it
> might make sense to do it in the future, when it is needed.
>
> >> This RFC follows up on my previous proposals which attempted to leverage
> >> watch_queue[1] and fsnotify[2] to provide a mechanism for file systems
> >> to push error notifications to user space.  This proposal starts by, as
> >> suggested by Darrick, limiting the scope of what I'm trying to do to an
> >> interface for administrators to monitor the health of a file system,
> >> instead of a generic inteface for file errors.  Therefore, this doesn't
> >> solve the problem of writeback errors or the need to watch a specific
> >> subsystem.
> >>
> >> * Format
> >>
> >> The feature is implemented on top of fanotify, as a new type of fanotify
> >> mark, FAN_ERROR, which a file system monitoring tool can register to
> >
> > You have a terminology mistake throughout your series.
> > FAN_ERROR is not a type of a mark, it is a type of an event.
> > A mark describes the watched object (i.e. a filesystem, mount, inode).
>
> Right.  I understand the mistake and will fix it around the series.
> >
> >> receive notifications.  A notification is split in three parts, and only
> >> the first is guaranteed to exist for any given error event:
> >>
> >>  - FS generic data: A file system agnostic structure that has a generic
> >>  error code and identifies the filesystem.  Basically, it let's
> >>  userspace know something happen on a monitored filesystem.
> >
> > I think an error seq counter per fs would be a nice addition to generic data.
> > It does not need to be persistent (it could be if filesystem supports it).
>
> Makes sense to me.
>
> >>
> >>  - FS location data: Identifies where in the code the problem
> >>  happened. (This is important for the use case of analysing frequent
> >>  error points that we discussed earlier).
> >>
> >>  - FS specific data: A detailed error report in a filesystem specific
> >>  format that details what the error is.  Ideally, a capable monitoring
> >>  tool can use the information here for error recovery.  For instance,
> >>  xfs can put the xfs_scrub structures here, ext4 can send its error
> >>  reports, etc.  An example of usage is done in the ext4 patch of this
> >>  series.
> >>
> >> More details on the information in each record can be found on the
> >> documentation introduced in patch 15.
> >>
> >> * Using fanotify
> >>
> >> Using fanotify for this kind of thing is slightly tricky because we want
> >> to guarantee delivery in some complicated conditions, for instance, the
> >> file system might want to send an error while holding several locks.
> >>
> >> Instead of working around file system constraints at the file system
> >> level, this proposal tries to make the FAN_ERROR submission safe in
> >> those contexts.  This is done with a new mode in fsnotify that
> >> preallocates the memory at group creation to be used for the
> >> notification submission.
> >>
> >> This new mode in fsnotify introduces a ring buffer to queue
> >> notifications, which eliminates the allocation path in fsnotify.  From
> >> what I saw, the allocation is the only problem in fsnotify for
> >> filesystems to submit errors in constrained situations.
> >>
> >
> > The ring buffer functionality for fsnotify is interesting and it may be
> > useful on its own, but IMO, its too big of a hammer for the problem
> > at hand.
> >
> > The question that you should be asking yourself is what is the
> > expected behavior in case of a flood of filesystem corruption errors.
> > I think it has already been expressed by filesystem maintainers on
> > one your previous postings, that a flood of filesystem corruption
> > errors is often noise and the only interesting information is the
> > first error.
>
> My idea was be to provide an ioctl for the user to resize the ring
> buffer when needed, to make the flood manageable. But I understand your
> main point about the ring buffer.  i'm not sure saving only the first
> notification solves Google's use case of error monitoring and analysis,
> though.  Khazhy, Ted, can you weight in?

I think this is a good point to bring up - a flood of errors shouldn't
drown out other filesystems, and from the perspective of error
reporting, it's better to drop all but one notification from one FS
than to drop the only notification from another. In the cases I can
think of, the first error is probably enough and does simplify things
quite a bit. There is the option of setting up a ring buffer per fs,
which does seem excessive in light of the previous statement.

>
> > For this reason, I think that FS_ERROR could be implemented
> > by attaching an fsnotify_error_info object to an fsnotify_sb_mark:
> >
> > struct fsnotify_sb_mark {
> >         struct fsnotify_mark fsn_mark;
> >         struct fsnotify_error_info info;
> > }
> >
> > Similar to fd sampled errseq, there can be only one error report
> > per sb-group pair (i.e. fsnotify_sb_mark) and the memory needed to store
> > the error report can be allocated at the time of setting the filesystem mark.
> >
> > With this, you will not need the added complexity of the ring buffer
> > and you will not need to limit FAN_ERROR reporting to a group that
> > is only listening for FAN_ERROR, which is an unneeded limitation IMO.
>
> The limitation exists because I was concerned about not breaking the
> semantics of FAN_ACCESS and others, with regards to merged
> notifications.  I believe there should be no other reason why
> notifications of FAN_CLASS_NOTIF can't be sent to the ring buffer too.
> That limitation could be lifted for everything but permission events, I
> think.
>
> --
> Gabriel Krisman Bertazi

Jan Kara May 11, 2021, 10:43 a.m. UTC | #4

On Tue 27-04-21 07:11:49, Amir Goldstein wrote:
> The ring buffer functionality for fsnotify is interesting and it may be
> useful on its own, but IMO, its too big of a hammer for the problem
> at hand.
> 
> The question that you should be asking yourself is what is the
> expected behavior in case of a flood of filesystem corruption errors.
> I think it has already been expressed by filesystem maintainers on
> one your previous postings, that a flood of filesystem corruption
> errors is often noise and the only interesting information is the first error.
> 
> For this reason, I think that FS_ERROR could be implemented
> by attaching an fsnotify_error_info object to an fsnotify_sb_mark:
> 
> struct fsnotify_sb_mark {
>         struct fsnotify_mark fsn_mark;
>         struct fsnotify_error_info info;
> }
> 
> Similar to fd sampled errseq, there can be only one error report
> per sb-group pair (i.e. fsnotify_sb_mark) and the memory needed to store
> the error report can be allocated at the time of setting the filesystem mark.
> 
> With this, you will not need the added complexity of the ring buffer
> and you will not need to limit FAN_ERROR reporting to a group that
> is only listening for FAN_ERROR, which is an unneeded limitation IMO.

Seeing that this 'single error per mark' idea is gathering some support I'd
like to add my 2c: Probably we don't want fsnotify_error_info attached to
every fsnotify_mark, I guess we can have:

struct fanotify_mark {
	struct fsnotify_mark fsn_mark;
	struct fanotify_error_event *event;
};

and 'event' will be normally NULL and if we add FAN_ERROR to mark's mask,
we will allocate event (also containing error info) to use when generating
error event. And then the handling will be somewhat similar to how we
handle overflow events.

								Honza