mbox series

[v3,00/15] File system wide monitoring

Message ID 20210629191035.681913-1-krisman@collabora.com (mailing list archive)
Headers show
Series File system wide monitoring | expand

Message

Gabriel Krisman Bertazi June 29, 2021, 7:10 p.m. UTC
Hi,

This is the third version of the FAN_FS_ERROR patches.  The main change
in this version is the inode information being reported through an FID
record, which means it requires the group to be created with
FAN_REPORT_FID.  It indeed simplifies a lot the FAN_FS_ERROR patch
itself.

This change raises the question of how we report non-inode errors.  On
one hand, we could omit the FID report, but then fsid would also be
ommited.  I chose to report these kind of errors against the root
inode.

The other changes in this iteration were made to attend to Amir
feedback.  Thank you again for your very detailed input.  It is really
appreciated.

This was tested with LTP for regressions, and also using the sample on
the last patch, with a corrupted image.  I can publish the bad image
upon request.

I also pushed the full series to:

  https://gitlab.collabora.com/krisman/linux -b fanotify-notifications-single-slot

Thank you

Original cover letter
---------------------
Hi,

This series follow up on my previous proposal [1] to support file system
wide monitoring.  As suggested by Amir, this proposal drops the ring
buffer in favor of a single slot associated with each mark.  This
simplifies a bit the implementation, as you can see in the code.

As a reminder, This proposal is limited to an interface for
administrators to monitor the health of a file system, instead of a
generic inteface for file errors.  Therefore, this doesn't solve the
problem of writeback errors or the need to watch a specific subtree.

In comparison to the previous RFC, this implementation also drops the
per-fs data and location, and leave those as future extensions.

* Implementation

The feature is implemented on top of fanotify, as a new type of fanotify
mark, FAN_ERROR, which a file system monitoring tool can register to
receive error notifications.  When an error occurs a new notification is
generated, in addition followed by this info field:

 - FS generic data: A file system agnostic structure that has a generic
 error code and identifies the filesystem.  Basically, it let's
 userspace know something happened on a monitored filesystem.  Since
 only the first error is recorded since the last read, this also
 includes a counter of errors that happened since the last read.

* Testing

This was tested by watching notifications flowing from an intentionally
corrupted filesystem in different places.  In addition, other events
were watched in an attempt to detect regressions.

Is there a specific testsuite for fanotify I should be running?

* Patches

This patchset is divided as follows: Patch 1 through 5 are refactoring
to fsnotify/fanotify in preparation for FS_ERROR/FAN_ERROR; patch 6 and
7 implement the FS_ERROR API for filesystems to report error; patch 8
add support for FAN_ERROR in fanotify; Patch 9 is an example
implementation for ext4; patch 10 and 11 provide a sample userspace code
and documentation.

I also pushed the full series to:

  https://gitlab.collabora.com/krisman/linux -b fanotify-notifications-single-slot

[1] https://lwn.net/Articles/854545/
[2] https://lwn.net/Articles/856916/

Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Dave Chinner <david@fromorbit.com>
Cc: jack@suse.com
To: amir73il@gmail.com
Cc: dhowells@redhat.com
Cc: khazhy@google.com
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-ext4@vger.kernel.org

Amir Goldstein (1):
  fsnotify: pass arguments of fsnotify() in struct fsnotify_event_info

Gabriel Krisman Bertazi (14):
  fsnotify: Don't insert unmergeable events in hashtable
  fanotify: Fold event size calculation to its own function
  fanotify: Split fsid check from other fid mode checks
  fanotify: Split superblock marks out to a new cache
  inotify: Don't force FS_IN_IGNORED
  fsnotify: Add helper to detect overflow_event
  fsnotify: Support passing argument to insert callback on add_event
  fsnotify: Always run the merge hook
  fsnotify: Support FS_ERROR event type
  fsnotify: Introduce helpers to send error_events
  fanotify: Introduce FAN_FS_ERROR event
  ext4: Send notifications on error
  samples: Add fs error monitoring example
  docs: Document the FAN_FS_ERROR event

 .../admin-guide/filesystem-monitoring.rst     |  70 +++++++
 Documentation/admin-guide/index.rst           |   1 +
 fs/ext4/super.c                               |   8 +
 fs/notify/fanotify/fanotify.c                 | 189 ++++++++++++-----
 fs/notify/fanotify/fanotify.h                 |  69 ++++++-
 fs/notify/fanotify/fanotify_user.c            | 195 +++++++++++++++---
 fs/notify/fsnotify.c                          |  85 ++++----
 fs/notify/inotify/inotify_fsnotify.c          |   5 +-
 fs/notify/inotify/inotify_user.c              |   6 +-
 fs/notify/notification.c                      |   8 +-
 include/linux/fanotify.h                      |  11 +-
 include/linux/fsnotify.h                      |  28 ++-
 include/linux/fsnotify_backend.h              | 104 ++++++++--
 include/uapi/linux/fanotify.h                 |   8 +
 samples/Kconfig                               |   9 +
 samples/Makefile                              |   1 +
 samples/fanotify/Makefile                     |   3 +
 samples/fanotify/fs-monitor.c                 | 134 ++++++++++++
 18 files changed, 777 insertions(+), 157 deletions(-)
 create mode 100644 Documentation/admin-guide/filesystem-monitoring.rst
 create mode 100644 samples/fanotify/Makefile
 create mode 100644 samples/fanotify/fs-monitor.c

Comments

Amir Goldstein June 30, 2021, 5:10 a.m. UTC | #1
+CC linux-api

On Tue, Jun 29, 2021 at 10:10 PM Gabriel Krisman Bertazi
<krisman@collabora.com> wrote:
>
> Hi,
>
> This is the third version of the FAN_FS_ERROR patches.  The main change
> in this version is the inode information being reported through an FID
> record, which means it requires the group to be created with
> FAN_REPORT_FID.  It indeed simplifies a lot the FAN_FS_ERROR patch
> itself.

I am glad that you took this path.
Uniformity across the UAPI is important.

>
> This change raises the question of how we report non-inode errors.  On
> one hand, we could omit the FID report, but then fsid would also be
> ommited.  I chose to report these kind of errors against the root
> inode.
>

There are other option to consider.

To avoid special casing error events in fanotify event read code,
it would is convenient to use a non-zero length FID, but you can
use a 8 bytes zero buffer as NULL-FID

If I am not mistaken, that amounts to 64 bytes of event_len
including the event_metadata and both records which is pretty
nicely aligned.

All 3 handle_type options below are valid options:
1. handle_type FILEID_ROOT
2. handle_type FILEID_INVALID
3. handle_type FILEID_INO32_GEN (i.e. ino=0;gen=0)

The advantage of option #3 is that the monitoring program
does not need to special case the NULL_FID case when
parsing the FID to informative user message.

> The other changes in this iteration were made to attend to Amir
> feedback.  Thank you again for your very detailed input.  It is really
> appreciated.
>
> This was tested with LTP for regressions, and also using the sample on
> the last patch, with a corrupted image.  I can publish the bad image
> upon request.

Just to set expectations, we now have an official standard for fanotify [1]
where we require an LTP test and man page update patch before merge
of UAPI changes.

That should not stop us from continuing the review process - it's just
a heads up, but I think that we are down to implementation details in
the review anyway and that the UAPI (give or take root inode) is
pretty much clear at this point, so spreading the review of UAPI to
wider audience is not a bad idea.

w.r.t man page update, I know you have created the admin-guide book,
but it's not the same. For linux-api reviewers, reviewing the changed to
fanotify man pages is good way to make sure we did not miss any corners.

w.r.t LTP test, I don't think that using a corrupt image will be a good way
for an LTP test. LTP tests can prepare and mount an ext4 loop image.
Does ext4 have some debugging method to inject an error?
Because that would be the best way IMO.
If it doesn't, you can implement this in ext4 and use it in the test if that
debug file exists - skip the test otherwise - it's common practice.

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/YMKv1U7tNPK955ho@google.com/
Jan Kara July 8, 2021, 11:32 a.m. UTC | #2
On Wed 30-06-21 08:10:39, Amir Goldstein wrote:
> > This change raises the question of how we report non-inode errors.  On
> > one hand, we could omit the FID report, but then fsid would also be
> > ommited.  I chose to report these kind of errors against the root
> > inode.
> 
> There are other option to consider.

Yeah, so reporting against root inode has the disadvantage that in
principle you don't know whether the error really happened on the root
inode or whether the event is in fact without an inode. So some information
is lost here. Maybe the set of errors that can happen without an inode and
the set of errors that can happen with an inode are disjoint, so no
information is actually lost but then does reporting root inode actually
bring any benefit? So I agree reporting root inode is not ideal.

> To avoid special casing error events in fanotify event read code,
> it would is convenient to use a non-zero length FID, but you can
> use a 8 bytes zero buffer as NULL-FID
> 
> If I am not mistaken, that amounts to 64 bytes of event_len
> including the event_metadata and both records which is pretty
> nicely aligned.
> 
> All 3 handle_type options below are valid options:
> 1. handle_type FILEID_ROOT
> 2. handle_type FILEID_INVALID
> 3. handle_type FILEID_INO32_GEN (i.e. ino=0;gen=0)
> 
> The advantage of option #3 is that the monitoring program
> does not need to special case the NULL_FID case when
> parsing the FID to informative user message.

I actually like #2 more. #1 has similar problems as I outlined above for
reporting root dir. The advantage that userspace won't have to special case
FILEID_INO32_GEN FID in #3 is IMHO a dream - if you want a good message,
you should report the problem was on a superblock, not some just some
zeroes instead of proper inode info. Even more if it was on a real inode, 
good reporter will e.g. try to resolve it to a path.

Also because we will presumably have more filesystems supporting this in
the future, normal inodes can be reported with other handle types anyway.
So IMO #2 is the most sensible option.

> w.r.t LTP test, I don't think that using a corrupt image will be a good way
> for an LTP test. LTP tests can prepare and mount an ext4 loop image.
> Does ext4 have some debugging method to inject an error?
> Because that would be the best way IMO.
> If it doesn't, you can implement this in ext4 and use it in the test if that
> debug file exists - skip the test otherwise - it's common practice.

Ext4 does not have an error injection facility. Not sure if we want to
force Gabriel into creating one just for these LTP tests. Actually creating
ext4 images with known problems (through mke2fs and debugfs) should be
rather easy and we could then test we get expected error notifications
back...

								Honza
Amir Goldstein July 8, 2021, 12:25 p.m. UTC | #3
On Thu, Jul 8, 2021 at 2:32 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 30-06-21 08:10:39, Amir Goldstein wrote:
> > > This change raises the question of how we report non-inode errors.  On
> > > one hand, we could omit the FID report, but then fsid would also be
> > > ommited.  I chose to report these kind of errors against the root
> > > inode.
> >
> > There are other option to consider.
>
> Yeah, so reporting against root inode has the disadvantage that in
> principle you don't know whether the error really happened on the root
> inode or whether the event is in fact without an inode. So some information
> is lost here. Maybe the set of errors that can happen without an inode and
> the set of errors that can happen with an inode are disjoint, so no
> information is actually lost but then does reporting root inode actually
> bring any benefit? So I agree reporting root inode is not ideal.
>
> > To avoid special casing error events in fanotify event read code,
> > it would is convenient to use a non-zero length FID, but you can
> > use a 8 bytes zero buffer as NULL-FID
> >
> > If I am not mistaken, that amounts to 64 bytes of event_len
> > including the event_metadata and both records which is pretty
> > nicely aligned.
> >
> > All 3 handle_type options below are valid options:
> > 1. handle_type FILEID_ROOT
> > 2. handle_type FILEID_INVALID
> > 3. handle_type FILEID_INO32_GEN (i.e. ino=0;gen=0)
> >
> > The advantage of option #3 is that the monitoring program
> > does not need to special case the NULL_FID case when
> > parsing the FID to informative user message.
>
> I actually like #2 more. #1 has similar problems as I outlined above for
> reporting root dir. The advantage that userspace won't have to special case
> FILEID_INO32_GEN FID in #3 is IMHO a dream - if you want a good message,
> you should report the problem was on a superblock, not some just some
> zeroes instead of proper inode info. Even more if it was on a real inode,
> good reporter will e.g. try to resolve it to a path.
>
> Also because we will presumably have more filesystems supporting this in
> the future, normal inodes can be reported with other handle types anyway.
> So IMO #2 is the most sensible option.
>

I am perfectly fine with #2, but just FYI, there is no ambiguity around using
FILEID_ROOT - it is a special "application level" type used by nfsd to describe
a handle to the root of an export (I'm not in which protocol versions).
The inode itself does not even need to be the root of the filesystem
(it seldom is)
nfsd keeps a dentry with elevated refcount per export in order to "decode"
those export root handles.

The filesystem itself will never return this value, so when reporting
errors on a
specific inode and the specific inode is the filesystem root directory then
the type will be the same as the native filesystem type and not FILEID_ROOT.

For this reason I thought it would be safe to use FILEID_ROOT for reporting
events on sb without inode and FILEID_INVALID for reporting failure to encode
fh, for example when not guessing max_fh_len correctly.

But it's just nice to have.

> > w.r.t LTP test, I don't think that using a corrupt image will be a good way
> > for an LTP test. LTP tests can prepare and mount an ext4 loop image.
> > Does ext4 have some debugging method to inject an error?
> > Because that would be the best way IMO.
> > If it doesn't, you can implement this in ext4 and use it in the test if that
> > debug file exists - skip the test otherwise - it's common practice.
>
> Ext4 does not have an error injection facility. Not sure if we want to
> force Gabriel into creating one just for these LTP tests. Actually creating
> ext4 images with known problems (through mke2fs and debugfs) should be
> rather easy and we could then test we get expected error notifications
> back...

Ok.

Thanks,
Amir.