diff mbox series

[v7,05/18] fsnotify: introduce pre-content permission events

Message ID 141e2cc2dfac8b2f49c1c8d219dd7c20925b2cef.1731433903.git.josef@toxicpanda.com (mailing list archive)
State New
Headers show
Series fanotify: add pre-content hooks | expand

Commit Message

Josef Bacik Nov. 12, 2024, 5:55 p.m. UTC
From: Amir Goldstein <amir73il@gmail.com>

The new FS_PRE_ACCESS permission event is similar to FS_ACCESS_PERM,
but it meant for a different use case of filling file content before
access to a file range, so it has slightly different semantics.

Generate FS_PRE_ACCESS/FS_ACCESS_PERM as two seperate events, so content
scanners could inspect the content filled by pre-content event handler.

Unlike FS_ACCESS_PERM, FS_PRE_ACCESS is also called before a file is
modified by syscalls as write() and fallocate().

FS_ACCESS_PERM is reported also on blockdev and pipes, but the new
pre-content events are only reported for regular files and dirs.

The pre-content events are meant to be used by hierarchical storage
managers that want to fill the content of files on first access.

There are some specific requirements from filesystems that could
be used with pre-content events, so add a flag for fs to opt-in
for pre-content events explicitly before they can be used.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fsnotify.c             |  2 +-
 include/linux/fs.h               |  1 +
 include/linux/fsnotify.h         | 37 ++++++++++++++++++++++++++++----
 include/linux/fsnotify_backend.h | 12 +++++++++--
 security/selinux/hooks.c         |  3 ++-
 5 files changed, 47 insertions(+), 8 deletions(-)

Comments

Linus Torvalds Nov. 12, 2024, 8:12 p.m. UTC | #1
On Tue, 12 Nov 2024 at 09:56, Josef Bacik <josef@toxicpanda.com> wrote:
>
>  #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
> +static inline int fsnotify_pre_content(struct file *file)
> +{
> +       struct inode *inode = file_inode(file);
> +
> +       /*
> +        * Pre-content events are only reported for regular files and dirs
> +        * if there are any pre-content event watchers on this sb.
> +        */
> +       if ((!S_ISDIR(inode->i_mode) && !S_ISREG(inode->i_mode)) ||
> +           !(inode->i_sb->s_iflags & SB_I_ALLOW_HSM) ||
> +           !fsnotify_sb_has_priority_watchers(inode->i_sb,
> +                                              FSNOTIFY_PRIO_PRE_CONTENT))
> +               return 0;
> +
> +       return fsnotify_file(file, FS_PRE_ACCESS);
> +}

Yeah, no.

None of this should check inode->i_sb->s_iflags at any point.

The "is there a pre-content" thing should check one thing, and one
thing only: that "is this file watched" flag.

The whole indecipherable mess of inline functions that do random
things in <linux/fsnotify.h> needs to be cleaned up, not made even
more indecipherable.

I'm NAKing this whole series until this is all sane and cleaned up,
and I don't want to see a new hacky version being sent out tomorrow
with just another layer of new hacks, with random new inline functions
that call other inline functions and have complex odd conditionals
that make no sense.

Really. If the new hooks don't have that *SINGLE* bit test, they will
not get merged.

And that *SINGLE* bit test had better not be hidden under multiple
layers of odd inline functions.

You DO NOT get to use the same old broken complex function for the new
hooks that then mix these odd helpers.

This whole "add another crazy inline function using another crazy
helper needs to STOP. Later on in the patch series you do

+/*
+ * fsnotify_truncate_perm - permission hook before file truncate
+ */
+static inline int fsnotify_truncate_perm(const struct path *path,
loff_t length)
+{
+       return fsnotify_pre_content(path, &length, 0);
+}

or things like this:

+static inline bool fsnotify_file_has_pre_content_watches(struct file *file)
+{
+       if (!(file->f_mode & FMODE_NOTIFY_PERM))
+               return false;
+
+       if (!(file_inode(file)->i_sb->s_iflags & SB_I_ALLOW_HSM))
+               return false;
+
+       return fsnotify_file_object_watched(file, FSNOTIFY_PRE_CONTENT_EVENTS);
+}

and no, NONE of that should be tested at runtime.

I repeat: you should have *ONE* inline function that basically does

 static inline bool fsnotify_file_watched(struct file *file)
 {
        return file && unlikely(file->f_mode & FMODE_NOTIFY_PERM);
 }

and absolutely nothing else. If that file is set, the file has
notification events, and you go to an out-of-line slow case. You don't
inline the unlikely cases after that.

And you make sure that you only set that special bit on files and
filesystems that support it. You most definitely don't check for
SB_I_ALLOW_HSM kind of flags at runtime in critical code.

               Linus
Amir Goldstein Nov. 12, 2024, 11:06 p.m. UTC | #2
On Tue, Nov 12, 2024 at 9:12 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 12 Nov 2024 at 09:56, Josef Bacik <josef@toxicpanda.com> wrote:
> >
> >  #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
> > +static inline int fsnotify_pre_content(struct file *file)
> > +{
> > +       struct inode *inode = file_inode(file);
> > +
> > +       /*
> > +        * Pre-content events are only reported for regular files and dirs
> > +        * if there are any pre-content event watchers on this sb.
> > +        */
> > +       if ((!S_ISDIR(inode->i_mode) && !S_ISREG(inode->i_mode)) ||
> > +           !(inode->i_sb->s_iflags & SB_I_ALLOW_HSM) ||
> > +           !fsnotify_sb_has_priority_watchers(inode->i_sb,
> > +                                              FSNOTIFY_PRIO_PRE_CONTENT))
> > +               return 0;
> > +
> > +       return fsnotify_file(file, FS_PRE_ACCESS);
> > +}
>
> Yeah, no.
>
> None of this should check inode->i_sb->s_iflags at any point.
>
> The "is there a pre-content" thing should check one thing, and one
> thing only: that "is this file watched" flag.
> The whole indecipherable mess of inline functions that do random
> things in <linux/fsnotify.h> needs to be cleaned up, not made even
> more indecipherable.
>
> I'm NAKing this whole series until this is all sane and cleaned up,
> and I don't want to see a new hacky version being sent out tomorrow
> with just another layer of new hacks, with random new inline functions
> that call other inline functions and have complex odd conditionals
> that make no sense.
>
> Really. If the new hooks don't have that *SINGLE* bit test, they will
> not get merged.
>
> And that *SINGLE* bit test had better not be hidden under multiple
> layers of odd inline functions.
>
> You DO NOT get to use the same old broken complex function for the new
> hooks that then mix these odd helpers.
>
> This whole "add another crazy inline function using another crazy
> helper needs to STOP. Later on in the patch series you do
>
> +/*
> + * fsnotify_truncate_perm - permission hook before file truncate
> + */
> +static inline int fsnotify_truncate_perm(const struct path *path,
> loff_t length)
> +{
> +       return fsnotify_pre_content(path, &length, 0);
> +}
>
> or things like this:
>
> +static inline bool fsnotify_file_has_pre_content_watches(struct file *file)
> +{
> +       if (!(file->f_mode & FMODE_NOTIFY_PERM))
> +               return false;
> +
> +       if (!(file_inode(file)->i_sb->s_iflags & SB_I_ALLOW_HSM))
> +               return false;
> +
> +       return fsnotify_file_object_watched(file, FSNOTIFY_PRE_CONTENT_EVENTS);
> +}
>
> and no, NONE of that should be tested at runtime.
>
> I repeat: you should have *ONE* inline function that basically does
>
>  static inline bool fsnotify_file_watched(struct file *file)
>  {
>         return file && unlikely(file->f_mode & FMODE_NOTIFY_PERM);
>  }
>
> and absolutely nothing else. If that file is set, the file has
> notification events, and you go to an out-of-line slow case. You don't
> inline the unlikely cases after that.
>
> And you make sure that you only set that special bit on files and
> filesystems that support it. You most definitely don't check for
> SB_I_ALLOW_HSM kind of flags at runtime in critical code.

I understand your point. It makes sense.
But it requires using another FMODE_HSM flag,
because FMODE_NOTIFY_PERM covers also the legacy
FS_ACCESS_PERM event, which has different semantics
that I consider broken, but it is what it is.

I am fine not optimizing out the legacy FS_ACCESS_PERM event
and just making sure not to add new bad code, if that is what you prefer
and I also am fine with using two FMODE_ flags if that is prefered.

Thanks,
Amir.
Linus Torvalds Nov. 12, 2024, 11:48 p.m. UTC | #3
On Tue, 12 Nov 2024 at 15:06, Amir Goldstein <amir73il@gmail.com> wrote:
>
> I am fine not optimizing out the legacy FS_ACCESS_PERM event
> and just making sure not to add new bad code, if that is what you prefer
> and I also am fine with using two FMODE_ flags if that is prefered.

So iirc we do have a handful of FMODE flags left. Not many, but I do
think a new one would be fine.

And if we were to run out (and I'm *not* suggesting we do that now!)
we actually have more free bits in "f_flags".

That f_flags set of flags is a mess for other reasons: we expose them
to user space, and we define the bits using octal numbers for random
bad historical reasons, and some architectures specify their own set
or bits, etc etc - nasty.

But if anybody is really worried about running out of f_mode bits, we
could almost certainly turn the existing

        unsigned int f_flags;

into a bitfield, and make it be something like

        unsigned int f_flags:26, f_special:6;

instead, with the rule being that "f_special" only gets set at open
time and never any other time (to avoid any data races with fcntl()
touching the other 24 bits in the word).

[ Bah. I thought we had 8 unused bits in f_flags, but I went and
looked. sparc uses 0x2000000 for __O_TMPFILE, so we actually only have
6 bits unused in f_flags. No actual good reason for the sparc choice I
think, but it is what it is ]

Anyway, I wouldn't begrudge you a bit if that cleans this fsnotify
mess up and makes it much simpler and clearer. I really think that if
we can do this cleanly, using a bit in f_mode is a good cause.

                Linus
Amir Goldstein Nov. 13, 2024, 12:05 a.m. UTC | #4
On Wed, Nov 13, 2024 at 12:48 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 12 Nov 2024 at 15:06, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > I am fine not optimizing out the legacy FS_ACCESS_PERM event
> > and just making sure not to add new bad code, if that is what you prefer
> > and I also am fine with using two FMODE_ flags if that is prefered.
>
> So iirc we do have a handful of FMODE flags left. Not many, but I do
> think a new one would be fine.
>

Maybe I could use just this one bit, but together with the existing
FMODE_NONOTIFY bit, I get 4 modes, which correspond to the
highest watching priority:

FMODE_NOTIFY_HSM (pre-content and all the rest)
FMODE_NOTIFY_PERM (permission and async)
FMODE_NOTIFY_NORMAL (only async events)
FMODE_NOTIFY_NONE (no events)

Thanks,
Amir.
Al Viro Nov. 13, 2024, 12:12 a.m. UTC | #5
On Tue, Nov 12, 2024 at 03:48:10PM -0800, Linus Torvalds wrote:
> On Tue, 12 Nov 2024 at 15:06, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > I am fine not optimizing out the legacy FS_ACCESS_PERM event
> > and just making sure not to add new bad code, if that is what you prefer
> > and I also am fine with using two FMODE_ flags if that is prefered.
> 
> So iirc we do have a handful of FMODE flags left. Not many, but I do
> think a new one would be fine.

8, 13, 24, 30 and 31.

> But if anybody is really worried about running out of f_mode bits, we
> could almost certainly turn the existing
> 
>         unsigned int f_flags;
> 
> into a bitfield, and make it be something like
> 
>         unsigned int f_flags:26, f_special:6;
> 
> instead, with the rule being that "f_special" only gets set at open
> time and never any other time (to avoid any data races with fcntl()
> touching the other 24 bits in the word).

Ugh...  Actually, I would rather mask that on fcntl side (and possibly
moved FMODE_RANDOM/FMODE_NOREUSE over there as well).

Would make for simpler rules for locking - ->f_mode would be never
changed past open, ->f_flags would have all changes under ->f_lock.
Linus Torvalds Nov. 13, 2024, 12:23 a.m. UTC | #6
On Tue, 12 Nov 2024 at 16:12, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Ugh...  Actually, I would rather mask that on fcntl side (and possibly
> moved FMODE_RANDOM/FMODE_NOREUSE over there as well).

Yeah, that's probably cleaner. I was thinking the bitfield would be a
simpler solution, but we already mask writes to specific bits on the
fcntl side for other reasons *anyway*, so we might as well mask reads
too, and just not expose any kernel-internal bits to user space.

> Would make for simpler rules for locking - ->f_mode would be never
> changed past open, ->f_flags would have all changes under ->f_lock.

Yeah, sounds sane.

That said, just looking at which bits are used in f_flags is a major
PITA. About half the definitions use octal, with the other half using
hex. Lovely.

So I'd rather not touch that mess until we have to.

                Linus
Linus Torvalds Nov. 13, 2024, 12:38 a.m. UTC | #7
On Tue, 12 Nov 2024 at 16:23, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 12 Nov 2024 at 16:12, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Ugh...  Actually, I would rather mask that on fcntl side (and possibly
> > moved FMODE_RANDOM/FMODE_NOREUSE over there as well).
> >
> > Would make for simpler rules for locking - ->f_mode would be never
> > changed past open, ->f_flags would have all changes under ->f_lock.
>
> Yeah, sounds sane.
>
> That said, just looking at which bits are used in f_flags is a major
> PITA. About half the definitions use octal, with the other half using
> hex. Lovely.
>
> So I'd rather not touch that mess until we have to.

Actually, maybe the locking and the octal/hex mess should be
considered a reason to clean this all up early rather than ignore it.

Looking at that locking code in fadvise() just for the f_mode use does
make me think this would be a really good cleanup.

I note that our fcntl code seems buggy as-is, because while it does
use f_lock for assignments (good), it clearly does *not* use them for
reading.

So it looks like you can actually read inconsistent values.

I get the feeling that f_flags would want WRITE_ONCE/READ_ONCE in
_addition_ to the f_lock use it has.

The f_mode thing with fadvise() smells like the same bug. Just because
the modifications are serialized wrt each other doesn't mean that
readers are then automatically ok.

           Linus
Al Viro Nov. 13, 2024, 1:19 a.m. UTC | #8
On Tue, Nov 12, 2024 at 04:38:42PM -0800, Linus Torvalds wrote:
> Looking at that locking code in fadvise() just for the f_mode use does
> make me think this would be a really good cleanup.
> 
> I note that our fcntl code seems buggy as-is, because while it does
> use f_lock for assignments (good), it clearly does *not* use them for
> reading.
> 
> So it looks like you can actually read inconsistent values.
> 
> I get the feeling that f_flags would want WRITE_ONCE/READ_ONCE in
> _addition_ to the f_lock use it has.

AFAICS, fasync logics is the fishy part - the rest should be sane.

> The f_mode thing with fadvise() smells like the same bug. Just because
> the modifications are serialized wrt each other doesn't mean that
> readers are then automatically ok.

Reads are also under ->f_lock in there, AFAICS...

Another thing in the vicinity is ->f_mode modifications after the calls
of anon_inode_getfile() in several callers - probably ought to switch
those to anon_inode_getfile_fmode().  That had been discussed back in
April when the function got merged, but "convert to using it" followup
series hadn't materialized...
Al Viro Nov. 13, 2024, 4:30 a.m. UTC | #9
On Wed, Nov 13, 2024 at 01:19:54AM +0000, Al Viro wrote:
> On Tue, Nov 12, 2024 at 04:38:42PM -0800, Linus Torvalds wrote:
> > Looking at that locking code in fadvise() just for the f_mode use does
> > make me think this would be a really good cleanup.
> > 
> > I note that our fcntl code seems buggy as-is, because while it does
> > use f_lock for assignments (good), it clearly does *not* use them for
> > reading.
> > 
> > So it looks like you can actually read inconsistent values.
> > 
> > I get the feeling that f_flags would want WRITE_ONCE/READ_ONCE in
> > _addition_ to the f_lock use it has.
> 
> AFAICS, fasync logics is the fishy part - the rest should be sane.
> 
> > The f_mode thing with fadvise() smells like the same bug. Just because
> > the modifications are serialized wrt each other doesn't mean that
> > readers are then automatically ok.
> 
> Reads are also under ->f_lock in there, AFAICS...
> 
> Another thing in the vicinity is ->f_mode modifications after the calls
> of anon_inode_getfile() in several callers - probably ought to switch
> those to anon_inode_getfile_fmode().  That had been discussed back in
> April when the function got merged, but "convert to using it" followup
> series hadn't materialized...

While we are at it, there's is a couple of kludges I really hate -
mixing __FMODE_NONOTIFY and __FMODE_EXEC with O_... flags.

E.g. for __FMODE_NONOTIFY all it takes is switching fanotify from
anon_inode_getfd() to anon_inode_getfile_fmode() and adding
a dentry_open_nonotify() to be used by fanotify on the other path.
That's it - no more weird shit in OPEN_FMODE(), etc.

For __FMODE_EXEC it might get trickier (nfs is the main consumer),
but I seriously suspect that something like "have path_openat()
check op->acc_mode & MAY_EXEC and set FMODE_EXEC in ->f_mode
right after struct file allocation" would make a good starting
point; yes, it would affect uselib(2), but... I've no idea whether
it wouldn't be the right thing to do; would be hard to test.

Anyway, untested __FMODE_NONOTIFY side of it:

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 22dd9dcce7ec..ebd1c82bfb6b 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1161,10 +1161,10 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+	BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ !=
 		HWEIGHT32(
 			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
-			__FMODE_EXEC | __FMODE_NONOTIFY));
+			__FMODE_EXEC));
 
 	fasync_cache = kmem_cache_create("fasync_cache",
 					 sizeof(struct fasync_struct), 0,
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 9644bc72e457..43fbf29ef03a 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -101,8 +101,7 @@ static void __init fanotify_sysctls_init(void)
  *
  * Internal and external open flags are stored together in field f_flags of
  * struct file. Only external open flags shall be allowed in event_f_flags.
- * Internal flags like FMODE_NONOTIFY, FMODE_EXEC, FMODE_NOCMTIME shall be
- * excluded.
+ * Internal flags like FMODE_EXEC shall be excluded.
  */
 #define	FANOTIFY_INIT_ALL_EVENT_F_BITS				( \
 		O_ACCMODE	| O_APPEND	| O_NONBLOCK	| \
@@ -262,8 +261,8 @@ static int create_fd(struct fsnotify_group *group, const struct path *path,
 	 * we need a new file handle for the userspace program so it can read even if it was
 	 * originally opened O_WRONLY.
 	 */
-	new_file = dentry_open(path,
-			       group->fanotify_data.f_flags | __FMODE_NONOTIFY,
+	new_file = dentry_open_nonotify(path,
+			       group->fanotify_data.f_flags,
 			       current_cred());
 	if (IS_ERR(new_file)) {
 		/*
@@ -1404,6 +1403,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
 	unsigned int fid_mode = flags & FANOTIFY_FID_BITS;
 	unsigned int class = flags & FANOTIFY_CLASS_BITS;
 	unsigned int internal_flags = 0;
+	struct file *file;
 
 	pr_debug("%s: flags=%x event_f_flags=%x\n",
 		 __func__, flags, event_f_flags);
@@ -1472,7 +1472,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
 	    (!(fid_mode & FAN_REPORT_NAME) || !(fid_mode & FAN_REPORT_FID)))
 		return -EINVAL;
 
-	f_flags = O_RDWR | __FMODE_NONOTIFY;
+	f_flags = O_RDWR;
 	if (flags & FAN_CLOEXEC)
 		f_flags |= O_CLOEXEC;
 	if (flags & FAN_NONBLOCK)
@@ -1550,10 +1550,18 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
 			goto out_destroy_group;
 	}
 
-	fd = anon_inode_getfd("[fanotify]", &fanotify_fops, group, f_flags);
+	fd = get_unused_fd_flags(flags);
 	if (fd < 0)
 		goto out_destroy_group;
 
+	file = anon_inode_getfile_fmode("[fanotify]", &fanotify_fops, group,
+					f_flags, FMODE_NONOTIFY);
+	if (IS_ERR(file)) {
+		fd = PTR_ERR(file);
+		put_unused_fd(fd);
+		goto out_destroy_group;
+	}
+	fd_install(fd, file);
 	return fd;
 
 out_destroy_group:
diff --git a/fs/open.c b/fs/open.c
index acaeb3e25c88..04cb581528ff 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1118,6 +1118,23 @@ struct file *dentry_open(const struct path *path, int flags,
 }
 EXPORT_SYMBOL(dentry_open);
 
+struct file *dentry_open_nonotify(const struct path *path, int flags,
+			 const struct cred *cred)
+{
+	struct file *f = alloc_empty_file(flags, cred);
+	if (!IS_ERR(f)) {
+		int error;
+
+		f->f_mode |= FMODE_NONOTIFY;
+		error = vfs_open(path, f);
+		if (error) {
+			fput(f);
+			f = ERR_PTR(error);
+		}
+	}
+	return f;
+}
+
 /**
  * dentry_create - Create and open a file
  * @path: path to create
@@ -1215,7 +1232,7 @@ inline struct open_how build_open_how(int flags, umode_t mode)
 inline int build_open_flags(const struct open_how *how, struct open_flags *op)
 {
 	u64 flags = how->flags;
-	u64 strip = __FMODE_NONOTIFY | O_CLOEXEC;
+	u64 strip = O_CLOEXEC;
 	int lookup_flags = 0;
 	int acc_mode = ACC_MODE(flags);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e3c603d01337..18888d601550 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2731,6 +2731,8 @@ struct file *dentry_open(const struct path *path, int flags,
 struct file *dentry_create(const struct path *path, int flags, umode_t mode,
 			   const struct cred *cred);
 struct path *backing_file_user_path(struct file *f);
+struct file *dentry_open_nonotify(const struct path *path, int flags,
+			 const struct cred *creds);
 
 /*
  * When mmapping a file on a stackable filesystem (e.g., overlayfs), the file
@@ -3620,11 +3622,9 @@ struct ctl_table;
 int __init list_bdev_fs_names(char *buf, size_t size);
 
 #define __FMODE_EXEC		((__force int) FMODE_EXEC)
-#define __FMODE_NONOTIFY	((__force int) FMODE_NONOTIFY)
 
 #define ACC_MODE(x) ("\004\002\006\006"[(x)&O_ACCMODE])
-#define OPEN_FMODE(flag) ((__force fmode_t)(((flag + 1) & O_ACCMODE) | \
-					    (flag & __FMODE_NONOTIFY)))
+#define OPEN_FMODE(flag) ((__force fmode_t)(((flag + 1) & O_ACCMODE)))
 
 static inline bool is_sxid(umode_t mode)
 {
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 80f37a0d40d7..613475285643 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -6,7 +6,6 @@
 
 /*
  * FMODE_EXEC is 0x20
- * FMODE_NONOTIFY is 0x4000000
  * These cannot be used by userspace O_* until internal and external open
  * flags are split.
  * -Eric Paris
Amir Goldstein Nov. 13, 2024, 8:50 a.m. UTC | #10
On Wed, Nov 13, 2024 at 5:30 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Nov 13, 2024 at 01:19:54AM +0000, Al Viro wrote:
> > On Tue, Nov 12, 2024 at 04:38:42PM -0800, Linus Torvalds wrote:
> > > Looking at that locking code in fadvise() just for the f_mode use does
> > > make me think this would be a really good cleanup.
> > >
> > > I note that our fcntl code seems buggy as-is, because while it does
> > > use f_lock for assignments (good), it clearly does *not* use them for
> > > reading.
> > >
> > > So it looks like you can actually read inconsistent values.
> > >
> > > I get the feeling that f_flags would want WRITE_ONCE/READ_ONCE in
> > > _addition_ to the f_lock use it has.
> >
> > AFAICS, fasync logics is the fishy part - the rest should be sane.
> >
> > > The f_mode thing with fadvise() smells like the same bug. Just because
> > > the modifications are serialized wrt each other doesn't mean that
> > > readers are then automatically ok.
> >
> > Reads are also under ->f_lock in there, AFAICS...
> >
> > Another thing in the vicinity is ->f_mode modifications after the calls
> > of anon_inode_getfile() in several callers - probably ought to switch
> > those to anon_inode_getfile_fmode().  That had been discussed back in
> > April when the function got merged, but "convert to using it" followup
> > series hadn't materialized...
>
> While we are at it, there's is a couple of kludges I really hate -
> mixing __FMODE_NONOTIFY and __FMODE_EXEC with O_... flags.
>
> E.g. for __FMODE_NONOTIFY all it takes is switching fanotify from
> anon_inode_getfd() to anon_inode_getfile_fmode() and adding
> a dentry_open_nonotify() to be used by fanotify on the other path.
> That's it - no more weird shit in OPEN_FMODE(), etc.
>
> For __FMODE_EXEC it might get trickier (nfs is the main consumer),
> but I seriously suspect that something like "have path_openat()
> check op->acc_mode & MAY_EXEC and set FMODE_EXEC in ->f_mode
> right after struct file allocation" would make a good starting
> point; yes, it would affect uselib(2), but... I've no idea whether
> it wouldn't be the right thing to do; would be hard to test.
>
> Anyway, untested __FMODE_NONOTIFY side of it:
>
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 22dd9dcce7ec..ebd1c82bfb6b 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -1161,10 +1161,10 @@ static int __init fcntl_init(void)
>          * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
>          * is defined as O_NONBLOCK on some platforms and not on others.
>          */
> -       BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
> +       BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ !=
>                 HWEIGHT32(
>                         (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
> -                       __FMODE_EXEC | __FMODE_NONOTIFY));
> +                       __FMODE_EXEC));
>
>         fasync_cache = kmem_cache_create("fasync_cache",
>                                          sizeof(struct fasync_struct), 0,
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 9644bc72e457..43fbf29ef03a 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -101,8 +101,7 @@ static void __init fanotify_sysctls_init(void)
>   *
>   * Internal and external open flags are stored together in field f_flags of
>   * struct file. Only external open flags shall be allowed in event_f_flags.
> - * Internal flags like FMODE_NONOTIFY, FMODE_EXEC, FMODE_NOCMTIME shall be
> - * excluded.
> + * Internal flags like FMODE_EXEC shall be excluded.
>   */
>  #define        FANOTIFY_INIT_ALL_EVENT_F_BITS                          ( \
>                 O_ACCMODE       | O_APPEND      | O_NONBLOCK    | \
> @@ -262,8 +261,8 @@ static int create_fd(struct fsnotify_group *group, const struct path *path,
>          * we need a new file handle for the userspace program so it can read even if it was
>          * originally opened O_WRONLY.
>          */
> -       new_file = dentry_open(path,
> -                              group->fanotify_data.f_flags | __FMODE_NONOTIFY,
> +       new_file = dentry_open_nonotify(path,
> +                              group->fanotify_data.f_flags,
>                                current_cred());
>         if (IS_ERR(new_file)) {
>                 /*
> @@ -1404,6 +1403,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
>         unsigned int fid_mode = flags & FANOTIFY_FID_BITS;
>         unsigned int class = flags & FANOTIFY_CLASS_BITS;
>         unsigned int internal_flags = 0;
> +       struct file *file;
>
>         pr_debug("%s: flags=%x event_f_flags=%x\n",
>                  __func__, flags, event_f_flags);
> @@ -1472,7 +1472,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
>             (!(fid_mode & FAN_REPORT_NAME) || !(fid_mode & FAN_REPORT_FID)))
>                 return -EINVAL;
>
> -       f_flags = O_RDWR | __FMODE_NONOTIFY;
> +       f_flags = O_RDWR;
>         if (flags & FAN_CLOEXEC)
>                 f_flags |= O_CLOEXEC;
>         if (flags & FAN_NONBLOCK)
> @@ -1550,10 +1550,18 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
>                         goto out_destroy_group;
>         }
>
> -       fd = anon_inode_getfd("[fanotify]", &fanotify_fops, group, f_flags);
> +       fd = get_unused_fd_flags(flags);
>         if (fd < 0)
>                 goto out_destroy_group;
>
> +       file = anon_inode_getfile_fmode("[fanotify]", &fanotify_fops, group,
> +                                       f_flags, FMODE_NONOTIFY);
> +       if (IS_ERR(file)) {
> +               fd = PTR_ERR(file);
> +               put_unused_fd(fd);
> +               goto out_destroy_group;
> +       }
> +       fd_install(fd, file);
>         return fd;
>
>  out_destroy_group:
> diff --git a/fs/open.c b/fs/open.c
> index acaeb3e25c88..04cb581528ff 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -1118,6 +1118,23 @@ struct file *dentry_open(const struct path *path, int flags,
>  }
>  EXPORT_SYMBOL(dentry_open);
>
> +struct file *dentry_open_nonotify(const struct path *path, int flags,
> +                        const struct cred *cred)
> +{
> +       struct file *f = alloc_empty_file(flags, cred);
> +       if (!IS_ERR(f)) {
> +               int error;
> +
> +               f->f_mode |= FMODE_NONOTIFY;
> +               error = vfs_open(path, f);
> +               if (error) {
> +                       fput(f);
> +                       f = ERR_PTR(error);
> +               }
> +       }
> +       return f;
> +}
> +
>  /**
>   * dentry_create - Create and open a file
>   * @path: path to create
> @@ -1215,7 +1232,7 @@ inline struct open_how build_open_how(int flags, umode_t mode)
>  inline int build_open_flags(const struct open_how *how, struct open_flags *op)
>  {
>         u64 flags = how->flags;
> -       u64 strip = __FMODE_NONOTIFY | O_CLOEXEC;
> +       u64 strip = O_CLOEXEC;
>         int lookup_flags = 0;
>         int acc_mode = ACC_MODE(flags);
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e3c603d01337..18888d601550 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2731,6 +2731,8 @@ struct file *dentry_open(const struct path *path, int flags,
>  struct file *dentry_create(const struct path *path, int flags, umode_t mode,
>                            const struct cred *cred);
>  struct path *backing_file_user_path(struct file *f);
> +struct file *dentry_open_nonotify(const struct path *path, int flags,
> +                        const struct cred *creds);
>
>  /*
>   * When mmapping a file on a stackable filesystem (e.g., overlayfs), the file
> @@ -3620,11 +3622,9 @@ struct ctl_table;
>  int __init list_bdev_fs_names(char *buf, size_t size);
>
>  #define __FMODE_EXEC           ((__force int) FMODE_EXEC)
> -#define __FMODE_NONOTIFY       ((__force int) FMODE_NONOTIFY)
>
>  #define ACC_MODE(x) ("\004\002\006\006"[(x)&O_ACCMODE])
> -#define OPEN_FMODE(flag) ((__force fmode_t)(((flag + 1) & O_ACCMODE) | \
> -                                           (flag & __FMODE_NONOTIFY)))
> +#define OPEN_FMODE(flag) ((__force fmode_t)(((flag + 1) & O_ACCMODE)))
>
>  static inline bool is_sxid(umode_t mode)
>  {
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 80f37a0d40d7..613475285643 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -6,7 +6,6 @@
>
>  /*
>   * FMODE_EXEC is 0x20
> - * FMODE_NONOTIFY is 0x4000000
>   * These cannot be used by userspace O_* until internal and external open
>   * flags are split.
>   * -Eric Paris

Nice. I will take it for a test drive.

Thanks,
Amir.
Christian Brauner Nov. 13, 2024, 10:10 a.m. UTC | #11
On Tue, Nov 12, 2024 at 03:48:10PM -0800, Linus Torvalds wrote:
> On Tue, 12 Nov 2024 at 15:06, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > I am fine not optimizing out the legacy FS_ACCESS_PERM event
> > and just making sure not to add new bad code, if that is what you prefer
> > and I also am fine with using two FMODE_ flags if that is prefered.
> 
> So iirc we do have a handful of FMODE flags left. Not many, but I do
> think a new one would be fine.

I freed up five bits and commented which bits are available in
include/linux/fs.h.
Amir Goldstein Nov. 13, 2024, 2:36 p.m. UTC | #12
On Wed, Nov 13, 2024 at 5:30 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Nov 13, 2024 at 01:19:54AM +0000, Al Viro wrote:
> > On Tue, Nov 12, 2024 at 04:38:42PM -0800, Linus Torvalds wrote:
> > > Looking at that locking code in fadvise() just for the f_mode use does
> > > make me think this would be a really good cleanup.
> > >
> > > I note that our fcntl code seems buggy as-is, because while it does
> > > use f_lock for assignments (good), it clearly does *not* use them for
> > > reading.
> > >
> > > So it looks like you can actually read inconsistent values.
> > >
> > > I get the feeling that f_flags would want WRITE_ONCE/READ_ONCE in
> > > _addition_ to the f_lock use it has.
> >
> > AFAICS, fasync logics is the fishy part - the rest should be sane.
> >
> > > The f_mode thing with fadvise() smells like the same bug. Just because
> > > the modifications are serialized wrt each other doesn't mean that
> > > readers are then automatically ok.
> >
> > Reads are also under ->f_lock in there, AFAICS...
> >
> > Another thing in the vicinity is ->f_mode modifications after the calls
> > of anon_inode_getfile() in several callers - probably ought to switch
> > those to anon_inode_getfile_fmode().  That had been discussed back in
> > April when the function got merged, but "convert to using it" followup
> > series hadn't materialized...
>
> While we are at it, there's is a couple of kludges I really hate -
> mixing __FMODE_NONOTIFY and __FMODE_EXEC with O_... flags.
>
> E.g. for __FMODE_NONOTIFY all it takes is switching fanotify from
> anon_inode_getfd() to anon_inode_getfile_fmode() and adding
> a dentry_open_nonotify() to be used by fanotify on the other path.
> That's it - no more weird shit in OPEN_FMODE(), etc.
>
> For __FMODE_EXEC it might get trickier (nfs is the main consumer),
> but I seriously suspect that something like "have path_openat()
> check op->acc_mode & MAY_EXEC and set FMODE_EXEC in ->f_mode
> right after struct file allocation" would make a good starting
> point; yes, it would affect uselib(2), but... I've no idea whether
> it wouldn't be the right thing to do; would be hard to test.
>
> Anyway, untested __FMODE_NONOTIFY side of it:
>
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 22dd9dcce7ec..ebd1c82bfb6b 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -1161,10 +1161,10 @@ static int __init fcntl_init(void)
>          * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
>          * is defined as O_NONBLOCK on some platforms and not on others.
>          */
> -       BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
> +       BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ !=
>                 HWEIGHT32(
>                         (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
> -                       __FMODE_EXEC | __FMODE_NONOTIFY));
> +                       __FMODE_EXEC));
>
>         fasync_cache = kmem_cache_create("fasync_cache",
>                                          sizeof(struct fasync_struct), 0,
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 9644bc72e457..43fbf29ef03a 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -101,8 +101,7 @@ static void __init fanotify_sysctls_init(void)
>   *
>   * Internal and external open flags are stored together in field f_flags of
>   * struct file. Only external open flags shall be allowed in event_f_flags.
> - * Internal flags like FMODE_NONOTIFY, FMODE_EXEC, FMODE_NOCMTIME shall be
> - * excluded.
> + * Internal flags like FMODE_EXEC shall be excluded.
>   */
>  #define        FANOTIFY_INIT_ALL_EVENT_F_BITS                          ( \
>                 O_ACCMODE       | O_APPEND      | O_NONBLOCK    | \
> @@ -262,8 +261,8 @@ static int create_fd(struct fsnotify_group *group, const struct path *path,
>          * we need a new file handle for the userspace program so it can read even if it was
>          * originally opened O_WRONLY.
>          */
> -       new_file = dentry_open(path,
> -                              group->fanotify_data.f_flags | __FMODE_NONOTIFY,
> +       new_file = dentry_open_nonotify(path,
> +                              group->fanotify_data.f_flags,

I would make this a bit more generic helper and the comment above
is truly clueless:

        /*
-        * we need a new file handle for the userspace program so it
can read even if it was
-        * originally opened O_WRONLY.
+        * We provide an fd for the userspace program, so it could access the
+        * file without generating fanotify events itself.
         */
-       new_file = dentry_open(path,
-                              group->fanotify_data.f_flags | __FMODE_NONOTIFY,
-                              current_cred());
+       new_file = dentry_open_fmode(path, group->fanotify_data.f_flags,
+                                    FMODE_NONOTIFY, current_cred());



>                                current_cred());
>         if (IS_ERR(new_file)) {
>                 /*
> @@ -1404,6 +1403,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
>         unsigned int fid_mode = flags & FANOTIFY_FID_BITS;
>         unsigned int class = flags & FANOTIFY_CLASS_BITS;
>         unsigned int internal_flags = 0;
> +       struct file *file;
>
>         pr_debug("%s: flags=%x event_f_flags=%x\n",
>                  __func__, flags, event_f_flags);
> @@ -1472,7 +1472,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
>             (!(fid_mode & FAN_REPORT_NAME) || !(fid_mode & FAN_REPORT_FID)))
>                 return -EINVAL;
>
> -       f_flags = O_RDWR | __FMODE_NONOTIFY;
> +       f_flags = O_RDWR;
>         if (flags & FAN_CLOEXEC)
>                 f_flags |= O_CLOEXEC;
>         if (flags & FAN_NONBLOCK)
> @@ -1550,10 +1550,18 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
>                         goto out_destroy_group;
>         }
>
> -       fd = anon_inode_getfd("[fanotify]", &fanotify_fops, group, f_flags);
> +       fd = get_unused_fd_flags(flags);

s/flags/f_flags

>         if (fd < 0)
>                 goto out_destroy_group;
>
> +       file = anon_inode_getfile_fmode("[fanotify]", &fanotify_fops, group,
> +                                       f_flags, FMODE_NONOTIFY);
> +       if (IS_ERR(file)) {
> +               fd = PTR_ERR(file);
> +               put_unused_fd(fd);
> +               goto out_destroy_group;
> +       }
> +       fd_install(fd, file);
>         return fd;
>
>  out_destroy_group:
> diff --git a/fs/open.c b/fs/open.c
> index acaeb3e25c88..04cb581528ff 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -1118,6 +1118,23 @@ struct file *dentry_open(const struct path *path, int flags,
>  }
>  EXPORT_SYMBOL(dentry_open);
>
> +struct file *dentry_open_nonotify(const struct path *path, int flags,
> +                        const struct cred *cred)
> +{
> +       struct file *f = alloc_empty_file(flags, cred);
> +       if (!IS_ERR(f)) {
> +               int error;
> +
> +               f->f_mode |= FMODE_NONOTIFY;
> +               error = vfs_open(path, f);
> +               if (error) {
> +                       fput(f);
> +                       f = ERR_PTR(error);
> +               }
> +       }
> +       return f;
> +}
> +
>  /**
>   * dentry_create - Create and open a file
>   * @path: path to create
> @@ -1215,7 +1232,7 @@ inline struct open_how build_open_how(int flags, umode_t mode)
>  inline int build_open_flags(const struct open_how *how, struct open_flags *op)
>  {
>         u64 flags = how->flags;
> -       u64 strip = __FMODE_NONOTIFY | O_CLOEXEC;
> +       u64 strip = O_CLOEXEC;
>         int lookup_flags = 0;
>         int acc_mode = ACC_MODE(flags);
>

Get rid of another stale comment:

        /*
-        * Strip flags that either shouldn't be set by userspace like
-        * FMODE_NONOTIFY or that aren't relevant in determining struct
-        * open_flags like O_CLOEXEC.
+        * Strip flags that aren't relevant in determining struct open_flags.
         */

With these changed, you can add:

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

With the f_flags typo fixed, this passed LTP sanity tests, but I am
going to test the NONOTIFY functionally some more.

Thanks,
Amir.
Linus Torvalds Nov. 13, 2024, 4:57 p.m. UTC | #13
On Tue, 12 Nov 2024 at 16:06, Amir Goldstein <amir73il@gmail.com> wrote:
>
> Maybe I could use just this one bit, but together with the existing
> FMODE_NONOTIFY bit, I get 4 modes, which correspond to the
> highest watching priority:

So you'd use two bits, but one of those would re-use the existing
FMODE_NONOTIFY? That sounds perfectly fine to me.

             Linus
Amir Goldstein Nov. 13, 2024, 6:49 p.m. UTC | #14
On Wed, Nov 13, 2024 at 5:57 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 12 Nov 2024 at 16:06, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > Maybe I could use just this one bit, but together with the existing
> > FMODE_NONOTIFY bit, I get 4 modes, which correspond to the
> > highest watching priority:
>
> So you'd use two bits, but one of those would re-use the existing
> FMODE_NONOTIFY? That sounds perfectly fine to me.
>

Yes, exactly, like this:

/*
 * The two FMODE_NONOTIFY_ bits used together have a special meaning of
 * not reporting any events at all including non-permission events.
 * These are the possible values of FMODE_NOTIFY(f->f_mode) and their meaning:
 *
 * FMODE_NONOTIFY_HSM - suppress only pre-content events.
 * FMODE_NONOTIFY_PERM - suppress permission (incl. pre-content) events.
 * FMODE_NONOTIFY - suppress all (incl. non-permission) events.
 */
#define FMODE_NONOTIFY_MASK \
        (FMODE_NONOTIFY_HSM | FMODE_NONOTIFY_PERM)
#define FMODE_NONOTIFY FMODE_NONOTIFY_MASK
#define FMODE_NOTIFY(mode) \
        ((mode) & FMODE_NONOTIFY_MASK)

Please see attached patch (build and sanity tested) to make sure that
we are on the same page.

Going forward in the patch series, the choice of the NONOTIFY lingo
creates some double negatives, like:

        /*
         * read()/write() and other types of access generate pre-content events.
         */
        if (!likely(file->f_mode & FMODE_NONOTIFY_HSM)) {
                int ret = fsnotify_path(&file->f_path);

But it was easier for me to work with NONOTIFY flags.

Thanks,
Amir.
Amir Goldstein Nov. 13, 2024, 7:11 p.m. UTC | #15
On Tue, Nov 12, 2024 at 9:12 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 12 Nov 2024 at 09:56, Josef Bacik <josef@toxicpanda.com> wrote:
> >
> >  #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
> > +static inline int fsnotify_pre_content(struct file *file)
> > +{
> > +       struct inode *inode = file_inode(file);
> > +
> > +       /*
> > +        * Pre-content events are only reported for regular files and dirs
> > +        * if there are any pre-content event watchers on this sb.
> > +        */
> > +       if ((!S_ISDIR(inode->i_mode) && !S_ISREG(inode->i_mode)) ||
> > +           !(inode->i_sb->s_iflags & SB_I_ALLOW_HSM) ||
> > +           !fsnotify_sb_has_priority_watchers(inode->i_sb,
> > +                                              FSNOTIFY_PRIO_PRE_CONTENT))
> > +               return 0;
> > +
> > +       return fsnotify_file(file, FS_PRE_ACCESS);
> > +}
>
> Yeah, no.
>
> None of this should check inode->i_sb->s_iflags at any point.
>
> The "is there a pre-content" thing should check one thing, and one
> thing only: that "is this file watched" flag.
>
> The whole indecipherable mess of inline functions that do random
> things in <linux/fsnotify.h> needs to be cleaned up, not made even
> more indecipherable.
>
> I'm NAKing this whole series until this is all sane and cleaned up,
> and I don't want to see a new hacky version being sent out tomorrow
> with just another layer of new hacks, with random new inline functions
> that call other inline functions and have complex odd conditionals
> that make no sense.
>
> Really. If the new hooks don't have that *SINGLE* bit test, they will
> not get merged.
>
> And that *SINGLE* bit test had better not be hidden under multiple
> layers of odd inline functions.
>
> You DO NOT get to use the same old broken complex function for the new
> hooks that then mix these odd helpers.

Up to here I understand.

>
> This whole "add another crazy inline function using another crazy
> helper needs to STOP. Later on in the patch series you do
>

The patch that I sent did add another convenience helper
fsnotify_path(), but as long as it is not hiding crazy tests,
and does not expand to huge inlined code, I don't see the problem.

Those convenience helpers help me to maintain readability and code
reuse. I do agree that the new hooks that can use the new open-time
check semantics should not expand to huge inlined code.

> +/*
> + * fsnotify_truncate_perm - permission hook before file truncate
> + */
> +static inline int fsnotify_truncate_perm(const struct path *path,
> loff_t length)
> +{
> +       return fsnotify_pre_content(path, &length, 0);
> +}
>

This example that you pointed at, I do not understand.
truncate() does not happen on an open file, so I cannot use the
FMODE_NONOTIFY_ test.

This is what I have in my WIP branch:

static inline int fsnotify_file_range(const struct path *path,
                                      const loff_t *ppos, size_t count)
{
        struct file_range range;
        const void *data;
        int data_type;

        /* Report page aligned range only when pos is known */
        if (ppos) {
                range.path = path;
                range.pos = PAGE_ALIGN_DOWN(*ppos);
                range.count = PAGE_ALIGN(*ppos + count) - range.pos;
                data = &range;
                data_type = FSNOTIFY_EVENT_FILE_RANGE;
        } else {
                data = path;
                data_type = FSNOTIFY_EVENT_PATH;
        }

        return fsnotify_parent(path->dentry, FS_PRE_ACCESS, data, data_type);
}

/*
 * fsnotify_truncate_perm - permission hook before file truncate
 */
static inline int fsnotify_truncate_perm(const struct path *path, loff_t length)
{
        struct inode *inode = d_inode(path->dentry);

        /*
         * Pre-content events are only reported for regular files and dirs
         * if there are any pre-content event watchers on this sb.
         */
        if ((!S_ISDIR(inode->i_mode) && !S_ISREG(inode->i_mode)) ||
            !(inode->i_sb->s_iflags & SB_I_ALLOW_HSM) ||
            !unlikely(fsnotify_sb_has_priority_watchers(inode->i_sb,
                                               FSNOTIFY_PRIO_PRE_CONTENT)))
                return 0;

        return fsnotify_file_range(path, &length, 0);
}

fsnotify_file_range() does not need to be inlined, but I do want
to reuse the code of fsnotify_file_range() which is also called for
the common file pre-access hook.

So did you mean that the unlikely stuff (i.e. fsnotify_file_range())
should be an indirect call? or something else?

Thanks,
Amir.
Al Viro Nov. 13, 2024, 8:31 p.m. UTC | #16
On Wed, Nov 13, 2024 at 03:36:33PM +0100, Amir Goldstein wrote:

> I would make this a bit more generic helper and the comment above
> is truly clueless:
> 
>         /*
> -        * we need a new file handle for the userspace program so it
> can read even if it was
> -        * originally opened O_WRONLY.
> +        * We provide an fd for the userspace program, so it could access the
> +        * file without generating fanotify events itself.
>          */
> -       new_file = dentry_open(path,
> -                              group->fanotify_data.f_flags | __FMODE_NONOTIFY,
> -                              current_cred());
> +       new_file = dentry_open_fmode(path, group->fanotify_data.f_flags,
> +                                    FMODE_NONOTIFY, current_cred());

Hmm...  Not sure I like that, TBH, since that'll lead to temptation to
turn dentry_open() into a wrapper for that thing and I would rather
keep them separate.

> > -       fd = anon_inode_getfd("[fanotify]", &fanotify_fops, group, f_flags);
> > +       fd = get_unused_fd_flags(flags);
> 
> s/flags/f_flags

ACK - thanks for catching that one.
Linus Torvalds Nov. 13, 2024, 9:22 p.m. UTC | #17
On Wed, 13 Nov 2024 at 11:11, Amir Goldstein <amir73il@gmail.com> wrote:
>
> >
> > This whole "add another crazy inline function using another crazy
> > helper needs to STOP. Later on in the patch series you do
> >
>
> The patch that I sent did add another convenience helper
> fsnotify_path(), but as long as it is not hiding crazy tests,
> and does not expand to huge inlined code, I don't see the problem.

So I don't mind adding a new inline function for convenience.

But I do mind the whole "multiple levels of inline functions" model,
and the thing I _particularly_ hate is the "mask is usually constant
so that the effect of the inline function is practically two different
things" as exemplified by "fsnotify_file()" and friends.

At that point, the inline function isn't a helper any more, it's a
hindrance to understanding what the heck is going on.

Basically, as an example: fsnotify_file() is actually two very
different things depending on the "mask" argument, an that argument is
*typically* a constant.

In fact, in fsnotify_file_area_perm() is very much is a constant, but
to make it extra non-obvious, it's a *hidden* constant, using

        __u32 fsnotify_mask = FS_ACCESS_PERM;

to hide the fact that it's actually calling fsnotify_file() with that
constant argument.

And in fsnotify_open() it's not exactly a constant, but it's kind of
one: when you actually look at fsnotify_file(), it has that "I do a
different filtering event based on mask", and the two different
constants fsnotify_open() uses are actually the same for that mask.

In other words, that whole "mask" test part of fsnotify_file()

        /* Permission events require group prio >= FSNOTIFY_PRIO_CONTENT */
        if (mask & ALL_FSNOTIFY_PERM_EVENTS &&
            !fsnotify_sb_has_priority_watchers(path->dentry->d_sb,
                                               FSNOTIFY_PRIO_CONTENT))
                return 0;

mess is actually STATICALLY TRUE OR FALSE, but it's made out to be
somehow an "arghumenty" to the function, and it's really obfuscated.

That is the kind of "helper inline" that I don't want to see in the
new paths. Making that conditional more complicated was part of what I
objected to in one of the patches.

> Those convenience helpers help me to maintain readability and code
> reuse.

ABSOLUTELY NOT.

That "convenience helkper" does exactly the opposite. It explicitly
and actively obfuscates when the actual
fsnotify_sb_has_priority_watchers() filtering is done.

That helper is evil.

Just go and look at the actual uses, let's take
fsnotify_file_area_perm() as an example. As mentioned, as an extra
level of obfuscation, that horrid "helper" function tries to hide how
"mask" is constant by doing

        __u32 fsnotify_mask = FS_ACCESS_PERM;

and then never modifying it, and then doing

        return fsnotify_file(file, fsnotify_mask);

but if you walk through the logic, you now see that ok, that means
that the "mask" conditional fsnotify_file() is actually just

    FS_ACCESS_PERM & ALL_FSNOTIFY_PERM_EVENTS

which is always true, so it means that fsnotify_file_area_perm()
unconditionally does that

    fsnotify_sb_has_priority_watchers(..)

filitering.

And dammit, you shouldn't have to walk through that pointless "helper"
variable, and that pointless "helper" inline function to see that. It
shouldn't be the case that fsnotify_file() does two completely
different things based on a constant argument.

It would have literally been much clearer to just have two explicitly
different versions of that function, *WITHOUT* some kind of
pseudo-conditional that isn't actually a conditional, and just have
fsnotify_file_area_perm() be very explicit about the fact that it uses
the fsnotify_sb_has_priority_watchers() logic.

IOW, that conditional only makes it harder to see what the actual
rules are. For no good reason.

Look, magically for some reason fsnotify_name() could do the same
thing without this kind of silly obfuscation. It just unconditonally
calls fsnotify_sb_has_watchers() to filter the events. No silly games
with doing two entirely different things based on a random constant
argument.

So this is why I say that any new fsnotify events will be NAK'ed and
not merged by me unless it's all obvious, and unless it all obviously
DOES NOT USE these inline garbage "helper" functions.

The new logic had better be very obviously *only* using the
file->f_mode bits, and just calling out-of-line to do the work. If I
have to walk through several layers of inline functions, and look at
what magic arguments those inline functions get just to see what the
hell they actually do, I'm not going to merge it.

Because I'm really tired of actively obfuscated VFS hooks that use
inline functions to hide what the hell they are doing and whether they
are expensive or not.

Your fsnotify_file_range() uses fsnotify_parent(), which is another of
those "it does two different things" functions that either call
fsnotify() on the dentry, or call __fsnotify_parent() on it if it's an
inode, which means that it's another case of "what does this actually
do" which is pointlessly hard to follow, since clearly for a truncate
event it can't be a directory.

And to make matters worse, fsnotify_truncate_perm() actually checks
truncate events for directories and regular files, when truncates
can't actually happen for anything but regular files in the first
place. So  your helper function does a nonsensical cray-cray test that
shouldn't exist.

That makes it another "this is not a helper function, this is just obfuscation".

                 Linus
Amir Goldstein Nov. 13, 2024, 10:35 p.m. UTC | #18
On Wed, Nov 13, 2024 at 10:22 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, 13 Nov 2024 at 11:11, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > >
> > > This whole "add another crazy inline function using another crazy
> > > helper needs to STOP. Later on in the patch series you do
> > >
> >
> > The patch that I sent did add another convenience helper
> > fsnotify_path(), but as long as it is not hiding crazy tests,
> > and does not expand to huge inlined code, I don't see the problem.
>
> So I don't mind adding a new inline function for convenience.
>
> But I do mind the whole "multiple levels of inline functions" model,
> and the thing I _particularly_ hate is the "mask is usually constant
> so that the effect of the inline function is practically two different
> things" as exemplified by "fsnotify_file()" and friends.
>
> At that point, the inline function isn't a helper any more, it's a
> hindrance to understanding what the heck is going on.
>
> Basically, as an example: fsnotify_file() is actually two very
> different things depending on the "mask" argument, an that argument is
> *typically* a constant.
>
> In fact, in fsnotify_file_area_perm() is very much is a constant, but
> to make it extra non-obvious, it's a *hidden* constant, using
>
>         __u32 fsnotify_mask = FS_ACCESS_PERM;
>
> to hide the fact that it's actually calling fsnotify_file() with that
> constant argument.

Yeh, that specific "obfuscation" is a leftover from history.
It is already gone in the patches that we sent.

>
> And in fsnotify_open() it's not exactly a constant, but it's kind of
> one: when you actually look at fsnotify_file(), it has that "I do a
> different filtering event based on mask", and the two different
> constants fsnotify_open() uses are actually the same for that mask.
>
> In other words, that whole "mask" test part of fsnotify_file()
>
>         /* Permission events require group prio >= FSNOTIFY_PRIO_CONTENT */
>         if (mask & ALL_FSNOTIFY_PERM_EVENTS &&
>             !fsnotify_sb_has_priority_watchers(path->dentry->d_sb,
>                                                FSNOTIFY_PRIO_CONTENT))
>                 return 0;
>
> mess is actually STATICALLY TRUE OR FALSE, but it's made out to be
> somehow an "arghumenty" to the function, and it's really obfuscated.
>

Yeh. I see that problem absolutely.
This is already gone in the patch that I send you today:
- All the old hooks call fsnotify_file() that only checks FMODE_NONOTIFY
  and calls fsnotify_path()
- The permission hooks now check FMODE_NONOTIFY_PERM
  and call fsnotify_path()

> That is the kind of "helper inline" that I don't want to see in the
> new paths. Making that conditional more complicated was part of what I
> objected to in one of the patches.
>
> > Those convenience helpers help me to maintain readability and code
> > reuse.
>
> ABSOLUTELY NOT.
>
> That "convenience helkper" does exactly the opposite. It explicitly
> and actively obfuscates when the actual
> fsnotify_sb_has_priority_watchers() filtering is done.
>
> That helper is evil.
>
> Just go and look at the actual uses, let's take
> fsnotify_file_area_perm() as an example. As mentioned, as an extra
> level of obfuscation, that horrid "helper" function tries to hide how
> "mask" is constant by doing
>
>         __u32 fsnotify_mask = FS_ACCESS_PERM;
>
> and then never modifying it, and then doing
>
>         return fsnotify_file(file, fsnotify_mask);
>
> but if you walk through the logic, you now see that ok, that means
> that the "mask" conditional fsnotify_file() is actually just
>
>     FS_ACCESS_PERM & ALL_FSNOTIFY_PERM_EVENTS
>
> which is always true, so it means that fsnotify_file_area_perm()
> unconditionally does that
>
>     fsnotify_sb_has_priority_watchers(..)
>
> filitering.
>
> And dammit, you shouldn't have to walk through that pointless "helper"
> variable, and that pointless "helper" inline function to see that. It
> shouldn't be the case that fsnotify_file() does two completely
> different things based on a constant argument.
>

ok. that's going to be history soon.
I will send this cleanup patch regardless of the pre-content series.

> It would have literally been much clearer to just have two explicitly
> different versions of that function, *WITHOUT* some kind of
> pseudo-conditional that isn't actually a conditional, and just have
> fsnotify_file_area_perm() be very explicit about the fact that it uses
> the fsnotify_sb_has_priority_watchers() logic.
>
> IOW, that conditional only makes it harder to see what the actual
> rules are. For no good reason.
>
> Look, magically for some reason fsnotify_name() could do the same
> thing without this kind of silly obfuscation. It just unconditonally
> calls fsnotify_sb_has_watchers() to filter the events. No silly games
> with doing two entirely different things based on a random constant
> argument.
>
> So this is why I say that any new fsnotify events will be NAK'ed and
> not merged by me unless it's all obvious, and unless it all obviously
> DOES NOT USE these inline garbage "helper" functions.
>
> The new logic had better be very obviously *only* using the
> file->f_mode bits, and just calling out-of-line to do the work. If I
> have to walk through several layers of inline functions, and look at
> what magic arguments those inline functions get just to see what the
> hell they actually do, I'm not going to merge it.

Sure for new hooks with new check-on-open semantics that is
going to be easy to do. The historic reason for the heavy inlining
is trying to optimize out indirect calls when we do not have the
luxury of using the check-on-open semantics.

>
> Because I'm really tired of actively obfuscated VFS hooks that use
> inline functions to hide what the hell they are doing and whether they
> are expensive or not.
>
> Your fsnotify_file_range() uses fsnotify_parent(), which is another of
> those "it does two different things" functions that either call
> fsnotify() on the dentry, or call __fsnotify_parent() on it if it's an
> inode, which means that it's another case of "what does this actually
> do" which is pointlessly hard to follow, since clearly for a truncate
> event it can't be a directory.
>
> And to make matters worse, fsnotify_truncate_perm() actually checks
> truncate events for directories and regular files, when truncates
> can't actually happen for anything but regular files in the first
> place. So  your helper function does a nonsensical cray-cray test that
> shouldn't exist.

Ha, right, that's a stupid copy&paste braino.
Easy to fix.

The simplest thing to do for the new hooks I think is to make
fsnotify_file_range() extern and then you won't need to look beyond it,
because it already comes after the unlikley FMODE_NONOTIFY_ bits check.

Will work on the rest of the series following those guidelines.
Let me know if the patch I sent you has taken a wrong direction.

Thanks,
Amir.
Linus Torvalds Nov. 13, 2024, 11:07 p.m. UTC | #19
On Wed, 13 Nov 2024 at 14:35, Amir Goldstein <amir73il@gmail.com> wrote:
>
> Sure for new hooks with new check-on-open semantics that is
> going to be easy to do. The historic reason for the heavy inlining
> is trying to optimize out indirect calls when we do not have the
> luxury of using the check-on-open semantics.

Right. I'm not asking you to fix the old cases - it would be lovely to
do, but I think that's a different story. The compiler *does* figure
out the oddities, so usually generated code doesn't look horrible, but
it's really hard for a human to understand.

And honestly, code that "the compiler can figure out, but ordinary
humans can't" isn't great code.

And hey, we have tons of "isn't great code". Stuff happens. And the
fsnotify code in particular has this really odd history of
inotify/dnotify/unification and the VFS layer also having been
modified under it and becoming much more complex.

I really wish we could just throw some of the legacy cases away. Oh well.

But because I'm very sensitive to the VFS layer core code, and partly
*because* we have this bad history of horridness here (and
particularly in the security hooks), I just want to make really sure
that the new cases do *not* use the same completely incomprehensible
model with random conditionals that make no sense.

So that's why I then react so strongly to some of this.

Put another way: I'm not expecting the fsnotify_file() and
fsnotify_parent() horror to go away. But I *am* expecting new
interfaces to not use them, and not write new code like that again.

                  Linus
Jan Kara Nov. 14, 2024, 3:01 p.m. UTC | #20
On Wed 13-11-24 19:49:31, Amir Goldstein wrote:
> From 7a2cd74654a53684d545b96c57c9091e420b3add Mon Sep 17 00:00:00 2001
> From: Amir Goldstein <amir73il@gmail.com>
> Date: Tue, 12 Nov 2024 13:46:08 +0100
> Subject: [PATCH] fsnotify: opt-in for permission events at file open time
> 
> Legacy inotify/fanotify listeners can add watches for events on inode,
> parent or mount and expect to get events (e.g. FS_MODIFY) on files that
> were already open at the time of setting up the watches.
> 
> fanotify permission events are typically used by Anti-malware sofware,
> that is watching the entire mount and it is not common to have more that
> one Anti-malware engine installed on a system.
> 
> To reduce the overhead of the fsnotify_file_perm() hooks on every file
> access, relax the semantics of the legacy FAN_ACCESS_PERM event to generate
> events only if there were *any* permission event listeners on the
> filesystem at the time that the file was opened.
> 
> The new semantic is implemented by extending the FMODE_NONOTIFY bit into
> two FMODE_NONOTIFY_* bits, that are used to store a mode for which of the
> events types to report.
> 
> This is going to apply to the new fanotify pre-content events in order
> to reduce the cost of the new pre-content event vfs hooks.
> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wj8L=mtcRTi=NECHMGfZQgXOp_uix1YVh04fEmrKaMnXA@mail.gmail.com/
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>

Couple of notes below.

> diff --git a/fs/open.c b/fs/open.c
> index 226aca8c7909..194c2c8d8cd4 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -901,7 +901,7 @@ static int do_dentry_open(struct file *f,
>  	f->f_sb_err = file_sample_sb_err(f);
>  
>  	if (unlikely(f->f_flags & O_PATH)) {
> -		f->f_mode = FMODE_PATH | FMODE_OPENED;
> +		f->f_mode = FMODE_PATH | FMODE_OPENED | FMODE_NONOTIFY;
>  		f->f_op = &empty_fops;
>  		return 0;
>  	}
> @@ -929,6 +929,12 @@ static int do_dentry_open(struct file *f,
>  	if (error)
>  		goto cleanup_all;
>  
> +	/*
> +	 * Set FMODE_NONOTIFY_* bits according to existing permission watches.
> +	 * If FMODE_NONOTIFY was already set for an fanotify fd, this doesn't
> +	 * change anything.
> +	 */
> +	f->f_mode |= fsnotify_file_mode(f);

Maybe it would be obvious to do this like:

	file_set_fsnotify_mode(f);

Because currently this depends on the details of how exactly FMODE_NONOTIFY
is encoded.

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 70359dd669ff..dd583ce7dba8 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -173,13 +173,14 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>  
>  #define	FMODE_NOREUSE		((__force fmode_t)(1 << 23))
>  
> -/* FMODE_* bit 24 */
> -
>  /* File is embedded in backing_file object */
> -#define FMODE_BACKING		((__force fmode_t)(1 << 25))
> +#define FMODE_BACKING		((__force fmode_t)(1 << 24))
> +
> +/* File shouldn't generate fanotify pre-content events */
> +#define FMODE_NONOTIFY_HSM	((__force fmode_t)(1 << 25))
>  
> -/* File was opened by fanotify and shouldn't generate fanotify events */
> -#define FMODE_NONOTIFY		((__force fmode_t)(1 << 26))
> +/* File shouldn't generate fanotify permission events */
> +#define FMODE_NONOTIFY_PERM	((__force fmode_t)(1 << 26))
>  
>  /* File is capable of returning -EAGAIN if I/O will block */
>  #define FMODE_NOWAIT		((__force fmode_t)(1 << 27))
> @@ -190,6 +191,21 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>  /* File does not contribute to nr_files count */
>  #define FMODE_NOACCOUNT		((__force fmode_t)(1 << 29))
>  
> +/*
> + * The two FMODE_NONOTIFY_ bits used together have a special meaning of
> + * not reporting any events at all including non-permission events.
> + * These are the possible values of FMODE_NOTIFY(f->f_mode) and their meaning:
> + *
> + * FMODE_NONOTIFY_HSM - suppress only pre-content events.
> + * FMODE_NONOTIFY_PERM - suppress permission (incl. pre-content) events.
> + * FMODE_NONOTIFY - suppress all (incl. non-permission) events.
> + */
> +#define FMODE_NONOTIFY_MASK \
> +	(FMODE_NONOTIFY_HSM | FMODE_NONOTIFY_PERM)
> +#define FMODE_NONOTIFY FMODE_NONOTIFY_MASK
> +#define FMODE_NOTIFY(mode) \
> +	((mode) & FMODE_NONOTIFY_MASK)

This looks a bit error-prone to me (FMODE_NONOTIFY looks like another FMODE
flag but in fact it is not which is an invitation for subtle bugs) and the
tests below which are sometimes done as (FMODE_NOTIFY(mode) == xxx) and
sometimes as (file->f_mode & xxx) are inconsistent and confusing (unless you
understand what's happening under the hood).

So how about defining macros like FMODE_FSNOTIFY_NORMAL(),
FMODE_FSNOTIFY_CONTENT() and FMODE_FSNOTIFY_PRE_CONTENT() which evaluate to
true if we should be sending normal/content/pre-content events to the file.
With appropriate comments this should make things more obvious.

								Honza
diff mbox series

Patch

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 316eec309299..cab5a1a16e57 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -626,7 +626,7 @@  static __init int fsnotify_init(void)
 {
 	int ret;
 
-	BUILD_BUG_ON(HWEIGHT32(ALL_FSNOTIFY_BITS) != 23);
+	BUILD_BUG_ON(HWEIGHT32(ALL_FSNOTIFY_BITS) != 24);
 
 	ret = init_srcu_struct(&fsnotify_mark_srcu);
 	if (ret)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9b58e9887e4b..ee0637fcb197 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1232,6 +1232,7 @@  extern int send_sigurg(struct file *file);
 #define SB_I_RETIRED	0x00000800	/* superblock shouldn't be reused */
 #define SB_I_NOUMASK	0x00001000	/* VFS does not apply umask */
 #define SB_I_NOIDMAP	0x00002000	/* No idmapped mounts on this superblock */
+#define SB_I_ALLOW_HSM	0x00004000	/* Allow HSM events on this superblock */
 
 /* Possible states of 'frozen' field */
 enum {
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index f0fd3dcae654..0f44cd60ac9a 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -154,14 +154,29 @@  static inline int fsnotify_file(struct file *file, __u32 mask)
 }
 
 #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
+static inline int fsnotify_pre_content(struct file *file)
+{
+	struct inode *inode = file_inode(file);
+
+	/*
+	 * Pre-content events are only reported for regular files and dirs
+	 * if there are any pre-content event watchers on this sb.
+	 */
+	if ((!S_ISDIR(inode->i_mode) && !S_ISREG(inode->i_mode)) ||
+	    !(inode->i_sb->s_iflags & SB_I_ALLOW_HSM) ||
+	    !fsnotify_sb_has_priority_watchers(inode->i_sb,
+					       FSNOTIFY_PRIO_PRE_CONTENT))
+		return 0;
+
+	return fsnotify_file(file, FS_PRE_ACCESS);
+}
+
 /*
- * fsnotify_file_area_perm - permission hook before access to file range
+ * fsnotify_file_area_perm - permission hook before access of file range
  */
 static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
 					  const loff_t *ppos, size_t count)
 {
-	__u32 fsnotify_mask = FS_ACCESS_PERM;
-
 	/*
 	 * filesystem may be modified in the context of permission events
 	 * (e.g. by HSM filling a file on access), so sb freeze protection
@@ -169,10 +184,24 @@  static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
 	 */
 	lockdep_assert_once(file_write_not_started(file));
 
+	/*
+	 * read()/write and other types of access generate pre-content events.
+	 */
+	if (perm_mask & (MAY_READ | MAY_WRITE | MAY_ACCESS)) {
+		int ret = fsnotify_pre_content(file);
+
+		if (ret)
+			return ret;
+	}
+
 	if (!(perm_mask & MAY_READ))
 		return 0;
 
-	return fsnotify_file(file, fsnotify_mask);
+	/*
+	 * read() also generates the legacy FS_ACCESS_PERM event, so content
+	 * scanners can inspect the content filled by pre-content event.
+	 */
+	return fsnotify_file(file, FS_ACCESS_PERM);
 }
 
 /*
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 53d5d0e02943..9bda354b5538 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -57,6 +57,8 @@ 
 #define FS_OPEN_EXEC_PERM	0x00040000	/* open/exec event in a permission hook */
 /* #define FS_DIR_MODIFY	0x00080000 */	/* Deprecated (reserved) */
 
+#define FS_PRE_ACCESS		0x00100000	/* Pre-content access hook */
+
 /*
  * Set on inode mark that cares about things that happen to its children.
  * Always set for dnotify and inotify.
@@ -78,8 +80,14 @@ 
  */
 #define ALL_FSNOTIFY_DIRENT_EVENTS (FS_CREATE | FS_DELETE | FS_MOVE | FS_RENAME)
 
-#define ALL_FSNOTIFY_PERM_EVENTS (FS_OPEN_PERM | FS_ACCESS_PERM | \
-				  FS_OPEN_EXEC_PERM)
+/* Content events can be used to inspect file content */
+#define FSNOTIFY_CONTENT_PERM_EVENTS (FS_OPEN_PERM | FS_OPEN_EXEC_PERM | \
+				      FS_ACCESS_PERM)
+/* Pre-content events can be used to fill file content */
+#define FSNOTIFY_PRE_CONTENT_EVENTS  (FS_PRE_ACCESS)
+
+#define ALL_FSNOTIFY_PERM_EVENTS (FSNOTIFY_CONTENT_PERM_EVENTS | \
+				  FSNOTIFY_PRE_CONTENT_EVENTS)
 
 /*
  * This is a list of all events that may get sent to a parent that is watching
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index fc926d3cac6e..c6f38705c715 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -3404,7 +3404,8 @@  static int selinux_path_notify(const struct path *path, u64 mask,
 		perm |= FILE__WATCH_WITH_PERM;
 
 	/* watches on read-like events need the file:watch_reads permission */
-	if (mask & (FS_ACCESS | FS_ACCESS_PERM | FS_CLOSE_NOWRITE))
+	if (mask & (FS_ACCESS | FS_ACCESS_PERM | FS_PRE_ACCESS |
+		    FS_CLOSE_NOWRITE))
 		perm |= FILE__WATCH_READS;
 
 	return path_has_perm(current_cred(), path, perm);