diff mbox series

[v6,09/21] fsnotify: Allow events reported with an empty inode

Message ID 20210812214010.3197279-10-krisman@collabora.com (mailing list archive)
State New, archived
Headers show
Series File system wide monitoring | expand

Commit Message

Gabriel Krisman Bertazi Aug. 12, 2021, 9:39 p.m. UTC
Some file system events (i.e. FS_ERROR) might not be associated with an
inode.  For these, it makes sense to associate them directly with the
super block of the file system they apply to.  This patch allows the
event to be reported with a NULL inode, by recovering the superblock
directly from the data field, if needed.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>

--
Changes since v5:
  - add fsnotify_data_sb handle to retrieve sb from the data field. (jan)
---
 fs/notify/fsnotify.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

Comments

Amir Goldstein Aug. 13, 2021, 7:58 a.m. UTC | #1
On Fri, Aug 13, 2021 at 12:41 AM Gabriel Krisman Bertazi
<krisman@collabora.com> wrote:
>
> Some file system events (i.e. FS_ERROR) might not be associated with an
> inode.  For these, it makes sense to associate them directly with the
> super block of the file system they apply to.  This patch allows the
> event to be reported with a NULL inode, by recovering the superblock
> directly from the data field, if needed.
>
> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
>
> --
> Changes since v5:
>   - add fsnotify_data_sb handle to retrieve sb from the data field. (jan)
> ---
>  fs/notify/fsnotify.c | 16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 30d422b8c0fc..536db02cb26e 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -98,6 +98,14 @@ void fsnotify_sb_delete(struct super_block *sb)
>         fsnotify_clear_marks_by_sb(sb);
>  }
>
> +static struct super_block *fsnotify_data_sb(const void *data, int data_type)
> +{
> +       struct inode *inode = fsnotify_data_inode(data, data_type);
> +       struct super_block *sb = inode ? inode->i_sb : NULL;
> +
> +       return sb;
> +}
> +
>  /*
>   * Given an inode, first check if we care what happens to our children.  Inotify
>   * and dnotify both tell their parents about events.  If we care about any event
> @@ -455,8 +463,10 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
>   *             @file_name is relative to
>   * @file_name: optional file name associated with event
>   * @inode:     optional inode associated with event -
> - *             either @dir or @inode must be non-NULL.
> - *             if both are non-NULL event may be reported to both.
> + *             If @dir and @inode are NULL, @data must have a type that
> + *             allows retrieving the file system associated with this

Irrelevant comment. sb must always be available from @data.

> + *             event.  if both are non-NULL event may be reported to
> + *             both.
>   * @cookie:    inotify rename cookie
>   */
>  int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> @@ -483,7 +493,7 @@ int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
>                  */
>                 parent = dir;
>         }
> -       sb = inode->i_sb;
> +       sb = inode ? inode->i_sb : fsnotify_data_sb(data, data_type);

        const struct path *path = fsnotify_data_path(data, data_type);
+       const struct super_block *sb = fsnotify_data_sb(data, data_type);

All the games with @data @inode and @dir args are irrelevant to this.
sb should always be available from @data and it does not matter
if fsnotify_data_inode() is the same as @inode, @dir or neither.
All those inodes are anyway on the same sb.

Thanks,
Amir.
Gabriel Krisman Bertazi Aug. 25, 2021, 6:40 p.m. UTC | #2
Amir Goldstein <amir73il@gmail.com> writes:

> On Fri, Aug 13, 2021 at 12:41 AM Gabriel Krisman Bertazi
> <krisman@collabora.com> wrote:
>>
>> Some file system events (i.e. FS_ERROR) might not be associated with an
>> inode.  For these, it makes sense to associate them directly with the
>> super block of the file system they apply to.  This patch allows the
>> event to be reported with a NULL inode, by recovering the superblock
>> directly from the data field, if needed.
>>
>> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
>>
>> --
>> Changes since v5:
>>   - add fsnotify_data_sb handle to retrieve sb from the data field. (jan)
>> ---
>>  fs/notify/fsnotify.c | 16 +++++++++++++---
>>  1 file changed, 13 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
>> index 30d422b8c0fc..536db02cb26e 100644
>> --- a/fs/notify/fsnotify.c
>> +++ b/fs/notify/fsnotify.c
>> @@ -98,6 +98,14 @@ void fsnotify_sb_delete(struct super_block *sb)
>>         fsnotify_clear_marks_by_sb(sb);
>>  }
>>
>> +static struct super_block *fsnotify_data_sb(const void *data, int data_type)
>> +{
>> +       struct inode *inode = fsnotify_data_inode(data, data_type);
>> +       struct super_block *sb = inode ? inode->i_sb : NULL;
>> +
>> +       return sb;
>> +}
>> +
>>  /*
>>   * Given an inode, first check if we care what happens to our children.  Inotify
>>   * and dnotify both tell their parents about events.  If we care about any event
>> @@ -455,8 +463,10 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
>>   *             @file_name is relative to
>>   * @file_name: optional file name associated with event
>>   * @inode:     optional inode associated with event -
>> - *             either @dir or @inode must be non-NULL.
>> - *             if both are non-NULL event may be reported to both.
>> + *             If @dir and @inode are NULL, @data must have a type that
>> + *             allows retrieving the file system associated with this
>
> Irrelevant comment. sb must always be available from @data.
>
>> + *             event.  if both are non-NULL event may be reported to
>> + *             both.
>>   * @cookie:    inotify rename cookie
>>   */
>>  int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
>> @@ -483,7 +493,7 @@ int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
>>                  */
>>                 parent = dir;
>>         }
>> -       sb = inode->i_sb;
>> +       sb = inode ? inode->i_sb : fsnotify_data_sb(data, data_type);
>
>         const struct path *path = fsnotify_data_path(data, data_type);
> +       const struct super_block *sb = fsnotify_data_sb(data, data_type);
>
> All the games with @data @inode and @dir args are irrelevant to this.
> sb should always be available from @data and it does not matter
> if fsnotify_data_inode() is the same as @inode, @dir or neither.
> All those inodes are anyway on the same sb.

Hi Amir,

I think this is actually necessary.  I could identify at least one event
(FS_CREATE | FS_ISDIR) where fsnotify is invoked with a NULL data field.
In that case, fsnotify_dirent is called with a negative dentry from
vfs_mkdir().  I'm not sure why exactly the dentry is negative after the
mkdir, but it would be possible we are racing with the file removal, I
guess?  It might be a bug in fsnotify that this case even happen, but
I'm not sure yet.

The easiest way around it is what I proposed: just use inode->i_sb if
data is NULL.  Since, as you said, data, inode and dir are all on the
same superblock, it should work, I think.
Amir Goldstein Aug. 25, 2021, 7:45 p.m. UTC | #3
On Wed, Aug 25, 2021 at 9:40 PM Gabriel Krisman Bertazi
<krisman@collabora.com> wrote:
>
> Amir Goldstein <amir73il@gmail.com> writes:
>
> > On Fri, Aug 13, 2021 at 12:41 AM Gabriel Krisman Bertazi
> > <krisman@collabora.com> wrote:
> >>
> >> Some file system events (i.e. FS_ERROR) might not be associated with an
> >> inode.  For these, it makes sense to associate them directly with the
> >> super block of the file system they apply to.  This patch allows the
> >> event to be reported with a NULL inode, by recovering the superblock
> >> directly from the data field, if needed.
> >>
> >> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
> >>
> >> --
> >> Changes since v5:
> >>   - add fsnotify_data_sb handle to retrieve sb from the data field. (jan)
> >> ---
> >>  fs/notify/fsnotify.c | 16 +++++++++++++---
> >>  1 file changed, 13 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> >> index 30d422b8c0fc..536db02cb26e 100644
> >> --- a/fs/notify/fsnotify.c
> >> +++ b/fs/notify/fsnotify.c
> >> @@ -98,6 +98,14 @@ void fsnotify_sb_delete(struct super_block *sb)
> >>         fsnotify_clear_marks_by_sb(sb);
> >>  }
> >>
> >> +static struct super_block *fsnotify_data_sb(const void *data, int data_type)
> >> +{
> >> +       struct inode *inode = fsnotify_data_inode(data, data_type);
> >> +       struct super_block *sb = inode ? inode->i_sb : NULL;
> >> +
> >> +       return sb;
> >> +}
> >> +
> >>  /*
> >>   * Given an inode, first check if we care what happens to our children.  Inotify
> >>   * and dnotify both tell their parents about events.  If we care about any event
> >> @@ -455,8 +463,10 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
> >>   *             @file_name is relative to
> >>   * @file_name: optional file name associated with event
> >>   * @inode:     optional inode associated with event -
> >> - *             either @dir or @inode must be non-NULL.
> >> - *             if both are non-NULL event may be reported to both.
> >> + *             If @dir and @inode are NULL, @data must have a type that
> >> + *             allows retrieving the file system associated with this
> >
> > Irrelevant comment. sb must always be available from @data.
> >
> >> + *             event.  if both are non-NULL event may be reported to
> >> + *             both.
> >>   * @cookie:    inotify rename cookie
> >>   */
> >>  int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> >> @@ -483,7 +493,7 @@ int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> >>                  */
> >>                 parent = dir;
> >>         }
> >> -       sb = inode->i_sb;
> >> +       sb = inode ? inode->i_sb : fsnotify_data_sb(data, data_type);
> >
> >         const struct path *path = fsnotify_data_path(data, data_type);
> > +       const struct super_block *sb = fsnotify_data_sb(data, data_type);
> >
> > All the games with @data @inode and @dir args are irrelevant to this.
> > sb should always be available from @data and it does not matter
> > if fsnotify_data_inode() is the same as @inode, @dir or neither.
> > All those inodes are anyway on the same sb.
>
> Hi Amir,
>
> I think this is actually necessary.  I could identify at least one event
> (FS_CREATE | FS_ISDIR) where fsnotify is invoked with a NULL data field.
> In that case, fsnotify_dirent is called with a negative dentry from
> vfs_mkdir().  I'm not sure why exactly the dentry is negative after the

That doesn't sound right at all.
Are you sure about this?
Which filesystem was this mkdir called on?

> mkdir, but it would be possible we are racing with the file removal, I

No. Both vfs_mkdir() and vfs_rmdir() hold the dir inode lock (on parent).

> guess?  It might be a bug in fsnotify that this case even happen, but
> I'm not sure yet.

fsnotify_data_inode() should not be NULL.
fsnotify_handle_inode_event() passes fsnotify_data_inode() to
callbacks like audit_watch_handle_event() that checks
WARN_ON_ONCE(!inode)

>
> The easiest way around it is what I proposed: just use inode->i_sb if
> data is NULL.  Since, as you said, data, inode and dir are all on the
> same superblock, it should work, I think.
>

It would be papering over another issue.
I don't mind if we use inode->i_sb as long as we understand the reason
for what you are reporting.

Please provide more information.

Thanks,
Amir.
Gabriel Krisman Bertazi Aug. 25, 2021, 9:50 p.m. UTC | #4
Amir Goldstein <amir73il@gmail.com> writes:

> On Wed, Aug 25, 2021 at 9:40 PM Gabriel Krisman Bertazi
> <krisman@collabora.com> wrote:
>>
>> Amir Goldstein <amir73il@gmail.com> writes:
>>
>> > On Fri, Aug 13, 2021 at 12:41 AM Gabriel Krisman Bertazi
>> > <krisman@collabora.com> wrote:
>> >>
>> >> Some file system events (i.e. FS_ERROR) might not be associated with an
>> >> inode.  For these, it makes sense to associate them directly with the
>> >> super block of the file system they apply to.  This patch allows the
>> >> event to be reported with a NULL inode, by recovering the superblock
>> >> directly from the data field, if needed.
>> >>
>> >> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
>> >>
>> >> --
>> >> Changes since v5:
>> >>   - add fsnotify_data_sb handle to retrieve sb from the data field. (jan)
>> >> ---
>> >>  fs/notify/fsnotify.c | 16 +++++++++++++---
>> >>  1 file changed, 13 insertions(+), 3 deletions(-)
>> >>
>> >> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
>> >> index 30d422b8c0fc..536db02cb26e 100644
>> >> --- a/fs/notify/fsnotify.c
>> >> +++ b/fs/notify/fsnotify.c
>> >> @@ -98,6 +98,14 @@ void fsnotify_sb_delete(struct super_block *sb)
>> >>         fsnotify_clear_marks_by_sb(sb);
>> >>  }
>> >>
>> >> +static struct super_block *fsnotify_data_sb(const void *data, int data_type)
>> >> +{
>> >> +       struct inode *inode = fsnotify_data_inode(data, data_type);
>> >> +       struct super_block *sb = inode ? inode->i_sb : NULL;
>> >> +
>> >> +       return sb;
>> >> +}
>> >> +
>> >>  /*
>> >>   * Given an inode, first check if we care what happens to our children.  Inotify
>> >>   * and dnotify both tell their parents about events.  If we care about any event
>> >> @@ -455,8 +463,10 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
>> >>   *             @file_name is relative to
>> >>   * @file_name: optional file name associated with event
>> >>   * @inode:     optional inode associated with event -
>> >> - *             either @dir or @inode must be non-NULL.
>> >> - *             if both are non-NULL event may be reported to both.
>> >> + *             If @dir and @inode are NULL, @data must have a type that
>> >> + *             allows retrieving the file system associated with this
>> >
>> > Irrelevant comment. sb must always be available from @data.
>> >
>> >> + *             event.  if both are non-NULL event may be reported to
>> >> + *             both.
>> >>   * @cookie:    inotify rename cookie
>> >>   */
>> >>  int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
>> >> @@ -483,7 +493,7 @@ int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
>> >>                  */
>> >>                 parent = dir;
>> >>         }
>> >> -       sb = inode->i_sb;
>> >> +       sb = inode ? inode->i_sb : fsnotify_data_sb(data, data_type);
>> >
>> >         const struct path *path = fsnotify_data_path(data, data_type);
>> > +       const struct super_block *sb = fsnotify_data_sb(data, data_type);
>> >
>> > All the games with @data @inode and @dir args are irrelevant to this.
>> > sb should always be available from @data and it does not matter
>> > if fsnotify_data_inode() is the same as @inode, @dir or neither.
>> > All those inodes are anyway on the same sb.
>>
>> Hi Amir,
>>
>> I think this is actually necessary.  I could identify at least one event
>> (FS_CREATE | FS_ISDIR) where fsnotify is invoked with a NULL data field.
>> In that case, fsnotify_dirent is called with a negative dentry from
>> vfs_mkdir().  I'm not sure why exactly the dentry is negative after the
>
> That doesn't sound right at all.
> Are you sure about this?
> Which filesystem was this mkdir called on?

You should be able to reproduce it on top of mainline if you pick only this
patch and do the change you suggested:

 -       sb = inode->i_sb;
 +       sb = fsnotify_data_sb(data, data_type);

And then boot a Debian stable with systemd.  The notification happens on
the cgroup pseudo-filesystem (/sys/fs/cgroup), which is being monitored
by systemd itself.  The event that arrives with a NULL data is telling the
directory /sys/fs/cgroup/*/ about the creation of directory
`init.scope`.

The change above triggers the following null dereference of struct
super_block, since data is NULL.

I will keep looking but you might be able to answer it immediately...

fsnotify was called with:
  data_type=2
  mask=40000100
  data=0
  name=init.scope

The code looks like this:

        fsnotify_mkdir(dir, dentry) {
       	  fsnotify_dirent(inode, dentry, FS_CREATE | FS_ISDIR) {
	    fsnotify_name(dir, mask, d_inode(dentry), &dentry->d_name, 0) {
	      fsnotify(mask, child, FSNOTIFY_EVENT_INODE, dir, name, NULL, cookie);
	    }
	  }
	}

The entire log:

BUG: kernel NULL pointer dereference, address: 00000000000003a8
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0 
Oops: 0000 [#1] SMP PTI
CPU: 3 PID: 1 Comm: systemd Not tainted 5.14.0-rc5- #279
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
RIP: 0010:fsnotify+0x103/0x5e0
Code: 84 c2 04 00 00 83 f8 02 0f 85 c6 04 00 00 48 8b 04 24 31 db 48 85 c0 0f 84 b9 04 00 00 48 8b 48 28 48 85 c9 0f 84 ac 04 00 00 <48> 83 b9 a8 03 00 00 00 0f 84 e3 03 00 00 8b 81 a0 03 00 00 48 85
RSP: 0018:ffffaff800013e18 EFLAGS: 00010246
RAX: 0000000000000026 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffaff800013ca8 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000000 R09: c0000000ffffefff
R10: 0000000000000001 R11: ffffaff800013c48 R12: ffff9bbb80467778
R13: 0000000040000100 R14: 00000000000001ed R15: 0000000000000000
FS:  00007f2e4d2c4900(0000) GS:ffff9bbbbbd80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000003a8 CR3: 0000000100054000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 ? cgroup_kn_unlock+0x33/0x80
 ? cgroup_mkdir+0x13e/0x410
 vfs_mkdir+0x16e/0x1d0
 do_mkdirat+0x8c/0x100
 do_syscall_64+0x3a/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f2e4da91b07
Code: 1f 40 00 48 8b 05 89 f3 0c 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 f3 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffc02877ad8 EFLAGS: 00000202 ORIG_RAX: 0000000000000053
RAX: ffffffffffffffda RBX: 00007ffc02877b50 RCX: 00007f2e4da91b07
RDX: 00007ffc02877980 RSI: 00000000000001ed RDI: 00005646df5f01d0
RBP: 00005646de9ed770 R08: 0000000000000001 R09: 0000000000000000
R10: 00005646df5f01c0 R11: 0000000000000202 R12: 0000000000000000
R13: 0000000000000000 R14: 00007ffc02877b50 R15: 00007ffc02877c60
Modules linked in:
CR2: 00000000000003a8
---[ end trace 4642e1d1df9669cb ]---

>
>> mkdir, but it would be possible we are racing with the file removal, I
>
> No. Both vfs_mkdir() and vfs_rmdir() hold the dir inode lock (on
> parent).
>
>> guess?  It might be a bug in fsnotify that this case even happen, but
>> I'm not sure yet.
>
> fsnotify_data_inode() should not be NULL.
> fsnotify_handle_inode_event() passes fsnotify_data_inode() to
> callbacks like audit_watch_handle_event() that checks
> WARN_ON_ONCE(!inode)
>
>>
>> The easiest way around it is what I proposed: just use inode->i_sb if
>> data is NULL.  Since, as you said, data, inode and dir are all on the
>> same superblock, it should work, I think.
>>
>
> It would be papering over another issue.
> I don't mind if we use inode->i_sb as long as we understand the reason
> for what you are reporting.
>
> Please provide more information.
Amir Goldstein Aug. 26, 2021, 10:44 a.m. UTC | #5
On Thu, Aug 26, 2021 at 12:50 AM Gabriel Krisman Bertazi
<krisman@collabora.com> wrote:
>
> Amir Goldstein <amir73il@gmail.com> writes:
>
> > On Wed, Aug 25, 2021 at 9:40 PM Gabriel Krisman Bertazi
> > <krisman@collabora.com> wrote:
> >>
> >> Amir Goldstein <amir73il@gmail.com> writes:
> >>
> >> > On Fri, Aug 13, 2021 at 12:41 AM Gabriel Krisman Bertazi
> >> > <krisman@collabora.com> wrote:
> >> >>
> >> >> Some file system events (i.e. FS_ERROR) might not be associated with an
> >> >> inode.  For these, it makes sense to associate them directly with the
> >> >> super block of the file system they apply to.  This patch allows the
> >> >> event to be reported with a NULL inode, by recovering the superblock
> >> >> directly from the data field, if needed.
> >> >>
> >> >> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
> >> >>
> >> >> --
> >> >> Changes since v5:
> >> >>   - add fsnotify_data_sb handle to retrieve sb from the data field. (jan)
> >> >> ---
> >> >>  fs/notify/fsnotify.c | 16 +++++++++++++---
> >> >>  1 file changed, 13 insertions(+), 3 deletions(-)
> >> >>
> >> >> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> >> >> index 30d422b8c0fc..536db02cb26e 100644
> >> >> --- a/fs/notify/fsnotify.c
> >> >> +++ b/fs/notify/fsnotify.c
> >> >> @@ -98,6 +98,14 @@ void fsnotify_sb_delete(struct super_block *sb)
> >> >>         fsnotify_clear_marks_by_sb(sb);
> >> >>  }
> >> >>
> >> >> +static struct super_block *fsnotify_data_sb(const void *data, int data_type)
> >> >> +{
> >> >> +       struct inode *inode = fsnotify_data_inode(data, data_type);
> >> >> +       struct super_block *sb = inode ? inode->i_sb : NULL;
> >> >> +
> >> >> +       return sb;
> >> >> +}
> >> >> +
> >> >>  /*
> >> >>   * Given an inode, first check if we care what happens to our children.  Inotify
> >> >>   * and dnotify both tell their parents about events.  If we care about any event
> >> >> @@ -455,8 +463,10 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
> >> >>   *             @file_name is relative to
> >> >>   * @file_name: optional file name associated with event
> >> >>   * @inode:     optional inode associated with event -
> >> >> - *             either @dir or @inode must be non-NULL.
> >> >> - *             if both are non-NULL event may be reported to both.
> >> >> + *             If @dir and @inode are NULL, @data must have a type that
> >> >> + *             allows retrieving the file system associated with this
> >> >
> >> > Irrelevant comment. sb must always be available from @data.
> >> >
> >> >> + *             event.  if both are non-NULL event may be reported to
> >> >> + *             both.
> >> >>   * @cookie:    inotify rename cookie
> >> >>   */
> >> >>  int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> >> >> @@ -483,7 +493,7 @@ int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> >> >>                  */
> >> >>                 parent = dir;
> >> >>         }
> >> >> -       sb = inode->i_sb;
> >> >> +       sb = inode ? inode->i_sb : fsnotify_data_sb(data, data_type);
> >> >
> >> >         const struct path *path = fsnotify_data_path(data, data_type);
> >> > +       const struct super_block *sb = fsnotify_data_sb(data, data_type);
> >> >
> >> > All the games with @data @inode and @dir args are irrelevant to this.
> >> > sb should always be available from @data and it does not matter
> >> > if fsnotify_data_inode() is the same as @inode, @dir or neither.
> >> > All those inodes are anyway on the same sb.
> >>
> >> Hi Amir,
> >>
> >> I think this is actually necessary.  I could identify at least one event
> >> (FS_CREATE | FS_ISDIR) where fsnotify is invoked with a NULL data field.
> >> In that case, fsnotify_dirent is called with a negative dentry from
> >> vfs_mkdir().  I'm not sure why exactly the dentry is negative after the
> >
> > That doesn't sound right at all.
> > Are you sure about this?
> > Which filesystem was this mkdir called on?
>
> You should be able to reproduce it on top of mainline if you pick only this
> patch and do the change you suggested:
>
>  -       sb = inode->i_sb;
>  +       sb = fsnotify_data_sb(data, data_type);
>
> And then boot a Debian stable with systemd.  The notification happens on
> the cgroup pseudo-filesystem (/sys/fs/cgroup), which is being monitored
> by systemd itself.  The event that arrives with a NULL data is telling the
> directory /sys/fs/cgroup/*/ about the creation of directory
> `init.scope`.
>
> The change above triggers the following null dereference of struct
> super_block, since data is NULL.
>
> I will keep looking but you might be able to answer it immediately...

Yes, I see what is going on.

cgroupfs is a sort of kernfs and kernfs_iop_mkdir() does not instantiate
the negative dentry. Instead, kernfs_dop_revalidate() always invalidates
negative dentries to force re-lookup to find the inode.

Documentation/filesystems/vfs.rst says on create() and friends:
"...you will probably call d_instantiate() with the dentry and the
  newly created inode..."

So this behavior seems legit.
Meaning that we have made a wrong assumption in fsnotify_create()
and fsnotify_mkdir().
Please note the comment above fsnotify_link() which anticipates
negative dentries.

I've audited the fsnotify backends and it seems that the
WARN_ON(!inode) in kernel/audit_* is the only immediate implication
of negative dentry with FS_CREATE.
I am the one who added these WARN_ON(), so I will remove them.
I think that missing inode in an FS_CREATE event really breaks
audit on kernfs, but not sure if that is a valid use case (Paul?).

Anyway, regarding your patch, I still prefer the solution proposed by Jan,
but not with a different implementation of fsnotify_data_sb().

Please see branch fsnotify_data_sb[1] with the proposed fixes.
The fixes assert the statement that "sb should always be available
from @data", regardless of kernfs anomaly.

If this works for you, please prepend those patches to your next
submission.

Regarding the state of this patch set in general, I must admit that
I wasn't able to follow if a conclusion was reached about the lifetime
management of fsnotify_error_event and associated sb mark.
Jan is going out on vacation and I think there is little point in spinning
another patch set revision before this issue is settled with Jan.

Thanks,
Amir.

[1] https://github.com/amir73il/linux/commits/fsnotify_data_sb
Paul Moore Aug. 27, 2021, 2:26 a.m. UTC | #6
On Thu, Aug 26, 2021 at 6:45 AM Amir Goldstein <amir73il@gmail.com> wrote:
> On Thu, Aug 26, 2021 at 12:50 AM Gabriel Krisman Bertazi
> <krisman@collabora.com> wrote:
> >
> > Amir Goldstein <amir73il@gmail.com> writes:
> >
> > > On Wed, Aug 25, 2021 at 9:40 PM Gabriel Krisman Bertazi
> > > <krisman@collabora.com> wrote:
> > >>
> > >> Amir Goldstein <amir73il@gmail.com> writes:
> > >>
> > >> > On Fri, Aug 13, 2021 at 12:41 AM Gabriel Krisman Bertazi
> > >> > <krisman@collabora.com> wrote:
> > >> >>
> > >> >> Some file system events (i.e. FS_ERROR) might not be associated with an
> > >> >> inode.  For these, it makes sense to associate them directly with the
> > >> >> super block of the file system they apply to.  This patch allows the
> > >> >> event to be reported with a NULL inode, by recovering the superblock
> > >> >> directly from the data field, if needed.
> > >> >>
> > >> >> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
> > >> >>
> > >> >> --
> > >> >> Changes since v5:
> > >> >>   - add fsnotify_data_sb handle to retrieve sb from the data field. (jan)
> > >> >> ---
> > >> >>  fs/notify/fsnotify.c | 16 +++++++++++++---
> > >> >>  1 file changed, 13 insertions(+), 3 deletions(-)
> > >> >>
> > >> >> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> > >> >> index 30d422b8c0fc..536db02cb26e 100644
> > >> >> --- a/fs/notify/fsnotify.c
> > >> >> +++ b/fs/notify/fsnotify.c
> > >> >> @@ -98,6 +98,14 @@ void fsnotify_sb_delete(struct super_block *sb)
> > >> >>         fsnotify_clear_marks_by_sb(sb);
> > >> >>  }
> > >> >>
> > >> >> +static struct super_block *fsnotify_data_sb(const void *data, int data_type)
> > >> >> +{
> > >> >> +       struct inode *inode = fsnotify_data_inode(data, data_type);
> > >> >> +       struct super_block *sb = inode ? inode->i_sb : NULL;
> > >> >> +
> > >> >> +       return sb;
> > >> >> +}
> > >> >> +
> > >> >>  /*
> > >> >>   * Given an inode, first check if we care what happens to our children.  Inotify
> > >> >>   * and dnotify both tell their parents about events.  If we care about any event
> > >> >> @@ -455,8 +463,10 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
> > >> >>   *             @file_name is relative to
> > >> >>   * @file_name: optional file name associated with event
> > >> >>   * @inode:     optional inode associated with event -
> > >> >> - *             either @dir or @inode must be non-NULL.
> > >> >> - *             if both are non-NULL event may be reported to both.
> > >> >> + *             If @dir and @inode are NULL, @data must have a type that
> > >> >> + *             allows retrieving the file system associated with this
> > >> >
> > >> > Irrelevant comment. sb must always be available from @data.
> > >> >
> > >> >> + *             event.  if both are non-NULL event may be reported to
> > >> >> + *             both.
> > >> >>   * @cookie:    inotify rename cookie
> > >> >>   */
> > >> >>  int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> > >> >> @@ -483,7 +493,7 @@ int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> > >> >>                  */
> > >> >>                 parent = dir;
> > >> >>         }
> > >> >> -       sb = inode->i_sb;
> > >> >> +       sb = inode ? inode->i_sb : fsnotify_data_sb(data, data_type);
> > >> >
> > >> >         const struct path *path = fsnotify_data_path(data, data_type);
> > >> > +       const struct super_block *sb = fsnotify_data_sb(data, data_type);
> > >> >
> > >> > All the games with @data @inode and @dir args are irrelevant to this.
> > >> > sb should always be available from @data and it does not matter
> > >> > if fsnotify_data_inode() is the same as @inode, @dir or neither.
> > >> > All those inodes are anyway on the same sb.
> > >>
> > >> Hi Amir,
> > >>
> > >> I think this is actually necessary.  I could identify at least one event
> > >> (FS_CREATE | FS_ISDIR) where fsnotify is invoked with a NULL data field.
> > >> In that case, fsnotify_dirent is called with a negative dentry from
> > >> vfs_mkdir().  I'm not sure why exactly the dentry is negative after the
> > >
> > > That doesn't sound right at all.
> > > Are you sure about this?
> > > Which filesystem was this mkdir called on?
> >
> > You should be able to reproduce it on top of mainline if you pick only this
> > patch and do the change you suggested:
> >
> >  -       sb = inode->i_sb;
> >  +       sb = fsnotify_data_sb(data, data_type);
> >
> > And then boot a Debian stable with systemd.  The notification happens on
> > the cgroup pseudo-filesystem (/sys/fs/cgroup), which is being monitored
> > by systemd itself.  The event that arrives with a NULL data is telling the
> > directory /sys/fs/cgroup/*/ about the creation of directory
> > `init.scope`.
> >
> > The change above triggers the following null dereference of struct
> > super_block, since data is NULL.
> >
> > I will keep looking but you might be able to answer it immediately...
>
> Yes, I see what is going on.
>
> cgroupfs is a sort of kernfs and kernfs_iop_mkdir() does not instantiate
> the negative dentry. Instead, kernfs_dop_revalidate() always invalidates
> negative dentries to force re-lookup to find the inode.
>
> Documentation/filesystems/vfs.rst says on create() and friends:
> "...you will probably call d_instantiate() with the dentry and the
>   newly created inode..."
>
> So this behavior seems legit.
> Meaning that we have made a wrong assumption in fsnotify_create()
> and fsnotify_mkdir().
> Please note the comment above fsnotify_link() which anticipates
> negative dentries.
>
> I've audited the fsnotify backends and it seems that the
> WARN_ON(!inode) in kernel/audit_* is the only immediate implication
> of negative dentry with FS_CREATE.
> I am the one who added these WARN_ON(), so I will remove them.
> I think that missing inode in an FS_CREATE event really breaks
> audit on kernfs, but not sure if that is a valid use case (Paul?).

While it is tempting to ignore kernfs from an audit filesystem watch
perspective, I can see admins potentially wanting to watch
kernfs/cgroupfs/other-config-pseudofs to detect who is potentially
playing with the system config.  Arguably the most important config
changes would already be audited if they were security relevant, but I
could also see an admin wanting to watch for *any* changes so it's
probably best not to rule out a kernfs based watch right now.

I'm sure I'm missing some details, but from what I gather from the
portion of the thread that I'm seeing, it looks like the audit issue
lies in audit_mark_handle_event() and audit_watch_handle_event().  In
both cases it looks like the functions are at least safe with a NULL
inode pointer, even with the WARN_ON() removed; the problem being that
the mark and watch will not be updated with the device and inode
number which means the audit filters based on those marks/watches will
not trigger.  Is that about right or did I read the thread and code a
bit too quickly?

Working under the assumption that the above is close enough to
correct, that is a bit of a problem as it means audit can't
effectively watch kernfs based filesystems, although it sounds like it
wasn't really working properly to begin with, yes?  Before I start
thinking too hard about this, does anyone already have a great idea to
fix this that they want to share?
Amir Goldstein Aug. 27, 2021, 9:36 a.m. UTC | #7
[Fork new thread from:
https://lore.kernel.org/linux-fsdevel/CAHC9VhT9SE6+kLYBh2d7CW5N6RCr=_ryK+ncGvqYJ51B7_egPA@mail.gmail.com/
and shrink CC list]

> > > >> Hi Amir,
> > > >>
> > > >> I think this is actually necessary.  I could identify at least one event
> > > >> (FS_CREATE | FS_ISDIR) where fsnotify is invoked with a NULL data field.
> > > >> In that case, fsnotify_dirent is called with a negative dentry from
> > > >> vfs_mkdir().  I'm not sure why exactly the dentry is negative after the
> > > >
> > > > That doesn't sound right at all.
> > > > Are you sure about this?
> > > > Which filesystem was this mkdir called on?
> > >
> > > You should be able to reproduce it on top of mainline if you pick only this
> > > patch and do the change you suggested:
> > >
> > >  -       sb = inode->i_sb;
> > >  +       sb = fsnotify_data_sb(data, data_type);
> > >
> > > And then boot a Debian stable with systemd.  The notification happens on
> > > the cgroup pseudo-filesystem (/sys/fs/cgroup), which is being monitored
> > > by systemd itself.  The event that arrives with a NULL data is telling the
> > > directory /sys/fs/cgroup/*/ about the creation of directory
> > > `init.scope`.
> > >
> > > The change above triggers the following null dereference of struct
> > > super_block, since data is NULL.
> > >
> > > I will keep looking but you might be able to answer it immediately...
> >
> > Yes, I see what is going on.
> >
> > cgroupfs is a sort of kernfs and kernfs_iop_mkdir() does not instantiate
> > the negative dentry. Instead, kernfs_dop_revalidate() always invalidates
> > negative dentries to force re-lookup to find the inode.
> >
> > Documentation/filesystems/vfs.rst says on create() and friends:
> > "...you will probably call d_instantiate() with the dentry and the
> >   newly created inode..."
> >
> > So this behavior seems legit.
> > Meaning that we have made a wrong assumption in fsnotify_create()
> > and fsnotify_mkdir().
> > Please note the comment above fsnotify_link() which anticipates
> > negative dentries.
> >
> > I've audited the fsnotify backends and it seems that the
> > WARN_ON(!inode) in kernel/audit_* is the only immediate implication
> > of negative dentry with FS_CREATE.
> > I am the one who added these WARN_ON(), so I will remove them.
> > I think that missing inode in an FS_CREATE event really breaks
> > audit on kernfs, but not sure if that is a valid use case (Paul?).
>
> While it is tempting to ignore kernfs from an audit filesystem watch
> perspective, I can see admins potentially wanting to watch
> kernfs/cgroupfs/other-config-pseudofs to detect who is potentially
> playing with the system config.  Arguably the most important config
> changes would already be audited if they were security relevant, but I
> could also see an admin wanting to watch for *any* changes so it's
> probably best not to rule out a kernfs based watch right now.
>
> I'm sure I'm missing some details, but from what I gather from the
> portion of the thread that I'm seeing, it looks like the audit issue
> lies in audit_mark_handle_event() and audit_watch_handle_event().  In
> both cases it looks like the functions are at least safe with a NULL
> inode pointer, even with the WARN_ON() removed; the problem being that

Correct. They are safe.

> the mark and watch will not be updated with the device and inode
> number which means the audit filters based on those marks/watches will
> not trigger.  Is that about right or did I read the thread and code a
> bit too quickly?
>

That is also my understanding of the code although I must admit
I did not try to test this setup.

> Working under the assumption that the above is close enough to
> correct, that is a bit of a problem as it means audit can't
> effectively watch kernfs based filesystems, although it sounds like it
> wasn't really working properly to begin with, yes?  Before I start

Correct. It seems it was always like that (I did not check history of kernfs)
but do note that users can set audit rules on specific kernfs directories or
files, it's only recursive rules on subtree may not work as expected

> thinking too hard about this, does anyone already have a great idea to
> fix this that they want to share?
>

One idea I had, it may not be a great idea, but I'll share it anyway :)

Introduce an event FS_INSTANTIATE that will only be exposed to
internal kernel fsnotify backends. It triggers when an inode is linked
into the dentry cache as a result of lookup() or mkdir() or whatever.

So for example, in case of audit watch on kernfs,
audit_watch_handle_event() will miss the FS_CREATE event, but
on the first user access to the new created directory, audit will get
the FS_INSTANTIATE event with child inode and be able to setup
the recursive watch rule.

I did not check if it is possible or easy to d_instantiate() in kernfs
mkdir() etc like other filesystems do and I do not know if it would be
possible to enforce that as a strict vfs API rather than a recommendation.

Thanks,
Amir.
Al Viro Aug. 27, 2021, 10:22 a.m. UTC | #8
On Fri, Aug 27, 2021 at 12:36:23PM +0300, Amir Goldstein wrote:

> I did not check if it is possible or easy to d_instantiate() in kernfs
> mkdir() etc like other filesystems do and I do not know if it would be
> possible to enforce that as a strict vfs API rather than a recommendation.

For the second question: it would not.  E.g. a network filesystem has every
right not to give you the data you would need to set an inode up - just
"your mkdir had been successful".
diff mbox series

Patch

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 30d422b8c0fc..536db02cb26e 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -98,6 +98,14 @@  void fsnotify_sb_delete(struct super_block *sb)
 	fsnotify_clear_marks_by_sb(sb);
 }
 
+static struct super_block *fsnotify_data_sb(const void *data, int data_type)
+{
+	struct inode *inode = fsnotify_data_inode(data, data_type);
+	struct super_block *sb = inode ? inode->i_sb : NULL;
+
+	return sb;
+}
+
 /*
  * Given an inode, first check if we care what happens to our children.  Inotify
  * and dnotify both tell their parents about events.  If we care about any event
@@ -455,8 +463,10 @@  static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
  *		@file_name is relative to
  * @file_name:	optional file name associated with event
  * @inode:	optional inode associated with event -
- *		either @dir or @inode must be non-NULL.
- *		if both are non-NULL event may be reported to both.
+ *		If @dir and @inode are NULL, @data must have a type that
+ *		allows retrieving the file system associated with this
+ *		event.  if both are non-NULL event may be reported to
+ *		both.
  * @cookie:	inotify rename cookie
  */
 int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
@@ -483,7 +493,7 @@  int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
 		 */
 		parent = dir;
 	}
-	sb = inode->i_sb;
+	sb = inode ? inode->i_sb : fsnotify_data_sb(data, data_type);
 
 	/*
 	 * Optimization: srcu_read_lock() has a memory barrier which can