fsnotify_mark_srcu wtf?
diff mbox

Message ID CAOQ4uxhMuKT78c0fif0+wZcyPV=h78Vr_4yZ8vUERkKc3+zKQg@mail.gmail.com
State New
Headers show

Commit Message

Amir Goldstein Nov. 3, 2016, 10:25 a.m. UTC
On Thu, Nov 3, 2016 at 12:09 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> We've got a report where a fanotify daemon that implements permission checks
> screws up and doesn't send a reply.  This then causes widespread hangs due to
> fsnotify_mark_srcu read side lock being held and thus causing synchronize_srcu()
> called from e.g. inotify_release()-> fsnotify_destroy_group()->
> fsnotify_mark_destroy_list() to block.
>
> Below program demonstrates the issue.  It should output a single line:
>
> close(inotify_fd): success
>
> Instead it outputs nothing, which means that close(inotify_fd) got blocked by
> the waiting permission event.
>
> Wouldn't making the srcu per-group fix this?  Would that be too expensive?
>

IIUC, the purpose of fsnotify_mark_srcu is to protect the inode and vfsmount
mark lists, which are used to iterate the groups to send events to.
So you cannot make it per group as far as I can tell.

OTOH, specifically, for fanotify, once you can passed
fanotify_should_send_event(),
there is no need to keep a reference to the mark, so it may be safely destroyed,
you only need a reference to the group.

I tested this patch on top of my fanotify super block watch development branch
and it seems to fix the problem you reported, but I am not savvy enough with
srcu to say that this is correct.
Jan, what do you think?

From 28da34cdf9bf71fe9bbac13ded11a19da3b7a48e Mon Sep 17 00:00:00 2001
From: Amir Goldstein <amir73il@gmail.com>
Date: Thu, 3 Nov 2016 11:55:46 +0200
Subject: [PATCH] fanotify: handle permission events without holding
 fsnotify_mark_srcu

Handling fanotify events does not entail dereferencing fsnotify_mask
beyond the point of fanotify_should_send_event().

For the case of permission events, which may block indefinetely,
return -EAGAIN and then fsnotify() will call handle_event() again
without a reference to the mark.

Without a reference to the mark, there is no need to call
handle_event() under fsnotify_mark_srcu read side lock
which is protecting the inode/vfsmount mark lists.
We just need to hold a reference to the group

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fanotify/fanotify.c | 14 +++++++++++---
 fs/notify/fsnotify.c          | 26 ++++++++++++++++++++++----
 2 files changed, 33 insertions(+), 7 deletions(-)

        /* global tests shouldn't care about events on child only the
specific event */
@@ -282,15 +282,33 @@ int fsnotify(struct inode *to_tell, __u32 mask,
void *data, int data_is,
                ret = send_to_group(to_tell, inode_mark, vfsmount_mark, mask,
                                    data, data_is, cookie, file_name);

-               if (ret && (mask & ALL_FSNOTIFY_PERM_EVENTS))
-                       goto out;
-
                if (inode_group)
                        inode_node = srcu_dereference(inode_node->next,
                                                      &fsnotify_mark_srcu);
                if (vfsmount_group)
                        vfsmount_node = srcu_dereference(vfsmount_node->next,
                                                         &fsnotify_mark_srcu);
+
+               if (ret != -EAGAIN)
+                       continue;
+
+               group = inode_group ? : vfsmount_group;
+               fsnotify_get_group(group);
+               srcu_read_unlock(&fsnotify_mark_srcu, idx);
+               /*
+                * Calling handle_event() again without reference to mark,
+                * so we do not need to call it under fsnotify_mark_srcu
+                * which is protecting the inode/vfsmount mark lists.
+                * We just need to hold a reference to the group
+                */
+               return group->ops->handle_event(group, to_tell, NULL, NULL,
+                                               mask, data, data_is,
+                                               file_name, cookie);
+               idx = srcu_read_lock(&fsnotify_mark_srcu);
+               fsnotify_put_group(group);
+
+               if (ret && (mask & ALL_FSNOTIFY_PERM_EVENTS))
+                       goto out;
        }
        ret = 0;
 out:

Comments

Jan Kara Nov. 5, 2016, 9:43 p.m. UTC | #1
On Thu 03-11-16 12:25:13, Amir Goldstein wrote:
> On Thu, Nov 3, 2016 at 12:09 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> > We've got a report where a fanotify daemon that implements permission checks
> > screws up and doesn't send a reply.  This then causes widespread hangs due to
> > fsnotify_mark_srcu read side lock being held and thus causing synchronize_srcu()
> > called from e.g. inotify_release()-> fsnotify_destroy_group()->
> > fsnotify_mark_destroy_list() to block.
> >
> > Below program demonstrates the issue.  It should output a single line:
> >
> > close(inotify_fd): success
> >
> > Instead it outputs nothing, which means that close(inotify_fd) got blocked by
> > the waiting permission event.
> >
> > Wouldn't making the srcu per-group fix this?  Would that be too expensive?
> >
> 
> IIUC, the purpose of fsnotify_mark_srcu is to protect the inode and vfsmount
> mark lists, which are used to iterate the groups to send events to.
> So you cannot make it per group as far as I can tell.
> 
> OTOH, specifically, for fanotify, once you can passed
> fanotify_should_send_event(),
> there is no need to keep a reference to the mark, so it may be safely destroyed,
> you only need a reference to the group.
> 
> I tested this patch on top of my fanotify super block watch development branch
> and it seems to fix the problem you reported, but I am not savvy enough with
> srcu to say that this is correct.
> Jan, what do you think?

So the way you've written it is definitely buggy. The inode_node and
vfsmount_node pointers may become invalid once you drop SRCU lock and so
you cannot dereference them even after regaining that lock. Also frankly
your solution looks a bit ugly. I'll think whether we can somehow fix the
problem...

								Honza

> 
> From 28da34cdf9bf71fe9bbac13ded11a19da3b7a48e Mon Sep 17 00:00:00 2001
> From: Amir Goldstein <amir73il@gmail.com>
> Date: Thu, 3 Nov 2016 11:55:46 +0200
> Subject: [PATCH] fanotify: handle permission events without holding
>  fsnotify_mark_srcu
> 
> Handling fanotify events does not entail dereferencing fsnotify_mask
> beyond the point of fanotify_should_send_event().
> 
> For the case of permission events, which may block indefinetely,
> return -EAGAIN and then fsnotify() will call handle_event() again
> without a reference to the mark.
> 
> Without a reference to the mark, there is no need to call
> handle_event() under fsnotify_mark_srcu read side lock
> which is protecting the inode/vfsmount mark lists.
> We just need to hold a reference to the group
> 
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>  fs/notify/fanotify/fanotify.c | 14 +++++++++++---
>  fs/notify/fsnotify.c          | 26 ++++++++++++++++++++++----
>  2 files changed, 33 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 604e307..0ffa9ed 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -298,9 +298,17 @@ static int fanotify_handle_event(struct
> fsnotify_group *group,
>         BUILD_BUG_ON(FAN_ACCESS_PERM != FS_ACCESS_PERM);
>         BUILD_BUG_ON(FAN_ONDIR != FS_ISDIR);
> 
> -       if (!fanotify_should_send_event(inode_mark, vfsmnt_mark, mask, data
> -                                       data_type))
> -               return 0;
> +       if (inode_mark || vfsmnt_mark) {
> +               if (!fanotify_should_send_event(inode_mark, vfsmnt_mark, mask,
> +                                               data, data_type))
> +                       return 0;
> +
> +#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
> +               /* Ask to be called again without a reference to mark */
> +               if (mask & FAN_ALL_PERM_EVENTS)
> +                       return -EAGAIN;
> +#endif
> +       }
> 
>         pr_debug("%s: group=%p inode=%p mask=%x\n", __func__, group, inode,
>                  mask);
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 34bccbe..dc0c86e 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -199,7 +199,7 @@ int fsnotify(struct inode *to_tell, __u32 mask,
> void *data, int data_is,
>  {
>         struct hlist_node *inode_node = NULL, *vfsmount_node = NULL;
>         struct fsnotify_mark *inode_mark = NULL, *vfsmount_mark = NULL;
> -       struct fsnotify_group *inode_group, *vfsmount_group;
> +       struct fsnotify_group *inode_group, *vfsmount_group, *group;
>         struct mount *mnt;
>         int idx, ret = 0;
>         /* global tests shouldn't care about events on child only the
> specific event */
> @@ -282,15 +282,33 @@ int fsnotify(struct inode *to_tell, __u32 mask,
> void *data, int data_is,
>                 ret = send_to_group(to_tell, inode_mark, vfsmount_mark, mask,
>                                     data, data_is, cookie, file_name);
> 
> -               if (ret && (mask & ALL_FSNOTIFY_PERM_EVENTS))
> -                       goto out;
> -
>                 if (inode_group)
>                         inode_node = srcu_dereference(inode_node->next,
>                                                       &fsnotify_mark_srcu);
>                 if (vfsmount_group)
>                         vfsmount_node = srcu_dereference(vfsmount_node->next,
>                                                          &fsnotify_mark_srcu);
> +
> +               if (ret != -EAGAIN)
> +                       continue;
> +
> +               group = inode_group ? : vfsmount_group;
> +               fsnotify_get_group(group);
> +               srcu_read_unlock(&fsnotify_mark_srcu, idx);
> +               /*
> +                * Calling handle_event() again without reference to mark,
> +                * so we do not need to call it under fsnotify_mark_srcu
> +                * which is protecting the inode/vfsmount mark lists.
> +                * We just need to hold a reference to the group
> +                */
> +               return group->ops->handle_event(group, to_tell, NULL, NULL,
> +                                               mask, data, data_is,
> +                                               file_name, cookie);
> +               idx = srcu_read_lock(&fsnotify_mark_srcu);
> +               fsnotify_put_group(group);
> +
> +               if (ret && (mask & ALL_FSNOTIFY_PERM_EVENTS))
> +                       goto out;
>         }
>         ret = 0;
>  out:
> -- 
> 2.7.4

Patch
diff mbox

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 604e307..0ffa9ed 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -298,9 +298,17 @@  static int fanotify_handle_event(struct
fsnotify_group *group,
        BUILD_BUG_ON(FAN_ACCESS_PERM != FS_ACCESS_PERM);
        BUILD_BUG_ON(FAN_ONDIR != FS_ISDIR);

-       if (!fanotify_should_send_event(inode_mark, vfsmnt_mark, mask, data
-                                       data_type))
-               return 0;
+       if (inode_mark || vfsmnt_mark) {
+               if (!fanotify_should_send_event(inode_mark, vfsmnt_mark, mask,
+                                               data, data_type))
+                       return 0;
+
+#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
+               /* Ask to be called again without a reference to mark */
+               if (mask & FAN_ALL_PERM_EVENTS)
+                       return -EAGAIN;
+#endif
+       }

        pr_debug("%s: group=%p inode=%p mask=%x\n", __func__, group, inode,
                 mask);
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 34bccbe..dc0c86e 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -199,7 +199,7 @@  int fsnotify(struct inode *to_tell, __u32 mask,
void *data, int data_is,
 {
        struct hlist_node *inode_node = NULL, *vfsmount_node = NULL;
        struct fsnotify_mark *inode_mark = NULL, *vfsmount_mark = NULL;
-       struct fsnotify_group *inode_group, *vfsmount_group;
+       struct fsnotify_group *inode_group, *vfsmount_group, *group;
        struct mount *mnt;
        int idx, ret = 0;