diff mbox series

fanotify: remove redundant capable(CAP_SYS_ADMIN)s

Message ID 20190522163150.16849-1-christian@brauner.io (mailing list archive)
State New, archived
Headers show
Series fanotify: remove redundant capable(CAP_SYS_ADMIN)s | expand

Commit Message

Christian Brauner May 22, 2019, 4:31 p.m. UTC
This removes two redundant capable(CAP_SYS_ADMIN) checks from
fanotify_init().
fanotify_init() guards the whole syscall with capable(CAP_SYS_ADMIN) at the
beginning. So the other two capable(CAP_SYS_ADMIN) checks are not needed.

Fixes: 5dd03f55fd2 ("fanotify: allow userspace to override max queue depth")
Fixes: ac7e22dcfaf ("fanotify: allow userspace to override max marks")
Signed-off-by: Christian Brauner <christian@brauner.io>
---
 fs/notify/fanotify/fanotify_user.c | 4 ----
 1 file changed, 4 deletions(-)

Comments

Amir Goldstein May 22, 2019, 6:29 p.m. UTC | #1
On Wed, May 22, 2019 at 7:32 PM Christian Brauner <christian@brauner.io> wrote:
>
> This removes two redundant capable(CAP_SYS_ADMIN) checks from
> fanotify_init().
> fanotify_init() guards the whole syscall with capable(CAP_SYS_ADMIN) at the
> beginning. So the other two capable(CAP_SYS_ADMIN) checks are not needed.

It's intentional:

commit e7099d8a5a34d2876908a9fab4952dabdcfc5909
Author: Eric Paris <eparis@redhat.com>
Date:   Thu Oct 28 17:21:57 2010 -0400

    fanotify: limit the number of marks in a single fanotify group

    There is currently no limit on the number of marks a given fanotify group
    can have.  Since fanotify is gated on CAP_SYS_ADMIN this was not seen as
    a serious DoS threat.  This patch implements a default of 8192, the same as
    inotify to work towards removing the CAP_SYS_ADMIN gating and eliminating
    the default DoS'able status.

    Signed-off-by: Eric Paris <eparis@redhat.com>

There idea is to eventually remove the gated CAP_SYS_ADMIN.
There is no reason that fanotify could not be used by unprivileged users
to setup inotify style watch on an inode or directories children, see:
https://patchwork.kernel.org/patch/10668299/

>
> Fixes: 5dd03f55fd2 ("fanotify: allow userspace to override max queue depth")
> Fixes: ac7e22dcfaf ("fanotify: allow userspace to override max marks")

Fixes is used to tag bug fixes for stable.
There is no bug.

Thanks,
Amir.
Christian Brauner May 22, 2019, 6:57 p.m. UTC | #2
On May 22, 2019 8:29:37 PM GMT+02:00, Amir Goldstein <amir73il@gmail.com> wrote:
>On Wed, May 22, 2019 at 7:32 PM Christian Brauner
><christian@brauner.io> wrote:
>>
>> This removes two redundant capable(CAP_SYS_ADMIN) checks from
>> fanotify_init().
>> fanotify_init() guards the whole syscall with capable(CAP_SYS_ADMIN)
>at the
>> beginning. So the other two capable(CAP_SYS_ADMIN) checks are not
>needed.
>
>It's intentional:
>
>commit e7099d8a5a34d2876908a9fab4952dabdcfc5909
>Author: Eric Paris <eparis@redhat.com>
>Date:   Thu Oct 28 17:21:57 2010 -0400
>
>    fanotify: limit the number of marks in a single fanotify group
>
>There is currently no limit on the number of marks a given fanotify
>group
>can have.  Since fanotify is gated on CAP_SYS_ADMIN this was not seen
>as
>a serious DoS threat.  This patch implements a default of 8192, the
>same as
>inotify to work towards removing the CAP_SYS_ADMIN gating and
>eliminating
>    the default DoS'able status.
>
>    Signed-off-by: Eric Paris <eparis@redhat.com>
>
>There idea is to eventually remove the gated CAP_SYS_ADMIN.
>There is no reason that fanotify could not be used by unprivileged
>users
>to setup inotify style watch on an inode or directories children, see:
>https://patchwork.kernel.org/patch/10668299/
>
>>
>> Fixes: 5dd03f55fd2 ("fanotify: allow userspace to override max queue
>depth")
>> Fixes: ac7e22dcfaf ("fanotify: allow userspace to override max
>marks")
>
>Fixes is used to tag bug fixes for stable.
>There is no bug.
>
>Thanks,
>Amir.

Interesting. When do you think the gate can be removed?
I was looking into switching from inotify to fanotify but since it's not useable from
non-initial userns it's a no-no
since we support nested workloads.
Amir Goldstein May 22, 2019, 8 p.m. UTC | #3
On Wed, May 22, 2019 at 9:57 PM Christian Brauner <christian@brauner.io> wrote:
>
> On May 22, 2019 8:29:37 PM GMT+02:00, Amir Goldstein <amir73il@gmail.com> wrote:
> >On Wed, May 22, 2019 at 7:32 PM Christian Brauner
> ><christian@brauner.io> wrote:
> >>
> >> This removes two redundant capable(CAP_SYS_ADMIN) checks from
> >> fanotify_init().
> >> fanotify_init() guards the whole syscall with capable(CAP_SYS_ADMIN)
> >at the
> >> beginning. So the other two capable(CAP_SYS_ADMIN) checks are not
> >needed.
> >
> >It's intentional:
> >
> >commit e7099d8a5a34d2876908a9fab4952dabdcfc5909
> >Author: Eric Paris <eparis@redhat.com>
> >Date:   Thu Oct 28 17:21:57 2010 -0400
> >
> >    fanotify: limit the number of marks in a single fanotify group
> >
> >There is currently no limit on the number of marks a given fanotify
> >group
> >can have.  Since fanotify is gated on CAP_SYS_ADMIN this was not seen
> >as
> >a serious DoS threat.  This patch implements a default of 8192, the
> >same as
> >inotify to work towards removing the CAP_SYS_ADMIN gating and
> >eliminating
> >    the default DoS'able status.
> >
> >    Signed-off-by: Eric Paris <eparis@redhat.com>
> >
> >There idea is to eventually remove the gated CAP_SYS_ADMIN.
> >There is no reason that fanotify could not be used by unprivileged
> >users
> >to setup inotify style watch on an inode or directories children, see:
> >https://patchwork.kernel.org/patch/10668299/
> >
> >>
> >> Fixes: 5dd03f55fd2 ("fanotify: allow userspace to override max queue
> >depth")
> >> Fixes: ac7e22dcfaf ("fanotify: allow userspace to override max
> >marks")
> >
> >Fixes is used to tag bug fixes for stable.
> >There is no bug.
> >
> >Thanks,
> >Amir.
>
> Interesting. When do you think the gate can be removed?

Nobody is working on this AFAIK.
What I posted was a simple POC, but I have no use case for this.
In the patchwork link above, Jan has listed the prerequisites for
removing the gate.

One of the prerequisites is FAN_REPORT_FID, which is now merged.
When events gets reported with fid instead of fd, unprivileged user
(hopefully) cannot use fid for privilege escalation.

> I was looking into switching from inotify to fanotify but since it's not usable from
> non-initial userns it's a no-no
> since we support nested workloads.

One of Jan's questions was what is the benefit of using inotify-compatible
fanotify vs. using inotify.
So what was the reason you were looking into switching from inotify to fanotify?
Is it because of mount/filesystem watch? Because making those available for
unprivileged user sounds risky...

Thanks,
Amir.
Christian Brauner May 23, 2019, 9:55 a.m. UTC | #4
On Wed, May 22, 2019 at 11:00:22PM +0300, Amir Goldstein wrote:
> On Wed, May 22, 2019 at 9:57 PM Christian Brauner <christian@brauner.io> wrote:
> >
> > On May 22, 2019 8:29:37 PM GMT+02:00, Amir Goldstein <amir73il@gmail.com> wrote:
> > >On Wed, May 22, 2019 at 7:32 PM Christian Brauner
> > ><christian@brauner.io> wrote:
> > >>
> > >> This removes two redundant capable(CAP_SYS_ADMIN) checks from
> > >> fanotify_init().
> > >> fanotify_init() guards the whole syscall with capable(CAP_SYS_ADMIN)
> > >at the
> > >> beginning. So the other two capable(CAP_SYS_ADMIN) checks are not
> > >needed.
> > >
> > >It's intentional:
> > >
> > >commit e7099d8a5a34d2876908a9fab4952dabdcfc5909
> > >Author: Eric Paris <eparis@redhat.com>
> > >Date:   Thu Oct 28 17:21:57 2010 -0400
> > >
> > >    fanotify: limit the number of marks in a single fanotify group
> > >
> > >There is currently no limit on the number of marks a given fanotify
> > >group
> > >can have.  Since fanotify is gated on CAP_SYS_ADMIN this was not seen
> > >as
> > >a serious DoS threat.  This patch implements a default of 8192, the
> > >same as
> > >inotify to work towards removing the CAP_SYS_ADMIN gating and
> > >eliminating
> > >    the default DoS'able status.
> > >
> > >    Signed-off-by: Eric Paris <eparis@redhat.com>
> > >
> > >There idea is to eventually remove the gated CAP_SYS_ADMIN.
> > >There is no reason that fanotify could not be used by unprivileged
> > >users
> > >to setup inotify style watch on an inode or directories children, see:
> > >https://patchwork.kernel.org/patch/10668299/
> > >
> > >>
> > >> Fixes: 5dd03f55fd2 ("fanotify: allow userspace to override max queue
> > >depth")
> > >> Fixes: ac7e22dcfaf ("fanotify: allow userspace to override max
> > >marks")
> > >
> > >Fixes is used to tag bug fixes for stable.
> > >There is no bug.
> > >
> > >Thanks,
> > >Amir.
> >
> > Interesting. When do you think the gate can be removed?
> 
> Nobody is working on this AFAIK.
> What I posted was a simple POC, but I have no use case for this.
> In the patchwork link above, Jan has listed the prerequisites for
> removing the gate.
> 
> One of the prerequisites is FAN_REPORT_FID, which is now merged.
> When events gets reported with fid instead of fd, unprivileged user
> (hopefully) cannot use fid for privilege escalation.
> 
> > I was looking into switching from inotify to fanotify but since it's not usable from
> > non-initial userns it's a no-no
> > since we support nested workloads.
> 
> One of Jan's questions was what is the benefit of using inotify-compatible
> fanotify vs. using inotify.
> So what was the reason you were looking into switching from inotify to fanotify?
> Is it because of mount/filesystem watch? Because making those available for

Yeah. Well, I would need to look but you could probably do it safely for
filesystems mountable in user namespaces (which are few).
Can you do a bind-mount and then place a watch on the bind-mount or is
this superblock based?

Thanks!
Christian
Amir Goldstein May 23, 2019, 10:25 a.m. UTC | #5
On Thu, May 23, 2019 at 12:55 PM Christian Brauner <christian@brauner.io> wrote:
>
> On Wed, May 22, 2019 at 11:00:22PM +0300, Amir Goldstein wrote:
> > On Wed, May 22, 2019 at 9:57 PM Christian Brauner <christian@brauner.io> wrote:
> > >
> > > On May 22, 2019 8:29:37 PM GMT+02:00, Amir Goldstein <amir73il@gmail.com> wrote:
> > > >On Wed, May 22, 2019 at 7:32 PM Christian Brauner
> > > ><christian@brauner.io> wrote:
> > > >>
> > > >> This removes two redundant capable(CAP_SYS_ADMIN) checks from
> > > >> fanotify_init().
> > > >> fanotify_init() guards the whole syscall with capable(CAP_SYS_ADMIN)
> > > >at the
> > > >> beginning. So the other two capable(CAP_SYS_ADMIN) checks are not
> > > >needed.
> > > >
> > > >It's intentional:
> > > >
> > > >commit e7099d8a5a34d2876908a9fab4952dabdcfc5909
> > > >Author: Eric Paris <eparis@redhat.com>
> > > >Date:   Thu Oct 28 17:21:57 2010 -0400
> > > >
> > > >    fanotify: limit the number of marks in a single fanotify group
> > > >
> > > >There is currently no limit on the number of marks a given fanotify
> > > >group
> > > >can have.  Since fanotify is gated on CAP_SYS_ADMIN this was not seen
> > > >as
> > > >a serious DoS threat.  This patch implements a default of 8192, the
> > > >same as
> > > >inotify to work towards removing the CAP_SYS_ADMIN gating and
> > > >eliminating
> > > >    the default DoS'able status.
> > > >
> > > >    Signed-off-by: Eric Paris <eparis@redhat.com>
> > > >
> > > >There idea is to eventually remove the gated CAP_SYS_ADMIN.
> > > >There is no reason that fanotify could not be used by unprivileged
> > > >users
> > > >to setup inotify style watch on an inode or directories children, see:
> > > >https://patchwork.kernel.org/patch/10668299/
> > > >
> > > >>
> > > >> Fixes: 5dd03f55fd2 ("fanotify: allow userspace to override max queue
> > > >depth")
> > > >> Fixes: ac7e22dcfaf ("fanotify: allow userspace to override max
> > > >marks")
> > > >
> > > >Fixes is used to tag bug fixes for stable.
> > > >There is no bug.
> > > >
> > > >Thanks,
> > > >Amir.
> > >
> > > Interesting. When do you think the gate can be removed?
> >
> > Nobody is working on this AFAIK.
> > What I posted was a simple POC, but I have no use case for this.
> > In the patchwork link above, Jan has listed the prerequisites for
> > removing the gate.
> >
> > One of the prerequisites is FAN_REPORT_FID, which is now merged.
> > When events gets reported with fid instead of fd, unprivileged user
> > (hopefully) cannot use fid for privilege escalation.
> >
> > > I was looking into switching from inotify to fanotify but since it's not usable from
> > > non-initial userns it's a no-no
> > > since we support nested workloads.
> >
> > One of Jan's questions was what is the benefit of using inotify-compatible
> > fanotify vs. using inotify.
> > So what was the reason you were looking into switching from inotify to fanotify?
> > Is it because of mount/filesystem watch? Because making those available for
>
> Yeah. Well, I would need to look but you could probably do it safely for
> filesystems mountable in user namespaces (which are few).
> Can you do a bind-mount and then place a watch on the bind-mount or is
> this superblock based?
>

Either.
FAN_MARK_MOUNT was there from day 1 of fanotify.
FAN_MARK_FILESYSTEM was merged to Linux Linux 4.20.

But directory modification events that are supported since v5.1 are
not available
with FAN_MARK_MOUNT, see:
https://github.com/amir73il/man-pages/blob/fanotify_fid/man2/fanotify_init.2#L97

Matthew,

Perhaps this fact is worth a mention in the linked entry for FAN_REPORT_FID
in fanotify_init.2 in addition to the comment on the entry for FAN_MARK_MOUNT
in fanotify_mark.2.

Thanks,
Amir.
Christian Brauner May 23, 2019, 10:42 a.m. UTC | #6
On Thu, May 23, 2019 at 01:25:08PM +0300, Amir Goldstein wrote:
> On Thu, May 23, 2019 at 12:55 PM Christian Brauner <christian@brauner.io> wrote:
> >
> > On Wed, May 22, 2019 at 11:00:22PM +0300, Amir Goldstein wrote:
> > > On Wed, May 22, 2019 at 9:57 PM Christian Brauner <christian@brauner.io> wrote:
> > > >
> > > > On May 22, 2019 8:29:37 PM GMT+02:00, Amir Goldstein <amir73il@gmail.com> wrote:
> > > > >On Wed, May 22, 2019 at 7:32 PM Christian Brauner
> > > > ><christian@brauner.io> wrote:
> > > > >>
> > > > >> This removes two redundant capable(CAP_SYS_ADMIN) checks from
> > > > >> fanotify_init().
> > > > >> fanotify_init() guards the whole syscall with capable(CAP_SYS_ADMIN)
> > > > >at the
> > > > >> beginning. So the other two capable(CAP_SYS_ADMIN) checks are not
> > > > >needed.
> > > > >
> > > > >It's intentional:
> > > > >
> > > > >commit e7099d8a5a34d2876908a9fab4952dabdcfc5909
> > > > >Author: Eric Paris <eparis@redhat.com>
> > > > >Date:   Thu Oct 28 17:21:57 2010 -0400
> > > > >
> > > > >    fanotify: limit the number of marks in a single fanotify group
> > > > >
> > > > >There is currently no limit on the number of marks a given fanotify
> > > > >group
> > > > >can have.  Since fanotify is gated on CAP_SYS_ADMIN this was not seen
> > > > >as
> > > > >a serious DoS threat.  This patch implements a default of 8192, the
> > > > >same as
> > > > >inotify to work towards removing the CAP_SYS_ADMIN gating and
> > > > >eliminating
> > > > >    the default DoS'able status.
> > > > >
> > > > >    Signed-off-by: Eric Paris <eparis@redhat.com>
> > > > >
> > > > >There idea is to eventually remove the gated CAP_SYS_ADMIN.
> > > > >There is no reason that fanotify could not be used by unprivileged
> > > > >users
> > > > >to setup inotify style watch on an inode or directories children, see:
> > > > >https://patchwork.kernel.org/patch/10668299/
> > > > >
> > > > >>
> > > > >> Fixes: 5dd03f55fd2 ("fanotify: allow userspace to override max queue
> > > > >depth")
> > > > >> Fixes: ac7e22dcfaf ("fanotify: allow userspace to override max
> > > > >marks")
> > > > >
> > > > >Fixes is used to tag bug fixes for stable.
> > > > >There is no bug.
> > > > >
> > > > >Thanks,
> > > > >Amir.
> > > >
> > > > Interesting. When do you think the gate can be removed?
> > >
> > > Nobody is working on this AFAIK.
> > > What I posted was a simple POC, but I have no use case for this.
> > > In the patchwork link above, Jan has listed the prerequisites for
> > > removing the gate.
> > >
> > > One of the prerequisites is FAN_REPORT_FID, which is now merged.
> > > When events gets reported with fid instead of fd, unprivileged user
> > > (hopefully) cannot use fid for privilege escalation.
> > >
> > > > I was looking into switching from inotify to fanotify but since it's not usable from
> > > > non-initial userns it's a no-no
> > > > since we support nested workloads.
> > >
> > > One of Jan's questions was what is the benefit of using inotify-compatible
> > > fanotify vs. using inotify.
> > > So what was the reason you were looking into switching from inotify to fanotify?
> > > Is it because of mount/filesystem watch? Because making those available for
> >
> > Yeah. Well, I would need to look but you could probably do it safely for
> > filesystems mountable in user namespaces (which are few).
> > Can you do a bind-mount and then place a watch on the bind-mount or is
> > this superblock based?
> >
> 
> Either.
> FAN_MARK_MOUNT was there from day 1 of fanotify.
> FAN_MARK_FILESYSTEM was merged to Linux Linux 4.20.
> 
> But directory modification events that are supported since v5.1 are
> not available
> with FAN_MARK_MOUNT, see:

Because you're worried about unprivileged users spying on events? Or
something else?
Because if you can do a bind-mount there's nothing preventing an
unprivileged user to do a hand-rolled recursive inotify that would
amount to the same thing anyway.
(And btw, v5.1 really is a major step forward and I would really like to
 use this api tbh.)

Christian

> https://github.com/amir73il/man-pages/blob/fanotify_fid/man2/fanotify_init.2#L97
> 
> Matthew,
> 
> Perhaps this fact is worth a mention in the linked entry for FAN_REPORT_FID
> in fanotify_init.2 in addition to the comment on the entry for FAN_MARK_MOUNT
> in fanotify_mark.2.
> 
> Thanks,
> Amir.
Amir Goldstein May 23, 2019, 11:40 a.m. UTC | #7
On Thu, May 23, 2019 at 1:42 PM Christian Brauner <christian@brauner.io> wrote:
>
> On Thu, May 23, 2019 at 01:25:08PM +0300, Amir Goldstein wrote:
> > On Thu, May 23, 2019 at 12:55 PM Christian Brauner <christian@brauner.io> wrote:
> > >
> > > On Wed, May 22, 2019 at 11:00:22PM +0300, Amir Goldstein wrote:
> > > > On Wed, May 22, 2019 at 9:57 PM Christian Brauner <christian@brauner.io> wrote:
> > > > >
> > > > > On May 22, 2019 8:29:37 PM GMT+02:00, Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > >On Wed, May 22, 2019 at 7:32 PM Christian Brauner
> > > > > ><christian@brauner.io> wrote:
> > > > > >>
> > > > > >> This removes two redundant capable(CAP_SYS_ADMIN) checks from
> > > > > >> fanotify_init().
> > > > > >> fanotify_init() guards the whole syscall with capable(CAP_SYS_ADMIN)
> > > > > >at the
> > > > > >> beginning. So the other two capable(CAP_SYS_ADMIN) checks are not
> > > > > >needed.
> > > > > >
> > > > > >It's intentional:
> > > > > >
> > > > > >commit e7099d8a5a34d2876908a9fab4952dabdcfc5909
> > > > > >Author: Eric Paris <eparis@redhat.com>
> > > > > >Date:   Thu Oct 28 17:21:57 2010 -0400
> > > > > >
> > > > > >    fanotify: limit the number of marks in a single fanotify group
> > > > > >
> > > > > >There is currently no limit on the number of marks a given fanotify
> > > > > >group
> > > > > >can have.  Since fanotify is gated on CAP_SYS_ADMIN this was not seen
> > > > > >as
> > > > > >a serious DoS threat.  This patch implements a default of 8192, the
> > > > > >same as
> > > > > >inotify to work towards removing the CAP_SYS_ADMIN gating and
> > > > > >eliminating
> > > > > >    the default DoS'able status.
> > > > > >
> > > > > >    Signed-off-by: Eric Paris <eparis@redhat.com>
> > > > > >
> > > > > >There idea is to eventually remove the gated CAP_SYS_ADMIN.
> > > > > >There is no reason that fanotify could not be used by unprivileged
> > > > > >users
> > > > > >to setup inotify style watch on an inode or directories children, see:
> > > > > >https://patchwork.kernel.org/patch/10668299/
> > > > > >
> > > > > >>
> > > > > >> Fixes: 5dd03f55fd2 ("fanotify: allow userspace to override max queue
> > > > > >depth")
> > > > > >> Fixes: ac7e22dcfaf ("fanotify: allow userspace to override max
> > > > > >marks")
> > > > > >
> > > > > >Fixes is used to tag bug fixes for stable.
> > > > > >There is no bug.
> > > > > >
> > > > > >Thanks,
> > > > > >Amir.
> > > > >
> > > > > Interesting. When do you think the gate can be removed?
> > > >
> > > > Nobody is working on this AFAIK.
> > > > What I posted was a simple POC, but I have no use case for this.
> > > > In the patchwork link above, Jan has listed the prerequisites for
> > > > removing the gate.
> > > >
> > > > One of the prerequisites is FAN_REPORT_FID, which is now merged.
> > > > When events gets reported with fid instead of fd, unprivileged user
> > > > (hopefully) cannot use fid for privilege escalation.
> > > >
> > > > > I was looking into switching from inotify to fanotify but since it's not usable from
> > > > > non-initial userns it's a no-no
> > > > > since we support nested workloads.
> > > >
> > > > One of Jan's questions was what is the benefit of using inotify-compatible
> > > > fanotify vs. using inotify.
> > > > So what was the reason you were looking into switching from inotify to fanotify?
> > > > Is it because of mount/filesystem watch? Because making those available for
> > >
> > > Yeah. Well, I would need to look but you could probably do it safely for
> > > filesystems mountable in user namespaces (which are few).
> > > Can you do a bind-mount and then place a watch on the bind-mount or is
> > > this superblock based?
> > >
> >
> > Either.
> > FAN_MARK_MOUNT was there from day 1 of fanotify.
> > FAN_MARK_FILESYSTEM was merged to Linux Linux 4.20.
> >
> > But directory modification events that are supported since v5.1 are
> > not available
> > with FAN_MARK_MOUNT, see:
>
> Because you're worried about unprivileged users spying on events? Or
> something else?

Something else. The current fsnotify_move/create/delete() VFS hooks
have no path/mount information, so it is not possible to filter them by
mount only by inode/sb.
Fixing that would not be trivial, but first a strong use case would need
to be presented.

> Because if you can do a bind-mount there's nothing preventing an
> unprivileged user to do a hand-rolled recursive inotify that would
> amount to the same thing anyway.

There is. unprivileged user cannot traverse into directories it is not
allowed to read/search.

> (And btw, v5.1 really is a major step forward and I would really like to
>  use this api tbh.)
>

You haven't answered my question. What is the reason you are interested
in the new API? What does it provide that the old API does not?
I know the 2 APIs differ. I just want to know which difference interests *you*,
because without a strong use case, it will be hard for me to make progress
upstream.

Is what you want really a "bind-mount" watch or a "subtree watch"?
The distinction is important. I am thinking about solutions for the latter,
although there is no immediate solution in the horizon - only ideas.

Thanks,
Amir.
Christian Brauner May 23, 2019, 11:58 a.m. UTC | #8
On Thu, May 23, 2019 at 02:40:39PM +0300, Amir Goldstein wrote:
> On Thu, May 23, 2019 at 1:42 PM Christian Brauner <christian@brauner.io> wrote:
> >
> > On Thu, May 23, 2019 at 01:25:08PM +0300, Amir Goldstein wrote:
> > > On Thu, May 23, 2019 at 12:55 PM Christian Brauner <christian@brauner.io> wrote:
> > > >
> > > > On Wed, May 22, 2019 at 11:00:22PM +0300, Amir Goldstein wrote:
> > > > > On Wed, May 22, 2019 at 9:57 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > >
> > > > > > On May 22, 2019 8:29:37 PM GMT+02:00, Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > >On Wed, May 22, 2019 at 7:32 PM Christian Brauner
> > > > > > ><christian@brauner.io> wrote:
> > > > > > >>
> > > > > > >> This removes two redundant capable(CAP_SYS_ADMIN) checks from
> > > > > > >> fanotify_init().
> > > > > > >> fanotify_init() guards the whole syscall with capable(CAP_SYS_ADMIN)
> > > > > > >at the
> > > > > > >> beginning. So the other two capable(CAP_SYS_ADMIN) checks are not
> > > > > > >needed.
> > > > > > >
> > > > > > >It's intentional:
> > > > > > >
> > > > > > >commit e7099d8a5a34d2876908a9fab4952dabdcfc5909
> > > > > > >Author: Eric Paris <eparis@redhat.com>
> > > > > > >Date:   Thu Oct 28 17:21:57 2010 -0400
> > > > > > >
> > > > > > >    fanotify: limit the number of marks in a single fanotify group
> > > > > > >
> > > > > > >There is currently no limit on the number of marks a given fanotify
> > > > > > >group
> > > > > > >can have.  Since fanotify is gated on CAP_SYS_ADMIN this was not seen
> > > > > > >as
> > > > > > >a serious DoS threat.  This patch implements a default of 8192, the
> > > > > > >same as
> > > > > > >inotify to work towards removing the CAP_SYS_ADMIN gating and
> > > > > > >eliminating
> > > > > > >    the default DoS'able status.
> > > > > > >
> > > > > > >    Signed-off-by: Eric Paris <eparis@redhat.com>
> > > > > > >
> > > > > > >There idea is to eventually remove the gated CAP_SYS_ADMIN.
> > > > > > >There is no reason that fanotify could not be used by unprivileged
> > > > > > >users
> > > > > > >to setup inotify style watch on an inode or directories children, see:
> > > > > > >https://patchwork.kernel.org/patch/10668299/
> > > > > > >
> > > > > > >>
> > > > > > >> Fixes: 5dd03f55fd2 ("fanotify: allow userspace to override max queue
> > > > > > >depth")
> > > > > > >> Fixes: ac7e22dcfaf ("fanotify: allow userspace to override max
> > > > > > >marks")
> > > > > > >
> > > > > > >Fixes is used to tag bug fixes for stable.
> > > > > > >There is no bug.
> > > > > > >
> > > > > > >Thanks,
> > > > > > >Amir.
> > > > > >
> > > > > > Interesting. When do you think the gate can be removed?
> > > > >
> > > > > Nobody is working on this AFAIK.
> > > > > What I posted was a simple POC, but I have no use case for this.
> > > > > In the patchwork link above, Jan has listed the prerequisites for
> > > > > removing the gate.
> > > > >
> > > > > One of the prerequisites is FAN_REPORT_FID, which is now merged.
> > > > > When events gets reported with fid instead of fd, unprivileged user
> > > > > (hopefully) cannot use fid for privilege escalation.
> > > > >
> > > > > > I was looking into switching from inotify to fanotify but since it's not usable from
> > > > > > non-initial userns it's a no-no
> > > > > > since we support nested workloads.
> > > > >
> > > > > One of Jan's questions was what is the benefit of using inotify-compatible
> > > > > fanotify vs. using inotify.
> > > > > So what was the reason you were looking into switching from inotify to fanotify?
> > > > > Is it because of mount/filesystem watch? Because making those available for
> > > >
> > > > Yeah. Well, I would need to look but you could probably do it safely for
> > > > filesystems mountable in user namespaces (which are few).
> > > > Can you do a bind-mount and then place a watch on the bind-mount or is
> > > > this superblock based?
> > > >
> > >
> > > Either.
> > > FAN_MARK_MOUNT was there from day 1 of fanotify.
> > > FAN_MARK_FILESYSTEM was merged to Linux Linux 4.20.
> > >
> > > But directory modification events that are supported since v5.1 are
> > > not available
> > > with FAN_MARK_MOUNT, see:
> >
> > Because you're worried about unprivileged users spying on events? Or
> > something else?
> 
> Something else. The current fsnotify_move/create/delete() VFS hooks
> have no path/mount information, so it is not possible to filter them by
> mount only by inode/sb.
> Fixing that would not be trivial, but first a strong use case would need
> to be presented.
> 
> > Because if you can do a bind-mount there's nothing preventing an
> > unprivileged user to do a hand-rolled recursive inotify that would
> > amount to the same thing anyway.
> 
> There is. unprivileged user cannot traverse into directories it is not
> allowed to read/search.

Right, I should've mentioned: when you're userns root and you have
access to all files. The part that is interesting to me is getting rid
of capable(CAP_SYS_ADMIN).

> 
> > (And btw, v5.1 really is a major step forward and I would really like to
> >  use this api tbh.)
> >
> 
> You haven't answered my question. What is the reason you are interested
> in the new API? What does it provide that the old API does not?
> I know the 2 APIs differ. I just want to know which difference interests *you*,
> because without a strong use case, it will be hard for me to make progress
> upstream.
> 
> Is what you want really a "bind-mount" watch or a "subtree watch"?
> The distinction is important. I am thinking about solutions for the latter,
> although there is no immediate solution in the horizon - only ideas.

Both cases would be interesting. But subtree watch is what would
probably help a lot already. So let me explain.
For LXD - not sure if you know what that is - we allow user to "hotplug"
mounts or certain whitelisted devices into a user namespace container.
One of the nifty features is that we let users specify a "required"
property. When "required" is "false" the user can give us a path, e.g.
/bla/bla/bla/target and then we place a watch on the closest existing
ancestor of my-device. When the target shows up we hotplug it for the
user. Now, as you imagine maintaining that cache until "target" shows up
is a royal pain.

I think that we can get rid of at least some of the complexity if
subtree watch and bind-mount watches would work.

Thanks!
Christian
Amir Goldstein May 23, 2019, 1:16 p.m. UTC | #9
On Thu, May 23, 2019 at 2:58 PM Christian Brauner <christian@brauner.io> wrote:
>
> On Thu, May 23, 2019 at 02:40:39PM +0300, Amir Goldstein wrote:
> > On Thu, May 23, 2019 at 1:42 PM Christian Brauner <christian@brauner.io> wrote:
> > >
> > > On Thu, May 23, 2019 at 01:25:08PM +0300, Amir Goldstein wrote:
> > > > On Thu, May 23, 2019 at 12:55 PM Christian Brauner <christian@brauner.io> wrote:
> > > > >
> > > > > On Wed, May 22, 2019 at 11:00:22PM +0300, Amir Goldstein wrote:
> > > > > > On Wed, May 22, 2019 at 9:57 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > >
> > > > > > > On May 22, 2019 8:29:37 PM GMT+02:00, Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > > >On Wed, May 22, 2019 at 7:32 PM Christian Brauner
> > > > > > > ><christian@brauner.io> wrote:
> > > > > > > >>
> > > > > > > >> This removes two redundant capable(CAP_SYS_ADMIN) checks from
> > > > > > > >> fanotify_init().
> > > > > > > >> fanotify_init() guards the whole syscall with capable(CAP_SYS_ADMIN)
> > > > > > > >at the
> > > > > > > >> beginning. So the other two capable(CAP_SYS_ADMIN) checks are not
> > > > > > > >needed.
> > > > > > > >
> > > > > > > >It's intentional:
> > > > > > > >
> > > > > > > >commit e7099d8a5a34d2876908a9fab4952dabdcfc5909
> > > > > > > >Author: Eric Paris <eparis@redhat.com>
> > > > > > > >Date:   Thu Oct 28 17:21:57 2010 -0400
> > > > > > > >
> > > > > > > >    fanotify: limit the number of marks in a single fanotify group
> > > > > > > >
> > > > > > > >There is currently no limit on the number of marks a given fanotify
> > > > > > > >group
> > > > > > > >can have.  Since fanotify is gated on CAP_SYS_ADMIN this was not seen
> > > > > > > >as
> > > > > > > >a serious DoS threat.  This patch implements a default of 8192, the
> > > > > > > >same as
> > > > > > > >inotify to work towards removing the CAP_SYS_ADMIN gating and
> > > > > > > >eliminating
> > > > > > > >    the default DoS'able status.
> > > > > > > >
> > > > > > > >    Signed-off-by: Eric Paris <eparis@redhat.com>
> > > > > > > >
> > > > > > > >There idea is to eventually remove the gated CAP_SYS_ADMIN.
> > > > > > > >There is no reason that fanotify could not be used by unprivileged
> > > > > > > >users
> > > > > > > >to setup inotify style watch on an inode or directories children, see:
> > > > > > > >https://patchwork.kernel.org/patch/10668299/
> > > > > > > >
> > > > > > > >>
> > > > > > > >> Fixes: 5dd03f55fd2 ("fanotify: allow userspace to override max queue
> > > > > > > >depth")
> > > > > > > >> Fixes: ac7e22dcfaf ("fanotify: allow userspace to override max
> > > > > > > >marks")
> > > > > > > >
> > > > > > > >Fixes is used to tag bug fixes for stable.
> > > > > > > >There is no bug.
> > > > > > > >
> > > > > > > >Thanks,
> > > > > > > >Amir.
> > > > > > >
> > > > > > > Interesting. When do you think the gate can be removed?
> > > > > >
> > > > > > Nobody is working on this AFAIK.
> > > > > > What I posted was a simple POC, but I have no use case for this.
> > > > > > In the patchwork link above, Jan has listed the prerequisites for
> > > > > > removing the gate.
> > > > > >
> > > > > > One of the prerequisites is FAN_REPORT_FID, which is now merged.
> > > > > > When events gets reported with fid instead of fd, unprivileged user
> > > > > > (hopefully) cannot use fid for privilege escalation.
> > > > > >
> > > > > > > I was looking into switching from inotify to fanotify but since it's not usable from
> > > > > > > non-initial userns it's a no-no
> > > > > > > since we support nested workloads.
> > > > > >
> > > > > > One of Jan's questions was what is the benefit of using inotify-compatible
> > > > > > fanotify vs. using inotify.
> > > > > > So what was the reason you were looking into switching from inotify to fanotify?
> > > > > > Is it because of mount/filesystem watch? Because making those available for
> > > > >
> > > > > Yeah. Well, I would need to look but you could probably do it safely for
> > > > > filesystems mountable in user namespaces (which are few).
> > > > > Can you do a bind-mount and then place a watch on the bind-mount or is
> > > > > this superblock based?
> > > > >
> > > >
> > > > Either.
> > > > FAN_MARK_MOUNT was there from day 1 of fanotify.
> > > > FAN_MARK_FILESYSTEM was merged to Linux Linux 4.20.
> > > >
> > > > But directory modification events that are supported since v5.1 are
> > > > not available
> > > > with FAN_MARK_MOUNT, see:
> > >
> > > Because you're worried about unprivileged users spying on events? Or
> > > something else?
> >
> > Something else. The current fsnotify_move/create/delete() VFS hooks
> > have no path/mount information, so it is not possible to filter them by
> > mount only by inode/sb.
> > Fixing that would not be trivial, but first a strong use case would need
> > to be presented.
> >
> > > Because if you can do a bind-mount there's nothing preventing an
> > > unprivileged user to do a hand-rolled recursive inotify that would
> > > amount to the same thing anyway.
> >
> > There is. unprivileged user cannot traverse into directories it is not
> > allowed to read/search.
>
> Right, I should've mentioned: when you're userns root and you have
> access to all files. The part that is interesting to me is getting rid
> of capable(CAP_SYS_ADMIN).

Indeed. so part of removing the gated capable(CAP_SYS_ADMIN)
is figuring out the permission checks needed for individual features.
I agree that for FAN_MARK_MOUNT/FILESYSTEM,
capabale(CAP_SYS_ADMIN) is too strong.
ns_capable(sb->s_user_ns, CAP_DAC_READ_SEARCH)
is probably enough.

>
> >
> > > (And btw, v5.1 really is a major step forward and I would really like to
> > >  use this api tbh.)
> > >
> >
> > You haven't answered my question. What is the reason you are interested
> > in the new API? What does it provide that the old API does not?
> > I know the 2 APIs differ. I just want to know which difference interests *you*,
> > because without a strong use case, it will be hard for me to make progress
> > upstream.
> >
> > Is what you want really a "bind-mount" watch or a "subtree watch"?
> > The distinction is important. I am thinking about solutions for the latter,
> > although there is no immediate solution in the horizon - only ideas.
>
> Both cases would be interesting. But subtree watch is what would
> probably help a lot already. So let me explain.
> For LXD - not sure if you know what that is -
I do

>  we allow user to "hotplug"
> mounts or certain whitelisted devices into a user namespace container.
> One of the nifty features is that we let users specify a "required"
> property. When "required" is "false" the user can give us a path, e.g.
> /bla/bla/bla/target and then we place a watch on the closest existing
> ancestor of my-device. When the target shows up we hotplug it for the
> user. Now, as you imagine maintaining that cache until "target" shows up
> is a royal pain.

You lost me there. Are you looking for notifications when device files appear?
When directory is created? Please give a concrete example.
What part of /bla/bla/bla/target appears, when and how.
fanotify does not give notifications when mounts are mounted.
I have seen a proposal by David Howells for mount change notifications.

Thanks,
Amir.
Christian Brauner May 23, 2019, 1:35 p.m. UTC | #10
On Thu, May 23, 2019 at 04:16:24PM +0300, Amir Goldstein wrote:
> On Thu, May 23, 2019 at 2:58 PM Christian Brauner <christian@brauner.io> wrote:
> >
> > On Thu, May 23, 2019 at 02:40:39PM +0300, Amir Goldstein wrote:
> > > On Thu, May 23, 2019 at 1:42 PM Christian Brauner <christian@brauner.io> wrote:
> > > >
> > > > On Thu, May 23, 2019 at 01:25:08PM +0300, Amir Goldstein wrote:
> > > > > On Thu, May 23, 2019 at 12:55 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > >
> > > > > > On Wed, May 22, 2019 at 11:00:22PM +0300, Amir Goldstein wrote:
> > > > > > > On Wed, May 22, 2019 at 9:57 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > > >
> > > > > > > > On May 22, 2019 8:29:37 PM GMT+02:00, Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > > > >On Wed, May 22, 2019 at 7:32 PM Christian Brauner
> > > > > > > > ><christian@brauner.io> wrote:
> > > > > > > > >>
> > > > > > > > >> This removes two redundant capable(CAP_SYS_ADMIN) checks from
> > > > > > > > >> fanotify_init().
> > > > > > > > >> fanotify_init() guards the whole syscall with capable(CAP_SYS_ADMIN)
> > > > > > > > >at the
> > > > > > > > >> beginning. So the other two capable(CAP_SYS_ADMIN) checks are not
> > > > > > > > >needed.
> > > > > > > > >
> > > > > > > > >It's intentional:
> > > > > > > > >
> > > > > > > > >commit e7099d8a5a34d2876908a9fab4952dabdcfc5909
> > > > > > > > >Author: Eric Paris <eparis@redhat.com>
> > > > > > > > >Date:   Thu Oct 28 17:21:57 2010 -0400
> > > > > > > > >
> > > > > > > > >    fanotify: limit the number of marks in a single fanotify group
> > > > > > > > >
> > > > > > > > >There is currently no limit on the number of marks a given fanotify
> > > > > > > > >group
> > > > > > > > >can have.  Since fanotify is gated on CAP_SYS_ADMIN this was not seen
> > > > > > > > >as
> > > > > > > > >a serious DoS threat.  This patch implements a default of 8192, the
> > > > > > > > >same as
> > > > > > > > >inotify to work towards removing the CAP_SYS_ADMIN gating and
> > > > > > > > >eliminating
> > > > > > > > >    the default DoS'able status.
> > > > > > > > >
> > > > > > > > >    Signed-off-by: Eric Paris <eparis@redhat.com>
> > > > > > > > >
> > > > > > > > >There idea is to eventually remove the gated CAP_SYS_ADMIN.
> > > > > > > > >There is no reason that fanotify could not be used by unprivileged
> > > > > > > > >users
> > > > > > > > >to setup inotify style watch on an inode or directories children, see:
> > > > > > > > >https://patchwork.kernel.org/patch/10668299/
> > > > > > > > >
> > > > > > > > >>
> > > > > > > > >> Fixes: 5dd03f55fd2 ("fanotify: allow userspace to override max queue
> > > > > > > > >depth")
> > > > > > > > >> Fixes: ac7e22dcfaf ("fanotify: allow userspace to override max
> > > > > > > > >marks")
> > > > > > > > >
> > > > > > > > >Fixes is used to tag bug fixes for stable.
> > > > > > > > >There is no bug.
> > > > > > > > >
> > > > > > > > >Thanks,
> > > > > > > > >Amir.
> > > > > > > >
> > > > > > > > Interesting. When do you think the gate can be removed?
> > > > > > >
> > > > > > > Nobody is working on this AFAIK.
> > > > > > > What I posted was a simple POC, but I have no use case for this.
> > > > > > > In the patchwork link above, Jan has listed the prerequisites for
> > > > > > > removing the gate.
> > > > > > >
> > > > > > > One of the prerequisites is FAN_REPORT_FID, which is now merged.
> > > > > > > When events gets reported with fid instead of fd, unprivileged user
> > > > > > > (hopefully) cannot use fid for privilege escalation.
> > > > > > >
> > > > > > > > I was looking into switching from inotify to fanotify but since it's not usable from
> > > > > > > > non-initial userns it's a no-no
> > > > > > > > since we support nested workloads.
> > > > > > >
> > > > > > > One of Jan's questions was what is the benefit of using inotify-compatible
> > > > > > > fanotify vs. using inotify.
> > > > > > > So what was the reason you were looking into switching from inotify to fanotify?
> > > > > > > Is it because of mount/filesystem watch? Because making those available for
> > > > > >
> > > > > > Yeah. Well, I would need to look but you could probably do it safely for
> > > > > > filesystems mountable in user namespaces (which are few).
> > > > > > Can you do a bind-mount and then place a watch on the bind-mount or is
> > > > > > this superblock based?
> > > > > >
> > > > >
> > > > > Either.
> > > > > FAN_MARK_MOUNT was there from day 1 of fanotify.
> > > > > FAN_MARK_FILESYSTEM was merged to Linux Linux 4.20.
> > > > >
> > > > > But directory modification events that are supported since v5.1 are
> > > > > not available
> > > > > with FAN_MARK_MOUNT, see:
> > > >
> > > > Because you're worried about unprivileged users spying on events? Or
> > > > something else?
> > >
> > > Something else. The current fsnotify_move/create/delete() VFS hooks
> > > have no path/mount information, so it is not possible to filter them by
> > > mount only by inode/sb.
> > > Fixing that would not be trivial, but first a strong use case would need
> > > to be presented.
> > >
> > > > Because if you can do a bind-mount there's nothing preventing an
> > > > unprivileged user to do a hand-rolled recursive inotify that would
> > > > amount to the same thing anyway.
> > >
> > > There is. unprivileged user cannot traverse into directories it is not
> > > allowed to read/search.
> >
> > Right, I should've mentioned: when you're userns root and you have
> > access to all files. The part that is interesting to me is getting rid
> > of capable(CAP_SYS_ADMIN).
> 
> Indeed. so part of removing the gated capable(CAP_SYS_ADMIN)
> is figuring out the permission checks needed for individual features.
> I agree that for FAN_MARK_MOUNT/FILESYSTEM,
> capabale(CAP_SYS_ADMIN) is too strong.
> ns_capable(sb->s_user_ns, CAP_DAC_READ_SEARCH)
> is probably enough.
> 
> >
> > >
> > > > (And btw, v5.1 really is a major step forward and I would really like to
> > > >  use this api tbh.)
> > > >
> > >
> > > You haven't answered my question. What is the reason you are interested
> > > in the new API? What does it provide that the old API does not?
> > > I know the 2 APIs differ. I just want to know which difference interests *you*,
> > > because without a strong use case, it will be hard for me to make progress
> > > upstream.
> > >
> > > Is what you want really a "bind-mount" watch or a "subtree watch"?
> > > The distinction is important. I am thinking about solutions for the latter,
> > > although there is no immediate solution in the horizon - only ideas.
> >
> > Both cases would be interesting. But subtree watch is what would
> > probably help a lot already. So let me explain.
> > For LXD - not sure if you know what that is -
> I do
> 
> >  we allow user to "hotplug"
> > mounts or certain whitelisted devices into a user namespace container.
> > One of the nifty features is that we let users specify a "required"
> > property. When "required" is "false" the user can give us a path, e.g.
> > /bla/bla/bla/target and then we place a watch on the closest existing
> > ancestor of my-device. When the target shows up we hotplug it for the
> > user. Now, as you imagine maintaining that cache until "target" shows up
> > is a royal pain.
> 
> You lost me there. Are you looking for notifications when device files appear?

Just when that file is created. fanotify doesn't need to do anything
special for device files. As long as a file (device or
otherwise) and directory creation/deletion triggers an event we're good.
I can then just stat the file and check that it's the device I expect.

> When directory is created? Please give a concrete example.
> What part of /bla/bla/bla/target appears, when and how.

So let's say the user tells me:
- When the "/A/B/C/target" file appears on the host filesystem,
  please give me access to "target" in the container at a path I tell
  you.
What I do right now is listen for the creation of the "target" file.
But at the time the user gives me instructions to listen for
"/A/B/C/target" only /A might exist and so I currently add a watch on A/
and then wait for the creation of B/, then wait for the creation of C/
and finally for the creation of "target" (Of course, I also need to
handle B/ and C/ being removed again an recreated and so on.). It would
be helpful, if I could specify, give me notifications, recursively for
e.g. A/ without me having to place extra watches on B/ and C/ when they
appear. Maybe that's out of scope...

> fanotify does not give notifications when mounts are mounted.
> I have seen a proposal by David Howells for mount change notifications.

David's going to send this out soon, I think.

Thanks!
Christian
Jan Kara May 23, 2019, 2:40 p.m. UTC | #11
On Thu 23-05-19 15:35:18, Christian Brauner wrote:
> So let's say the user tells me:
> - When the "/A/B/C/target" file appears on the host filesystem,
>   please give me access to "target" in the container at a path I tell
>   you.
> What I do right now is listen for the creation of the "target" file.
> But at the time the user gives me instructions to listen for
> "/A/B/C/target" only /A might exist and so I currently add a watch on A/
> and then wait for the creation of B/, then wait for the creation of C/
> and finally for the creation of "target" (Of course, I also need to
> handle B/ and C/ being removed again an recreated and so on.). It would
> be helpful, if I could specify, give me notifications, recursively for
> e.g. A/ without me having to place extra watches on B/ and C/ when they
> appear. Maybe that's out of scope...

I see. But this is going to be painful whatever you do. Consider for
example situation like:

mkdir -p BAR/B/C/
touch BAR/B/C/target
mv BAR A

Or even situation where several renames race so that the end result creates
the name (or does not create it depending on how renames race). And by the
time you decide A/B/C/target exists, it doesn't need to exist anymore.
Honestly I don't see how you want to implement *any* solution in a sane
way. About the most reliable+simple would seem to be stat "A/B/C/target"
once per second as dumb as it is.

								Honza
Christian Brauner May 23, 2019, 2:43 p.m. UTC | #12
On Thu, May 23, 2019 at 04:40:50PM +0200, Jan Kara wrote:
> On Thu 23-05-19 15:35:18, Christian Brauner wrote:
> > So let's say the user tells me:
> > - When the "/A/B/C/target" file appears on the host filesystem,
> >   please give me access to "target" in the container at a path I tell
> >   you.
> > What I do right now is listen for the creation of the "target" file.
> > But at the time the user gives me instructions to listen for
> > "/A/B/C/target" only /A might exist and so I currently add a watch on A/
> > and then wait for the creation of B/, then wait for the creation of C/
> > and finally for the creation of "target" (Of course, I also need to
> > handle B/ and C/ being removed again an recreated and so on.). It would
> > be helpful, if I could specify, give me notifications, recursively for
> > e.g. A/ without me having to place extra watches on B/ and C/ when they
> > appear. Maybe that's out of scope...
> 
> I see. But this is going to be painful whatever you do. Consider for
> example situation like:
> 
> mkdir -p BAR/B/C/
> touch BAR/B/C/target
> mv BAR A
> 
> Or even situation where several renames race so that the end result creates
> the name (or does not create it depending on how renames race). And by the
> time you decide A/B/C/target exists, it doesn't need to exist anymore.
> Honestly I don't see how you want to implement *any* solution in a sane
> way. About the most reliable+simple would seem to be stat "A/B/C/target"
> once per second as dumb as it is.

What we have kinda works rn good enough. And yes, it's inherently racy.
Basically, iirc we only watch that it exists once, then create the thing
for the container and then consider our job done. If that thing is
removed under us we don't really care.

Christian
Amir Goldstein May 23, 2019, 3:15 p.m. UTC | #13
On Thu, May 23, 2019 at 5:40 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 23-05-19 15:35:18, Christian Brauner wrote:
> > So let's say the user tells me:
> > - When the "/A/B/C/target" file appears on the host filesystem,
> >   please give me access to "target" in the container at a path I tell
> >   you.
> > What I do right now is listen for the creation of the "target" file.
> > But at the time the user gives me instructions to listen for
> > "/A/B/C/target" only /A might exist and so I currently add a watch on A/
> > and then wait for the creation of B/, then wait for the creation of C/
> > and finally for the creation of "target" (Of course, I also need to
> > handle B/ and C/ being removed again an recreated and so on.). It would
> > be helpful, if I could specify, give me notifications, recursively for
> > e.g. A/ without me having to place extra watches on B/ and C/ when they
> > appear. Maybe that's out of scope...
>
> I see. But this is going to be painful whatever you do. Consider for
> example situation like:
>
> mkdir -p BAR/B/C/
> touch BAR/B/C/target
> mv BAR A
>
> Or even situation where several renames race so that the end result creates
> the name (or does not create it depending on how renames race). And by the
> time you decide A/B/C/target exists, it doesn't need to exist anymore.
> Honestly I don't see how you want to implement *any* solution in a sane
> way. About the most reliable+simple would seem to be stat "A/B/C/target"
> once per second as dumb as it is.
>

Just wanted to point out that while looking at possible solutions for
"path based rules" for fanotify (i.e. subtree filter) I realized that the audit
subsystem already has a quite sophisticated mechanism to maintain
and enforce path based filesystem rules.

I do not love that code at all, I can hardly follow it, but if someone would
have wanted a way to be notified when an object of a given path name
appears or disappears from the namespace, it seems like something in
the kernel is already going to a great deal of effort to do that already.
Or maybe I am misunderstanding what this code does.

Thanks,
Amir.
Matthew Bobrowski June 5, 2019, 10:26 a.m. UTC | #14
On Thu, May 23, 2019 at 01:25:08PM +0300, Amir Goldstein wrote:

...

> > > > Interesting. When do you think the gate can be removed?
> > >
> > > Nobody is working on this AFAIK.
> > > What I posted was a simple POC, but I have no use case for this.
> > > In the patchwork link above, Jan has listed the prerequisites for
> > > removing the gate.
> > >
> > > One of the prerequisites is FAN_REPORT_FID, which is now merged.
> > > When events gets reported with fid instead of fd, unprivileged user
> > > (hopefully) cannot use fid for privilege escalation.
> > >
> > > > I was looking into switching from inotify to fanotify but since it's not usable from
> > > > non-initial userns it's a no-no
> > > > since we support nested workloads.
> > >
> > > One of Jan's questions was what is the benefit of using inotify-compatible
> > > fanotify vs. using inotify.
> > > So what was the reason you were looking into switching from inotify to fanotify?
> > > Is it because of mount/filesystem watch? Because making those available for
> >
> > Yeah. Well, I would need to look but you could probably do it safely for
> > filesystems mountable in user namespaces (which are few).
> > Can you do a bind-mount and then place a watch on the bind-mount or is
> > this superblock based?
> >
> 
> Either.
> FAN_MARK_MOUNT was there from day 1 of fanotify.
> FAN_MARK_FILESYSTEM was merged to Linux Linux 4.20.
> 
> But directory modification events that are supported since v5.1 are
> not available
> with FAN_MARK_MOUNT, see:
> https://github.com/amir73il/man-pages/blob/fanotify_fid/man2/fanotify_init.2#L97
> 
> Matthew,
> 
> Perhaps this fact is worth a mention in the linked entry for FAN_REPORT_FID
> in fanotify_init.2 in addition to the comment on the entry for FAN_MARK_MOUNT
> in fanotify_mark.2.

Sorry, a little late to the party...

The fact being that directory modification events that are supported since v5.1
are not available when used in conjunction with FAN_MARK_MOUNT?
diff mbox series

Patch

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index a90bb19dcfa2..ec2739a66103 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -845,8 +845,6 @@  SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
 
 	if (flags & FAN_UNLIMITED_QUEUE) {
 		fd = -EPERM;
-		if (!capable(CAP_SYS_ADMIN))
-			goto out_destroy_group;
 		group->max_events = UINT_MAX;
 	} else {
 		group->max_events = FANOTIFY_DEFAULT_MAX_EVENTS;
@@ -854,8 +852,6 @@  SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
 
 	if (flags & FAN_UNLIMITED_MARKS) {
 		fd = -EPERM;
-		if (!capable(CAP_SYS_ADMIN))
-			goto out_destroy_group;
 		group->fanotify_data.max_marks = UINT_MAX;
 	} else {
 		group->fanotify_data.max_marks = FANOTIFY_DEFAULT_MAX_MARKS;