mbox series

[v2,0/2] unprivileged fanotify listener

Message ID 20210304112921.3996419-1-amir73il@gmail.com (mailing list archive)
Headers show
Series unprivileged fanotify listener | expand

Message

Amir Goldstein March 4, 2021, 11:29 a.m. UTC
Jan,

These patches try to implement a minimal set and least controversial
functionality that we can allow for unprivileged users as a starting
point.

The patches were tested on top of v5.12-rc1 and the fanotify_merge
patches using the unprivileged listener LTP tests written by Matthew
and another LTP tests I wrote to test the sysfs tunable limits [1].

Thanks,
Amir.

Changes since v1:
- Dropped marks per group limit in favor of max per user
- Rename sysfs files from 'listener' to 'group' terminology
- Dropped internal group flag FANOTIFY_UNPRIV
- Limit unprivileged listener to FAN_REPORT_FID family
- Report event->pid depending on reader capabilities

[1] https://github.com/amir73il/ltp/commits/fanotify_unpriv

Amir Goldstein (2):
  fanotify: configurable limits via sysfs
  fanotify: support limited functionality for unprivileged users

 fs/notify/fanotify/fanotify.c      |  16 ++-
 fs/notify/fanotify/fanotify_user.c | 152 ++++++++++++++++++++++++-----
 fs/notify/fdinfo.c                 |   3 +-
 fs/notify/group.c                  |   1 -
 fs/notify/mark.c                   |   4 -
 include/linux/fanotify.h           |  36 ++++++-
 include/linux/fsnotify_backend.h   |   6 +-
 include/linux/sched/user.h         |   3 -
 include/linux/user_namespace.h     |   4 +
 kernel/sysctl.c                    |  12 ++-
 kernel/ucount.c                    |   4 +
 11 files changed, 194 insertions(+), 47 deletions(-)

Comments

Jan Kara March 16, 2021, 3:55 p.m. UTC | #1
On Thu 04-03-21 13:29:19, Amir Goldstein wrote:
> Jan,
> 
> These patches try to implement a minimal set and least controversial
> functionality that we can allow for unprivileged users as a starting
> point.
> 
> The patches were tested on top of v5.12-rc1 and the fanotify_merge
> patches using the unprivileged listener LTP tests written by Matthew
> and another LTP tests I wrote to test the sysfs tunable limits [1].

Thanks. I've added both patches to my tree.

								Honza

> 
> Thanks,
> Amir.
> 
> Changes since v1:
> - Dropped marks per group limit in favor of max per user
> - Rename sysfs files from 'listener' to 'group' terminology
> - Dropped internal group flag FANOTIFY_UNPRIV
> - Limit unprivileged listener to FAN_REPORT_FID family
> - Report event->pid depending on reader capabilities
> 
> [1] https://github.com/amir73il/ltp/commits/fanotify_unpriv
> 
> Amir Goldstein (2):
>   fanotify: configurable limits via sysfs
>   fanotify: support limited functionality for unprivileged users
> 
>  fs/notify/fanotify/fanotify.c      |  16 ++-
>  fs/notify/fanotify/fanotify_user.c | 152 ++++++++++++++++++++++++-----
>  fs/notify/fdinfo.c                 |   3 +-
>  fs/notify/group.c                  |   1 -
>  fs/notify/mark.c                   |   4 -
>  include/linux/fanotify.h           |  36 ++++++-
>  include/linux/fsnotify_backend.h   |   6 +-
>  include/linux/sched/user.h         |   3 -
>  include/linux/user_namespace.h     |   4 +
>  kernel/sysctl.c                    |  12 ++-
>  kernel/ucount.c                    |   4 +
>  11 files changed, 194 insertions(+), 47 deletions(-)
> 
> -- 
> 2.30.0
>
Amir Goldstein March 17, 2021, 11:01 a.m. UTC | #2
On Tue, Mar 16, 2021 at 5:55 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 04-03-21 13:29:19, Amir Goldstein wrote:
> > Jan,
> >
> > These patches try to implement a minimal set and least controversial
> > functionality that we can allow for unprivileged users as a starting
> > point.
> >
> > The patches were tested on top of v5.12-rc1 and the fanotify_merge
> > patches using the unprivileged listener LTP tests written by Matthew
> > and another LTP tests I wrote to test the sysfs tunable limits [1].
>
> Thanks. I've added both patches to my tree.

Great!
I'll go post the LTP tests and work on the man page updates.

BTW, I noticed that you pushed the aggregating for_next branch,
but not the fsnotify topic branch.

Is this intentional?

I am asking because I am usually basing my development branches
off of your fsnotify branch, but I can base them on the unpushed branch.

Heads up. I am playing with extra privileges we may be able to
allow an ns_capable user.
For example, watching a FS_USERNS_MOUNT filesystem that the user
itself has mounted inside userns.

Another feature I am investigating is how to utilize the new idmapped
mounts to get a subtree watch functionality. This requires attaching a
userns to the group on fanotify_init().

<hand waving>
If the group's userns are the same or below the idmapped mount userns,
then all the objects accessed via that idmapped mount are accessible
to the group's userns admin. We can use that fact to filter events very
early based on their mnt_userns and the group's userns, which should be
cheaper than any subtree permission checks.
<\hand waving>

Thanks,
Amir.
Jan Kara March 17, 2021, 11:42 a.m. UTC | #3
On Wed 17-03-21 13:01:35, Amir Goldstein wrote:
> On Tue, Mar 16, 2021 at 5:55 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Thu 04-03-21 13:29:19, Amir Goldstein wrote:
> > > Jan,
> > >
> > > These patches try to implement a minimal set and least controversial
> > > functionality that we can allow for unprivileged users as a starting
> > > point.
> > >
> > > The patches were tested on top of v5.12-rc1 and the fanotify_merge
> > > patches using the unprivileged listener LTP tests written by Matthew
> > > and another LTP tests I wrote to test the sysfs tunable limits [1].
> >
> > Thanks. I've added both patches to my tree.
> 
> Great!
> I'll go post the LTP tests and work on the man page updates.
> 
> BTW, I noticed that you pushed the aggregating for_next branch,
> but not the fsnotify topic branch.
> 
> Is this intentional?

Not really, pushed now. Thanks for reminder.

> I am asking because I am usually basing my development branches
> off of your fsnotify branch, but I can base them on the unpushed branch.
> 
> Heads up. I am playing with extra privileges we may be able to
> allow an ns_capable user.
> For example, watching a FS_USERNS_MOUNT filesystem that the user
> itself has mounted inside userns.
> 
> Another feature I am investigating is how to utilize the new idmapped
> mounts to get a subtree watch functionality. This requires attaching a
> userns to the group on fanotify_init().
> 
> <hand waving>
> If the group's userns are the same or below the idmapped mount userns,
> then all the objects accessed via that idmapped mount are accessible
> to the group's userns admin. We can use that fact to filter events very
> early based on their mnt_userns and the group's userns, which should be
> cheaper than any subtree permission checks.
> <\hand waving>

Yeah, I agree this should work. Just it seems to me the userbase for this
functionality will be (at least currently) rather limited. While full
subtree watches would be IMO interesting to much more users.

								Honza
Amir Goldstein March 17, 2021, 12:19 p.m. UTC | #4
On Wed, Mar 17, 2021 at 1:42 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 17-03-21 13:01:35, Amir Goldstein wrote:
> > On Tue, Mar 16, 2021 at 5:55 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Thu 04-03-21 13:29:19, Amir Goldstein wrote:
> > > > Jan,
> > > >
> > > > These patches try to implement a minimal set and least controversial
> > > > functionality that we can allow for unprivileged users as a starting
> > > > point.
> > > >
> > > > The patches were tested on top of v5.12-rc1 and the fanotify_merge
> > > > patches using the unprivileged listener LTP tests written by Matthew
> > > > and another LTP tests I wrote to test the sysfs tunable limits [1].
> > >
> > > Thanks. I've added both patches to my tree.
> >
> > Great!
> > I'll go post the LTP tests and work on the man page updates.
> >
> > BTW, I noticed that you pushed the aggregating for_next branch,
> > but not the fsnotify topic branch.
> >
> > Is this intentional?
>
> Not really, pushed now. Thanks for reminder.
>
> > I am asking because I am usually basing my development branches
> > off of your fsnotify branch, but I can base them on the unpushed branch.
> >
> > Heads up. I am playing with extra privileges we may be able to
> > allow an ns_capable user.
> > For example, watching a FS_USERNS_MOUNT filesystem that the user
> > itself has mounted inside userns.
> >
> > Another feature I am investigating is how to utilize the new idmapped
> > mounts to get a subtree watch functionality. This requires attaching a
> > userns to the group on fanotify_init().
> >
> > <hand waving>
> > If the group's userns are the same or below the idmapped mount userns,
> > then all the objects accessed via that idmapped mount are accessible
> > to the group's userns admin. We can use that fact to filter events very
> > early based on their mnt_userns and the group's userns, which should be
> > cheaper than any subtree permission checks.
> > <\hand waving>
>
> Yeah, I agree this should work. Just it seems to me the userbase for this
> functionality will be (at least currently) rather limited. While full

That may change when systemd home dirs feature starts to use
idmapped mounts.
Being able to watch the user's entire home directory is a big win
already.

> subtree watches would be IMO interesting to much more users.

Agreed.

I was looking into that as well, using the example of nfsd_acceptable()
to implement the subtree permission check.

The problem here is that even if unprivileged users cannot compromise
security, they can still cause significant CPU overhead either queueing
events or filtering events and that is something I haven't been able to
figure out a way to escape from.

BUT, if you allow userns admin to setup subtree watches (a.k.a filtered
filesystem marks) on a userns filesystem/idmapped mount, now users
can watch subtrees in their home directories and other processes will
pay CPU penalty only for file access in the users home directories.

That might be acceptable.

Thanks,
Amir.
Christian Brauner March 17, 2021, 5:45 p.m. UTC | #5
On Wed, Mar 17, 2021 at 02:19:57PM +0200, Amir Goldstein wrote:
> On Wed, Mar 17, 2021 at 1:42 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Wed 17-03-21 13:01:35, Amir Goldstein wrote:
> > > On Tue, Mar 16, 2021 at 5:55 PM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Thu 04-03-21 13:29:19, Amir Goldstein wrote:
> > > > > Jan,
> > > > >
> > > > > These patches try to implement a minimal set and least controversial
> > > > > functionality that we can allow for unprivileged users as a starting
> > > > > point.
> > > > >
> > > > > The patches were tested on top of v5.12-rc1 and the fanotify_merge
> > > > > patches using the unprivileged listener LTP tests written by Matthew
> > > > > and another LTP tests I wrote to test the sysfs tunable limits [1].
> > > >
> > > > Thanks. I've added both patches to my tree.
> > >
> > > Great!
> > > I'll go post the LTP tests and work on the man page updates.
> > >
> > > BTW, I noticed that you pushed the aggregating for_next branch,
> > > but not the fsnotify topic branch.
> > >
> > > Is this intentional?
> >
> > Not really, pushed now. Thanks for reminder.
> >
> > > I am asking because I am usually basing my development branches
> > > off of your fsnotify branch, but I can base them on the unpushed branch.
> > >
> > > Heads up. I am playing with extra privileges we may be able to
> > > allow an ns_capable user.
> > > For example, watching a FS_USERNS_MOUNT filesystem that the user
> > > itself has mounted inside userns.
> > >
> > > Another feature I am investigating is how to utilize the new idmapped
> > > mounts to get a subtree watch functionality. This requires attaching a
> > > userns to the group on fanotify_init().
> > >
> > > <hand waving>
> > > If the group's userns are the same or below the idmapped mount userns,
> > > then all the objects accessed via that idmapped mount are accessible
> > > to the group's userns admin. We can use that fact to filter events very
> > > early based on their mnt_userns and the group's userns, which should be
> > > cheaper than any subtree permission checks.
> > > <\hand waving>
> >
> > Yeah, I agree this should work. Just it seems to me the userbase for this
> > functionality will be (at least currently) rather limited. While full
> 
> That may change when systemd home dirs feature starts to use
> idmapped mounts.
> Being able to watch the user's entire home directory is a big win
> already.

Hey Amir,
Hey Jan,

I think so too.

> 
> > subtree watches would be IMO interesting to much more users.
> 
> Agreed.

We have a use-case for subtree watches: One feature for containers we
have is that users can e.g. tell us that they want the container manager
to hotplug an arbitrary unix or block device into the container whenever
the relevant device shows up on the system. For example they could
instruct the container manager to plugin some new driver device when it
shows up in /dev. That works nicely because of uevents. But users quite
often also instruct us to plugin a path once it shows up in some
directory in the filesystem hierarchy and unplug it once it is removed.
Right now we're mainting an inotify-based hand-rolled recursive watch to
make this work so we detect that add and remove event. I would be wildly
excited if we could get rid of some of that complexity by using subtree
watches. The container manager on the host will be unaffected by this
feature since it will usually have root privileges and manage
unprivileged containers.
The unprivileged (userns use-case specifically here) subtree watches
will be necessary and really good to have to make this work for
container workloads and nested containers, i.e. where the container
manager itselfs runs in a container and starts new containres. Since the
subtree feature would be interesting for systemd itself and since our
container manager (ChromeOS etc.) runs systemd inside unprivileged
containers on a large scale it would be good if subtree watches could
work in userns too.

> 
> I was looking into that as well, using the example of nfsd_acceptable()
> to implement the subtree permission check.
> 
> The problem here is that even if unprivileged users cannot compromise
> security, they can still cause significant CPU overhead either queueing
> events or filtering events and that is something I haven't been able to
> figure out a way to escape from.
> 
> BUT, if you allow userns admin to setup subtree watches (a.k.a filtered
> filesystem marks) on a userns filesystem/idmapped mount, now users

I think that sounds reasonable.
If the mount really is idmapped, it might be interesting to consider
checking for privilege in the mnt_userns in addition to the regular
permission checks that fanotify performs. My (equally handwavy) thinking
is that this might allow for a nice feature where the creator of the
mount (e.g. systemd) can block the creation of subtree watches by
attaching a mnt_userns to the mnt that the user has no privilege in.
(Just a thought.).

Christian
Amir Goldstein March 17, 2021, 7:14 p.m. UTC | #6
On Wed, Mar 17, 2021 at 7:45 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Wed, Mar 17, 2021 at 02:19:57PM +0200, Amir Goldstein wrote:
> > On Wed, Mar 17, 2021 at 1:42 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Wed 17-03-21 13:01:35, Amir Goldstein wrote:
> > > > On Tue, Mar 16, 2021 at 5:55 PM Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > On Thu 04-03-21 13:29:19, Amir Goldstein wrote:
> > > > > > Jan,
> > > > > >
> > > > > > These patches try to implement a minimal set and least controversial
> > > > > > functionality that we can allow for unprivileged users as a starting
> > > > > > point.
> > > > > >
> > > > > > The patches were tested on top of v5.12-rc1 and the fanotify_merge
> > > > > > patches using the unprivileged listener LTP tests written by Matthew
> > > > > > and another LTP tests I wrote to test the sysfs tunable limits [1].
> > > > >
> > > > > Thanks. I've added both patches to my tree.
> > > >
> > > > Great!
> > > > I'll go post the LTP tests and work on the man page updates.
> > > >
> > > > BTW, I noticed that you pushed the aggregating for_next branch,
> > > > but not the fsnotify topic branch.
> > > >
> > > > Is this intentional?
> > >
> > > Not really, pushed now. Thanks for reminder.
> > >
> > > > I am asking because I am usually basing my development branches
> > > > off of your fsnotify branch, but I can base them on the unpushed branch.
> > > >
> > > > Heads up. I am playing with extra privileges we may be able to
> > > > allow an ns_capable user.
> > > > For example, watching a FS_USERNS_MOUNT filesystem that the user
> > > > itself has mounted inside userns.
> > > >
> > > > Another feature I am investigating is how to utilize the new idmapped
> > > > mounts to get a subtree watch functionality. This requires attaching a
> > > > userns to the group on fanotify_init().
> > > >
> > > > <hand waving>
> > > > If the group's userns are the same or below the idmapped mount userns,
> > > > then all the objects accessed via that idmapped mount are accessible
> > > > to the group's userns admin. We can use that fact to filter events very
> > > > early based on their mnt_userns and the group's userns, which should be
> > > > cheaper than any subtree permission checks.
> > > > <\hand waving>
> > >
> > > Yeah, I agree this should work. Just it seems to me the userbase for this
> > > functionality will be (at least currently) rather limited. While full
> >
> > That may change when systemd home dirs feature starts to use
> > idmapped mounts.
> > Being able to watch the user's entire home directory is a big win
> > already.
>
> Hey Amir,
> Hey Jan,
>
> I think so too.
>
> >
> > > subtree watches would be IMO interesting to much more users.
> >
> > Agreed.
>
> We have a use-case for subtree watches: One feature for containers we
> have is that users can e.g. tell us that they want the container manager
> to hotplug an arbitrary unix or block device into the container whenever
> the relevant device shows up on the system. For example they could
> instruct the container manager to plugin some new driver device when it
> shows up in /dev. That works nicely because of uevents. But users quite
> often also instruct us to plugin a path once it shows up in some
> directory in the filesystem hierarchy and unplug it once it is removed.
> Right now we're mainting an inotify-based hand-rolled recursive watch to
> make this work so we detect that add and remove event. I would be wildly
> excited if we could get rid of some of that complexity by using subtree
> watches. The container manager on the host will be unaffected by this
> feature since it will usually have root privileges and manage
> unprivileged containers.
> The unprivileged (userns use-case specifically here) subtree watches
> will be necessary and really good to have to make this work for
> container workloads and nested containers, i.e. where the container
> manager itselfs runs in a container and starts new containres. Since the
> subtree feature would be interesting for systemd itself and since our
> container manager (ChromeOS etc.) runs systemd inside unprivileged
> containers on a large scale it would be good if subtree watches could
> work in userns too.
>

I don't understand the subtree watch use case.
You will have to walk me through it.

What exactly is the container manager trying to detect?
That a subdir of a specific name/path was created/deleted?
It doesn't sound like a recursive watch is needed for that.
What am I missing?

As for nested container managers (and systemd), my thinking is
that if all the mounts that manager is watching for serving its containers
are idmapped to that manager's userns (is that a viable option?), then
there shouldn't be a problem to setup userns filtered watches in order to
be notified on all the events that happen via those idmapped mounts
and filtering by "subtree" is not needed.
I am clearly far from understanding the big picture.

> >
> > I was looking into that as well, using the example of nfsd_acceptable()
> > to implement the subtree permission check.
> >
> > The problem here is that even if unprivileged users cannot compromise
> > security, they can still cause significant CPU overhead either queueing
> > events or filtering events and that is something I haven't been able to
> > figure out a way to escape from.
> >
> > BUT, if you allow userns admin to setup subtree watches (a.k.a filtered
> > filesystem marks) on a userns filesystem/idmapped mount, now users
>
> I think that sounds reasonable.
> If the mount really is idmapped, it might be interesting to consider
> checking for privilege in the mnt_userns in addition to the regular
> permission checks that fanotify performs. My (equally handwavy) thinking
> is that this might allow for a nice feature where the creator of the
> mount (e.g. systemd) can block the creation of subtree watches by
> attaching a mnt_userns to the mnt that the user has no privilege in.
> (Just a thought.).
>

Currently, (upstream) only init_userns CAP_SYS_ADMIN can setup
fanotify watches.
In linux-next, unprivileged user can already setup inode watches
(i.e. like inotify).

So I am not sure what you are referring to by "block the creation of
subtree watches".

If systemd were to idmap my home dir to mnt_userns where my user
has CAP_SYS_ADMIN, then allowing my user to setup a watch for
all events on that mount should not be too hard.
If you think that is useful and you want to play with this feature I can
provide a WIP branch soon.

Thanks,
Amir.
Christian Brauner March 18, 2021, 2:31 p.m. UTC | #7
On Wed, Mar 17, 2021 at 09:14:06PM +0200, Amir Goldstein wrote:
> On Wed, Mar 17, 2021 at 7:45 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Wed, Mar 17, 2021 at 02:19:57PM +0200, Amir Goldstein wrote:
> > > On Wed, Mar 17, 2021 at 1:42 PM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Wed 17-03-21 13:01:35, Amir Goldstein wrote:
> > > > > On Tue, Mar 16, 2021 at 5:55 PM Jan Kara <jack@suse.cz> wrote:
> > > > > >
> > > > > > On Thu 04-03-21 13:29:19, Amir Goldstein wrote:
> > > > > > > Jan,
> > > > > > >
> > > > > > > These patches try to implement a minimal set and least controversial
> > > > > > > functionality that we can allow for unprivileged users as a starting
> > > > > > > point.
> > > > > > >
> > > > > > > The patches were tested on top of v5.12-rc1 and the fanotify_merge
> > > > > > > patches using the unprivileged listener LTP tests written by Matthew
> > > > > > > and another LTP tests I wrote to test the sysfs tunable limits [1].
> > > > > >
> > > > > > Thanks. I've added both patches to my tree.
> > > > >
> > > > > Great!
> > > > > I'll go post the LTP tests and work on the man page updates.
> > > > >
> > > > > BTW, I noticed that you pushed the aggregating for_next branch,
> > > > > but not the fsnotify topic branch.
> > > > >
> > > > > Is this intentional?
> > > >
> > > > Not really, pushed now. Thanks for reminder.
> > > >
> > > > > I am asking because I am usually basing my development branches
> > > > > off of your fsnotify branch, but I can base them on the unpushed branch.
> > > > >
> > > > > Heads up. I am playing with extra privileges we may be able to
> > > > > allow an ns_capable user.
> > > > > For example, watching a FS_USERNS_MOUNT filesystem that the user
> > > > > itself has mounted inside userns.
> > > > >
> > > > > Another feature I am investigating is how to utilize the new idmapped
> > > > > mounts to get a subtree watch functionality. This requires attaching a
> > > > > userns to the group on fanotify_init().
> > > > >
> > > > > <hand waving>
> > > > > If the group's userns are the same or below the idmapped mount userns,
> > > > > then all the objects accessed via that idmapped mount are accessible
> > > > > to the group's userns admin. We can use that fact to filter events very
> > > > > early based on their mnt_userns and the group's userns, which should be
> > > > > cheaper than any subtree permission checks.
> > > > > <\hand waving>
> > > >
> > > > Yeah, I agree this should work. Just it seems to me the userbase for this
> > > > functionality will be (at least currently) rather limited. While full
> > >
> > > That may change when systemd home dirs feature starts to use
> > > idmapped mounts.
> > > Being able to watch the user's entire home directory is a big win
> > > already.
> >
> > Hey Amir,
> > Hey Jan,
> >
> > I think so too.
> >
> > >
> > > > subtree watches would be IMO interesting to much more users.
> > >
> > > Agreed.
> >
> > We have a use-case for subtree watches: One feature for containers we
> > have is that users can e.g. tell us that they want the container manager
> > to hotplug an arbitrary unix or block device into the container whenever
> > the relevant device shows up on the system. For example they could
> > instruct the container manager to plugin some new driver device when it
> > shows up in /dev. That works nicely because of uevents. But users quite
> > often also instruct us to plugin a path once it shows up in some
> > directory in the filesystem hierarchy and unplug it once it is removed.
> > Right now we're mainting an inotify-based hand-rolled recursive watch to
> > make this work so we detect that add and remove event. I would be wildly
> > excited if we could get rid of some of that complexity by using subtree
> > watches. The container manager on the host will be unaffected by this
> > feature since it will usually have root privileges and manage
> > unprivileged containers.
> > The unprivileged (userns use-case specifically here) subtree watches
> > will be necessary and really good to have to make this work for
> > container workloads and nested containers, i.e. where the container
> > manager itselfs runs in a container and starts new containres. Since the
> > subtree feature would be interesting for systemd itself and since our
> > container manager (ChromeOS etc.) runs systemd inside unprivileged
> > containers on a large scale it would be good if subtree watches could
> > work in userns too.
> >
> 
> I don't understand the subtree watch use case.
> You will have to walk me through it.
> 
> What exactly is the container manager trying to detect?
> That a subdir of a specific name/path was created/deleted?
> It doesn't sound like a recursive watch is needed for that.
> What am I missing?

Sorry if I was unclear. For example, a user may tell the container
manager to hotplug

/home/jdoe/some/path/

into the container. Users are free to tell the container manager that
that path doesn't need exist. At that poing the container manager will
start to mirror the first part of the path that does exist. And as soon
as the full path has been created the container manager will hotplug
that path as a new mount into the container. Similarly it will
remove that mount from the container as soon as the path is deleted from
the host.

So say the user tells the container manager to inject

/home/jdoe/some/path

into the container but only

/home/jdoe

currently exists then the container manager will recursively watch:

/home/jdoe

waiting for the full path to be created.

This is all a bit nasty since we need to ensure that we notice all
events. For example, the user could create

/home/jdoe/some

but then right after that

/home/jdoe/some

could be removed again. With the inotify listener we need to constantly
add (and remove iirc) watch fds and ensure that we never miss an event
and that's brittle. I'd rather have something that allows me to mirror

/home/jdoe

recursively directly. But maybe I'm misunderstanding fanotify and it
can't really help us but I thought that subtree watches might.

One of the reason't I didn't use fanotiy when we implemented this was
that it couldn't be used inside of user namespaces, i.e. CAP_SYS_ADMIN
in the initial userns was required.
We always make very sure that users can properly nest containers and
have almost all the same features available that they have with
non-nested containers. And since fanotify currently requires
CAP_SYS_ADMIN in the init userns it means a container manager running
inside a container wanting to hotplug paths for nested containers can't
use fanotify. 

(Btw, this is part of the code I wrote to implement this logic via
inotify a long time ago
https://github.com/lxc/lxd/blob/f12f03a4ba4645892ef6cc167c24da49d1217b02/lxd/device/device_utils_inotify.go
[I'm sorry you have to see this in case you click on it...])

> 
> As for nested container managers (and systemd), my thinking is
> that if all the mounts that manager is watching for serving its containers
> are idmapped to that manager's userns (is that a viable option?), then

Yes, it is possible. We do now support AT_RECURSIVE with all mount
attributes including idmapping mounts.

> there shouldn't be a problem to setup userns filtered watches in order to
> be notified on all the events that happen via those idmapped mounts
> and filtering by "subtree" is not needed.
> I am clearly far from understanding the big picture.

I think I need to refamiliarize myself with what "subtree" watches do.
Maybe I misunderstood what they do. I'll take a look.

> 
> > >
> > > I was looking into that as well, using the example of nfsd_acceptable()
> > > to implement the subtree permission check.
> > >
> > > The problem here is that even if unprivileged users cannot compromise
> > > security, they can still cause significant CPU overhead either queueing
> > > events or filtering events and that is something I haven't been able to
> > > figure out a way to escape from.
> > >
> > > BUT, if you allow userns admin to setup subtree watches (a.k.a filtered
> > > filesystem marks) on a userns filesystem/idmapped mount, now users
> >
> > I think that sounds reasonable.
> > If the mount really is idmapped, it might be interesting to consider
> > checking for privilege in the mnt_userns in addition to the regular
> > permission checks that fanotify performs. My (equally handwavy) thinking
> > is that this might allow for a nice feature where the creator of the
> > mount (e.g. systemd) can block the creation of subtree watches by
> > attaching a mnt_userns to the mnt that the user has no privilege in.
> > (Just a thought.).
> >
> 
> Currently, (upstream) only init_userns CAP_SYS_ADMIN can setup
> fanotify watches.
> In linux-next, unprivileged user can already setup inode watches
> (i.e. like inotify).

Just to clarify: you mean "unprivileged" as in non-root users in
init_user_ns and therefore also users in non-init userns. That's what
inotify allows you. This would probably allows us to use fanotify
instead of the hand-rolled recursive notify watching we currently do and
that I linked to above.

> 
> So I am not sure what you are referring to by "block the creation of
> subtree watches".
> 
> If systemd were to idmap my home dir to mnt_userns where my user
> has CAP_SYS_ADMIN, then allowing my user to setup a watch for
> all events on that mount should not be too hard.

Right, that was essentially what my comment was about.

> If you think that is useful and you want to play with this feature I can
> provide a WIP branch soon.

I would like to first play with the support for unprivileged fanotify
but sure, it does sound useful!

Christian
Jan Kara March 18, 2021, 3:44 p.m. UTC | #8
On Wed 17-03-21 14:19:57, Amir Goldstein wrote:
> On Wed, Mar 17, 2021 at 1:42 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Wed 17-03-21 13:01:35, Amir Goldstein wrote:
> > > On Tue, Mar 16, 2021 at 5:55 PM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Thu 04-03-21 13:29:19, Amir Goldstein wrote:
> > > > > Jan,
> > > > >
> > > > > These patches try to implement a minimal set and least controversial
> > > > > functionality that we can allow for unprivileged users as a starting
> > > > > point.
> > > > >
> > > > > The patches were tested on top of v5.12-rc1 and the fanotify_merge
> > > > > patches using the unprivileged listener LTP tests written by Matthew
> > > > > and another LTP tests I wrote to test the sysfs tunable limits [1].
> > > >
> > > > Thanks. I've added both patches to my tree.
> > >
> > > Great!
> > > I'll go post the LTP tests and work on the man page updates.
> > >
> > > BTW, I noticed that you pushed the aggregating for_next branch,
> > > but not the fsnotify topic branch.
> > >
> > > Is this intentional?
> >
> > Not really, pushed now. Thanks for reminder.
> >
> > > I am asking because I am usually basing my development branches
> > > off of your fsnotify branch, but I can base them on the unpushed branch.
> > >
> > > Heads up. I am playing with extra privileges we may be able to
> > > allow an ns_capable user.
> > > For example, watching a FS_USERNS_MOUNT filesystem that the user
> > > itself has mounted inside userns.
> > >
> > > Another feature I am investigating is how to utilize the new idmapped
> > > mounts to get a subtree watch functionality. This requires attaching a
> > > userns to the group on fanotify_init().
> > >
> > > <hand waving>
> > > If the group's userns are the same or below the idmapped mount userns,
> > > then all the objects accessed via that idmapped mount are accessible
> > > to the group's userns admin. We can use that fact to filter events very
> > > early based on their mnt_userns and the group's userns, which should be
> > > cheaper than any subtree permission checks.
> > > <\hand waving>
> >
> > Yeah, I agree this should work. Just it seems to me the userbase for this
> > functionality will be (at least currently) rather limited. While full
> 
> That may change when systemd home dirs feature starts to use idmapped
> mounts. Being able to watch the user's entire home directory is a big
> win already.

Do you mean that home directory would be an extra mount with userns in
which the user has CAP_SYS_ADMIN so he'd be able to watch subtrees on that
mount?

> > subtree watches would be IMO interesting to much more users.
> 
> Agreed.
> 
> I was looking into that as well, using the example of nfsd_acceptable()
> to implement the subtree permission check.
> 
> The problem here is that even if unprivileged users cannot compromise
> security, they can still cause significant CPU overhead either queueing
> events or filtering events and that is something I haven't been able to
> figure out a way to escape from.

WRT queueing overhead, given a user can place ~1M of directory watches, he
can cause noticable total overhead for queueing events anyway. Furthermore
the queue size is limited so unless the user spends time consuming events
as well, the load won't last long. But I agree we need to be careful not to
introduce too big latencies to operations generating events. So I think if
we could quickly detect whether a generated event has a good chance of
being relevant for some subtree watch of a group and queue it in that case
and worry about permission checks only once events are received and thus
receiver pays the cost of expensive checks, that might be fine as well.

								Honza
Amir Goldstein March 18, 2021, 4:48 p.m. UTC | #9
[...]

I understand the use case.

> I'd rather have something that allows me to mirror
>
> /home/jdoe
>
> recursively directly. But maybe I'm misunderstanding fanotify and it
> can't really help us but I thought that subtree watches might.
>

There are no subtree watches. They are still a holy grale for fanotify...
There are filesystem and mnt watches and the latter support far fewer
events (only events for operations that carry the path argument).

With filesystem watches, you can get events for all mkdirs and you can
figure out the created path, but you'd have to do all the filtering in
userspace.

What I am trying to create is "filtered" filesystem watches and the filter needs
to be efficient enough so the watcher will not incur too big of a penalty
on all the operations in the filesystem.

Thanks to your mnt_userns changes, implementing a filter to intercept
(say) mkdir calles on a specific mnt_userns should be quite simple, but
filtering by "path" (i.e. /home/jdoe/some/path) will still need to happen in
userspace.

This narrows the problem to the nested container manager that will only
need to filter events which happened via mounts under its control.

[...]

> > there shouldn't be a problem to setup userns filtered watches in order to
> > be notified on all the events that happen via those idmapped mounts
> > and filtering by "subtree" is not needed.
> > I am clearly far from understanding the big picture.
>
> I think I need to refamiliarize myself with what "subtree" watches do.
> Maybe I misunderstood what they do. I'll take a look.
>

You will not find them :-)

[...]

> > Currently, (upstream) only init_userns CAP_SYS_ADMIN can setup
> > fanotify watches.
> > In linux-next, unprivileged user can already setup inode watches
> > (i.e. like inotify).
>
> Just to clarify: you mean "unprivileged" as in non-root users in
> init_user_ns and therefore also users in non-init userns. That's what

Correct.

> inotify allows you. This would probably allows us to use fanotify
> instead of the hand-rolled recursive notify watching we currently do and
> that I linked to above.
>

The code that sits in linux-next can give you pretty much a drop-in
replacement of inotify and nothing more. See example code:
https://github.com/amir73il/inotify-tools/commits/fanotify_name_fid

> > If you think that is useful and you want to play with this feature I can
> > provide a WIP branch soon.
>
> I would like to first play with the support for unprivileged fanotify
> but sure, it does sound useful!

Just so you have an idea what I am talking about, this is a very early
POC branch:
https://github.com/amir73il/linux/commits/fanotify_userns

It will not be very useful to you yet I think.
Userns admin can watch all events on a tmpfs/fuse mounted
inside its userns.
Userns admin can watch open/read/write/close events on an
idmapped mount mapped to its userns.
But I think the more useful feature would be to watch all events on
an idmapped mount mapped to its userns.

Thanks,
Amir.
Amir Goldstein March 18, 2021, 5:07 p.m. UTC | #10
> > That may change when systemd home dirs feature starts to use idmapped
> > mounts. Being able to watch the user's entire home directory is a big
> > win already.
>
> Do you mean that home directory would be an extra mount with userns in
> which the user has CAP_SYS_ADMIN so he'd be able to watch subtrees on that
> mount?
>

That is what I meant.
My understanding of the systemd-homed use case for idmapped mounts is
that the user has CAP_SYS_ADMIN is the mapped userns, but I may be wrong.

> > > subtree watches would be IMO interesting to much more users.
> >
> > Agreed.
> >
> > I was looking into that as well, using the example of nfsd_acceptable()
> > to implement the subtree permission check.
> >
> > The problem here is that even if unprivileged users cannot compromise
> > security, they can still cause significant CPU overhead either queueing
> > events or filtering events and that is something I haven't been able to
> > figure out a way to escape from.
>
> WRT queueing overhead, given a user can place ~1M of directory watches, he
> can cause noticable total overhead for queueing events anyway. Furthermore

I suppose so. But a user placing 1M dir watches at least adds this overhead
knowingly. Adding a overhead on the entire filesystem when just wanting to
watch a small subtree doesn't sound ideal. Especially in very nested setups.
So yes, we need to be careful.

> the queue size is limited so unless the user spends time consuming events
> as well, the load won't last long. But I agree we need to be careful not to
> introduce too big latencies to operations generating events. So I think if
> we could quickly detect whether a generated event has a good chance of
> being relevant for some subtree watch of a group and queue it in that case
> and worry about permission checks only once events are received and thus
> receiver pays the cost of expensive checks, that might be fine as well.
>

So far the only idea I had for "quickly detect" which I cannot find flaws in
is to filter by mnt_userms, but its power is limited.

Thanks,
Amir.
Christian Brauner March 18, 2021, 6:40 p.m. UTC | #11
On Thu, Mar 18, 2021 at 07:07:00PM +0200, Amir Goldstein wrote:
> > > That may change when systemd home dirs feature starts to use idmapped
> > > mounts. Being able to watch the user's entire home directory is a big
> > > win already.
> >
> > Do you mean that home directory would be an extra mount with userns in
> > which the user has CAP_SYS_ADMIN so he'd be able to watch subtrees on that
> > mount?
> >
> 
> That is what I meant.
> My understanding of the systemd-homed use case for idmapped mounts is
> that the user has CAP_SYS_ADMIN is the mapped userns, but I may be wrong.

systemd can simply create a new userns with the uid/gid of the target
user effectively delegating it (That's independent of actually writing a
uid gid mapping for the userns which will be done with privileges.) and
then attach it to that mount for the user.
Mine and Lennart's idea there so far has been that the creation would
likely be done by the user's session at login time

brauner     1346  0.0  0.0  20956  8512 ?        Ss   Mar03   0:03 /lib/systemd/systemd --user

and systemd as root would then take care of writing the mapping to the
userns and then attaching it to the mount. (I'll see Lennart in the next
few days and see what works best and once we're ready start a discussion
somwhere on a public list, I would suggest.)

(If systemd doesn't want a user to be able to monitor a mnt it can
simply create a userns with a different uid/gid but with the relevant
mapping. This was what my earlier point was about "blocking a user from
creating a subtree watch".)

Christian
Christian Brauner March 19, 2021, 1:40 p.m. UTC | #12
On Thu, Mar 18, 2021 at 06:48:11PM +0200, Amir Goldstein wrote:
> [...]
> 
> I understand the use case.
> 
> > I'd rather have something that allows me to mirror
> >
> > /home/jdoe
> >
> > recursively directly. But maybe I'm misunderstanding fanotify and it
> > can't really help us but I thought that subtree watches might.
> >
> 
> There are no subtree watches. They are still a holy grale for fanotify...
> There are filesystem and mnt watches and the latter support far fewer
> events (only events for operations that carry the path argument).
> 
> With filesystem watches, you can get events for all mkdirs and you can
> figure out the created path, but you'd have to do all the filtering in
> userspace.
> 
> What I am trying to create is "filtered" filesystem watches and the filter needs
> to be efficient enough so the watcher will not incur too big of a penalty
> on all the operations in the filesystem.
> 
> Thanks to your mnt_userns changes, implementing a filter to intercept
> (say) mkdir calles on a specific mnt_userns should be quite simple, but
> filtering by "path" (i.e. /home/jdoe/some/path) will still need to happen in
> userspace.
> 
> This narrows the problem to the nested container manager that will only
> need to filter events which happened via mounts under its control.
> 
> [...]
> 
> > > there shouldn't be a problem to setup userns filtered watches in order to
> > > be notified on all the events that happen via those idmapped mounts
> > > and filtering by "subtree" is not needed.
> > > I am clearly far from understanding the big picture.
> >
> > I think I need to refamiliarize myself with what "subtree" watches do.
> > Maybe I misunderstood what they do. I'll take a look.
> >
> 
> You will not find them :-)

Heh. :)

> 
> [...]
> 
> > > Currently, (upstream) only init_userns CAP_SYS_ADMIN can setup
> > > fanotify watches.
> > > In linux-next, unprivileged user can already setup inode watches
> > > (i.e. like inotify).
> >
> > Just to clarify: you mean "unprivileged" as in non-root users in
> > init_user_ns and therefore also users in non-init userns. That's what
> 
> Correct.
> 
> > inotify allows you. This would probably allows us to use fanotify
> > instead of the hand-rolled recursive notify watching we currently do and
> > that I linked to above.
> >
> 
> The code that sits in linux-next can give you pretty much a drop-in
> replacement of inotify and nothing more. See example code:
> https://github.com/amir73il/inotify-tools/commits/fanotify_name_fid

This is really great. Thank you for doing that work this will help quite
a lot of use-cases and make things way simpler. I created a TODO to port
our path-hotplug to this once this feature lands.

> 
> > > If you think that is useful and you want to play with this feature I can
> > > provide a WIP branch soon.
> >
> > I would like to first play with the support for unprivileged fanotify
> > but sure, it does sound useful!
> 
> Just so you have an idea what I am talking about, this is a very early
> POC branch:
> https://github.com/amir73il/linux/commits/fanotify_userns

Thanks!  I'll try to pull this and take a look next week. I hope that's
ok.

Christian
Amir Goldstein March 19, 2021, 2:21 p.m. UTC | #13
On Fri, Mar 19, 2021 at 3:40 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Thu, Mar 18, 2021 at 06:48:11PM +0200, Amir Goldstein wrote:
> > [...]
> >
> > I understand the use case.
> >
> > > I'd rather have something that allows me to mirror
> > >
> > > /home/jdoe
> > >
> > > recursively directly. But maybe I'm misunderstanding fanotify and it
> > > can't really help us but I thought that subtree watches might.
> > >
> >
> > There are no subtree watches. They are still a holy grale for fanotify...
> > There are filesystem and mnt watches and the latter support far fewer
> > events (only events for operations that carry the path argument).
> >
> > With filesystem watches, you can get events for all mkdirs and you can
> > figure out the created path, but you'd have to do all the filtering in
> > userspace.
> >
> > What I am trying to create is "filtered" filesystem watches and the filter needs
> > to be efficient enough so the watcher will not incur too big of a penalty
> > on all the operations in the filesystem.
> >
> > Thanks to your mnt_userns changes, implementing a filter to intercept
> > (say) mkdir calles on a specific mnt_userns should be quite simple, but
> > filtering by "path" (i.e. /home/jdoe/some/path) will still need to happen in
> > userspace.
> >
> > This narrows the problem to the nested container manager that will only
> > need to filter events which happened via mounts under its control.
> >
> > [...]
> >
> > > > there shouldn't be a problem to setup userns filtered watches in order to
> > > > be notified on all the events that happen via those idmapped mounts
> > > > and filtering by "subtree" is not needed.
> > > > I am clearly far from understanding the big picture.
> > >
> > > I think I need to refamiliarize myself with what "subtree" watches do.
> > > Maybe I misunderstood what they do. I'll take a look.
> > >
> >
> > You will not find them :-)
>
> Heh. :)
>
> >
> > [...]
> >
> > > > Currently, (upstream) only init_userns CAP_SYS_ADMIN can setup
> > > > fanotify watches.
> > > > In linux-next, unprivileged user can already setup inode watches
> > > > (i.e. like inotify).
> > >
> > > Just to clarify: you mean "unprivileged" as in non-root users in
> > > init_user_ns and therefore also users in non-init userns. That's what
> >
> > Correct.
> >
> > > inotify allows you. This would probably allows us to use fanotify
> > > instead of the hand-rolled recursive notify watching we currently do and
> > > that I linked to above.
> > >
> >
> > The code that sits in linux-next can give you pretty much a drop-in
> > replacement of inotify and nothing more. See example code:
> > https://github.com/amir73il/inotify-tools/commits/fanotify_name_fid
>
> This is really great. Thank you for doing that work this will help quite
> a lot of use-cases and make things way simpler. I created a TODO to port
> our path-hotplug to this once this feature lands.
>

FWIW, I just tried to build this branch on Ubuntu 20.04.2 with LTS kernel
and there were some build issues, so rebased my branch on upstream
inotify-tools to fix those build issues.

I was not aware that the inotify-tools project is alive, I never intended
to upstream this demo code and never created a github pull request
but rebasing on upstream brought in some CI scripts, when I pushed the
branch to my github it triggered some tests that reported build failures on
Ubuntu 16.04 and 18.04.

Anyway, there is a pre-rebase branch 'fanotify_name' and the post rebase
branch 'fanotify_name_fid'. You can try whichever works for you.

You can look at the test script src/test_demo.sh for usage example.
Or just cd into a writable directory and run the script to see the demo.
The demo determines whether to use a recursive watch or "global"
watch by the uid of the user.

> >
> > > > If you think that is useful and you want to play with this feature I can
> > > > provide a WIP branch soon.
> > >
> > > I would like to first play with the support for unprivileged fanotify
> > > but sure, it does sound useful!
> >
> > Just so you have an idea what I am talking about, this is a very early
> > POC branch:
> > https://github.com/amir73il/linux/commits/fanotify_userns
>
> Thanks!  I'll try to pull this and take a look next week. I hope that's
> ok.
>

Fine. I'm curious to know what it does.
Did not get to test it with userns yet :)

Thanks,
Amir.
Amir Goldstein March 20, 2021, 12:57 p.m. UTC | #14
> > > The code that sits in linux-next can give you pretty much a drop-in
> > > replacement of inotify and nothing more. See example code:
> > > https://github.com/amir73il/inotify-tools/commits/fanotify_name_fid
> >
> > This is really great. Thank you for doing that work this will help quite
> > a lot of use-cases and make things way simpler. I created a TODO to port
> > our path-hotplug to this once this feature lands.
> >
>
> FWIW, I just tried to build this branch on Ubuntu 20.04.2 with LTS kernel
> and there were some build issues, so rebased my branch on upstream
> inotify-tools to fix those build issues.
>
> I was not aware that the inotify-tools project is alive, I never intended
> to upstream this demo code and never created a github pull request
> but rebasing on upstream brought in some CI scripts, when I pushed the
> branch to my github it triggered some tests that reported build failures on
> Ubuntu 16.04 and 18.04.
>
> Anyway, there is a pre-rebase branch 'fanotify_name' and the post rebase
> branch 'fanotify_name_fid'. You can try whichever works for you.
>
> You can look at the test script src/test_demo.sh for usage example.
> Or just cd into a writable directory and run the script to see the demo.
> The demo determines whether to use a recursive watch or "global"
> watch by the uid of the user.
>
> > >
> > > > > If you think that is useful and you want to play with this feature I can
> > > > > provide a WIP branch soon.
> > > >
> > > > I would like to first play with the support for unprivileged fanotify
> > > > but sure, it does sound useful!
> > >
> > > Just so you have an idea what I am talking about, this is a very early
> > > POC branch:
> > > https://github.com/amir73il/linux/commits/fanotify_userns
> >
> > Thanks!  I'll try to pull this and take a look next week. I hope that's
> > ok.
> >
>
> Fine. I'm curious to know what it does.
> Did not get to test it with userns yet :)

Now tested FAN_MARK_FILESYSTEM watch on tmpfs mounted
inside userns and works fine, with two wrinkles I needed to iron:

1. FAN_REPORT_FID not supported on tmpfs because tmpfs has
    zero f_fsid (easy to fix)
2. open_by_handle_at() is not userns aware (can relax for
    FS_USERNS_MOUNT fs)

Pushed these two fixes to branch fanotify_userns.

Thanks,
Amir.
Amir Goldstein March 22, 2021, 12:44 p.m. UTC | #15
On Sat, Mar 20, 2021 at 2:57 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> > > > The code that sits in linux-next can give you pretty much a drop-in
> > > > replacement of inotify and nothing more. See example code:
> > > > https://github.com/amir73il/inotify-tools/commits/fanotify_name_fid
> > >
> > > This is really great. Thank you for doing that work this will help quite
> > > a lot of use-cases and make things way simpler. I created a TODO to port
> > > our path-hotplug to this once this feature lands.
> > >
> >
> > FWIW, I just tried to build this branch on Ubuntu 20.04.2 with LTS kernel
> > and there were some build issues, so rebased my branch on upstream
> > inotify-tools to fix those build issues.
> >
> > I was not aware that the inotify-tools project is alive, I never intended
> > to upstream this demo code and never created a github pull request
> > but rebasing on upstream brought in some CI scripts, when I pushed the
> > branch to my github it triggered some tests that reported build failures on
> > Ubuntu 16.04 and 18.04.
> >
> > Anyway, there is a pre-rebase branch 'fanotify_name' and the post rebase
> > branch 'fanotify_name_fid'. You can try whichever works for you.

FYI, fixed the CI build errors on fanotify_name_fid branch.

> >
> > You can look at the test script src/test_demo.sh for usage example.
> > Or just cd into a writable directory and run the script to see the demo.
> > The demo determines whether to use a recursive watch or "global"
> > watch by the uid of the user.
> >
> > > >
> > > > > > If you think that is useful and you want to play with this feature I can
> > > > > > provide a WIP branch soon.
> > > > >
> > > > > I would like to first play with the support for unprivileged fanotify
> > > > > but sure, it does sound useful!
> > > >
> > > > Just so you have an idea what I am talking about, this is a very early
> > > > POC branch:
> > > > https://github.com/amir73il/linux/commits/fanotify_userns
> > >
> > > Thanks!  I'll try to pull this and take a look next week. I hope that's
> > > ok.
> > >
> >
> > Fine. I'm curious to know what it does.
> > Did not get to test it with userns yet :)
>
> Now tested FAN_MARK_FILESYSTEM watch on tmpfs mounted
> inside userns and works fine, with two wrinkles I needed to iron:
>
> 1. FAN_REPORT_FID not supported on tmpfs because tmpfs has
>     zero f_fsid (easy to fix)
> 2. open_by_handle_at() is not userns aware (can relax for
>     FS_USERNS_MOUNT fs)
>
> Pushed these two fixes to branch fanotify_userns.

Pushed another fix to mnt refcount bug in WIP and another commit to
add the last piece that could make fanotify usable for systemd-homed
setup - a filesystem watch filtered by mnt_userns (not tested yet).

One thing I am struggling with is the language to describe user ns
and idmapped mounts related logic. I have a feeling that I am getting
the vocabulary all wrong. See my commit message text below.
Editorial corrections would be appreciated.

Thanks,
Amir.

---

    fanotify: support sb mark filtered by idmapped mount

    For a group created inside userns, adding a mount mark is allowed
    if the mount is idmapped inside the group's userns and adding a
    filesystem mark is allowed if the filesystem was mounted inside the
    group's userns.

    Allow also adding a filesystem mark on a filesystem that is mounted
    in the group's userns via an idmapped mount.

    In that case, the mark is created with an internal flag, which indicates
    that events should be sent to the group only if they happened via an
    idmapped mount inside the group's userns.

    Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Christian Brauner March 22, 2021, 4:28 p.m. UTC | #16
On Mon, Mar 22, 2021 at 02:44:20PM +0200, Amir Goldstein wrote:
> On Sat, Mar 20, 2021 at 2:57 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > > > > The code that sits in linux-next can give you pretty much a drop-in
> > > > > replacement of inotify and nothing more. See example code:
> > > > > https://github.com/amir73il/inotify-tools/commits/fanotify_name_fid
> > > >
> > > > This is really great. Thank you for doing that work this will help quite
> > > > a lot of use-cases and make things way simpler. I created a TODO to port
> > > > our path-hotplug to this once this feature lands.
> > > >
> > >
> > > FWIW, I just tried to build this branch on Ubuntu 20.04.2 with LTS kernel
> > > and there were some build issues, so rebased my branch on upstream
> > > inotify-tools to fix those build issues.
> > >
> > > I was not aware that the inotify-tools project is alive, I never intended
> > > to upstream this demo code and never created a github pull request
> > > but rebasing on upstream brought in some CI scripts, when I pushed the
> > > branch to my github it triggered some tests that reported build failures on
> > > Ubuntu 16.04 and 18.04.
> > >
> > > Anyway, there is a pre-rebase branch 'fanotify_name' and the post rebase
> > > branch 'fanotify_name_fid'. You can try whichever works for you.
> 
> FYI, fixed the CI build errors on fanotify_name_fid branch.
> 
> > >
> > > You can look at the test script src/test_demo.sh for usage example.
> > > Or just cd into a writable directory and run the script to see the demo.
> > > The demo determines whether to use a recursive watch or "global"
> > > watch by the uid of the user.
> > >
> > > > >
> > > > > > > If you think that is useful and you want to play with this feature I can
> > > > > > > provide a WIP branch soon.
> > > > > >
> > > > > > I would like to first play with the support for unprivileged fanotify
> > > > > > but sure, it does sound useful!
> > > > >
> > > > > Just so you have an idea what I am talking about, this is a very early
> > > > > POC branch:
> > > > > https://github.com/amir73il/linux/commits/fanotify_userns
> > > >
> > > > Thanks!  I'll try to pull this and take a look next week. I hope that's
> > > > ok.
> > > >
> > >
> > > Fine. I'm curious to know what it does.
> > > Did not get to test it with userns yet :)
> >
> > Now tested FAN_MARK_FILESYSTEM watch on tmpfs mounted
> > inside userns and works fine, with two wrinkles I needed to iron:
> >
> > 1. FAN_REPORT_FID not supported on tmpfs because tmpfs has
> >     zero f_fsid (easy to fix)
> > 2. open_by_handle_at() is not userns aware (can relax for
> >     FS_USERNS_MOUNT fs)
> >
> > Pushed these two fixes to branch fanotify_userns.
> 
> Pushed another fix to mnt refcount bug in WIP and another commit to
> add the last piece that could make fanotify usable for systemd-homed
> setup - a filesystem watch filtered by mnt_userns (not tested yet).

Sounds interesting.

So I'm looking and commenting on that branch a little.
One general question, when fanotify FANOTIFY_PERM_EVENTS is set fanotify
will return a file descriptor (for relevant events) referring to the
file/directory that e.g. got created. And there are no permissions
checks other than the capable(CAP_SYS_ADMIN) check when the fanotify
instance is created, right?

> 
> One thing I am struggling with is the language to describe user ns
> and idmapped mounts related logic. I have a feeling that I am getting
> the vocabulary all wrong. See my commit message text below.

The lingo seems legit. :)

I need some time thinking about the concepts to convince myself that
this is safe. But I like the direction as this is going to be really
useful.

Christian
Amir Goldstein March 22, 2021, 5:22 p.m. UTC | #17
On Mon, Mar 22, 2021 at 6:28 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Mon, Mar 22, 2021 at 02:44:20PM +0200, Amir Goldstein wrote:
> > On Sat, Mar 20, 2021 at 2:57 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > > > > The code that sits in linux-next can give you pretty much a drop-in
> > > > > > replacement of inotify and nothing more. See example code:
> > > > > > https://github.com/amir73il/inotify-tools/commits/fanotify_name_fid
> > > > >
> > > > > This is really great. Thank you for doing that work this will help quite
> > > > > a lot of use-cases and make things way simpler. I created a TODO to port
> > > > > our path-hotplug to this once this feature lands.
> > > > >
> > > >
> > > > FWIW, I just tried to build this branch on Ubuntu 20.04.2 with LTS kernel
> > > > and there were some build issues, so rebased my branch on upstream
> > > > inotify-tools to fix those build issues.
> > > >
> > > > I was not aware that the inotify-tools project is alive, I never intended
> > > > to upstream this demo code and never created a github pull request
> > > > but rebasing on upstream brought in some CI scripts, when I pushed the
> > > > branch to my github it triggered some tests that reported build failures on
> > > > Ubuntu 16.04 and 18.04.
> > > >
> > > > Anyway, there is a pre-rebase branch 'fanotify_name' and the post rebase
> > > > branch 'fanotify_name_fid'. You can try whichever works for you.
> >
> > FYI, fixed the CI build errors on fanotify_name_fid branch.
> >
> > > >
> > > > You can look at the test script src/test_demo.sh for usage example.
> > > > Or just cd into a writable directory and run the script to see the demo.
> > > > The demo determines whether to use a recursive watch or "global"
> > > > watch by the uid of the user.
> > > >
> > > > > >
> > > > > > > > If you think that is useful and you want to play with this feature I can
> > > > > > > > provide a WIP branch soon.
> > > > > > >
> > > > > > > I would like to first play with the support for unprivileged fanotify
> > > > > > > but sure, it does sound useful!
> > > > > >
> > > > > > Just so you have an idea what I am talking about, this is a very early
> > > > > > POC branch:
> > > > > > https://github.com/amir73il/linux/commits/fanotify_userns
> > > > >
> > > > > Thanks!  I'll try to pull this and take a look next week. I hope that's
> > > > > ok.
> > > > >
> > > >
> > > > Fine. I'm curious to know what it does.
> > > > Did not get to test it with userns yet :)
> > >
> > > Now tested FAN_MARK_FILESYSTEM watch on tmpfs mounted
> > > inside userns and works fine, with two wrinkles I needed to iron:
> > >
> > > 1. FAN_REPORT_FID not supported on tmpfs because tmpfs has
> > >     zero f_fsid (easy to fix)
> > > 2. open_by_handle_at() is not userns aware (can relax for
> > >     FS_USERNS_MOUNT fs)
> > >
> > > Pushed these two fixes to branch fanotify_userns.
> >
> > Pushed another fix to mnt refcount bug in WIP and another commit to
> > add the last piece that could make fanotify usable for systemd-homed
> > setup - a filesystem watch filtered by mnt_userns (not tested yet).
>
> Sounds interesting.
>
> So I'm looking and commenting on that branch a little.
> One general question, when fanotify FANOTIFY_PERM_EVENTS is set fanotify
> will return a file descriptor (for relevant events) referring to the
> file/directory that e.g. got created. And there are no permissions
> checks other than the capable(CAP_SYS_ADMIN) check when the fanotify
> instance is created, right?
>

Correct.
fanotify_init() enforces that in a few maybe not so obvious steps:

1. Either CAP_SYS_ADMIN or fid_mode (no file descriptor in event):

        if (!capable(CAP_SYS_ADMIN)) {
                /*
                 * An unprivileged user can setup an fanotify group with
                 * limited functionality - an unprivileged group is limited to
                 * notification events with file handles and it cannot use
                 * unlimited queue/marks.
                 */
                if ((flags & FANOTIFY_ADMIN_INIT_FLAGS) || !fid_mode)
                        return -EPERM;
        }

2. fid_mode is not supported for high priority classes:

        if (fid_mode && class != FAN_CLASS_NOTIF)
                return -EINVAL;

3. Permission events are only allowed for high priority classes:
        /*
         * group->priority == FS_PRIO_0 == FAN_CLASS_NOTIF.  These are not
         * allowed to set permissions events.
         */
        ret = -EINVAL;
        if (mask & FANOTIFY_PERM_EVENTS &&
            group->priority == FS_PRIO_0)

You may want to look at the summary of all the limitations on
unprivileged listener here:
https://github.com/amir73il/man-pages/blob/fanotify_unpriv/man2/fanotify_init.2#L400

Thanks,
Amir.
Amir Goldstein March 22, 2021, 6:38 p.m. UTC | #18
> > > The problem here is that even if unprivileged users cannot compromise
> > > security, they can still cause significant CPU overhead either queueing
> > > events or filtering events and that is something I haven't been able to
> > > figure out a way to escape from.
> >
> > WRT queueing overhead, given a user can place ~1M of directory watches, he
> > can cause noticable total overhead for queueing events anyway. Furthermore
>
> I suppose so. But a user placing 1M dir watches at least adds this overhead
> knowingly. Adding a overhead on the entire filesystem when just wanting to
> watch a small subtree doesn't sound ideal. Especially in very nested setups.
> So yes, we need to be careful.
>

I was thinking about this some more and I think the answer is in your example.
User can only place 1M dir watches if ucount marks limits permits it.

So whatever we allow to do with subtree or userns filtered marks should
also be limited by ucounts.

> > the queue size is limited so unless the user spends time consuming events
> > as well, the load won't last long. But I agree we need to be careful not to
> > introduce too big latencies to operations generating events. So I think if
> > we could quickly detect whether a generated event has a good chance of
> > being relevant for some subtree watch of a group and queue it in that case
> > and worry about permission checks only once events are received and thus
> > receiver pays the cost of expensive checks, that might be fine as well.
> >
>
> So far the only idea I had for "quickly detect" which I cannot find flaws in
> is to filter by mnt_userms, but its power is limited.
>

So I have implemented this idea on fanotify_userns branch and the cost
per "filtered" sb mark is quite low - its a pretty cheap check in
send_to_group()
But still, if an unbound number of users can add to the sb mark list it is
not going to end well.

<hand waving>
I think what we need here (thinking out loud) is to account the sb marks
to the user that mounted the filesystem or to the user mapped to admin using
idmapped mount, maybe to both(?), probably using a separate ucount entry
(e.g. max_fanotify_filesystem_marks).

We can set this limit by default to a small number (128?) to limit the sb list
iteration per filesystem event and container manager / systemd can further
limit this resource when creating idmapped mounts, which would otherwise
allow the mapped user to add "filtered" (*) sb marks.
</hand waving>

Thanks,
Amir.

(*) "filtered" can refer to both the userns filter I proposed and going forward
     also maybe to subtree filter
Jan Kara March 24, 2021, 11:48 a.m. UTC | #19
On Mon 22-03-21 20:38:32, Amir Goldstein wrote:
> > > > The problem here is that even if unprivileged users cannot compromise
> > > > security, they can still cause significant CPU overhead either queueing
> > > > events or filtering events and that is something I haven't been able to
> > > > figure out a way to escape from.
> > >
> > > WRT queueing overhead, given a user can place ~1M of directory watches, he
> > > can cause noticable total overhead for queueing events anyway. Furthermore
> >
> > I suppose so. But a user placing 1M dir watches at least adds this overhead
> > knowingly. Adding a overhead on the entire filesystem when just wanting to
> > watch a small subtree doesn't sound ideal. Especially in very nested setups.
> > So yes, we need to be careful.
> >
> 
> I was thinking about this some more and I think the answer is in your example.
> User can only place 1M dir watches if ucount marks limits permits it.
> 
> So whatever we allow to do with subtree or userns filtered marks should
> also be limited by ucounts.

Yes, agreed.

> > > the queue size is limited so unless the user spends time consuming events
> > > as well, the load won't last long. But I agree we need to be careful not to
> > > introduce too big latencies to operations generating events. So I think if
> > > we could quickly detect whether a generated event has a good chance of
> > > being relevant for some subtree watch of a group and queue it in that case
> > > and worry about permission checks only once events are received and thus
> > > receiver pays the cost of expensive checks, that might be fine as well.
> > >
> >
> > So far the only idea I had for "quickly detect" which I cannot find flaws in
> > is to filter by mnt_userms, but its power is limited.
> 
> So I have implemented this idea on fanotify_userns branch and the cost
> per "filtered" sb mark is quite low - its a pretty cheap check in
> send_to_group()
> But still, if an unbound number of users can add to the sb mark list it is
> not going to end well.

Thinking out loud: So what is the cost going to be for the side generating
events? Ideally it would of O(number of fanotify groups receiving event).
We cannot get better than that and if the constants aren't too big I think
this is acceptable overhead. Sure this can mean total work of (number of
users) * (max number of subtree marks per user) for queueing notification
event but I don't think it is practical for a DoS attack and I also don't
think that in practice users will be watching overlapping subtrees that
much.

The question is whether we can get that fast. Probably not because that
would have to attach subtree watches to directory inodes or otherwise
filter out unrelated fanotify groups in O(1). But the complexity of
O(number of groups receiving events + depth of dir where event is happening)
is probably achievable - we'd walk the tree up and have roots of watched
subtrees marked. What do you think?

Also there is a somewhat related question what is the semantics of subtree
watches in presence of mounts - do subtree watches "see through" mount
points? Probably not but then with bind mounts this can be sometimes
inconvenient / confusing - e.g. if I have /tmp bind-mounted to /var/tmp and
I'm watching subtree of /var, I would not get events for what's in
/var/tmp... Which is logical if you spell it out like this but applications
often don't care how the mount hierarchy looks like, they just care about
locally visible directory structure.

> <hand waving>
> I think what we need here (thinking out loud) is to account the sb marks
> to the user that mounted the filesystem or to the user mapped to admin using
> idmapped mount, maybe to both(?), probably using a separate ucount entry
> (e.g. max_fanotify_filesystem_marks).

I'm somewhat lost here. Are these two users different? We have /home/foo
which is a mounted filesystem. AFAIU it will be mounted in a special user
namespace for user 'foo' - let's call is 'foo-ns'. /home/foo has idmapping
attached so system [ug]ids and non-trivially mapped to on-disk [ug]ids. Now
we have a user - let's call it 'foo-usr' that has enough capabilities
(whatever they are) in 'foo-ns' to place fanotify subtree marks in
/home/foo. So these marks are naturally accounted towards 'foo-usr'. To
whom else you'd like to also account these marks and why?

> We can set this limit by default to a small number (128?) to limit the sb list
> iteration per filesystem event and container manager / systemd can further
> limit this resource when creating idmapped mounts, which would otherwise
> allow the mapped user to add "filtered" (*) sb marks.
> </hand waving>
>
> (*) "filtered" can refer to both the userns filter I proposed and going forward
>      also maybe to subtree filter

								Honza
Amir Goldstein March 24, 2021, 1:57 p.m. UTC | #20
> > Now tested FAN_MARK_FILESYSTEM watch on tmpfs mounted
> > inside userns and works fine, with two wrinkles I needed to iron:
> >
> > 1. FAN_REPORT_FID not supported on tmpfs because tmpfs has
> >     zero f_fsid (easy to fix)
> > 2. open_by_handle_at() is not userns aware (can relax for
> >     FS_USERNS_MOUNT fs)
> >
> > Pushed these two fixes to branch fanotify_userns.
>
> Pushed another fix to mnt refcount bug in WIP and another commit to
> add the last piece that could make fanotify usable for systemd-homed
> setup - a filesystem watch filtered by mnt_userns (not tested yet).
>

Now I used mount-idmapped (from xfstest) to test that last piece.
Found a minor bug and pushed a fix.

It is working as expected, that is filtering only the events generated via
the idmapped mount. However, because the listener I tested is capable in
the mapped userns and not in the sb userns, the listener cannot
open_ny_handle_at(), so the result is not as useful as one might hope.

I guess we will also need to make open_by_handle_at() idmapped aware
and use a variant of vfs_dentry_acceptable() that validates that the opened
path is legitimately accessible via the idmapped mount.

I think I will leave this complexity to you should you think the userns filtered
watch is something worth the effort.

Thanks,
Amir.
Christian Brauner March 24, 2021, 2:32 p.m. UTC | #21
On Wed, Mar 24, 2021 at 03:57:12PM +0200, Amir Goldstein wrote:
> > > Now tested FAN_MARK_FILESYSTEM watch on tmpfs mounted
> > > inside userns and works fine, with two wrinkles I needed to iron:
> > >
> > > 1. FAN_REPORT_FID not supported on tmpfs because tmpfs has
> > >     zero f_fsid (easy to fix)
> > > 2. open_by_handle_at() is not userns aware (can relax for
> > >     FS_USERNS_MOUNT fs)
> > >
> > > Pushed these two fixes to branch fanotify_userns.
> >
> > Pushed another fix to mnt refcount bug in WIP and another commit to
> > add the last piece that could make fanotify usable for systemd-homed
> > setup - a filesystem watch filtered by mnt_userns (not tested yet).
> >
> 
> Now I used mount-idmapped (from xfstest) to test that last piece.
> Found a minor bug and pushed a fix.
> 
> It is working as expected, that is filtering only the events generated via
> the idmapped mount. However, because the listener I tested is capable in
> the mapped userns and not in the sb userns, the listener cannot
> open_ny_handle_at(), so the result is not as useful as one might hope.

This is another dumb question probably but in general, are you saying
that someone watching a mount or directory and does _not_ want file
descriptors from fanotify to be returned has no other way of getting to
the path they want to open other than by using open_by_handle_at()?

> 
> I guess we will also need to make open_by_handle_at() idmapped aware
> and use a variant of vfs_dentry_acceptable() that validates that the opened
> path is legitimately accessible via the idmapped mount.

So as a first step, I think there's a legitimate case to be made for
open_by_handle_at() to be made useable inside user namespaces. That's a
change worth to be made independent of fanotify. For example, nowadays
cgroups have a 64 bit identifier that can be used with open_by_handle_at
to map a cgrp id to a path and back:
https://lkml.org/lkml/2020/12/2/1126
Right now this can't be used in user namespaces because of this
restriction but it is genuinely useful to have this feature available
since cgroups are FS_USERNS_MOUNT and that identifier <-> path mapping
is very convenient.
Without looking at the code I'm not super sure how name_to_handle_at()
and open_by_handle_at() behave in the face of mount namespaces so that
would need looking into to. But it would be a genuinely useful change, I
think.

> 
> I think I will leave this complexity to you should you think the userns filtered
> watch is something worth the effort.

Fair enough!

Thanks!
Christian
Amir Goldstein March 24, 2021, 3:05 p.m. UTC | #22
On Wed, Mar 24, 2021 at 4:32 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Wed, Mar 24, 2021 at 03:57:12PM +0200, Amir Goldstein wrote:
> > > > Now tested FAN_MARK_FILESYSTEM watch on tmpfs mounted
> > > > inside userns and works fine, with two wrinkles I needed to iron:
> > > >
> > > > 1. FAN_REPORT_FID not supported on tmpfs because tmpfs has
> > > >     zero f_fsid (easy to fix)
> > > > 2. open_by_handle_at() is not userns aware (can relax for
> > > >     FS_USERNS_MOUNT fs)
> > > >
> > > > Pushed these two fixes to branch fanotify_userns.
> > >
> > > Pushed another fix to mnt refcount bug in WIP and another commit to
> > > add the last piece that could make fanotify usable for systemd-homed
> > > setup - a filesystem watch filtered by mnt_userns (not tested yet).
> > >
> >
> > Now I used mount-idmapped (from xfstest) to test that last piece.
> > Found a minor bug and pushed a fix.
> >
> > It is working as expected, that is filtering only the events generated via
> > the idmapped mount. However, because the listener I tested is capable in
> > the mapped userns and not in the sb userns, the listener cannot
> > open_ny_handle_at(), so the result is not as useful as one might hope.
>
> This is another dumb question probably but in general, are you saying
> that someone watching a mount or directory and does _not_ want file
> descriptors from fanotify to be returned has no other way of getting to
> the path they want to open other than by using open_by_handle_at()?
>

Well there is another way.
It is demonstrated in my demo with intoifywatch --fanotify --recursive.
It involved userspace iterating a subtree of interest to create fid->path
map.

The fanotify recursive watch is similar but not exactly the same as the
old intoify recursive watch, because with inotify recursive watch you
can miss events.

With fanotify recursive watch, the listener (if capable) can setup a
filesystem mark so events will not be missed. They will be recorded
by fid with an unknown path and the path information can be found later
by the crawler and updated in the map before the final report.

Events on fid that were not found by the crawler need not be reported.
That's essentially a subtree watch for the poor implemented in userspace.

I did not implement the combination --fanotify --global --recursive in my
demo, because it did not make sense with the current permission model
(listener that can setup a fs mark can always resolve fids to path), but it
would be quite trivial to add.


> >
> > I guess we will also need to make open_by_handle_at() idmapped aware
> > and use a variant of vfs_dentry_acceptable() that validates that the opened
> > path is legitimately accessible via the idmapped mount.
>
> So as a first step, I think there's a legitimate case to be made for
> open_by_handle_at() to be made useable inside user namespaces. That's a
> change worth to be made independent of fanotify. For example, nowadays
> cgroups have a 64 bit identifier that can be used with open_by_handle_at
> to map a cgrp id to a path and back:
> https://lkml.org/lkml/2020/12/2/1126
> Right now this can't be used in user namespaces because of this
> restriction but it is genuinely useful to have this feature available
> since cgroups are FS_USERNS_MOUNT and that identifier <-> path mapping
> is very convenient.

FS_USERNS_MOUNT is a simple case and I think it is safe.
There is already a patch for that on my fanotify_userns branch.

> Without looking at the code I'm not super sure how name_to_handle_at()
> and open_by_handle_at() behave in the face of mount namespaces so that
> would need looking into to. But it would be a genuinely useful change, I
> think.
>

name_to_handle_at()/open_by_handle_at() should be indifferent to mount ns,
because the former returns mount_id which is globally unique (I think)
and the latter takes a mount_fd argument, which means the caller needs to
prove access to the mount prior to getting an open fd in that mount.

Thanks,
Amir.
Amir Goldstein March 24, 2021, 3:50 p.m. UTC | #23
> > So I have implemented this idea on fanotify_userns branch and the cost
> > per "filtered" sb mark is quite low - its a pretty cheap check in
> > send_to_group()
> > But still, if an unbound number of users can add to the sb mark list it is
> > not going to end well.
>
> Thinking out loud: So what is the cost going to be for the side generating
> events? Ideally it would of O(number of fanotify groups receiving event).
> We cannot get better than that and if the constants aren't too big I think
> this is acceptable overhead. Sure this can mean total work of (number of
> users) * (max number of subtree marks per user) for queueing notification
> event but I don't think it is practical for a DoS attack and I also don't
> think that in practice users will be watching overlapping subtrees that
> much.
>

Why overlapping?
My concern is not so much from DoS attacks.
My concern is more from innocent users adding unacceptable
accumulated overhead.

Think of a filesystem mounted at /home/ with 100K directories at
/home/user$N, every user gets its own idmapped mount from
systemd-homed and may or may not choose to run a listener to
get events generated under its own home dir (which is an idmapped
mount). Even if we limit one sb mask per user, we can still have 100K
marks list in sb.

For this reason I think we need to limit the number of marks per sb.
The simple way is a global config like max_queued_events, but I think
we can do better than that.

> The question is whether we can get that fast. Probably not because that
> would have to attach subtree watches to directory inodes or otherwise
> filter out unrelated fanotify groups in O(1). But the complexity of
> O(number of groups receiving events + depth of dir where event is happening)
> is probably achievable - we'd walk the tree up and have roots of watched
> subtrees marked. What do you think?
>

I am for that. I already posted a POC along those lines [1].
I was just not sure how to limit the potential accumulated overhead.

[1] https://github.com/amir73il/linux/commits/fanotify_subtree_mark

> Also there is a somewhat related question what is the semantics of subtree
> watches in presence of mounts - do subtree watches "see through" mount
> points? Probably not but then with bind mounts this can be sometimes
> inconvenient / confusing - e.g. if I have /tmp bind-mounted to /var/tmp and
> I'm watching subtree of /var, I would not get events for what's in
> /var/tmp... Which is logical if you spell it out like this but applications
> often don't care how the mount hierarchy looks like, they just care about
> locally visible directory structure.

Those are hard questions.
I think that userns/mountns developers needed to address them a while ago
and I think there are some helpers that help with checking visibility of paths.

>
> > <hand waving>
> > I think what we need here (thinking out loud) is to account the sb marks
> > to the user that mounted the filesystem or to the user mapped to admin using
> > idmapped mount, maybe to both(?), probably using a separate ucount entry
> > (e.g. max_fanotify_filesystem_marks).
>
> I'm somewhat lost here. Are these two users different? We have /home/foo
> which is a mounted filesystem. AFAIU it will be mounted in a special user
> namespace for user 'foo' - let's call is 'foo-ns'. /home/foo has idmapping
> attached so system [ug]ids and non-trivially mapped to on-disk [ug]ids. Now
> we have a user - let's call it 'foo-usr' that has enough capabilities
> (whatever they are) in 'foo-ns' to place fanotify subtree marks in
> /home/foo. So these marks are naturally accounted towards 'foo-usr'. To
> whom else you'd like to also account these marks and why?
>

I would like the system admin to be able to limit 100 sb marks on /home
(filtered or not) because that impacts the send_to_group iteration.
I would also like systemd to be able to grant a smaller quota of filtered
sb marks per user when creating and mapping the idmapped mounts
at /home/foo$N

I *think* we can achieve that, by accounting the sb marks to uid 0
(who mounted /home) in ucounts entry "fanotify_sb_marks".
If /home would have been a FS_USERNS_MOUNT mounted inside
some userns, then all its sb marks would be accounted to uid 0 of
that userns.

I have no ideas if this all adds up.
My head explodes even from trying to express these rules :-/

Thanks,
Amir.
Christian Brauner March 24, 2021, 4:28 p.m. UTC | #24
On Wed, Mar 24, 2021 at 05:05:45PM +0200, Amir Goldstein wrote:
> On Wed, Mar 24, 2021 at 4:32 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Wed, Mar 24, 2021 at 03:57:12PM +0200, Amir Goldstein wrote:
> > > > > Now tested FAN_MARK_FILESYSTEM watch on tmpfs mounted
> > > > > inside userns and works fine, with two wrinkles I needed to iron:
> > > > >
> > > > > 1. FAN_REPORT_FID not supported on tmpfs because tmpfs has
> > > > >     zero f_fsid (easy to fix)
> > > > > 2. open_by_handle_at() is not userns aware (can relax for
> > > > >     FS_USERNS_MOUNT fs)
> > > > >
> > > > > Pushed these two fixes to branch fanotify_userns.
> > > >
> > > > Pushed another fix to mnt refcount bug in WIP and another commit to
> > > > add the last piece that could make fanotify usable for systemd-homed
> > > > setup - a filesystem watch filtered by mnt_userns (not tested yet).
> > > >
> > >
> > > Now I used mount-idmapped (from xfstest) to test that last piece.
> > > Found a minor bug and pushed a fix.
> > >
> > > It is working as expected, that is filtering only the events generated via
> > > the idmapped mount. However, because the listener I tested is capable in
> > > the mapped userns and not in the sb userns, the listener cannot
> > > open_ny_handle_at(), so the result is not as useful as one might hope.
> >
> > This is another dumb question probably but in general, are you saying
> > that someone watching a mount or directory and does _not_ want file
> > descriptors from fanotify to be returned has no other way of getting to
> > the path they want to open other than by using open_by_handle_at()?
> >
> 
> Well there is another way.
> It is demonstrated in my demo with intoifywatch --fanotify --recursive.
> It involved userspace iterating a subtree of interest to create fid->path
> map.

Ok, so this seems to be

inotifytools_filename_from_fid()
-> if (fanotify_mark_type != FAN_MARK_FILESYSTEM)
           watch_from_fid()
   -> read_path_from(/proc/self/fd/w->dirfd)

> 
> The fanotify recursive watch is similar but not exactly the same as the
> old intoify recursive watch, because with inotify recursive watch you
> can miss events.
> 
> With fanotify recursive watch, the listener (if capable) can setup a
> filesystem mark so events will not be missed. They will be recorded
> by fid with an unknown path and the path information can be found later
> by the crawler and updated in the map before the final report.
> 
> Events on fid that were not found by the crawler need not be reported.
> That's essentially a subtree watch for the poor implemented in userspace.

This is already a good improvement.
Honestly, having FAN_MARK_INODE workable unprivileged is already pretty
great. In addition having FAN_MARK_MOUNT workable with idmapped mounts
will likely get us what most users care about, afaict that is the POC
in:
https://github.com/amir73il/linux/commit/f0d5d462c5baeb82a658944c6df80704434f09a1

(I'm reading the source correctly that FAN_MARK_MOUNT works with
FAN_REPORT_FID as long as no inode event set in FANOTIFY_INODE_EVENTS is
set? I'm asking because my manpage - probably too old - seems to imply
that FAN_REPORT_FID can't be used with FAN_MARK_MOUNT although I might
just be stumbling over the phrasing.)

I think FAN_MARK_FILESYSTEM should simply stay under the s_userns_s
capable requirement. That's imho the cleanest semantics for this, i.e.
I'd drop:
https://github.com/amir73il/linux/commit/bd20e273f3c3a650805b3da32e493f01cc2a4763
This is neither an urgent use-case nor am I feeling very comfortable
with it.

> 
> I did not implement the combination --fanotify --global --recursive in my
> demo, because it did not make sense with the current permission model
> (listener that can setup a fs mark can always resolve fids to path), but it
> would be quite trivial to add.
> 
> 
> > >
> > > I guess we will also need to make open_by_handle_at() idmapped aware
> > > and use a variant of vfs_dentry_acceptable() that validates that the opened
> > > path is legitimately accessible via the idmapped mount.
> >
> > So as a first step, I think there's a legitimate case to be made for
> > open_by_handle_at() to be made useable inside user namespaces. That's a
> > change worth to be made independent of fanotify. For example, nowadays
> > cgroups have a 64 bit identifier that can be used with open_by_handle_at
> > to map a cgrp id to a path and back:
> > https://lkml.org/lkml/2020/12/2/1126
> > Right now this can't be used in user namespaces because of this
> > restriction but it is genuinely useful to have this feature available
> > since cgroups are FS_USERNS_MOUNT and that identifier <-> path mapping
> > is very convenient.
> 
> FS_USERNS_MOUNT is a simple case and I think it is safe.
> There is already a patch for that on my fanotify_userns branch.

Great!
Christian
Amir Goldstein March 24, 2021, 5:07 p.m. UTC | #25
> > Well there is another way.
> > It is demonstrated in my demo with intoifywatch --fanotify --recursive.
> > It involved userspace iterating a subtree of interest to create fid->path
> > map.
>
> Ok, so this seems to be
>
> inotifytools_filename_from_fid()
> -> if (fanotify_mark_type != FAN_MARK_FILESYSTEM)
>            watch_from_fid()
>    -> read_path_from(/proc/self/fd/w->dirfd)
>

Yes.

> >
> > The fanotify recursive watch is similar but not exactly the same as the
> > old intoify recursive watch, because with inotify recursive watch you
> > can miss events.
> >
> > With fanotify recursive watch, the listener (if capable) can setup a
> > filesystem mark so events will not be missed. They will be recorded
> > by fid with an unknown path and the path information can be found later
> > by the crawler and updated in the map before the final report.
> >
> > Events on fid that were not found by the crawler need not be reported.
> > That's essentially a subtree watch for the poor implemented in userspace.
>
> This is already a good improvement.
> Honestly, having FAN_MARK_INODE workable unprivileged is already pretty

I'm not so sure why you say that, because unprivileged FAN_MARK_INODE
watches are pretty close in functionality to inotify.
There are only subtle differences.

> great. In addition having FAN_MARK_MOUNT workable with idmapped mounts
> will likely get us what most users care about, afaict that is the POC
> in:
> https://github.com/amir73il/linux/commit/f0d5d462c5baeb82a658944c6df80704434f09a1
>

Hmm, the problem is the limited set of events you can get from
FAN_MARK_MOUNT which does not include FAN_CREATE.

> (I'm reading the source correctly that FAN_MARK_MOUNT works with
> FAN_REPORT_FID as long as no inode event set in FANOTIFY_INODE_EVENTS is
> set? I'm asking because my manpage - probably too old - seems to imply
> that FAN_REPORT_FID can't be used with FAN_MARK_MOUNT although I might
> just be stumbling over the phrasing.)
>

commit d71c9b4a5c6fbc7164007b52dba1de410d018292
Author: Amir Goldstein <amir73il@gmail.com>
Date:   Mon Apr 20 21:42:56 2020 +0300

    fanotify_mark.2: Clarification about FAN_MARK_MOUNT and FAN_REPORT_FID

    It is not true that FAN_MARK_MOUNT cannot be used with a group
    that was initialized with flag FAN_REPORT_FID.
 ...

IOW, no FAN_CREATE, FAN_DELETE, FAN_MOVE

The technical reason for that is Al's objection to pass the mnt context
into vfs helpers (and then fsnotify hooks).

> I think FAN_MARK_FILESYSTEM should simply stay under the s_userns_s
> capable requirement. That's imho the cleanest semantics for this, i.e.
> I'd drop:
> https://github.com/amir73il/linux/commit/bd20e273f3c3a650805b3da32e493f01cc2a4763
> This is neither an urgent use-case nor am I feeling very comfortable
> with it.
>

The purpose of this commit is to provide FAN_CREATE, FAN_DELETE
FAN_MOVE events filtered by (an idmapped) mount.
I don't like it so much myself, but I have not had any better idea how to
achieve that goal so far.

There is another way though.
We can create a new set of hooks outside the vfs helpers that do have
the mnt context.

I have already created such a set for another POC [1].
In this POC I introduced three new events FS_MODIFY_INTENT,
FS_NAME_INTENT, FS_MOVE_INTENT, which I had no plans of
exposing to fanotify. Nor did I need the granularity of CREATE,
DELETE, RENAME event types (all collapsed into NAME_INTENT).

But if we hit a dead end, we can resort to this strategy.

Thanks,
Amir.

[1] https://github.com/amir73il/linux/commits/fsnotify_pre_modify
Christian Brauner March 25, 2021, 11:12 a.m. UTC | #26
On Wed, Mar 24, 2021 at 07:07:17PM +0200, Amir Goldstein wrote:
> > > Well there is another way.
> > > It is demonstrated in my demo with intoifywatch --fanotify --recursive.
> > > It involved userspace iterating a subtree of interest to create fid->path
> > > map.
> >
> > Ok, so this seems to be
> >
> > inotifytools_filename_from_fid()
> > -> if (fanotify_mark_type != FAN_MARK_FILESYSTEM)
> >            watch_from_fid()
> >    -> read_path_from(/proc/self/fd/w->dirfd)
> >
> 
> Yes.
> 
> > >
> > > The fanotify recursive watch is similar but not exactly the same as the
> > > old intoify recursive watch, because with inotify recursive watch you
> > > can miss events.
> > >
> > > With fanotify recursive watch, the listener (if capable) can setup a
> > > filesystem mark so events will not be missed. They will be recorded
> > > by fid with an unknown path and the path information can be found later
> > > by the crawler and updated in the map before the final report.
> > >
> > > Events on fid that were not found by the crawler need not be reported.
> > > That's essentially a subtree watch for the poor implemented in userspace.
> >
> > This is already a good improvement.
> > Honestly, having FAN_MARK_INODE workable unprivileged is already pretty
> 
> I'm not so sure why you say that, because unprivileged FAN_MARK_INODE
> watches are pretty close in functionality to inotify.
> There are only subtle differences.

Simply because until now we couldn't use fanotify in any way because of
the capable() restriction.

> 
> > great. In addition having FAN_MARK_MOUNT workable with idmapped mounts
> > will likely get us what most users care about, afaict that is the POC
> > in:
> > https://github.com/amir73il/linux/commit/f0d5d462c5baeb82a658944c6df80704434f09a1
> >
> 
> Hmm, the problem is the limited set of events you can get from
> FAN_MARK_MOUNT which does not include FAN_CREATE.

Yes, that's what I gathered from perusing the code.

> 
> > (I'm reading the source correctly that FAN_MARK_MOUNT works with
> > FAN_REPORT_FID as long as no inode event set in FANOTIFY_INODE_EVENTS is
> > set? I'm asking because my manpage - probably too old - seems to imply
> > that FAN_REPORT_FID can't be used with FAN_MARK_MOUNT although I might
> > just be stumbling over the phrasing.)
> >
> 
> commit d71c9b4a5c6fbc7164007b52dba1de410d018292
> Author: Amir Goldstein <amir73il@gmail.com>
> Date:   Mon Apr 20 21:42:56 2020 +0300
> 
>     fanotify_mark.2: Clarification about FAN_MARK_MOUNT and FAN_REPORT_FID
> 
>     It is not true that FAN_MARK_MOUNT cannot be used with a group
>     that was initialized with flag FAN_REPORT_FID.

Btw, I don't see FAN_MARK_INODE in the man2 in the upstream repository.
I know it's essentially the 0 or default value but it would still be
worthwhile to mention it in the manpage.

>  ...
> 
> IOW, no FAN_CREATE, FAN_DELETE, FAN_MOVE
> 
> The technical reason for that is Al's objection to pass the mnt context
> into vfs helpers (and then fsnotify hooks).

Ah yes, I remember that.

> 
> > I think FAN_MARK_FILESYSTEM should simply stay under the s_userns_s
> > capable requirement. That's imho the cleanest semantics for this, i.e.
> > I'd drop:
> > https://github.com/amir73il/linux/commit/bd20e273f3c3a650805b3da32e493f01cc2a4763
> > This is neither an urgent use-case nor am I feeling very comfortable
> > with it.
> >
> 
> The purpose of this commit is to provide FAN_CREATE, FAN_DELETE
> FAN_MOVE events filtered by (an idmapped) mount.

I see.

I wanted to write a few words about the use-case you mention and the
need for subtree watches for this particular case:

"A common use case for of a filesystem subtree is a bind mount, not on
the root of the filesystem, such as the bind mounts used for
containers."

Which presumably means you want to point fanotify to the rootfs mount of
the container and have it watch everything under that rootfs including
any submounts in there. While this certainly might be useful I'm not
sure it's a very common use-case or really necessary to support.

Assuming a full system-like container setup like runC, systemd-nspawn,
LXD, and similar runtimes do it we end up with a few additonal mounts.
They are usually performed in the containers mount+user namespace.
The standard ones, i.e. the ones most container runtimes setup are:
- sysfs
- cgroupfs
- procfs
- mqueue
- devpts
- tmpfs on /dev (as a substitute for devtmpfs not being namespaced)
  - tmpfs on /dev/shm
  - bind mounts of a few host device files into /dev:
    - /dev/fuse
    - /dev/net/tun
    - /dev/full
    - /dev/null
    - /dev/random
    - /dev/tty
    - /dev/urandom
    - /dev/zero
    - /dev/console
I think most of these mounts aren't very interesting to monitor. It's in
general not very common to monitor full pseudo fileystems such as proc,
sysfs, or devpts afaik.
Notably, systemd - both outside and inside containers - will use some
inotify watches but it's always on specific directories and never across
mount borders.

In addition to the default mounts above - I've mentioned this before -
the container manager might choose to inject mounts into a running
container (to share data or whatever). There are different strategies on
how to do this.
I1. The first strategy is to inject the mount into the container in such
    a way that the container has full control over it, i.e. it can
    unmount and remount (with the restriction that some flags might
    become locked when moving the mount across user namespaces).
I2. The second strategy is to inject the mount in such a way that the
    container doesn't have control over it, i.e. the container can't
    umount or remount.

Most container runtimes will support I1. but systemd-nspawn uses I2.
There might be use-cases where the container manager would like to watch
those mounts too. But then the container manager will just call
fanotify_mark() and add that mount to the list of watched mounts when
injecting it.
I get that there are other use-cases that make subtree watches very
interesting but I don't think the container use-case is a particularly
pressing one.

> I don't like it so much myself, but I have not had any better idea how to
> achieve that goal so far.

The limitations of FAN_MARK_MOUNT as I now understand them are indeed
unpleasant. If we could get FAN_MARK_MOUNT with the same event support
as FAN_MARK_INODE that would be great.
I think the delegation model that makes sense to me is to allow
FAN_MARK_MOUNT when the caller is ns_capable(mnt->mnt_userns) and of
course ns_capable() in the userns they called fanotify_init() in. That
feels ok and supportable.
But I don't think anything beyond that like the sb filter patch that you
showed is the right approach.

Christian
Jan Kara March 25, 2021, 1:49 p.m. UTC | #27
On Wed 24-03-21 17:50:52, Amir Goldstein wrote:
> > > So I have implemented this idea on fanotify_userns branch and the cost
> > > per "filtered" sb mark is quite low - its a pretty cheap check in
> > > send_to_group()
> > > But still, if an unbound number of users can add to the sb mark list it is
> > > not going to end well.
> >
> > Thinking out loud: So what is the cost going to be for the side generating
> > events? Ideally it would of O(number of fanotify groups receiving event).
> > We cannot get better than that and if the constants aren't too big I think
> > this is acceptable overhead. Sure this can mean total work of (number of
> > users) * (max number of subtree marks per user) for queueing notification
> > event but I don't think it is practical for a DoS attack and I also don't
> > think that in practice users will be watching overlapping subtrees that
> > much.
> >
> 
> Why overlapping?
> My concern is not so much from DoS attacks.
> My concern is more from innocent users adding unacceptable
> accumulated overhead.
> 
> Think of a filesystem mounted at /home/ with 100K directories at
> /home/user$N, every user gets its own idmapped mount from
> systemd-homed and may or may not choose to run a listener to
> get events generated under its own home dir (which is an idmapped
> mount). Even if we limit one sb mask per user, we can still have 100K
> marks list in sb.

I see but then you'd have to have 100K users using the same filesystem on
the server at the same time? Which doesn't look likely to me? I'd presume
the home dir is watched only if the user is actually running something on
that machine... So I'm not sure how realistic this example is. But yes,
maybe we need some more efficient algorithm for selecting which subtree
watch is actually relevant for an event.

> For this reason I think we need to limit the number of marks per sb.
> The simple way is a global config like max_queued_events, but I think
> we can do better than that.

Adding a global limit on number of sb marks is OK but still I'd like the
system to scale reasonably with say tens to hundreds watches...

> > The question is whether we can get that fast. Probably not because that
> > would have to attach subtree watches to directory inodes or otherwise
> > filter out unrelated fanotify groups in O(1). But the complexity of
> > O(number of groups receiving events + depth of dir where event is happening)
> > is probably achievable - we'd walk the tree up and have roots of watched
> > subtrees marked. What do you think?
> 
> I am for that. I already posted a POC along those lines [1].
> I was just not sure how to limit the potential accumulated overhead.
> 
> [1] https://github.com/amir73il/linux/commits/fanotify_subtree_mark

Yes, but AFAICT your solution was O((number of subtree marks on sb) * depth
of dir) while I'm speaking about O(number of *groups* on sb + depth of
dir). Which is significantly less and will require more careful setup to
achieve such complexity (like placing special marks in lists of watches for
directories that are roots of watched subtrees and checking such lists when
walking up the tree).

> > Also there is a somewhat related question what is the semantics of subtree
> > watches in presence of mounts - do subtree watches "see through" mount
> > points? Probably not but then with bind mounts this can be sometimes
> > inconvenient / confusing - e.g. if I have /tmp bind-mounted to /var/tmp and
> > I'm watching subtree of /var, I would not get events for what's in
> > /var/tmp... Which is logical if you spell it out like this but applications
> > often don't care how the mount hierarchy looks like, they just care about
> > locally visible directory structure.
> 
> Those are hard questions.  I think that userns/mountns developers needed
> to address them a while ago and I think there are some helpers that help
> with checking visibility of paths.
> 
> > > <hand waving>
> > > I think what we need here (thinking out loud) is to account the sb marks
> > > to the user that mounted the filesystem or to the user mapped to admin using
> > > idmapped mount, maybe to both(?), probably using a separate ucount entry
> > > (e.g. max_fanotify_filesystem_marks).
> >
> > I'm somewhat lost here. Are these two users different? We have /home/foo
> > which is a mounted filesystem. AFAIU it will be mounted in a special user
> > namespace for user 'foo' - let's call is 'foo-ns'. /home/foo has idmapping
> > attached so system [ug]ids and non-trivially mapped to on-disk [ug]ids. Now
> > we have a user - let's call it 'foo-usr' that has enough capabilities
> > (whatever they are) in 'foo-ns' to place fanotify subtree marks in
> > /home/foo. So these marks are naturally accounted towards 'foo-usr'. To
> > whom else you'd like to also account these marks and why?
> >
> 
> I would like the system admin to be able to limit 100 sb marks on /home
> (filtered or not) because that impacts the send_to_group iteration.

OK, so per-sb limitation of sb mark number...

> I would also like systemd to be able to grant a smaller quota of filtered
> sb marks per user when creating and mapping the idmapped mounts
> at /home/foo$N

... and a ucount to go with it?

> I *think* we can achieve that, by accounting the sb marks to uid 0
> (who mounted /home) in ucounts entry "fanotify_sb_marks".

But a superblock can be mounted in multiple places, in multiple user
namespaces, potentially by different users (think of nested containers)? So
if we want a per-sb limit on sb marks, I think that accounting those per
user won't really achieve that?

> If /home would have been a FS_USERNS_MOUNT mounted inside
> some userns, then all its sb marks would be accounted to uid 0 of
> that userns.

								Honza
Amir Goldstein March 25, 2021, 3:05 p.m. UTC | #28
> > I would like the system admin to be able to limit 100 sb marks on /home
> > (filtered or not) because that impacts the send_to_group iteration.
>
> OK, so per-sb limitation of sb mark number...
>
> > I would also like systemd to be able to grant a smaller quota of filtered
> > sb marks per user when creating and mapping the idmapped mounts
> > at /home/foo$N
>
> ... and a ucount to go with it?
>
> > I *think* we can achieve that, by accounting the sb marks to uid 0
> > (who mounted /home) in ucounts entry "fanotify_sb_marks".
>
> But a superblock can be mounted in multiple places, in multiple user
> namespaces, potentially by different users (think of nested containers)? So
> if we want a per-sb limit on sb marks, I think that accounting those per
> user won't really achieve that?
>

I agree. It won't.
We can start with the global max_fanotify_sb_marks.
I do not have an idea how to make that workable using ucounts.

Thanks,
Amir.
Amir Goldstein March 25, 2021, 3:31 p.m. UTC | #29
> I get that there are other use-cases that make subtree watches very
> interesting but I don't think the container use-case is a particularly
> pressing one.
>

That's what I thought.

Containers are usually "contained" by a mount and possibly by userns,
so it makes more sense and it would be more efficient to filter by those
contexts.

> > I don't like it so much myself, but I have not had any better idea how to
> > achieve that goal so far.
>
> The limitations of FAN_MARK_MOUNT as I now understand them are indeed
> unpleasant. If we could get FAN_MARK_MOUNT with the same event support
> as FAN_MARK_INODE that would be great.
> I think the delegation model that makes sense to me is to allow
> FAN_MARK_MOUNT when the caller is ns_capable(mnt->mnt_userns) and of
> course ns_capable() in the userns they called fanotify_init() in. That
> feels ok and supportable.

I present to you a demo [1][2] of FAN_MARK_MOUNT on idmapped mount that:

1. Can subscribe and receive FAN_LINK (new) events
2. Is capable of open_by_handle() if fid is under mount root

FAN_LINK (temp name) is an event that I wanted to add anyway [3] and
AFAIK it's the only event that you really need in order to detect when a dir
was created for the use case of injecting a bind mount into a container.

The kernel branch [1] intentionally excludes the controversial patch that
added support for userns filtered sb marks.

Therefore, trying to run the demo script as is on an idmapped mount
inside userns will auto-detect UID 0, try to setup an sb mark and fail.

Instead, the demo script should be run as follows to combine a
mount mark and recursive inode marks:

./test_demo.sh <idmapped-mount-path> 1

For example:
~# ./test_demo.sh /vdf 1
+ WD=/vdf
+ ID=1
...
+ inotifywatch --fanotify --recursive -w -e link --timeout -2 /vdf
Establishing watches...
...
+ mkdir -p a/dir0 a/dir1 a/dir2/subdir2
+ touch a/dir2/file2
...
[fid=ad91a2b8.81a99d43.3000081;name='dir2'] /vdf/a/dir2
[fid=ad91a2b8.81a99d43.8a;name='.'] /vdf/a/dir2/.
[fid=ad91a2b8.81a99d43.10000a6;name='.'] /vdf/a/dir2/subdir2/.
[fid=ad91a2b8.81a99d43.8a;name='file2'] /vdf/a/dir2/file2
...
total  modify  ..................................  create  link
delete  filename
1      0       0       0       0       0        0       1       0
0       /vdf/a/dir2
1      0       0       0       0       0        0       0       1
0       /vdf/a/dir2/.
1      0       0       0       0       0        0       0       1
0       /vdf/a/dir2/subdir2/.
1      0       0       0       0       0        0       0       1
0       /vdf/a/dir2/file2

Thanks,
Amir.

[1] https://github.com/amir73il/linux/commits/fanotify_link
[2] https://github.com/amir73il/inotify-tools/commits/fanotify_link
[3] https://lore.kernel.org/linux-fsdevel/CAOQ4uxhEsbfA5+sW4XPnUKgCkXtwoDA-BR3iRO34Nx5c4y7Nug@mail.gmail.com/
Amir Goldstein March 28, 2021, 2:58 p.m. UTC | #30
> > The limitations of FAN_MARK_MOUNT as I now understand them are indeed
> > unpleasant. If we could get FAN_MARK_MOUNT with the same event support
> > as FAN_MARK_INODE that would be great.
> > I think the delegation model that makes sense to me is to allow
> > FAN_MARK_MOUNT when the caller is ns_capable(mnt->mnt_userns) and of
> > course ns_capable() in the userns they called fanotify_init() in. That
> > feels ok and supportable.
>
> I present to you a demo [1][2] of FAN_MARK_MOUNT on idmapped mount that:
>
> 1. Can subscribe and receive FAN_LINK (new) events
> 2. Is capable of open_by_handle() if fid is under mount root
>
> FAN_LINK (temp name) is an event that I wanted to add anyway [3] and
> AFAIK it's the only event that you really need in order to detect when a dir
> was created for the use case of injecting a bind mount into a container.

Scratch that part about the new event.
I found a way to make FAN_CREATE available for FAN_MARK_MOUNT.
Will post an RFC patch.
Same demo instructions. Different branches [1][2]:

>
> The kernel branch [1] intentionally excludes the controversial patch that
> added support for userns filtered sb marks.
>
> Therefore, trying to run the demo script as is on an idmapped mount
> inside userns will auto-detect UID 0, try to setup an sb mark and fail.
>
> Instead, the demo script should be run as follows to combine a
> mount mark and recursive inode marks:
>
> ./test_demo.sh <idmapped-mount-path> 1
>
> For example:

~# ./test_demo.sh /vdf 1
+ WD=/vdf
+ ID=1
...
+ inotifywatch --fanotify --recursive -w --timeout -2 /vdf
Establishing watches...
...
+ mkdir -p a/dir0 a/dir1 a/dir2/A/B/C/
+ touch a/dir2/A/B/C/file2
...
[fid=94847cf7.d74a50ab.30000c2;name='dir2'] /mnt/a/dir2
Adding recursive watches on directory '/mnt/a/dir2/'...
[fid=94847cf7.d74a50ab.87;name='A'] /mnt/a/dir2/A
Adding recursive watches on directory '/mnt/a/dir2/A/'...
[fid=94847cf7.d74a50ab.1000087;name='B'] /mnt/a/dir2/A/B
Adding recursive watches on directory '/mnt/a/dir2/A/B/'...
[fid=94847cf7.d74a50ab.20073e5;name='C'] /mnt/a/dir2/A/B/C
Adding recursive watches on directory '/mnt/a/dir2/A/B/C/'...
[fid=94847cf7.d74a50ab.30000c9;name='file2'] /mnt/a/dir2/A/B/C/file2

Hope that helps.

Thanks,
Amir.

[1] https://github.com/amir73il/linux/commits/fanotify_userns
[2] https://github.com/amir73il/inotify-tools/commits/fanotify_userns