Message ID | 20210304112921.3996419-1-amir73il@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | unprivileged fanotify listener | expand |
On Thu 04-03-21 13:29:19, Amir Goldstein wrote: > Jan, > > These patches try to implement a minimal set and least controversial > functionality that we can allow for unprivileged users as a starting > point. > > The patches were tested on top of v5.12-rc1 and the fanotify_merge > patches using the unprivileged listener LTP tests written by Matthew > and another LTP tests I wrote to test the sysfs tunable limits [1]. Thanks. I've added both patches to my tree. Honza > > Thanks, > Amir. > > Changes since v1: > - Dropped marks per group limit in favor of max per user > - Rename sysfs files from 'listener' to 'group' terminology > - Dropped internal group flag FANOTIFY_UNPRIV > - Limit unprivileged listener to FAN_REPORT_FID family > - Report event->pid depending on reader capabilities > > [1] https://github.com/amir73il/ltp/commits/fanotify_unpriv > > Amir Goldstein (2): > fanotify: configurable limits via sysfs > fanotify: support limited functionality for unprivileged users > > fs/notify/fanotify/fanotify.c | 16 ++- > fs/notify/fanotify/fanotify_user.c | 152 ++++++++++++++++++++++++----- > fs/notify/fdinfo.c | 3 +- > fs/notify/group.c | 1 - > fs/notify/mark.c | 4 - > include/linux/fanotify.h | 36 ++++++- > include/linux/fsnotify_backend.h | 6 +- > include/linux/sched/user.h | 3 - > include/linux/user_namespace.h | 4 + > kernel/sysctl.c | 12 ++- > kernel/ucount.c | 4 + > 11 files changed, 194 insertions(+), 47 deletions(-) > > -- > 2.30.0 >
On Tue, Mar 16, 2021 at 5:55 PM Jan Kara <jack@suse.cz> wrote: > > On Thu 04-03-21 13:29:19, Amir Goldstein wrote: > > Jan, > > > > These patches try to implement a minimal set and least controversial > > functionality that we can allow for unprivileged users as a starting > > point. > > > > The patches were tested on top of v5.12-rc1 and the fanotify_merge > > patches using the unprivileged listener LTP tests written by Matthew > > and another LTP tests I wrote to test the sysfs tunable limits [1]. > > Thanks. I've added both patches to my tree. Great! I'll go post the LTP tests and work on the man page updates. BTW, I noticed that you pushed the aggregating for_next branch, but not the fsnotify topic branch. Is this intentional? I am asking because I am usually basing my development branches off of your fsnotify branch, but I can base them on the unpushed branch. Heads up. I am playing with extra privileges we may be able to allow an ns_capable user. For example, watching a FS_USERNS_MOUNT filesystem that the user itself has mounted inside userns. Another feature I am investigating is how to utilize the new idmapped mounts to get a subtree watch functionality. This requires attaching a userns to the group on fanotify_init(). <hand waving> If the group's userns are the same or below the idmapped mount userns, then all the objects accessed via that idmapped mount are accessible to the group's userns admin. We can use that fact to filter events very early based on their mnt_userns and the group's userns, which should be cheaper than any subtree permission checks. <\hand waving> Thanks, Amir.
On Wed 17-03-21 13:01:35, Amir Goldstein wrote: > On Tue, Mar 16, 2021 at 5:55 PM Jan Kara <jack@suse.cz> wrote: > > > > On Thu 04-03-21 13:29:19, Amir Goldstein wrote: > > > Jan, > > > > > > These patches try to implement a minimal set and least controversial > > > functionality that we can allow for unprivileged users as a starting > > > point. > > > > > > The patches were tested on top of v5.12-rc1 and the fanotify_merge > > > patches using the unprivileged listener LTP tests written by Matthew > > > and another LTP tests I wrote to test the sysfs tunable limits [1]. > > > > Thanks. I've added both patches to my tree. > > Great! > I'll go post the LTP tests and work on the man page updates. > > BTW, I noticed that you pushed the aggregating for_next branch, > but not the fsnotify topic branch. > > Is this intentional? Not really, pushed now. Thanks for reminder. > I am asking because I am usually basing my development branches > off of your fsnotify branch, but I can base them on the unpushed branch. > > Heads up. I am playing with extra privileges we may be able to > allow an ns_capable user. > For example, watching a FS_USERNS_MOUNT filesystem that the user > itself has mounted inside userns. > > Another feature I am investigating is how to utilize the new idmapped > mounts to get a subtree watch functionality. This requires attaching a > userns to the group on fanotify_init(). > > <hand waving> > If the group's userns are the same or below the idmapped mount userns, > then all the objects accessed via that idmapped mount are accessible > to the group's userns admin. We can use that fact to filter events very > early based on their mnt_userns and the group's userns, which should be > cheaper than any subtree permission checks. > <\hand waving> Yeah, I agree this should work. Just it seems to me the userbase for this functionality will be (at least currently) rather limited. While full subtree watches would be IMO interesting to much more users. Honza
On Wed, Mar 17, 2021 at 1:42 PM Jan Kara <jack@suse.cz> wrote: > > On Wed 17-03-21 13:01:35, Amir Goldstein wrote: > > On Tue, Mar 16, 2021 at 5:55 PM Jan Kara <jack@suse.cz> wrote: > > > > > > On Thu 04-03-21 13:29:19, Amir Goldstein wrote: > > > > Jan, > > > > > > > > These patches try to implement a minimal set and least controversial > > > > functionality that we can allow for unprivileged users as a starting > > > > point. > > > > > > > > The patches were tested on top of v5.12-rc1 and the fanotify_merge > > > > patches using the unprivileged listener LTP tests written by Matthew > > > > and another LTP tests I wrote to test the sysfs tunable limits [1]. > > > > > > Thanks. I've added both patches to my tree. > > > > Great! > > I'll go post the LTP tests and work on the man page updates. > > > > BTW, I noticed that you pushed the aggregating for_next branch, > > but not the fsnotify topic branch. > > > > Is this intentional? > > Not really, pushed now. Thanks for reminder. > > > I am asking because I am usually basing my development branches > > off of your fsnotify branch, but I can base them on the unpushed branch. > > > > Heads up. I am playing with extra privileges we may be able to > > allow an ns_capable user. > > For example, watching a FS_USERNS_MOUNT filesystem that the user > > itself has mounted inside userns. > > > > Another feature I am investigating is how to utilize the new idmapped > > mounts to get a subtree watch functionality. This requires attaching a > > userns to the group on fanotify_init(). > > > > <hand waving> > > If the group's userns are the same or below the idmapped mount userns, > > then all the objects accessed via that idmapped mount are accessible > > to the group's userns admin. We can use that fact to filter events very > > early based on their mnt_userns and the group's userns, which should be > > cheaper than any subtree permission checks. > > <\hand waving> > > Yeah, I agree this should work. Just it seems to me the userbase for this > functionality will be (at least currently) rather limited. While full That may change when systemd home dirs feature starts to use idmapped mounts. Being able to watch the user's entire home directory is a big win already. > subtree watches would be IMO interesting to much more users. Agreed. I was looking into that as well, using the example of nfsd_acceptable() to implement the subtree permission check. The problem here is that even if unprivileged users cannot compromise security, they can still cause significant CPU overhead either queueing events or filtering events and that is something I haven't been able to figure out a way to escape from. BUT, if you allow userns admin to setup subtree watches (a.k.a filtered filesystem marks) on a userns filesystem/idmapped mount, now users can watch subtrees in their home directories and other processes will pay CPU penalty only for file access in the users home directories. That might be acceptable. Thanks, Amir.
On Wed, Mar 17, 2021 at 02:19:57PM +0200, Amir Goldstein wrote: > On Wed, Mar 17, 2021 at 1:42 PM Jan Kara <jack@suse.cz> wrote: > > > > On Wed 17-03-21 13:01:35, Amir Goldstein wrote: > > > On Tue, Mar 16, 2021 at 5:55 PM Jan Kara <jack@suse.cz> wrote: > > > > > > > > On Thu 04-03-21 13:29:19, Amir Goldstein wrote: > > > > > Jan, > > > > > > > > > > These patches try to implement a minimal set and least controversial > > > > > functionality that we can allow for unprivileged users as a starting > > > > > point. > > > > > > > > > > The patches were tested on top of v5.12-rc1 and the fanotify_merge > > > > > patches using the unprivileged listener LTP tests written by Matthew > > > > > and another LTP tests I wrote to test the sysfs tunable limits [1]. > > > > > > > > Thanks. I've added both patches to my tree. > > > > > > Great! > > > I'll go post the LTP tests and work on the man page updates. > > > > > > BTW, I noticed that you pushed the aggregating for_next branch, > > > but not the fsnotify topic branch. > > > > > > Is this intentional? > > > > Not really, pushed now. Thanks for reminder. > > > > > I am asking because I am usually basing my development branches > > > off of your fsnotify branch, but I can base them on the unpushed branch. > > > > > > Heads up. I am playing with extra privileges we may be able to > > > allow an ns_capable user. > > > For example, watching a FS_USERNS_MOUNT filesystem that the user > > > itself has mounted inside userns. > > > > > > Another feature I am investigating is how to utilize the new idmapped > > > mounts to get a subtree watch functionality. This requires attaching a > > > userns to the group on fanotify_init(). > > > > > > <hand waving> > > > If the group's userns are the same or below the idmapped mount userns, > > > then all the objects accessed via that idmapped mount are accessible > > > to the group's userns admin. We can use that fact to filter events very > > > early based on their mnt_userns and the group's userns, which should be > > > cheaper than any subtree permission checks. > > > <\hand waving> > > > > Yeah, I agree this should work. Just it seems to me the userbase for this > > functionality will be (at least currently) rather limited. While full > > That may change when systemd home dirs feature starts to use > idmapped mounts. > Being able to watch the user's entire home directory is a big win > already. Hey Amir, Hey Jan, I think so too. > > > subtree watches would be IMO interesting to much more users. > > Agreed. We have a use-case for subtree watches: One feature for containers we have is that users can e.g. tell us that they want the container manager to hotplug an arbitrary unix or block device into the container whenever the relevant device shows up on the system. For example they could instruct the container manager to plugin some new driver device when it shows up in /dev. That works nicely because of uevents. But users quite often also instruct us to plugin a path once it shows up in some directory in the filesystem hierarchy and unplug it once it is removed. Right now we're mainting an inotify-based hand-rolled recursive watch to make this work so we detect that add and remove event. I would be wildly excited if we could get rid of some of that complexity by using subtree watches. The container manager on the host will be unaffected by this feature since it will usually have root privileges and manage unprivileged containers. The unprivileged (userns use-case specifically here) subtree watches will be necessary and really good to have to make this work for container workloads and nested containers, i.e. where the container manager itselfs runs in a container and starts new containres. Since the subtree feature would be interesting for systemd itself and since our container manager (ChromeOS etc.) runs systemd inside unprivileged containers on a large scale it would be good if subtree watches could work in userns too. > > I was looking into that as well, using the example of nfsd_acceptable() > to implement the subtree permission check. > > The problem here is that even if unprivileged users cannot compromise > security, they can still cause significant CPU overhead either queueing > events or filtering events and that is something I haven't been able to > figure out a way to escape from. > > BUT, if you allow userns admin to setup subtree watches (a.k.a filtered > filesystem marks) on a userns filesystem/idmapped mount, now users I think that sounds reasonable. If the mount really is idmapped, it might be interesting to consider checking for privilege in the mnt_userns in addition to the regular permission checks that fanotify performs. My (equally handwavy) thinking is that this might allow for a nice feature where the creator of the mount (e.g. systemd) can block the creation of subtree watches by attaching a mnt_userns to the mnt that the user has no privilege in. (Just a thought.). Christian
On Wed, Mar 17, 2021 at 7:45 PM Christian Brauner <christian.brauner@ubuntu.com> wrote: > > On Wed, Mar 17, 2021 at 02:19:57PM +0200, Amir Goldstein wrote: > > On Wed, Mar 17, 2021 at 1:42 PM Jan Kara <jack@suse.cz> wrote: > > > > > > On Wed 17-03-21 13:01:35, Amir Goldstein wrote: > > > > On Tue, Mar 16, 2021 at 5:55 PM Jan Kara <jack@suse.cz> wrote: > > > > > > > > > > On Thu 04-03-21 13:29:19, Amir Goldstein wrote: > > > > > > Jan, > > > > > > > > > > > > These patches try to implement a minimal set and least controversial > > > > > > functionality that we can allow for unprivileged users as a starting > > > > > > point. > > > > > > > > > > > > The patches were tested on top of v5.12-rc1 and the fanotify_merge > > > > > > patches using the unprivileged listener LTP tests written by Matthew > > > > > > and another LTP tests I wrote to test the sysfs tunable limits [1]. > > > > > > > > > > Thanks. I've added both patches to my tree. > > > > > > > > Great! > > > > I'll go post the LTP tests and work on the man page updates. > > > > > > > > BTW, I noticed that you pushed the aggregating for_next branch, > > > > but not the fsnotify topic branch. > > > > > > > > Is this intentional? > > > > > > Not really, pushed now. Thanks for reminder. > > > > > > > I am asking because I am usually basing my development branches > > > > off of your fsnotify branch, but I can base them on the unpushed branch. > > > > > > > > Heads up. I am playing with extra privileges we may be able to > > > > allow an ns_capable user. > > > > For example, watching a FS_USERNS_MOUNT filesystem that the user > > > > itself has mounted inside userns. > > > > > > > > Another feature I am investigating is how to utilize the new idmapped > > > > mounts to get a subtree watch functionality. This requires attaching a > > > > userns to the group on fanotify_init(). > > > > > > > > <hand waving> > > > > If the group's userns are the same or below the idmapped mount userns, > > > > then all the objects accessed via that idmapped mount are accessible > > > > to the group's userns admin. We can use that fact to filter events very > > > > early based on their mnt_userns and the group's userns, which should be > > > > cheaper than any subtree permission checks. > > > > <\hand waving> > > > > > > Yeah, I agree this should work. Just it seems to me the userbase for this > > > functionality will be (at least currently) rather limited. While full > > > > That may change when systemd home dirs feature starts to use > > idmapped mounts. > > Being able to watch the user's entire home directory is a big win > > already. > > Hey Amir, > Hey Jan, > > I think so too. > > > > > > subtree watches would be IMO interesting to much more users. > > > > Agreed. > > We have a use-case for subtree watches: One feature for containers we > have is that users can e.g. tell us that they want the container manager > to hotplug an arbitrary unix or block device into the container whenever > the relevant device shows up on the system. For example they could > instruct the container manager to plugin some new driver device when it > shows up in /dev. That works nicely because of uevents. But users quite > often also instruct us to plugin a path once it shows up in some > directory in the filesystem hierarchy and unplug it once it is removed. > Right now we're mainting an inotify-based hand-rolled recursive watch to > make this work so we detect that add and remove event. I would be wildly > excited if we could get rid of some of that complexity by using subtree > watches. The container manager on the host will be unaffected by this > feature since it will usually have root privileges and manage > unprivileged containers. > The unprivileged (userns use-case specifically here) subtree watches > will be necessary and really good to have to make this work for > container workloads and nested containers, i.e. where the container > manager itselfs runs in a container and starts new containres. Since the > subtree feature would be interesting for systemd itself and since our > container manager (ChromeOS etc.) runs systemd inside unprivileged > containers on a large scale it would be good if subtree watches could > work in userns too. > I don't understand the subtree watch use case. You will have to walk me through it. What exactly is the container manager trying to detect? That a subdir of a specific name/path was created/deleted? It doesn't sound like a recursive watch is needed for that. What am I missing? As for nested container managers (and systemd), my thinking is that if all the mounts that manager is watching for serving its containers are idmapped to that manager's userns (is that a viable option?), then there shouldn't be a problem to setup userns filtered watches in order to be notified on all the events that happen via those idmapped mounts and filtering by "subtree" is not needed. I am clearly far from understanding the big picture. > > > > I was looking into that as well, using the example of nfsd_acceptable() > > to implement the subtree permission check. > > > > The problem here is that even if unprivileged users cannot compromise > > security, they can still cause significant CPU overhead either queueing > > events or filtering events and that is something I haven't been able to > > figure out a way to escape from. > > > > BUT, if you allow userns admin to setup subtree watches (a.k.a filtered > > filesystem marks) on a userns filesystem/idmapped mount, now users > > I think that sounds reasonable. > If the mount really is idmapped, it might be interesting to consider > checking for privilege in the mnt_userns in addition to the regular > permission checks that fanotify performs. My (equally handwavy) thinking > is that this might allow for a nice feature where the creator of the > mount (e.g. systemd) can block the creation of subtree watches by > attaching a mnt_userns to the mnt that the user has no privilege in. > (Just a thought.). > Currently, (upstream) only init_userns CAP_SYS_ADMIN can setup fanotify watches. In linux-next, unprivileged user can already setup inode watches (i.e. like inotify). So I am not sure what you are referring to by "block the creation of subtree watches". If systemd were to idmap my home dir to mnt_userns where my user has CAP_SYS_ADMIN, then allowing my user to setup a watch for all events on that mount should not be too hard. If you think that is useful and you want to play with this feature I can provide a WIP branch soon. Thanks, Amir.
On Wed, Mar 17, 2021 at 09:14:06PM +0200, Amir Goldstein wrote: > On Wed, Mar 17, 2021 at 7:45 PM Christian Brauner > <christian.brauner@ubuntu.com> wrote: > > > > On Wed, Mar 17, 2021 at 02:19:57PM +0200, Amir Goldstein wrote: > > > On Wed, Mar 17, 2021 at 1:42 PM Jan Kara <jack@suse.cz> wrote: > > > > > > > > On Wed 17-03-21 13:01:35, Amir Goldstein wrote: > > > > > On Tue, Mar 16, 2021 at 5:55 PM Jan Kara <jack@suse.cz> wrote: > > > > > > > > > > > > On Thu 04-03-21 13:29:19, Amir Goldstein wrote: > > > > > > > Jan, > > > > > > > > > > > > > > These patches try to implement a minimal set and least controversial > > > > > > > functionality that we can allow for unprivileged users as a starting > > > > > > > point. > > > > > > > > > > > > > > The patches were tested on top of v5.12-rc1 and the fanotify_merge > > > > > > > patches using the unprivileged listener LTP tests written by Matthew > > > > > > > and another LTP tests I wrote to test the sysfs tunable limits [1]. > > > > > > > > > > > > Thanks. I've added both patches to my tree. > > > > > > > > > > Great! > > > > > I'll go post the LTP tests and work on the man page updates. > > > > > > > > > > BTW, I noticed that you pushed the aggregating for_next branch, > > > > > but not the fsnotify topic branch. > > > > > > > > > > Is this intentional? > > > > > > > > Not really, pushed now. Thanks for reminder. > > > > > > > > > I am asking because I am usually basing my development branches > > > > > off of your fsnotify branch, but I can base them on the unpushed branch. > > > > > > > > > > Heads up. I am playing with extra privileges we may be able to > > > > > allow an ns_capable user. > > > > > For example, watching a FS_USERNS_MOUNT filesystem that the user > > > > > itself has mounted inside userns. > > > > > > > > > > Another feature I am investigating is how to utilize the new idmapped > > > > > mounts to get a subtree watch functionality. This requires attaching a > > > > > userns to the group on fanotify_init(). > > > > > > > > > > <hand waving> > > > > > If the group's userns are the same or below the idmapped mount userns, > > > > > then all the objects accessed via that idmapped mount are accessible > > > > > to the group's userns admin. We can use that fact to filter events very > > > > > early based on their mnt_userns and the group's userns, which should be > > > > > cheaper than any subtree permission checks. > > > > > <\hand waving> > > > > > > > > Yeah, I agree this should work. Just it seems to me the userbase for this > > > > functionality will be (at least currently) rather limited. While full > > > > > > That may change when systemd home dirs feature starts to use > > > idmapped mounts. > > > Being able to watch the user's entire home directory is a big win > > > already. > > > > Hey Amir, > > Hey Jan, > > > > I think so too. > > > > > > > > > subtree watches would be IMO interesting to much more users. > > > > > > Agreed. > > > > We have a use-case for subtree watches: One feature for containers we > > have is that users can e.g. tell us that they want the container manager > > to hotplug an arbitrary unix or block device into the container whenever > > the relevant device shows up on the system. For example they could > > instruct the container manager to plugin some new driver device when it > > shows up in /dev. That works nicely because of uevents. But users quite > > often also instruct us to plugin a path once it shows up in some > > directory in the filesystem hierarchy and unplug it once it is removed. > > Right now we're mainting an inotify-based hand-rolled recursive watch to > > make this work so we detect that add and remove event. I would be wildly > > excited if we could get rid of some of that complexity by using subtree > > watches. The container manager on the host will be unaffected by this > > feature since it will usually have root privileges and manage > > unprivileged containers. > > The unprivileged (userns use-case specifically here) subtree watches > > will be necessary and really good to have to make this work for > > container workloads and nested containers, i.e. where the container > > manager itselfs runs in a container and starts new containres. Since the > > subtree feature would be interesting for systemd itself and since our > > container manager (ChromeOS etc.) runs systemd inside unprivileged > > containers on a large scale it would be good if subtree watches could > > work in userns too. > > > > I don't understand the subtree watch use case. > You will have to walk me through it. > > What exactly is the container manager trying to detect? > That a subdir of a specific name/path was created/deleted? > It doesn't sound like a recursive watch is needed for that. > What am I missing? Sorry if I was unclear. For example, a user may tell the container manager to hotplug /home/jdoe/some/path/ into the container. Users are free to tell the container manager that that path doesn't need exist. At that poing the container manager will start to mirror the first part of the path that does exist. And as soon as the full path has been created the container manager will hotplug that path as a new mount into the container. Similarly it will remove that mount from the container as soon as the path is deleted from the host. So say the user tells the container manager to inject /home/jdoe/some/path into the container but only /home/jdoe currently exists then the container manager will recursively watch: /home/jdoe waiting for the full path to be created. This is all a bit nasty since we need to ensure that we notice all events. For example, the user could create /home/jdoe/some but then right after that /home/jdoe/some could be removed again. With the inotify listener we need to constantly add (and remove iirc) watch fds and ensure that we never miss an event and that's brittle. I'd rather have something that allows me to mirror /home/jdoe recursively directly. But maybe I'm misunderstanding fanotify and it can't really help us but I thought that subtree watches might. One of the reason't I didn't use fanotiy when we implemented this was that it couldn't be used inside of user namespaces, i.e. CAP_SYS_ADMIN in the initial userns was required. We always make very sure that users can properly nest containers and have almost all the same features available that they have with non-nested containers. And since fanotify currently requires CAP_SYS_ADMIN in the init userns it means a container manager running inside a container wanting to hotplug paths for nested containers can't use fanotify. (Btw, this is part of the code I wrote to implement this logic via inotify a long time ago https://github.com/lxc/lxd/blob/f12f03a4ba4645892ef6cc167c24da49d1217b02/lxd/device/device_utils_inotify.go [I'm sorry you have to see this in case you click on it...]) > > As for nested container managers (and systemd), my thinking is > that if all the mounts that manager is watching for serving its containers > are idmapped to that manager's userns (is that a viable option?), then Yes, it is possible. We do now support AT_RECURSIVE with all mount attributes including idmapping mounts. > there shouldn't be a problem to setup userns filtered watches in order to > be notified on all the events that happen via those idmapped mounts > and filtering by "subtree" is not needed. > I am clearly far from understanding the big picture. I think I need to refamiliarize myself with what "subtree" watches do. Maybe I misunderstood what they do. I'll take a look. > > > > > > > I was looking into that as well, using the example of nfsd_acceptable() > > > to implement the subtree permission check. > > > > > > The problem here is that even if unprivileged users cannot compromise > > > security, they can still cause significant CPU overhead either queueing > > > events or filtering events and that is something I haven't been able to > > > figure out a way to escape from. > > > > > > BUT, if you allow userns admin to setup subtree watches (a.k.a filtered > > > filesystem marks) on a userns filesystem/idmapped mount, now users > > > > I think that sounds reasonable. > > If the mount really is idmapped, it might be interesting to consider > > checking for privilege in the mnt_userns in addition to the regular > > permission checks that fanotify performs. My (equally handwavy) thinking > > is that this might allow for a nice feature where the creator of the > > mount (e.g. systemd) can block the creation of subtree watches by > > attaching a mnt_userns to the mnt that the user has no privilege in. > > (Just a thought.). > > > > Currently, (upstream) only init_userns CAP_SYS_ADMIN can setup > fanotify watches. > In linux-next, unprivileged user can already setup inode watches > (i.e. like inotify). Just to clarify: you mean "unprivileged" as in non-root users in init_user_ns and therefore also users in non-init userns. That's what inotify allows you. This would probably allows us to use fanotify instead of the hand-rolled recursive notify watching we currently do and that I linked to above. > > So I am not sure what you are referring to by "block the creation of > subtree watches". > > If systemd were to idmap my home dir to mnt_userns where my user > has CAP_SYS_ADMIN, then allowing my user to setup a watch for > all events on that mount should not be too hard. Right, that was essentially what my comment was about. > If you think that is useful and you want to play with this feature I can > provide a WIP branch soon. I would like to first play with the support for unprivileged fanotify but sure, it does sound useful! Christian
On Wed 17-03-21 14:19:57, Amir Goldstein wrote: > On Wed, Mar 17, 2021 at 1:42 PM Jan Kara <jack@suse.cz> wrote: > > > > On Wed 17-03-21 13:01:35, Amir Goldstein wrote: > > > On Tue, Mar 16, 2021 at 5:55 PM Jan Kara <jack@suse.cz> wrote: > > > > > > > > On Thu 04-03-21 13:29:19, Amir Goldstein wrote: > > > > > Jan, > > > > > > > > > > These patches try to implement a minimal set and least controversial > > > > > functionality that we can allow for unprivileged users as a starting > > > > > point. > > > > > > > > > > The patches were tested on top of v5.12-rc1 and the fanotify_merge > > > > > patches using the unprivileged listener LTP tests written by Matthew > > > > > and another LTP tests I wrote to test the sysfs tunable limits [1]. > > > > > > > > Thanks. I've added both patches to my tree. > > > > > > Great! > > > I'll go post the LTP tests and work on the man page updates. > > > > > > BTW, I noticed that you pushed the aggregating for_next branch, > > > but not the fsnotify topic branch. > > > > > > Is this intentional? > > > > Not really, pushed now. Thanks for reminder. > > > > > I am asking because I am usually basing my development branches > > > off of your fsnotify branch, but I can base them on the unpushed branch. > > > > > > Heads up. I am playing with extra privileges we may be able to > > > allow an ns_capable user. > > > For example, watching a FS_USERNS_MOUNT filesystem that the user > > > itself has mounted inside userns. > > > > > > Another feature I am investigating is how to utilize the new idmapped > > > mounts to get a subtree watch functionality. This requires attaching a > > > userns to the group on fanotify_init(). > > > > > > <hand waving> > > > If the group's userns are the same or below the idmapped mount userns, > > > then all the objects accessed via that idmapped mount are accessible > > > to the group's userns admin. We can use that fact to filter events very > > > early based on their mnt_userns and the group's userns, which should be > > > cheaper than any subtree permission checks. > > > <\hand waving> > > > > Yeah, I agree this should work. Just it seems to me the userbase for this > > functionality will be (at least currently) rather limited. While full > > That may change when systemd home dirs feature starts to use idmapped > mounts. Being able to watch the user's entire home directory is a big > win already. Do you mean that home directory would be an extra mount with userns in which the user has CAP_SYS_ADMIN so he'd be able to watch subtrees on that mount? > > subtree watches would be IMO interesting to much more users. > > Agreed. > > I was looking into that as well, using the example of nfsd_acceptable() > to implement the subtree permission check. > > The problem here is that even if unprivileged users cannot compromise > security, they can still cause significant CPU overhead either queueing > events or filtering events and that is something I haven't been able to > figure out a way to escape from. WRT queueing overhead, given a user can place ~1M of directory watches, he can cause noticable total overhead for queueing events anyway. Furthermore the queue size is limited so unless the user spends time consuming events as well, the load won't last long. But I agree we need to be careful not to introduce too big latencies to operations generating events. So I think if we could quickly detect whether a generated event has a good chance of being relevant for some subtree watch of a group and queue it in that case and worry about permission checks only once events are received and thus receiver pays the cost of expensive checks, that might be fine as well. Honza
[...] I understand the use case. > I'd rather have something that allows me to mirror > > /home/jdoe > > recursively directly. But maybe I'm misunderstanding fanotify and it > can't really help us but I thought that subtree watches might. > There are no subtree watches. They are still a holy grale for fanotify... There are filesystem and mnt watches and the latter support far fewer events (only events for operations that carry the path argument). With filesystem watches, you can get events for all mkdirs and you can figure out the created path, but you'd have to do all the filtering in userspace. What I am trying to create is "filtered" filesystem watches and the filter needs to be efficient enough so the watcher will not incur too big of a penalty on all the operations in the filesystem. Thanks to your mnt_userns changes, implementing a filter to intercept (say) mkdir calles on a specific mnt_userns should be quite simple, but filtering by "path" (i.e. /home/jdoe/some/path) will still need to happen in userspace. This narrows the problem to the nested container manager that will only need to filter events which happened via mounts under its control. [...] > > there shouldn't be a problem to setup userns filtered watches in order to > > be notified on all the events that happen via those idmapped mounts > > and filtering by "subtree" is not needed. > > I am clearly far from understanding the big picture. > > I think I need to refamiliarize myself with what "subtree" watches do. > Maybe I misunderstood what they do. I'll take a look. > You will not find them :-) [...] > > Currently, (upstream) only init_userns CAP_SYS_ADMIN can setup > > fanotify watches. > > In linux-next, unprivileged user can already setup inode watches > > (i.e. like inotify). > > Just to clarify: you mean "unprivileged" as in non-root users in > init_user_ns and therefore also users in non-init userns. That's what Correct. > inotify allows you. This would probably allows us to use fanotify > instead of the hand-rolled recursive notify watching we currently do and > that I linked to above. > The code that sits in linux-next can give you pretty much a drop-in replacement of inotify and nothing more. See example code: https://github.com/amir73il/inotify-tools/commits/fanotify_name_fid > > If you think that is useful and you want to play with this feature I can > > provide a WIP branch soon. > > I would like to first play with the support for unprivileged fanotify > but sure, it does sound useful! Just so you have an idea what I am talking about, this is a very early POC branch: https://github.com/amir73il/linux/commits/fanotify_userns It will not be very useful to you yet I think. Userns admin can watch all events on a tmpfs/fuse mounted inside its userns. Userns admin can watch open/read/write/close events on an idmapped mount mapped to its userns. But I think the more useful feature would be to watch all events on an idmapped mount mapped to its userns. Thanks, Amir.
> > That may change when systemd home dirs feature starts to use idmapped > > mounts. Being able to watch the user's entire home directory is a big > > win already. > > Do you mean that home directory would be an extra mount with userns in > which the user has CAP_SYS_ADMIN so he'd be able to watch subtrees on that > mount? > That is what I meant. My understanding of the systemd-homed use case for idmapped mounts is that the user has CAP_SYS_ADMIN is the mapped userns, but I may be wrong. > > > subtree watches would be IMO interesting to much more users. > > > > Agreed. > > > > I was looking into that as well, using the example of nfsd_acceptable() > > to implement the subtree permission check. > > > > The problem here is that even if unprivileged users cannot compromise > > security, they can still cause significant CPU overhead either queueing > > events or filtering events and that is something I haven't been able to > > figure out a way to escape from. > > WRT queueing overhead, given a user can place ~1M of directory watches, he > can cause noticable total overhead for queueing events anyway. Furthermore I suppose so. But a user placing 1M dir watches at least adds this overhead knowingly. Adding a overhead on the entire filesystem when just wanting to watch a small subtree doesn't sound ideal. Especially in very nested setups. So yes, we need to be careful. > the queue size is limited so unless the user spends time consuming events > as well, the load won't last long. But I agree we need to be careful not to > introduce too big latencies to operations generating events. So I think if > we could quickly detect whether a generated event has a good chance of > being relevant for some subtree watch of a group and queue it in that case > and worry about permission checks only once events are received and thus > receiver pays the cost of expensive checks, that might be fine as well. > So far the only idea I had for "quickly detect" which I cannot find flaws in is to filter by mnt_userms, but its power is limited. Thanks, Amir.
On Thu, Mar 18, 2021 at 07:07:00PM +0200, Amir Goldstein wrote: > > > That may change when systemd home dirs feature starts to use idmapped > > > mounts. Being able to watch the user's entire home directory is a big > > > win already. > > > > Do you mean that home directory would be an extra mount with userns in > > which the user has CAP_SYS_ADMIN so he'd be able to watch subtrees on that > > mount? > > > > That is what I meant. > My understanding of the systemd-homed use case for idmapped mounts is > that the user has CAP_SYS_ADMIN is the mapped userns, but I may be wrong. systemd can simply create a new userns with the uid/gid of the target user effectively delegating it (That's independent of actually writing a uid gid mapping for the userns which will be done with privileges.) and then attach it to that mount for the user. Mine and Lennart's idea there so far has been that the creation would likely be done by the user's session at login time brauner 1346 0.0 0.0 20956 8512 ? Ss Mar03 0:03 /lib/systemd/systemd --user and systemd as root would then take care of writing the mapping to the userns and then attaching it to the mount. (I'll see Lennart in the next few days and see what works best and once we're ready start a discussion somwhere on a public list, I would suggest.) (If systemd doesn't want a user to be able to monitor a mnt it can simply create a userns with a different uid/gid but with the relevant mapping. This was what my earlier point was about "blocking a user from creating a subtree watch".) Christian
On Thu, Mar 18, 2021 at 06:48:11PM +0200, Amir Goldstein wrote: > [...] > > I understand the use case. > > > I'd rather have something that allows me to mirror > > > > /home/jdoe > > > > recursively directly. But maybe I'm misunderstanding fanotify and it > > can't really help us but I thought that subtree watches might. > > > > There are no subtree watches. They are still a holy grale for fanotify... > There are filesystem and mnt watches and the latter support far fewer > events (only events for operations that carry the path argument). > > With filesystem watches, you can get events for all mkdirs and you can > figure out the created path, but you'd have to do all the filtering in > userspace. > > What I am trying to create is "filtered" filesystem watches and the filter needs > to be efficient enough so the watcher will not incur too big of a penalty > on all the operations in the filesystem. > > Thanks to your mnt_userns changes, implementing a filter to intercept > (say) mkdir calles on a specific mnt_userns should be quite simple, but > filtering by "path" (i.e. /home/jdoe/some/path) will still need to happen in > userspace. > > This narrows the problem to the nested container manager that will only > need to filter events which happened via mounts under its control. > > [...] > > > > there shouldn't be a problem to setup userns filtered watches in order to > > > be notified on all the events that happen via those idmapped mounts > > > and filtering by "subtree" is not needed. > > > I am clearly far from understanding the big picture. > > > > I think I need to refamiliarize myself with what "subtree" watches do. > > Maybe I misunderstood what they do. I'll take a look. > > > > You will not find them :-) Heh. :) > > [...] > > > > Currently, (upstream) only init_userns CAP_SYS_ADMIN can setup > > > fanotify watches. > > > In linux-next, unprivileged user can already setup inode watches > > > (i.e. like inotify). > > > > Just to clarify: you mean "unprivileged" as in non-root users in > > init_user_ns and therefore also users in non-init userns. That's what > > Correct. > > > inotify allows you. This would probably allows us to use fanotify > > instead of the hand-rolled recursive notify watching we currently do and > > that I linked to above. > > > > The code that sits in linux-next can give you pretty much a drop-in > replacement of inotify and nothing more. See example code: > https://github.com/amir73il/inotify-tools/commits/fanotify_name_fid This is really great. Thank you for doing that work this will help quite a lot of use-cases and make things way simpler. I created a TODO to port our path-hotplug to this once this feature lands. > > > > If you think that is useful and you want to play with this feature I can > > > provide a WIP branch soon. > > > > I would like to first play with the support for unprivileged fanotify > > but sure, it does sound useful! > > Just so you have an idea what I am talking about, this is a very early > POC branch: > https://github.com/amir73il/linux/commits/fanotify_userns Thanks! I'll try to pull this and take a look next week. I hope that's ok. Christian
On Fri, Mar 19, 2021 at 3:40 PM Christian Brauner <christian.brauner@ubuntu.com> wrote: > > On Thu, Mar 18, 2021 at 06:48:11PM +0200, Amir Goldstein wrote: > > [...] > > > > I understand the use case. > > > > > I'd rather have something that allows me to mirror > > > > > > /home/jdoe > > > > > > recursively directly. But maybe I'm misunderstanding fanotify and it > > > can't really help us but I thought that subtree watches might. > > > > > > > There are no subtree watches. They are still a holy grale for fanotify... > > There are filesystem and mnt watches and the latter support far fewer > > events (only events for operations that carry the path argument). > > > > With filesystem watches, you can get events for all mkdirs and you can > > figure out the created path, but you'd have to do all the filtering in > > userspace. > > > > What I am trying to create is "filtered" filesystem watches and the filter needs > > to be efficient enough so the watcher will not incur too big of a penalty > > on all the operations in the filesystem. > > > > Thanks to your mnt_userns changes, implementing a filter to intercept > > (say) mkdir calles on a specific mnt_userns should be quite simple, but > > filtering by "path" (i.e. /home/jdoe/some/path) will still need to happen in > > userspace. > > > > This narrows the problem to the nested container manager that will only > > need to filter events which happened via mounts under its control. > > > > [...] > > > > > > there shouldn't be a problem to setup userns filtered watches in order to > > > > be notified on all the events that happen via those idmapped mounts > > > > and filtering by "subtree" is not needed. > > > > I am clearly far from understanding the big picture. > > > > > > I think I need to refamiliarize myself with what "subtree" watches do. > > > Maybe I misunderstood what they do. I'll take a look. > > > > > > > You will not find them :-) > > Heh. :) > > > > > [...] > > > > > > Currently, (upstream) only init_userns CAP_SYS_ADMIN can setup > > > > fanotify watches. > > > > In linux-next, unprivileged user can already setup inode watches > > > > (i.e. like inotify). > > > > > > Just to clarify: you mean "unprivileged" as in non-root users in > > > init_user_ns and therefore also users in non-init userns. That's what > > > > Correct. > > > > > inotify allows you. This would probably allows us to use fanotify > > > instead of the hand-rolled recursive notify watching we currently do and > > > that I linked to above. > > > > > > > The code that sits in linux-next can give you pretty much a drop-in > > replacement of inotify and nothing more. See example code: > > https://github.com/amir73il/inotify-tools/commits/fanotify_name_fid > > This is really great. Thank you for doing that work this will help quite > a lot of use-cases and make things way simpler. I created a TODO to port > our path-hotplug to this once this feature lands. > FWIW, I just tried to build this branch on Ubuntu 20.04.2 with LTS kernel and there were some build issues, so rebased my branch on upstream inotify-tools to fix those build issues. I was not aware that the inotify-tools project is alive, I never intended to upstream this demo code and never created a github pull request but rebasing on upstream brought in some CI scripts, when I pushed the branch to my github it triggered some tests that reported build failures on Ubuntu 16.04 and 18.04. Anyway, there is a pre-rebase branch 'fanotify_name' and the post rebase branch 'fanotify_name_fid'. You can try whichever works for you. You can look at the test script src/test_demo.sh for usage example. Or just cd into a writable directory and run the script to see the demo. The demo determines whether to use a recursive watch or "global" watch by the uid of the user. > > > > > > If you think that is useful and you want to play with this feature I can > > > > provide a WIP branch soon. > > > > > > I would like to first play with the support for unprivileged fanotify > > > but sure, it does sound useful! > > > > Just so you have an idea what I am talking about, this is a very early > > POC branch: > > https://github.com/amir73il/linux/commits/fanotify_userns > > Thanks! I'll try to pull this and take a look next week. I hope that's > ok. > Fine. I'm curious to know what it does. Did not get to test it with userns yet :) Thanks, Amir.
> > > The code that sits in linux-next can give you pretty much a drop-in > > > replacement of inotify and nothing more. See example code: > > > https://github.com/amir73il/inotify-tools/commits/fanotify_name_fid > > > > This is really great. Thank you for doing that work this will help quite > > a lot of use-cases and make things way simpler. I created a TODO to port > > our path-hotplug to this once this feature lands. > > > > FWIW, I just tried to build this branch on Ubuntu 20.04.2 with LTS kernel > and there were some build issues, so rebased my branch on upstream > inotify-tools to fix those build issues. > > I was not aware that the inotify-tools project is alive, I never intended > to upstream this demo code and never created a github pull request > but rebasing on upstream brought in some CI scripts, when I pushed the > branch to my github it triggered some tests that reported build failures on > Ubuntu 16.04 and 18.04. > > Anyway, there is a pre-rebase branch 'fanotify_name' and the post rebase > branch 'fanotify_name_fid'. You can try whichever works for you. > > You can look at the test script src/test_demo.sh for usage example. > Or just cd into a writable directory and run the script to see the demo. > The demo determines whether to use a recursive watch or "global" > watch by the uid of the user. > > > > > > > > > If you think that is useful and you want to play with this feature I can > > > > > provide a WIP branch soon. > > > > > > > > I would like to first play with the support for unprivileged fanotify > > > > but sure, it does sound useful! > > > > > > Just so you have an idea what I am talking about, this is a very early > > > POC branch: > > > https://github.com/amir73il/linux/commits/fanotify_userns > > > > Thanks! I'll try to pull this and take a look next week. I hope that's > > ok. > > > > Fine. I'm curious to know what it does. > Did not get to test it with userns yet :) Now tested FAN_MARK_FILESYSTEM watch on tmpfs mounted inside userns and works fine, with two wrinkles I needed to iron: 1. FAN_REPORT_FID not supported on tmpfs because tmpfs has zero f_fsid (easy to fix) 2. open_by_handle_at() is not userns aware (can relax for FS_USERNS_MOUNT fs) Pushed these two fixes to branch fanotify_userns. Thanks, Amir.
On Sat, Mar 20, 2021 at 2:57 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > The code that sits in linux-next can give you pretty much a drop-in > > > > replacement of inotify and nothing more. See example code: > > > > https://github.com/amir73il/inotify-tools/commits/fanotify_name_fid > > > > > > This is really great. Thank you for doing that work this will help quite > > > a lot of use-cases and make things way simpler. I created a TODO to port > > > our path-hotplug to this once this feature lands. > > > > > > > FWIW, I just tried to build this branch on Ubuntu 20.04.2 with LTS kernel > > and there were some build issues, so rebased my branch on upstream > > inotify-tools to fix those build issues. > > > > I was not aware that the inotify-tools project is alive, I never intended > > to upstream this demo code and never created a github pull request > > but rebasing on upstream brought in some CI scripts, when I pushed the > > branch to my github it triggered some tests that reported build failures on > > Ubuntu 16.04 and 18.04. > > > > Anyway, there is a pre-rebase branch 'fanotify_name' and the post rebase > > branch 'fanotify_name_fid'. You can try whichever works for you. FYI, fixed the CI build errors on fanotify_name_fid branch. > > > > You can look at the test script src/test_demo.sh for usage example. > > Or just cd into a writable directory and run the script to see the demo. > > The demo determines whether to use a recursive watch or "global" > > watch by the uid of the user. > > > > > > > > > > > > If you think that is useful and you want to play with this feature I can > > > > > > provide a WIP branch soon. > > > > > > > > > > I would like to first play with the support for unprivileged fanotify > > > > > but sure, it does sound useful! > > > > > > > > Just so you have an idea what I am talking about, this is a very early > > > > POC branch: > > > > https://github.com/amir73il/linux/commits/fanotify_userns > > > > > > Thanks! I'll try to pull this and take a look next week. I hope that's > > > ok. > > > > > > > Fine. I'm curious to know what it does. > > Did not get to test it with userns yet :) > > Now tested FAN_MARK_FILESYSTEM watch on tmpfs mounted > inside userns and works fine, with two wrinkles I needed to iron: > > 1. FAN_REPORT_FID not supported on tmpfs because tmpfs has > zero f_fsid (easy to fix) > 2. open_by_handle_at() is not userns aware (can relax for > FS_USERNS_MOUNT fs) > > Pushed these two fixes to branch fanotify_userns. Pushed another fix to mnt refcount bug in WIP and another commit to add the last piece that could make fanotify usable for systemd-homed setup - a filesystem watch filtered by mnt_userns (not tested yet). One thing I am struggling with is the language to describe user ns and idmapped mounts related logic. I have a feeling that I am getting the vocabulary all wrong. See my commit message text below. Editorial corrections would be appreciated. Thanks, Amir. --- fanotify: support sb mark filtered by idmapped mount For a group created inside userns, adding a mount mark is allowed if the mount is idmapped inside the group's userns and adding a filesystem mark is allowed if the filesystem was mounted inside the group's userns. Allow also adding a filesystem mark on a filesystem that is mounted in the group's userns via an idmapped mount. In that case, the mark is created with an internal flag, which indicates that events should be sent to the group only if they happened via an idmapped mount inside the group's userns. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
On Mon, Mar 22, 2021 at 02:44:20PM +0200, Amir Goldstein wrote: > On Sat, Mar 20, 2021 at 2:57 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > > The code that sits in linux-next can give you pretty much a drop-in > > > > > replacement of inotify and nothing more. See example code: > > > > > https://github.com/amir73il/inotify-tools/commits/fanotify_name_fid > > > > > > > > This is really great. Thank you for doing that work this will help quite > > > > a lot of use-cases and make things way simpler. I created a TODO to port > > > > our path-hotplug to this once this feature lands. > > > > > > > > > > FWIW, I just tried to build this branch on Ubuntu 20.04.2 with LTS kernel > > > and there were some build issues, so rebased my branch on upstream > > > inotify-tools to fix those build issues. > > > > > > I was not aware that the inotify-tools project is alive, I never intended > > > to upstream this demo code and never created a github pull request > > > but rebasing on upstream brought in some CI scripts, when I pushed the > > > branch to my github it triggered some tests that reported build failures on > > > Ubuntu 16.04 and 18.04. > > > > > > Anyway, there is a pre-rebase branch 'fanotify_name' and the post rebase > > > branch 'fanotify_name_fid'. You can try whichever works for you. > > FYI, fixed the CI build errors on fanotify_name_fid branch. > > > > > > > You can look at the test script src/test_demo.sh for usage example. > > > Or just cd into a writable directory and run the script to see the demo. > > > The demo determines whether to use a recursive watch or "global" > > > watch by the uid of the user. > > > > > > > > > > > > > > > If you think that is useful and you want to play with this feature I can > > > > > > > provide a WIP branch soon. > > > > > > > > > > > > I would like to first play with the support for unprivileged fanotify > > > > > > but sure, it does sound useful! > > > > > > > > > > Just so you have an idea what I am talking about, this is a very early > > > > > POC branch: > > > > > https://github.com/amir73il/linux/commits/fanotify_userns > > > > > > > > Thanks! I'll try to pull this and take a look next week. I hope that's > > > > ok. > > > > > > > > > > Fine. I'm curious to know what it does. > > > Did not get to test it with userns yet :) > > > > Now tested FAN_MARK_FILESYSTEM watch on tmpfs mounted > > inside userns and works fine, with two wrinkles I needed to iron: > > > > 1. FAN_REPORT_FID not supported on tmpfs because tmpfs has > > zero f_fsid (easy to fix) > > 2. open_by_handle_at() is not userns aware (can relax for > > FS_USERNS_MOUNT fs) > > > > Pushed these two fixes to branch fanotify_userns. > > Pushed another fix to mnt refcount bug in WIP and another commit to > add the last piece that could make fanotify usable for systemd-homed > setup - a filesystem watch filtered by mnt_userns (not tested yet). Sounds interesting. So I'm looking and commenting on that branch a little. One general question, when fanotify FANOTIFY_PERM_EVENTS is set fanotify will return a file descriptor (for relevant events) referring to the file/directory that e.g. got created. And there are no permissions checks other than the capable(CAP_SYS_ADMIN) check when the fanotify instance is created, right? > > One thing I am struggling with is the language to describe user ns > and idmapped mounts related logic. I have a feeling that I am getting > the vocabulary all wrong. See my commit message text below. The lingo seems legit. :) I need some time thinking about the concepts to convince myself that this is safe. But I like the direction as this is going to be really useful. Christian
On Mon, Mar 22, 2021 at 6:28 PM Christian Brauner <christian.brauner@ubuntu.com> wrote: > > On Mon, Mar 22, 2021 at 02:44:20PM +0200, Amir Goldstein wrote: > > On Sat, Mar 20, 2021 at 2:57 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > > > > The code that sits in linux-next can give you pretty much a drop-in > > > > > > replacement of inotify and nothing more. See example code: > > > > > > https://github.com/amir73il/inotify-tools/commits/fanotify_name_fid > > > > > > > > > > This is really great. Thank you for doing that work this will help quite > > > > > a lot of use-cases and make things way simpler. I created a TODO to port > > > > > our path-hotplug to this once this feature lands. > > > > > > > > > > > > > FWIW, I just tried to build this branch on Ubuntu 20.04.2 with LTS kernel > > > > and there were some build issues, so rebased my branch on upstream > > > > inotify-tools to fix those build issues. > > > > > > > > I was not aware that the inotify-tools project is alive, I never intended > > > > to upstream this demo code and never created a github pull request > > > > but rebasing on upstream brought in some CI scripts, when I pushed the > > > > branch to my github it triggered some tests that reported build failures on > > > > Ubuntu 16.04 and 18.04. > > > > > > > > Anyway, there is a pre-rebase branch 'fanotify_name' and the post rebase > > > > branch 'fanotify_name_fid'. You can try whichever works for you. > > > > FYI, fixed the CI build errors on fanotify_name_fid branch. > > > > > > > > > > You can look at the test script src/test_demo.sh for usage example. > > > > Or just cd into a writable directory and run the script to see the demo. > > > > The demo determines whether to use a recursive watch or "global" > > > > watch by the uid of the user. > > > > > > > > > > > > > > > > > > If you think that is useful and you want to play with this feature I can > > > > > > > > provide a WIP branch soon. > > > > > > > > > > > > > > I would like to first play with the support for unprivileged fanotify > > > > > > > but sure, it does sound useful! > > > > > > > > > > > > Just so you have an idea what I am talking about, this is a very early > > > > > > POC branch: > > > > > > https://github.com/amir73il/linux/commits/fanotify_userns > > > > > > > > > > Thanks! I'll try to pull this and take a look next week. I hope that's > > > > > ok. > > > > > > > > > > > > > Fine. I'm curious to know what it does. > > > > Did not get to test it with userns yet :) > > > > > > Now tested FAN_MARK_FILESYSTEM watch on tmpfs mounted > > > inside userns and works fine, with two wrinkles I needed to iron: > > > > > > 1. FAN_REPORT_FID not supported on tmpfs because tmpfs has > > > zero f_fsid (easy to fix) > > > 2. open_by_handle_at() is not userns aware (can relax for > > > FS_USERNS_MOUNT fs) > > > > > > Pushed these two fixes to branch fanotify_userns. > > > > Pushed another fix to mnt refcount bug in WIP and another commit to > > add the last piece that could make fanotify usable for systemd-homed > > setup - a filesystem watch filtered by mnt_userns (not tested yet). > > Sounds interesting. > > So I'm looking and commenting on that branch a little. > One general question, when fanotify FANOTIFY_PERM_EVENTS is set fanotify > will return a file descriptor (for relevant events) referring to the > file/directory that e.g. got created. And there are no permissions > checks other than the capable(CAP_SYS_ADMIN) check when the fanotify > instance is created, right? > Correct. fanotify_init() enforces that in a few maybe not so obvious steps: 1. Either CAP_SYS_ADMIN or fid_mode (no file descriptor in event): if (!capable(CAP_SYS_ADMIN)) { /* * An unprivileged user can setup an fanotify group with * limited functionality - an unprivileged group is limited to * notification events with file handles and it cannot use * unlimited queue/marks. */ if ((flags & FANOTIFY_ADMIN_INIT_FLAGS) || !fid_mode) return -EPERM; } 2. fid_mode is not supported for high priority classes: if (fid_mode && class != FAN_CLASS_NOTIF) return -EINVAL; 3. Permission events are only allowed for high priority classes: /* * group->priority == FS_PRIO_0 == FAN_CLASS_NOTIF. These are not * allowed to set permissions events. */ ret = -EINVAL; if (mask & FANOTIFY_PERM_EVENTS && group->priority == FS_PRIO_0) You may want to look at the summary of all the limitations on unprivileged listener here: https://github.com/amir73il/man-pages/blob/fanotify_unpriv/man2/fanotify_init.2#L400 Thanks, Amir.
> > > The problem here is that even if unprivileged users cannot compromise > > > security, they can still cause significant CPU overhead either queueing > > > events or filtering events and that is something I haven't been able to > > > figure out a way to escape from. > > > > WRT queueing overhead, given a user can place ~1M of directory watches, he > > can cause noticable total overhead for queueing events anyway. Furthermore > > I suppose so. But a user placing 1M dir watches at least adds this overhead > knowingly. Adding a overhead on the entire filesystem when just wanting to > watch a small subtree doesn't sound ideal. Especially in very nested setups. > So yes, we need to be careful. > I was thinking about this some more and I think the answer is in your example. User can only place 1M dir watches if ucount marks limits permits it. So whatever we allow to do with subtree or userns filtered marks should also be limited by ucounts. > > the queue size is limited so unless the user spends time consuming events > > as well, the load won't last long. But I agree we need to be careful not to > > introduce too big latencies to operations generating events. So I think if > > we could quickly detect whether a generated event has a good chance of > > being relevant for some subtree watch of a group and queue it in that case > > and worry about permission checks only once events are received and thus > > receiver pays the cost of expensive checks, that might be fine as well. > > > > So far the only idea I had for "quickly detect" which I cannot find flaws in > is to filter by mnt_userms, but its power is limited. > So I have implemented this idea on fanotify_userns branch and the cost per "filtered" sb mark is quite low - its a pretty cheap check in send_to_group() But still, if an unbound number of users can add to the sb mark list it is not going to end well. <hand waving> I think what we need here (thinking out loud) is to account the sb marks to the user that mounted the filesystem or to the user mapped to admin using idmapped mount, maybe to both(?), probably using a separate ucount entry (e.g. max_fanotify_filesystem_marks). We can set this limit by default to a small number (128?) to limit the sb list iteration per filesystem event and container manager / systemd can further limit this resource when creating idmapped mounts, which would otherwise allow the mapped user to add "filtered" (*) sb marks. </hand waving> Thanks, Amir. (*) "filtered" can refer to both the userns filter I proposed and going forward also maybe to subtree filter
On Mon 22-03-21 20:38:32, Amir Goldstein wrote: > > > > The problem here is that even if unprivileged users cannot compromise > > > > security, they can still cause significant CPU overhead either queueing > > > > events or filtering events and that is something I haven't been able to > > > > figure out a way to escape from. > > > > > > WRT queueing overhead, given a user can place ~1M of directory watches, he > > > can cause noticable total overhead for queueing events anyway. Furthermore > > > > I suppose so. But a user placing 1M dir watches at least adds this overhead > > knowingly. Adding a overhead on the entire filesystem when just wanting to > > watch a small subtree doesn't sound ideal. Especially in very nested setups. > > So yes, we need to be careful. > > > > I was thinking about this some more and I think the answer is in your example. > User can only place 1M dir watches if ucount marks limits permits it. > > So whatever we allow to do with subtree or userns filtered marks should > also be limited by ucounts. Yes, agreed. > > > the queue size is limited so unless the user spends time consuming events > > > as well, the load won't last long. But I agree we need to be careful not to > > > introduce too big latencies to operations generating events. So I think if > > > we could quickly detect whether a generated event has a good chance of > > > being relevant for some subtree watch of a group and queue it in that case > > > and worry about permission checks only once events are received and thus > > > receiver pays the cost of expensive checks, that might be fine as well. > > > > > > > So far the only idea I had for "quickly detect" which I cannot find flaws in > > is to filter by mnt_userms, but its power is limited. > > So I have implemented this idea on fanotify_userns branch and the cost > per "filtered" sb mark is quite low - its a pretty cheap check in > send_to_group() > But still, if an unbound number of users can add to the sb mark list it is > not going to end well. Thinking out loud: So what is the cost going to be for the side generating events? Ideally it would of O(number of fanotify groups receiving event). We cannot get better than that and if the constants aren't too big I think this is acceptable overhead. Sure this can mean total work of (number of users) * (max number of subtree marks per user) for queueing notification event but I don't think it is practical for a DoS attack and I also don't think that in practice users will be watching overlapping subtrees that much. The question is whether we can get that fast. Probably not because that would have to attach subtree watches to directory inodes or otherwise filter out unrelated fanotify groups in O(1). But the complexity of O(number of groups receiving events + depth of dir where event is happening) is probably achievable - we'd walk the tree up and have roots of watched subtrees marked. What do you think? Also there is a somewhat related question what is the semantics of subtree watches in presence of mounts - do subtree watches "see through" mount points? Probably not but then with bind mounts this can be sometimes inconvenient / confusing - e.g. if I have /tmp bind-mounted to /var/tmp and I'm watching subtree of /var, I would not get events for what's in /var/tmp... Which is logical if you spell it out like this but applications often don't care how the mount hierarchy looks like, they just care about locally visible directory structure. > <hand waving> > I think what we need here (thinking out loud) is to account the sb marks > to the user that mounted the filesystem or to the user mapped to admin using > idmapped mount, maybe to both(?), probably using a separate ucount entry > (e.g. max_fanotify_filesystem_marks). I'm somewhat lost here. Are these two users different? We have /home/foo which is a mounted filesystem. AFAIU it will be mounted in a special user namespace for user 'foo' - let's call is 'foo-ns'. /home/foo has idmapping attached so system [ug]ids and non-trivially mapped to on-disk [ug]ids. Now we have a user - let's call it 'foo-usr' that has enough capabilities (whatever they are) in 'foo-ns' to place fanotify subtree marks in /home/foo. So these marks are naturally accounted towards 'foo-usr'. To whom else you'd like to also account these marks and why? > We can set this limit by default to a small number (128?) to limit the sb list > iteration per filesystem event and container manager / systemd can further > limit this resource when creating idmapped mounts, which would otherwise > allow the mapped user to add "filtered" (*) sb marks. > </hand waving> > > (*) "filtered" can refer to both the userns filter I proposed and going forward > also maybe to subtree filter Honza
> > Now tested FAN_MARK_FILESYSTEM watch on tmpfs mounted > > inside userns and works fine, with two wrinkles I needed to iron: > > > > 1. FAN_REPORT_FID not supported on tmpfs because tmpfs has > > zero f_fsid (easy to fix) > > 2. open_by_handle_at() is not userns aware (can relax for > > FS_USERNS_MOUNT fs) > > > > Pushed these two fixes to branch fanotify_userns. > > Pushed another fix to mnt refcount bug in WIP and another commit to > add the last piece that could make fanotify usable for systemd-homed > setup - a filesystem watch filtered by mnt_userns (not tested yet). > Now I used mount-idmapped (from xfstest) to test that last piece. Found a minor bug and pushed a fix. It is working as expected, that is filtering only the events generated via the idmapped mount. However, because the listener I tested is capable in the mapped userns and not in the sb userns, the listener cannot open_ny_handle_at(), so the result is not as useful as one might hope. I guess we will also need to make open_by_handle_at() idmapped aware and use a variant of vfs_dentry_acceptable() that validates that the opened path is legitimately accessible via the idmapped mount. I think I will leave this complexity to you should you think the userns filtered watch is something worth the effort. Thanks, Amir.
On Wed, Mar 24, 2021 at 03:57:12PM +0200, Amir Goldstein wrote: > > > Now tested FAN_MARK_FILESYSTEM watch on tmpfs mounted > > > inside userns and works fine, with two wrinkles I needed to iron: > > > > > > 1. FAN_REPORT_FID not supported on tmpfs because tmpfs has > > > zero f_fsid (easy to fix) > > > 2. open_by_handle_at() is not userns aware (can relax for > > > FS_USERNS_MOUNT fs) > > > > > > Pushed these two fixes to branch fanotify_userns. > > > > Pushed another fix to mnt refcount bug in WIP and another commit to > > add the last piece that could make fanotify usable for systemd-homed > > setup - a filesystem watch filtered by mnt_userns (not tested yet). > > > > Now I used mount-idmapped (from xfstest) to test that last piece. > Found a minor bug and pushed a fix. > > It is working as expected, that is filtering only the events generated via > the idmapped mount. However, because the listener I tested is capable in > the mapped userns and not in the sb userns, the listener cannot > open_ny_handle_at(), so the result is not as useful as one might hope. This is another dumb question probably but in general, are you saying that someone watching a mount or directory and does _not_ want file descriptors from fanotify to be returned has no other way of getting to the path they want to open other than by using open_by_handle_at()? > > I guess we will also need to make open_by_handle_at() idmapped aware > and use a variant of vfs_dentry_acceptable() that validates that the opened > path is legitimately accessible via the idmapped mount. So as a first step, I think there's a legitimate case to be made for open_by_handle_at() to be made useable inside user namespaces. That's a change worth to be made independent of fanotify. For example, nowadays cgroups have a 64 bit identifier that can be used with open_by_handle_at to map a cgrp id to a path and back: https://lkml.org/lkml/2020/12/2/1126 Right now this can't be used in user namespaces because of this restriction but it is genuinely useful to have this feature available since cgroups are FS_USERNS_MOUNT and that identifier <-> path mapping is very convenient. Without looking at the code I'm not super sure how name_to_handle_at() and open_by_handle_at() behave in the face of mount namespaces so that would need looking into to. But it would be a genuinely useful change, I think. > > I think I will leave this complexity to you should you think the userns filtered > watch is something worth the effort. Fair enough! Thanks! Christian
On Wed, Mar 24, 2021 at 4:32 PM Christian Brauner <christian.brauner@ubuntu.com> wrote: > > On Wed, Mar 24, 2021 at 03:57:12PM +0200, Amir Goldstein wrote: > > > > Now tested FAN_MARK_FILESYSTEM watch on tmpfs mounted > > > > inside userns and works fine, with two wrinkles I needed to iron: > > > > > > > > 1. FAN_REPORT_FID not supported on tmpfs because tmpfs has > > > > zero f_fsid (easy to fix) > > > > 2. open_by_handle_at() is not userns aware (can relax for > > > > FS_USERNS_MOUNT fs) > > > > > > > > Pushed these two fixes to branch fanotify_userns. > > > > > > Pushed another fix to mnt refcount bug in WIP and another commit to > > > add the last piece that could make fanotify usable for systemd-homed > > > setup - a filesystem watch filtered by mnt_userns (not tested yet). > > > > > > > Now I used mount-idmapped (from xfstest) to test that last piece. > > Found a minor bug and pushed a fix. > > > > It is working as expected, that is filtering only the events generated via > > the idmapped mount. However, because the listener I tested is capable in > > the mapped userns and not in the sb userns, the listener cannot > > open_ny_handle_at(), so the result is not as useful as one might hope. > > This is another dumb question probably but in general, are you saying > that someone watching a mount or directory and does _not_ want file > descriptors from fanotify to be returned has no other way of getting to > the path they want to open other than by using open_by_handle_at()? > Well there is another way. It is demonstrated in my demo with intoifywatch --fanotify --recursive. It involved userspace iterating a subtree of interest to create fid->path map. The fanotify recursive watch is similar but not exactly the same as the old intoify recursive watch, because with inotify recursive watch you can miss events. With fanotify recursive watch, the listener (if capable) can setup a filesystem mark so events will not be missed. They will be recorded by fid with an unknown path and the path information can be found later by the crawler and updated in the map before the final report. Events on fid that were not found by the crawler need not be reported. That's essentially a subtree watch for the poor implemented in userspace. I did not implement the combination --fanotify --global --recursive in my demo, because it did not make sense with the current permission model (listener that can setup a fs mark can always resolve fids to path), but it would be quite trivial to add. > > > > I guess we will also need to make open_by_handle_at() idmapped aware > > and use a variant of vfs_dentry_acceptable() that validates that the opened > > path is legitimately accessible via the idmapped mount. > > So as a first step, I think there's a legitimate case to be made for > open_by_handle_at() to be made useable inside user namespaces. That's a > change worth to be made independent of fanotify. For example, nowadays > cgroups have a 64 bit identifier that can be used with open_by_handle_at > to map a cgrp id to a path and back: > https://lkml.org/lkml/2020/12/2/1126 > Right now this can't be used in user namespaces because of this > restriction but it is genuinely useful to have this feature available > since cgroups are FS_USERNS_MOUNT and that identifier <-> path mapping > is very convenient. FS_USERNS_MOUNT is a simple case and I think it is safe. There is already a patch for that on my fanotify_userns branch. > Without looking at the code I'm not super sure how name_to_handle_at() > and open_by_handle_at() behave in the face of mount namespaces so that > would need looking into to. But it would be a genuinely useful change, I > think. > name_to_handle_at()/open_by_handle_at() should be indifferent to mount ns, because the former returns mount_id which is globally unique (I think) and the latter takes a mount_fd argument, which means the caller needs to prove access to the mount prior to getting an open fd in that mount. Thanks, Amir.
> > So I have implemented this idea on fanotify_userns branch and the cost > > per "filtered" sb mark is quite low - its a pretty cheap check in > > send_to_group() > > But still, if an unbound number of users can add to the sb mark list it is > > not going to end well. > > Thinking out loud: So what is the cost going to be for the side generating > events? Ideally it would of O(number of fanotify groups receiving event). > We cannot get better than that and if the constants aren't too big I think > this is acceptable overhead. Sure this can mean total work of (number of > users) * (max number of subtree marks per user) for queueing notification > event but I don't think it is practical for a DoS attack and I also don't > think that in practice users will be watching overlapping subtrees that > much. > Why overlapping? My concern is not so much from DoS attacks. My concern is more from innocent users adding unacceptable accumulated overhead. Think of a filesystem mounted at /home/ with 100K directories at /home/user$N, every user gets its own idmapped mount from systemd-homed and may or may not choose to run a listener to get events generated under its own home dir (which is an idmapped mount). Even if we limit one sb mask per user, we can still have 100K marks list in sb. For this reason I think we need to limit the number of marks per sb. The simple way is a global config like max_queued_events, but I think we can do better than that. > The question is whether we can get that fast. Probably not because that > would have to attach subtree watches to directory inodes or otherwise > filter out unrelated fanotify groups in O(1). But the complexity of > O(number of groups receiving events + depth of dir where event is happening) > is probably achievable - we'd walk the tree up and have roots of watched > subtrees marked. What do you think? > I am for that. I already posted a POC along those lines [1]. I was just not sure how to limit the potential accumulated overhead. [1] https://github.com/amir73il/linux/commits/fanotify_subtree_mark > Also there is a somewhat related question what is the semantics of subtree > watches in presence of mounts - do subtree watches "see through" mount > points? Probably not but then with bind mounts this can be sometimes > inconvenient / confusing - e.g. if I have /tmp bind-mounted to /var/tmp and > I'm watching subtree of /var, I would not get events for what's in > /var/tmp... Which is logical if you spell it out like this but applications > often don't care how the mount hierarchy looks like, they just care about > locally visible directory structure. Those are hard questions. I think that userns/mountns developers needed to address them a while ago and I think there are some helpers that help with checking visibility of paths. > > > <hand waving> > > I think what we need here (thinking out loud) is to account the sb marks > > to the user that mounted the filesystem or to the user mapped to admin using > > idmapped mount, maybe to both(?), probably using a separate ucount entry > > (e.g. max_fanotify_filesystem_marks). > > I'm somewhat lost here. Are these two users different? We have /home/foo > which is a mounted filesystem. AFAIU it will be mounted in a special user > namespace for user 'foo' - let's call is 'foo-ns'. /home/foo has idmapping > attached so system [ug]ids and non-trivially mapped to on-disk [ug]ids. Now > we have a user - let's call it 'foo-usr' that has enough capabilities > (whatever they are) in 'foo-ns' to place fanotify subtree marks in > /home/foo. So these marks are naturally accounted towards 'foo-usr'. To > whom else you'd like to also account these marks and why? > I would like the system admin to be able to limit 100 sb marks on /home (filtered or not) because that impacts the send_to_group iteration. I would also like systemd to be able to grant a smaller quota of filtered sb marks per user when creating and mapping the idmapped mounts at /home/foo$N I *think* we can achieve that, by accounting the sb marks to uid 0 (who mounted /home) in ucounts entry "fanotify_sb_marks". If /home would have been a FS_USERNS_MOUNT mounted inside some userns, then all its sb marks would be accounted to uid 0 of that userns. I have no ideas if this all adds up. My head explodes even from trying to express these rules :-/ Thanks, Amir.
On Wed, Mar 24, 2021 at 05:05:45PM +0200, Amir Goldstein wrote: > On Wed, Mar 24, 2021 at 4:32 PM Christian Brauner > <christian.brauner@ubuntu.com> wrote: > > > > On Wed, Mar 24, 2021 at 03:57:12PM +0200, Amir Goldstein wrote: > > > > > Now tested FAN_MARK_FILESYSTEM watch on tmpfs mounted > > > > > inside userns and works fine, with two wrinkles I needed to iron: > > > > > > > > > > 1. FAN_REPORT_FID not supported on tmpfs because tmpfs has > > > > > zero f_fsid (easy to fix) > > > > > 2. open_by_handle_at() is not userns aware (can relax for > > > > > FS_USERNS_MOUNT fs) > > > > > > > > > > Pushed these two fixes to branch fanotify_userns. > > > > > > > > Pushed another fix to mnt refcount bug in WIP and another commit to > > > > add the last piece that could make fanotify usable for systemd-homed > > > > setup - a filesystem watch filtered by mnt_userns (not tested yet). > > > > > > > > > > Now I used mount-idmapped (from xfstest) to test that last piece. > > > Found a minor bug and pushed a fix. > > > > > > It is working as expected, that is filtering only the events generated via > > > the idmapped mount. However, because the listener I tested is capable in > > > the mapped userns and not in the sb userns, the listener cannot > > > open_ny_handle_at(), so the result is not as useful as one might hope. > > > > This is another dumb question probably but in general, are you saying > > that someone watching a mount or directory and does _not_ want file > > descriptors from fanotify to be returned has no other way of getting to > > the path they want to open other than by using open_by_handle_at()? > > > > Well there is another way. > It is demonstrated in my demo with intoifywatch --fanotify --recursive. > It involved userspace iterating a subtree of interest to create fid->path > map. Ok, so this seems to be inotifytools_filename_from_fid() -> if (fanotify_mark_type != FAN_MARK_FILESYSTEM) watch_from_fid() -> read_path_from(/proc/self/fd/w->dirfd) > > The fanotify recursive watch is similar but not exactly the same as the > old intoify recursive watch, because with inotify recursive watch you > can miss events. > > With fanotify recursive watch, the listener (if capable) can setup a > filesystem mark so events will not be missed. They will be recorded > by fid with an unknown path and the path information can be found later > by the crawler and updated in the map before the final report. > > Events on fid that were not found by the crawler need not be reported. > That's essentially a subtree watch for the poor implemented in userspace. This is already a good improvement. Honestly, having FAN_MARK_INODE workable unprivileged is already pretty great. In addition having FAN_MARK_MOUNT workable with idmapped mounts will likely get us what most users care about, afaict that is the POC in: https://github.com/amir73il/linux/commit/f0d5d462c5baeb82a658944c6df80704434f09a1 (I'm reading the source correctly that FAN_MARK_MOUNT works with FAN_REPORT_FID as long as no inode event set in FANOTIFY_INODE_EVENTS is set? I'm asking because my manpage - probably too old - seems to imply that FAN_REPORT_FID can't be used with FAN_MARK_MOUNT although I might just be stumbling over the phrasing.) I think FAN_MARK_FILESYSTEM should simply stay under the s_userns_s capable requirement. That's imho the cleanest semantics for this, i.e. I'd drop: https://github.com/amir73il/linux/commit/bd20e273f3c3a650805b3da32e493f01cc2a4763 This is neither an urgent use-case nor am I feeling very comfortable with it. > > I did not implement the combination --fanotify --global --recursive in my > demo, because it did not make sense with the current permission model > (listener that can setup a fs mark can always resolve fids to path), but it > would be quite trivial to add. > > > > > > > > I guess we will also need to make open_by_handle_at() idmapped aware > > > and use a variant of vfs_dentry_acceptable() that validates that the opened > > > path is legitimately accessible via the idmapped mount. > > > > So as a first step, I think there's a legitimate case to be made for > > open_by_handle_at() to be made useable inside user namespaces. That's a > > change worth to be made independent of fanotify. For example, nowadays > > cgroups have a 64 bit identifier that can be used with open_by_handle_at > > to map a cgrp id to a path and back: > > https://lkml.org/lkml/2020/12/2/1126 > > Right now this can't be used in user namespaces because of this > > restriction but it is genuinely useful to have this feature available > > since cgroups are FS_USERNS_MOUNT and that identifier <-> path mapping > > is very convenient. > > FS_USERNS_MOUNT is a simple case and I think it is safe. > There is already a patch for that on my fanotify_userns branch. Great! Christian
> > Well there is another way. > > It is demonstrated in my demo with intoifywatch --fanotify --recursive. > > It involved userspace iterating a subtree of interest to create fid->path > > map. > > Ok, so this seems to be > > inotifytools_filename_from_fid() > -> if (fanotify_mark_type != FAN_MARK_FILESYSTEM) > watch_from_fid() > -> read_path_from(/proc/self/fd/w->dirfd) > Yes. > > > > The fanotify recursive watch is similar but not exactly the same as the > > old intoify recursive watch, because with inotify recursive watch you > > can miss events. > > > > With fanotify recursive watch, the listener (if capable) can setup a > > filesystem mark so events will not be missed. They will be recorded > > by fid with an unknown path and the path information can be found later > > by the crawler and updated in the map before the final report. > > > > Events on fid that were not found by the crawler need not be reported. > > That's essentially a subtree watch for the poor implemented in userspace. > > This is already a good improvement. > Honestly, having FAN_MARK_INODE workable unprivileged is already pretty I'm not so sure why you say that, because unprivileged FAN_MARK_INODE watches are pretty close in functionality to inotify. There are only subtle differences. > great. In addition having FAN_MARK_MOUNT workable with idmapped mounts > will likely get us what most users care about, afaict that is the POC > in: > https://github.com/amir73il/linux/commit/f0d5d462c5baeb82a658944c6df80704434f09a1 > Hmm, the problem is the limited set of events you can get from FAN_MARK_MOUNT which does not include FAN_CREATE. > (I'm reading the source correctly that FAN_MARK_MOUNT works with > FAN_REPORT_FID as long as no inode event set in FANOTIFY_INODE_EVENTS is > set? I'm asking because my manpage - probably too old - seems to imply > that FAN_REPORT_FID can't be used with FAN_MARK_MOUNT although I might > just be stumbling over the phrasing.) > commit d71c9b4a5c6fbc7164007b52dba1de410d018292 Author: Amir Goldstein <amir73il@gmail.com> Date: Mon Apr 20 21:42:56 2020 +0300 fanotify_mark.2: Clarification about FAN_MARK_MOUNT and FAN_REPORT_FID It is not true that FAN_MARK_MOUNT cannot be used with a group that was initialized with flag FAN_REPORT_FID. ... IOW, no FAN_CREATE, FAN_DELETE, FAN_MOVE The technical reason for that is Al's objection to pass the mnt context into vfs helpers (and then fsnotify hooks). > I think FAN_MARK_FILESYSTEM should simply stay under the s_userns_s > capable requirement. That's imho the cleanest semantics for this, i.e. > I'd drop: > https://github.com/amir73il/linux/commit/bd20e273f3c3a650805b3da32e493f01cc2a4763 > This is neither an urgent use-case nor am I feeling very comfortable > with it. > The purpose of this commit is to provide FAN_CREATE, FAN_DELETE FAN_MOVE events filtered by (an idmapped) mount. I don't like it so much myself, but I have not had any better idea how to achieve that goal so far. There is another way though. We can create a new set of hooks outside the vfs helpers that do have the mnt context. I have already created such a set for another POC [1]. In this POC I introduced three new events FS_MODIFY_INTENT, FS_NAME_INTENT, FS_MOVE_INTENT, which I had no plans of exposing to fanotify. Nor did I need the granularity of CREATE, DELETE, RENAME event types (all collapsed into NAME_INTENT). But if we hit a dead end, we can resort to this strategy. Thanks, Amir. [1] https://github.com/amir73il/linux/commits/fsnotify_pre_modify
On Wed, Mar 24, 2021 at 07:07:17PM +0200, Amir Goldstein wrote: > > > Well there is another way. > > > It is demonstrated in my demo with intoifywatch --fanotify --recursive. > > > It involved userspace iterating a subtree of interest to create fid->path > > > map. > > > > Ok, so this seems to be > > > > inotifytools_filename_from_fid() > > -> if (fanotify_mark_type != FAN_MARK_FILESYSTEM) > > watch_from_fid() > > -> read_path_from(/proc/self/fd/w->dirfd) > > > > Yes. > > > > > > > The fanotify recursive watch is similar but not exactly the same as the > > > old intoify recursive watch, because with inotify recursive watch you > > > can miss events. > > > > > > With fanotify recursive watch, the listener (if capable) can setup a > > > filesystem mark so events will not be missed. They will be recorded > > > by fid with an unknown path and the path information can be found later > > > by the crawler and updated in the map before the final report. > > > > > > Events on fid that were not found by the crawler need not be reported. > > > That's essentially a subtree watch for the poor implemented in userspace. > > > > This is already a good improvement. > > Honestly, having FAN_MARK_INODE workable unprivileged is already pretty > > I'm not so sure why you say that, because unprivileged FAN_MARK_INODE > watches are pretty close in functionality to inotify. > There are only subtle differences. Simply because until now we couldn't use fanotify in any way because of the capable() restriction. > > > great. In addition having FAN_MARK_MOUNT workable with idmapped mounts > > will likely get us what most users care about, afaict that is the POC > > in: > > https://github.com/amir73il/linux/commit/f0d5d462c5baeb82a658944c6df80704434f09a1 > > > > Hmm, the problem is the limited set of events you can get from > FAN_MARK_MOUNT which does not include FAN_CREATE. Yes, that's what I gathered from perusing the code. > > > (I'm reading the source correctly that FAN_MARK_MOUNT works with > > FAN_REPORT_FID as long as no inode event set in FANOTIFY_INODE_EVENTS is > > set? I'm asking because my manpage - probably too old - seems to imply > > that FAN_REPORT_FID can't be used with FAN_MARK_MOUNT although I might > > just be stumbling over the phrasing.) > > > > commit d71c9b4a5c6fbc7164007b52dba1de410d018292 > Author: Amir Goldstein <amir73il@gmail.com> > Date: Mon Apr 20 21:42:56 2020 +0300 > > fanotify_mark.2: Clarification about FAN_MARK_MOUNT and FAN_REPORT_FID > > It is not true that FAN_MARK_MOUNT cannot be used with a group > that was initialized with flag FAN_REPORT_FID. Btw, I don't see FAN_MARK_INODE in the man2 in the upstream repository. I know it's essentially the 0 or default value but it would still be worthwhile to mention it in the manpage. > ... > > IOW, no FAN_CREATE, FAN_DELETE, FAN_MOVE > > The technical reason for that is Al's objection to pass the mnt context > into vfs helpers (and then fsnotify hooks). Ah yes, I remember that. > > > I think FAN_MARK_FILESYSTEM should simply stay under the s_userns_s > > capable requirement. That's imho the cleanest semantics for this, i.e. > > I'd drop: > > https://github.com/amir73il/linux/commit/bd20e273f3c3a650805b3da32e493f01cc2a4763 > > This is neither an urgent use-case nor am I feeling very comfortable > > with it. > > > > The purpose of this commit is to provide FAN_CREATE, FAN_DELETE > FAN_MOVE events filtered by (an idmapped) mount. I see. I wanted to write a few words about the use-case you mention and the need for subtree watches for this particular case: "A common use case for of a filesystem subtree is a bind mount, not on the root of the filesystem, such as the bind mounts used for containers." Which presumably means you want to point fanotify to the rootfs mount of the container and have it watch everything under that rootfs including any submounts in there. While this certainly might be useful I'm not sure it's a very common use-case or really necessary to support. Assuming a full system-like container setup like runC, systemd-nspawn, LXD, and similar runtimes do it we end up with a few additonal mounts. They are usually performed in the containers mount+user namespace. The standard ones, i.e. the ones most container runtimes setup are: - sysfs - cgroupfs - procfs - mqueue - devpts - tmpfs on /dev (as a substitute for devtmpfs not being namespaced) - tmpfs on /dev/shm - bind mounts of a few host device files into /dev: - /dev/fuse - /dev/net/tun - /dev/full - /dev/null - /dev/random - /dev/tty - /dev/urandom - /dev/zero - /dev/console I think most of these mounts aren't very interesting to monitor. It's in general not very common to monitor full pseudo fileystems such as proc, sysfs, or devpts afaik. Notably, systemd - both outside and inside containers - will use some inotify watches but it's always on specific directories and never across mount borders. In addition to the default mounts above - I've mentioned this before - the container manager might choose to inject mounts into a running container (to share data or whatever). There are different strategies on how to do this. I1. The first strategy is to inject the mount into the container in such a way that the container has full control over it, i.e. it can unmount and remount (with the restriction that some flags might become locked when moving the mount across user namespaces). I2. The second strategy is to inject the mount in such a way that the container doesn't have control over it, i.e. the container can't umount or remount. Most container runtimes will support I1. but systemd-nspawn uses I2. There might be use-cases where the container manager would like to watch those mounts too. But then the container manager will just call fanotify_mark() and add that mount to the list of watched mounts when injecting it. I get that there are other use-cases that make subtree watches very interesting but I don't think the container use-case is a particularly pressing one. > I don't like it so much myself, but I have not had any better idea how to > achieve that goal so far. The limitations of FAN_MARK_MOUNT as I now understand them are indeed unpleasant. If we could get FAN_MARK_MOUNT with the same event support as FAN_MARK_INODE that would be great. I think the delegation model that makes sense to me is to allow FAN_MARK_MOUNT when the caller is ns_capable(mnt->mnt_userns) and of course ns_capable() in the userns they called fanotify_init() in. That feels ok and supportable. But I don't think anything beyond that like the sb filter patch that you showed is the right approach. Christian
On Wed 24-03-21 17:50:52, Amir Goldstein wrote: > > > So I have implemented this idea on fanotify_userns branch and the cost > > > per "filtered" sb mark is quite low - its a pretty cheap check in > > > send_to_group() > > > But still, if an unbound number of users can add to the sb mark list it is > > > not going to end well. > > > > Thinking out loud: So what is the cost going to be for the side generating > > events? Ideally it would of O(number of fanotify groups receiving event). > > We cannot get better than that and if the constants aren't too big I think > > this is acceptable overhead. Sure this can mean total work of (number of > > users) * (max number of subtree marks per user) for queueing notification > > event but I don't think it is practical for a DoS attack and I also don't > > think that in practice users will be watching overlapping subtrees that > > much. > > > > Why overlapping? > My concern is not so much from DoS attacks. > My concern is more from innocent users adding unacceptable > accumulated overhead. > > Think of a filesystem mounted at /home/ with 100K directories at > /home/user$N, every user gets its own idmapped mount from > systemd-homed and may or may not choose to run a listener to > get events generated under its own home dir (which is an idmapped > mount). Even if we limit one sb mask per user, we can still have 100K > marks list in sb. I see but then you'd have to have 100K users using the same filesystem on the server at the same time? Which doesn't look likely to me? I'd presume the home dir is watched only if the user is actually running something on that machine... So I'm not sure how realistic this example is. But yes, maybe we need some more efficient algorithm for selecting which subtree watch is actually relevant for an event. > For this reason I think we need to limit the number of marks per sb. > The simple way is a global config like max_queued_events, but I think > we can do better than that. Adding a global limit on number of sb marks is OK but still I'd like the system to scale reasonably with say tens to hundreds watches... > > The question is whether we can get that fast. Probably not because that > > would have to attach subtree watches to directory inodes or otherwise > > filter out unrelated fanotify groups in O(1). But the complexity of > > O(number of groups receiving events + depth of dir where event is happening) > > is probably achievable - we'd walk the tree up and have roots of watched > > subtrees marked. What do you think? > > I am for that. I already posted a POC along those lines [1]. > I was just not sure how to limit the potential accumulated overhead. > > [1] https://github.com/amir73il/linux/commits/fanotify_subtree_mark Yes, but AFAICT your solution was O((number of subtree marks on sb) * depth of dir) while I'm speaking about O(number of *groups* on sb + depth of dir). Which is significantly less and will require more careful setup to achieve such complexity (like placing special marks in lists of watches for directories that are roots of watched subtrees and checking such lists when walking up the tree). > > Also there is a somewhat related question what is the semantics of subtree > > watches in presence of mounts - do subtree watches "see through" mount > > points? Probably not but then with bind mounts this can be sometimes > > inconvenient / confusing - e.g. if I have /tmp bind-mounted to /var/tmp and > > I'm watching subtree of /var, I would not get events for what's in > > /var/tmp... Which is logical if you spell it out like this but applications > > often don't care how the mount hierarchy looks like, they just care about > > locally visible directory structure. > > Those are hard questions. I think that userns/mountns developers needed > to address them a while ago and I think there are some helpers that help > with checking visibility of paths. > > > > <hand waving> > > > I think what we need here (thinking out loud) is to account the sb marks > > > to the user that mounted the filesystem or to the user mapped to admin using > > > idmapped mount, maybe to both(?), probably using a separate ucount entry > > > (e.g. max_fanotify_filesystem_marks). > > > > I'm somewhat lost here. Are these two users different? We have /home/foo > > which is a mounted filesystem. AFAIU it will be mounted in a special user > > namespace for user 'foo' - let's call is 'foo-ns'. /home/foo has idmapping > > attached so system [ug]ids and non-trivially mapped to on-disk [ug]ids. Now > > we have a user - let's call it 'foo-usr' that has enough capabilities > > (whatever they are) in 'foo-ns' to place fanotify subtree marks in > > /home/foo. So these marks are naturally accounted towards 'foo-usr'. To > > whom else you'd like to also account these marks and why? > > > > I would like the system admin to be able to limit 100 sb marks on /home > (filtered or not) because that impacts the send_to_group iteration. OK, so per-sb limitation of sb mark number... > I would also like systemd to be able to grant a smaller quota of filtered > sb marks per user when creating and mapping the idmapped mounts > at /home/foo$N ... and a ucount to go with it? > I *think* we can achieve that, by accounting the sb marks to uid 0 > (who mounted /home) in ucounts entry "fanotify_sb_marks". But a superblock can be mounted in multiple places, in multiple user namespaces, potentially by different users (think of nested containers)? So if we want a per-sb limit on sb marks, I think that accounting those per user won't really achieve that? > If /home would have been a FS_USERNS_MOUNT mounted inside > some userns, then all its sb marks would be accounted to uid 0 of > that userns. Honza
> > I would like the system admin to be able to limit 100 sb marks on /home > > (filtered or not) because that impacts the send_to_group iteration. > > OK, so per-sb limitation of sb mark number... > > > I would also like systemd to be able to grant a smaller quota of filtered > > sb marks per user when creating and mapping the idmapped mounts > > at /home/foo$N > > ... and a ucount to go with it? > > > I *think* we can achieve that, by accounting the sb marks to uid 0 > > (who mounted /home) in ucounts entry "fanotify_sb_marks". > > But a superblock can be mounted in multiple places, in multiple user > namespaces, potentially by different users (think of nested containers)? So > if we want a per-sb limit on sb marks, I think that accounting those per > user won't really achieve that? > I agree. It won't. We can start with the global max_fanotify_sb_marks. I do not have an idea how to make that workable using ucounts. Thanks, Amir.
> I get that there are other use-cases that make subtree watches very > interesting but I don't think the container use-case is a particularly > pressing one. > That's what I thought. Containers are usually "contained" by a mount and possibly by userns, so it makes more sense and it would be more efficient to filter by those contexts. > > I don't like it so much myself, but I have not had any better idea how to > > achieve that goal so far. > > The limitations of FAN_MARK_MOUNT as I now understand them are indeed > unpleasant. If we could get FAN_MARK_MOUNT with the same event support > as FAN_MARK_INODE that would be great. > I think the delegation model that makes sense to me is to allow > FAN_MARK_MOUNT when the caller is ns_capable(mnt->mnt_userns) and of > course ns_capable() in the userns they called fanotify_init() in. That > feels ok and supportable. I present to you a demo [1][2] of FAN_MARK_MOUNT on idmapped mount that: 1. Can subscribe and receive FAN_LINK (new) events 2. Is capable of open_by_handle() if fid is under mount root FAN_LINK (temp name) is an event that I wanted to add anyway [3] and AFAIK it's the only event that you really need in order to detect when a dir was created for the use case of injecting a bind mount into a container. The kernel branch [1] intentionally excludes the controversial patch that added support for userns filtered sb marks. Therefore, trying to run the demo script as is on an idmapped mount inside userns will auto-detect UID 0, try to setup an sb mark and fail. Instead, the demo script should be run as follows to combine a mount mark and recursive inode marks: ./test_demo.sh <idmapped-mount-path> 1 For example: ~# ./test_demo.sh /vdf 1 + WD=/vdf + ID=1 ... + inotifywatch --fanotify --recursive -w -e link --timeout -2 /vdf Establishing watches... ... + mkdir -p a/dir0 a/dir1 a/dir2/subdir2 + touch a/dir2/file2 ... [fid=ad91a2b8.81a99d43.3000081;name='dir2'] /vdf/a/dir2 [fid=ad91a2b8.81a99d43.8a;name='.'] /vdf/a/dir2/. [fid=ad91a2b8.81a99d43.10000a6;name='.'] /vdf/a/dir2/subdir2/. [fid=ad91a2b8.81a99d43.8a;name='file2'] /vdf/a/dir2/file2 ... total modify .................................. create link delete filename 1 0 0 0 0 0 0 1 0 0 /vdf/a/dir2 1 0 0 0 0 0 0 0 1 0 /vdf/a/dir2/. 1 0 0 0 0 0 0 0 1 0 /vdf/a/dir2/subdir2/. 1 0 0 0 0 0 0 0 1 0 /vdf/a/dir2/file2 Thanks, Amir. [1] https://github.com/amir73il/linux/commits/fanotify_link [2] https://github.com/amir73il/inotify-tools/commits/fanotify_link [3] https://lore.kernel.org/linux-fsdevel/CAOQ4uxhEsbfA5+sW4XPnUKgCkXtwoDA-BR3iRO34Nx5c4y7Nug@mail.gmail.com/
> > The limitations of FAN_MARK_MOUNT as I now understand them are indeed > > unpleasant. If we could get FAN_MARK_MOUNT with the same event support > > as FAN_MARK_INODE that would be great. > > I think the delegation model that makes sense to me is to allow > > FAN_MARK_MOUNT when the caller is ns_capable(mnt->mnt_userns) and of > > course ns_capable() in the userns they called fanotify_init() in. That > > feels ok and supportable. > > I present to you a demo [1][2] of FAN_MARK_MOUNT on idmapped mount that: > > 1. Can subscribe and receive FAN_LINK (new) events > 2. Is capable of open_by_handle() if fid is under mount root > > FAN_LINK (temp name) is an event that I wanted to add anyway [3] and > AFAIK it's the only event that you really need in order to detect when a dir > was created for the use case of injecting a bind mount into a container. Scratch that part about the new event. I found a way to make FAN_CREATE available for FAN_MARK_MOUNT. Will post an RFC patch. Same demo instructions. Different branches [1][2]: > > The kernel branch [1] intentionally excludes the controversial patch that > added support for userns filtered sb marks. > > Therefore, trying to run the demo script as is on an idmapped mount > inside userns will auto-detect UID 0, try to setup an sb mark and fail. > > Instead, the demo script should be run as follows to combine a > mount mark and recursive inode marks: > > ./test_demo.sh <idmapped-mount-path> 1 > > For example: ~# ./test_demo.sh /vdf 1 + WD=/vdf + ID=1 ... + inotifywatch --fanotify --recursive -w --timeout -2 /vdf Establishing watches... ... + mkdir -p a/dir0 a/dir1 a/dir2/A/B/C/ + touch a/dir2/A/B/C/file2 ... [fid=94847cf7.d74a50ab.30000c2;name='dir2'] /mnt/a/dir2 Adding recursive watches on directory '/mnt/a/dir2/'... [fid=94847cf7.d74a50ab.87;name='A'] /mnt/a/dir2/A Adding recursive watches on directory '/mnt/a/dir2/A/'... [fid=94847cf7.d74a50ab.1000087;name='B'] /mnt/a/dir2/A/B Adding recursive watches on directory '/mnt/a/dir2/A/B/'... [fid=94847cf7.d74a50ab.20073e5;name='C'] /mnt/a/dir2/A/B/C Adding recursive watches on directory '/mnt/a/dir2/A/B/C/'... [fid=94847cf7.d74a50ab.30000c9;name='file2'] /mnt/a/dir2/A/B/C/file2 Hope that helps. Thanks, Amir. [1] https://github.com/amir73il/linux/commits/fanotify_userns [2] https://github.com/amir73il/inotify-tools/commits/fanotify_userns