Message ID | 20200604200413.587896-1-gladkov.alexey@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | proc: use subset option to hide some top-level procfs entries | expand |
Alexey Gladkov <gladkov.alexey@gmail.com> writes: > Greetings! > > Preface > ------- > This patch set can be applied over: > > git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git d35bec8a5788 I am not going to seriously look at this for merging until after the merge window closes. Have you thought about the possibility of relaxing the permission checks to mount proc such that we don't need to verify there is an existing mount of proc? With just the subset pids I think this is feasible. It might not be worth it at this point, but it is definitely worth asking the question. As one of the benefits early propopents of the idea of a subset of proc touted was that they would not be as restricted as they are with today's proc. I ask because this has a bearing on the other options you are playing with. Do we want to find a way to have the benefit of relaxed permission checks while still including a few more files. > Overview > -------- > Directories and files can be created and deleted by dynamically loaded modules. > Not all of these files are virtualized and safe inside the container. > > However, subset=pid is not enough because many containers wants to have > /proc/meminfo, /proc/cpuinfo, etc. We need a way to limit the visibility of > files per procfs mountpoint. Is it desirable to have meminfo and cpuinfo as they are today or do people want them to reflect the ``container'' context. So that applications like the JVM don't allocation too many cpus or don't try and consume too much memory, or run on nodes that cgroups current make unavailable. Are there any users or planned users of this functionality yet? I am concerned that you might be adding functionality that no one will ever use that will just add code to the kernel that no one cares about, that will then accumulate bugs. Having had to work through a few of those cases to make each mount of proc have it's own super block I am not a great fan of adding another one. If the runc, lxc and other container runtime folks can productively use such and option to do useful things and they are sensible things to do I don't have any fundamental objection. But I do want to be certain this is a feature that is going to be used. Eric > Introduced changes > ------------------ > Allow to specify the names of files and directories in the subset= parameter and > thereby make a whitelist of top-level permitted names. > > > Alexey Gladkov (2): > proc: use subset option to hide some top-level procfs entries > docs: proc: update documentation about subset= parameter > > Documentation/filesystems/proc.rst | 6 +++ > fs/proc/base.c | 15 +++++- > fs/proc/generic.c | 75 +++++++++++++++++++++------ > fs/proc/inode.c | 18 ++++--- > fs/proc/internal.h | 12 +++++ > fs/proc/root.c | 81 ++++++++++++++++++++++++------ > include/linux/proc_fs.h | 11 ++-- > 7 files changed, 175 insertions(+), 43 deletions(-)
On Thu, Jun 04, 2020 at 03:33:25PM -0500, Eric W. Biederman wrote: > Alexey Gladkov <gladkov.alexey@gmail.com> writes: > > > Greetings! > > > > Preface > > ------- > > This patch set can be applied over: > > > > git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git d35bec8a5788 > > I am not going to seriously look at this for merging until after the > merge window closes. > > Have you thought about the possibility of relaxing the permission checks > to mount proc such that we don't need to verify there is an existing > mount of proc? With just the subset pids I think this is feasible. It > might not be worth it at this point, but it is definitely worth asking > the question. As one of the benefits early propopents of the idea of a > subset of proc touted was that they would not be as restricted as they > are with today's proc. > > I ask because this has a bearing on the other options you are playing > with. > > Do we want to find a way to have the benefit of relaxed permission > checks while still including a few more files. > > > Overview > > -------- > > Directories and files can be created and deleted by dynamically loaded modules. > > Not all of these files are virtualized and safe inside the container. > > > > However, subset=pid is not enough because many containers wants to have > > /proc/meminfo, /proc/cpuinfo, etc. We need a way to limit the visibility of > > files per procfs mountpoint. > > Is it desirable to have meminfo and cpuinfo as they are today or do > people want them to reflect the ``container'' context. So that > applications like the JVM don't allocation too many cpus or don't try > and consume too much memory, or run on nodes that cgroups current make > unavailable. > > Are there any users or planned users of this functionality yet? > > I am concerned that you might be adding functionality that no one will > ever use that will just add code to the kernel that no one cares about, > that will then accumulate bugs. Having had to work through a few of > those cases to make each mount of proc have it's own super block I am > not a great fan of adding another one. > > If the runc, lxc and other container runtime folks can productively use > such and option to do useful things and they are sensible things to do I > don't have any fundamental objection. But I do want to be certain this > is a feature that is going to be used. I'm not sure Alexey is introducing virtualized meminfo and cpuinfo (but I haven't had time to look at this patchset). In any case, we are currently virtualizing: /proc/cpuinfo /proc/diskstats /proc/loadavg /proc/meminfo /proc/stat /proc/swaps /proc/uptime for each container with a tiny in-userspace filesystem LXCFS ( https://github.com/lxc/lxcfs ) and have been doing that for years. Having meminfo and cpuinfo virtualized in procfs was something we have been wanting for a long time and there have been patches by other people (from Siteground, I believe) to achieve this a few years back but were disregarded. I think meminfo and cpuinfo would already be great. And if we're virtualizing cpuinfo we also need to virtualize the cpu bits exposed in /proc/stat. It would also be great to virtualize /proc/uptime. Right now we're achieving this essentially by substracting the time the init process of the pid namespace has started since system boot time, minus the time when the system started to get the actual reaper age (It's a bit more involved but that's the gist.). This is all on the topic list for this year's virtual container's microconference at Plumber's and I would suggest we try to discuss the various requirements for something like this there. (I'm about to send the CFP out.) Christian
On Thu, Jun 04, 2020 at 03:33:25PM -0500, Eric W. Biederman wrote: > Alexey Gladkov <gladkov.alexey@gmail.com> writes: > > > Greetings! > > > > Preface > > ------- > > This patch set can be applied over: > > > > git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git d35bec8a5788 > > I am not going to seriously look at this for merging until after the > merge window closes. OK. I'll wait. > Have you thought about the possibility of relaxing the permission checks > to mount proc such that we don't need to verify there is an existing > mount of proc? With just the subset pids I think this is feasible. It > might not be worth it at this point, but it is definitely worth asking > the question. As one of the benefits early propopents of the idea of a > subset of proc touted was that they would not be as restricted as they > are with today's proc. I'm not sure I follow. What do you mean by the possibility of relaxing the permission checks to mount proc? Do you suggest to allow a user to mount procfs with hidepid=2,subset=pid options? If so then this is an interesting idea. > I ask because this has a bearing on the other options you are playing > with. I can not agree with this because I do not touch on other options. The hidepid and subset=pid has no relation to the visibility of regular files. On the other hand, in procfs there is absolutely no way to restrict access other than selinux. > Do we want to find a way to have the benefit of relaxed permission > checks while still including a few more files. In fact, I see no problem allowing the user to mount procfs with the hidepid=2,subset=pid options. We can make subset=self, which would allow not only pids subset but also other symlinks that lead to self (/proc/net, /proc/mounts) and if we ever add virtualization to meminfo, cpuinfo etc. > > Overview > > -------- > > Directories and files can be created and deleted by dynamically loaded modules. > > Not all of these files are virtualized and safe inside the container. > > > > However, subset=pid is not enough because many containers wants to have > > /proc/meminfo, /proc/cpuinfo, etc. We need a way to limit the visibility of > > files per procfs mountpoint. > > Is it desirable to have meminfo and cpuinfo as they are today or do > people want them to reflect the ``container'' context. So that > applications like the JVM don't allocation too many cpus or don't try > and consume too much memory, or run on nodes that cgroups current make > unavailable. Of course, it would be better if these files took into account the limitations of cgroups or some kind of ``containerized'' context. > Are there any users or planned users of this functionality yet? I know that java uses meminfo for sure. The purpose of this patch is to isolate the container from unwanted files in procfs. > I am concerned that you might be adding functionality that no one will > ever use that will just add code to the kernel that no one cares about, > that will then accumulate bugs. Having had to work through a few of > those cases to make each mount of proc have it's own super block I am > not a great fan of adding another one. > > If the runc, lxc and other container runtime folks can productively use > such and option to do useful things and they are sensible things to do I > don't have any fundamental objection. But I do want to be certain this > is a feature that is going to be used. Ok, just an example how docker or runc (actually almost all golang-based container systems) is trying to block access to something in procfs: $ docker run -it --rm busybox # mount |grep /proc proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) proc on /proc/bus type proc (ro,relatime) proc on /proc/fs type proc (ro,relatime) proc on /proc/irq type proc (ro,relatime) proc on /proc/sys type proc (ro,relatime) proc on /proc/sysrq-trigger type proc (ro,relatime) tmpfs on /proc/asound type tmpfs (ro,seclabel,relatime) tmpfs on /proc/acpi type tmpfs (ro,seclabel,relatime) tmpfs on /proc/kcore type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755) tmpfs on /proc/keys type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755) tmpfs on /proc/latency_stats type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755) tmpfs on /proc/timer_list type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755) tmpfs on /proc/sched_debug type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755) tmpfs on /proc/scsi type tmpfs (ro,seclabel,relatime) For now I'm just trying ti create a better way to restrict access in the procfs than this since procfs is used in containers.
On Thu, Jun 04, 2020 at 11:32:20PM +0200, Christian Brauner wrote: > > Is it desirable to have meminfo and cpuinfo as they are today or do > > people want them to reflect the ``container'' context. So that > > applications like the JVM don't allocation too many cpus or don't try > > and consume too much memory, or run on nodes that cgroups current make > > unavailable. > > > > Are there any users or planned users of this functionality yet? > > > > I am concerned that you might be adding functionality that no one will > > ever use that will just add code to the kernel that no one cares about, > > that will then accumulate bugs. Having had to work through a few of > > those cases to make each mount of proc have it's own super block I am > > not a great fan of adding another one. > > > > If the runc, lxc and other container runtime folks can productively use > > such and option to do useful things and they are sensible things to do I > > don't have any fundamental objection. But I do want to be certain this > > is a feature that is going to be used. > > I'm not sure Alexey is introducing virtualized meminfo and cpuinfo (but > I haven't had time to look at this patchset). No. Not yet :) I just suggest a way to restrict access to files in the procfs inside a container about which you know nothing. > In any case, we are currently virtualizing: > /proc/cpuinfo > /proc/diskstats > /proc/loadavg > /proc/meminfo > /proc/stat > /proc/swaps > /proc/uptime > for each container with a tiny in-userspace filesystem LXCFS > ( https://github.com/lxc/lxcfs ) > and have been doing that for years. I know about it. The reason for the appearance of such a solution is also clear. > Having meminfo and cpuinfo virtualized in procfs was something we have > been wanting for a long time and there have been patches by other people > (from Siteground, I believe) to achieve this a few years back but were > disregarded. > > I think meminfo and cpuinfo would already be great. And if we're > virtualizing cpuinfo we also need to virtualize the cpu bits exposed in > /proc/stat. It would also be great to virtualize /proc/uptime. Right now > we're achieving this essentially by substracting the time the init > process of the pid namespace has started since system boot time, minus > the time when the system started to get the actual reaper age (It's a > bit more involved but that's the gist.). > > This is all on the topic list for this year's virtual container's > microconference at Plumber's and I would suggest we try to discuss the > various requirements for something like this there. (I'm about to send > the CFP out.) > > Christian >
Alexey Gladkov <gladkov.alexey@gmail.com> writes: > On Thu, Jun 04, 2020 at 03:33:25PM -0500, Eric W. Biederman wrote: >> Alexey Gladkov <gladkov.alexey@gmail.com> writes: >> >> > Greetings! >> > >> > Preface >> > ------- >> > This patch set can be applied over: >> > >> > git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git d35bec8a5788 >> >> I am not going to seriously look at this for merging until after the >> merge window closes. > > OK. I'll wait. That will mean your patches can be based on -rc1. >> Have you thought about the possibility of relaxing the permission checks >> to mount proc such that we don't need to verify there is an existing >> mount of proc? With just the subset pids I think this is feasible. It >> might not be worth it at this point, but it is definitely worth asking >> the question. As one of the benefits early propopents of the idea of a >> subset of proc touted was that they would not be as restricted as they >> are with today's proc. > > I'm not sure I follow. > > What do you mean by the possibility of relaxing the permission checks to > mount proc? > > Do you suggest to allow a user to mount procfs with hidepid=2,subset=pid > options? If so then this is an interesting idea. The key part would be subset=pid. You would still need to be root in your user namespace, and mount namespace. You would not need to have a separate copy of proc with nothing hidden already mounted. >> I ask because this has a bearing on the other options you are playing >> with. > > I can not agree with this because I do not touch on other options. > The hidepid and subset=pid has no relation to the visibility of regular > files. On the other hand, in procfs there is absolutely no way to restrict > access other than selinux. Untrue. At a practical level the user namespace greatly restricts access to proc because many of the non-process files are limited to global root only. >> Do we want to find a way to have the benefit of relaxed permission >> checks while still including a few more files. > > In fact, I see no problem allowing the user to mount procfs with the > hidepid=2,subset=pid options. > > We can make subset=self, which would allow not only pids subset but also > other symlinks that lead to self (/proc/net, /proc/mounts) and if we ever > add virtualization to meminfo, cpuinfo etc. > >> > Overview >> > -------- >> > Directories and files can be created and deleted by dynamically loaded modules. >> > Not all of these files are virtualized and safe inside the container. >> > >> > However, subset=pid is not enough because many containers wants to have >> > /proc/meminfo, /proc/cpuinfo, etc. We need a way to limit the visibility of >> > files per procfs mountpoint. >> >> Is it desirable to have meminfo and cpuinfo as they are today or do >> people want them to reflect the ``container'' context. So that >> applications like the JVM don't allocation too many cpus or don't try >> and consume too much memory, or run on nodes that cgroups current make >> unavailable. > > Of course, it would be better if these files took into account the > limitations of cgroups or some kind of ``containerized'' context. > >> Are there any users or planned users of this functionality yet? > > I know that java uses meminfo for sure. > > The purpose of this patch is to isolate the container from unwanted files > in procfs. If what we want is the ability not to use the original but to have a modified version of these files. We probably want empty files that serve as mount points. Or possibly a version of these files that takes into account restrictions. In either even we need to do the research through real programs and real kernel options to see what is our best option for exporting the limitations that programs have and deciding on the long term API for that. If we research things and we decide the best way to let java know of it's limitations is to change /proc/meminfo. That needs to be a change that always applies to meminfo and is not controlled by options. >> I am concerned that you might be adding functionality that no one will >> ever use that will just add code to the kernel that no one cares about, >> that will then accumulate bugs. Having had to work through a few of >> those cases to make each mount of proc have it's own super block I am >> not a great fan of adding another one. >> >> If the runc, lxc and other container runtime folks can productively use >> such and option to do useful things and they are sensible things to do I >> don't have any fundamental objection. But I do want to be certain this >> is a feature that is going to be used. > > Ok, just an example how docker or runc (actually almost all golang-based > container systems) is trying to block access to something in procfs: > > $ docker run -it --rm busybox > # mount |grep /proc > proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) > proc on /proc/bus type proc (ro,relatime) > proc on /proc/fs type proc (ro,relatime) > proc on /proc/irq type proc (ro,relatime) > proc on /proc/sys type proc (ro,relatime) > proc on /proc/sysrq-trigger type proc (ro,relatime) > tmpfs on /proc/asound type tmpfs (ro,seclabel,relatime) > tmpfs on /proc/acpi type tmpfs (ro,seclabel,relatime) > tmpfs on /proc/kcore type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755) > tmpfs on /proc/keys type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755) > tmpfs on /proc/latency_stats type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755) > tmpfs on /proc/timer_list type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755) > tmpfs on /proc/sched_debug type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755) > tmpfs on /proc/scsi type tmpfs (ro,seclabel,relatime) > > For now I'm just trying ti create a better way to restrict access in > the procfs than this since procfs is used in containers. Docker historically has been crap about having a sensible policy. The problem is that Docker wanted to allow real root in a container and somehow make it safe by blocking access to proc files and by dropping capabilities. Practically everything that Docker has done is much better and simpler by restricting the processes to a user namespace, with a root user whose uid is not the global root user. Which is why I want us to make certain we are doing something that makes sense, and is architecturally sound. You have cleared the big hurdle and proc now has options that are usable. I really appreciate that. I am not opposed to the general direction you are going to find a way to make proc more usable. I just want our next step to be solid. Eric
On Thu, Jun 04, 2020 at 11:17:38PM -0500, Eric W. Biederman wrote: > >> I am not going to seriously look at this for merging until after the > >> merge window closes. > > > > OK. I'll wait. > > That will mean your patches can be based on -rc1. OK. > > Do you suggest to allow a user to mount procfs with hidepid=2,subset=pid > > options? If so then this is an interesting idea. > > The key part would be subset=pid. You would still need to be root in > your user namespace, and mount namespace. You would not need to have a > separate copy of proc with nothing hidden already mounted. Can you tell me more about your idea ? I thought I understood it, but it seems my understanding is different. I thought that you are suggesting that you move in the direction of allowing procfs to mount an unprivileged user. > > I can not agree with this because I do not touch on other options. > > The hidepid and subset=pid has no relation to the visibility of regular > > files. On the other hand, in procfs there is absolutely no way to restrict > > access other than selinux. > > Untrue. At a practical level the user namespace greatly restricts > access to proc because many of the non-process files are limited to > global root only. I am not worried about the files created in procfs by the kernel itself because the permissions are set correctly and are checked correctly. I worry about kernel modules, especially about modules out of tree. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/gadget/function/rndis.c#n904 I certainly understand that 0660 is not 0666, but still. > > I know that java uses meminfo for sure. > > > > The purpose of this patch is to isolate the container from unwanted files > > in procfs. > > If what we want is the ability not to use the original but to have > a modified version of these files. We probably want empty files that > serve as mount points. > > Or possibly a version of these files that takes into account > restrictions. In either even we need to do the research through real > programs and real kernel options to see what is our best option for > exporting the limitations that programs have and deciding on the long > term API for that. Yes, but that's a slightly different story. It would be great if all of these files provide modified information. My patch is about those files that we don’t know about and which we don’t want. > If we research things and we decide the best way to let java know of > it's limitations is to change /proc/meminfo. That needs to be a change > that always applies to meminfo and is not controlled by options. > > > For now I'm just trying ti create a better way to restrict access in > > the procfs than this since procfs is used in containers. > > Docker historically has been crap about having a sensible policy. The > problem is that Docker wanted to allow real root in a container and > somehow make it safe by blocking access to proc files and by dropping > capabilities. > > Practically everything that Docker has done is much better and simpler by > restricting the processes to a user namespace, with a root user whose > uid is not the global root user. > > Which is why I want us to make certain we are doing something that makes > sense, and is architecturally sound. Ok. Then ignore this patchset. > You have cleared the big hurdle and proc now has options that are > usable. I really appreciate that. I am not opposed to the general > direction you are going to find a way to make proc more usable. I just > want our next step to be solid.