Message ID | 20181118111751.6142-1-christian@brauner.io (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | proc: allow killing processes via file descriptors | expand |
I had been led to believe that the proposal would be a comprehensive process API, not an ioctl basically equivalent to my previous patch. If you had a more comprehensive proposal, please just share it on LKML instead of limiting the discussion to those able to attend these various conferences. If there's some determined opposition to a general new process API, this opposition needs a fair and full airing, as not everyone can attend these conferences. On Sun, Nov 18, 2018 at 3:17 AM, Christian Brauner <christian@brauner.io> wrote: > With this patch an open() call on /proc/<pid> will give userspace a handle > to struct pid of the process associated with /proc/<pid>. This allows to > maintain a stable handle on a process. > I have been discussing various approaches extensively during technical > conferences this year culminating in a long argument with Eric at Linux > Plumbers. The general consensus was that having a handle on a process > will be something that is very simple and easy to maintain ioctls are the opposite of "easy to maintain". Their file-descriptor-specific behavior makes it difficult to use the things safely. If you want to take this approach, please make a new system call. An ioctl is just a system call with a very strange spelling and unfortunate collision semantics. > with the > option of being extensible via a more advanced api if the need arises. The need *has* arisen; see my exithand patch. > I > believe that this patch is the most simple, dumb, and therefore > maintainable solution. > > The need for this has arisen in order to reliably kill a process without > running into issues of the pid being recycled as has been described in the > rejected patch [1]. That patch was not "rejected". It was tabled pending the more comprehensive process API proposal that was supposed to have emerged. This patch is just another variant of the sort of approach we discussed on that patch's thread here. As I mentioned on that thread, the right approach option is a new system call, not an ioctl. To fulfill the need described in that patchset a new > ioctl() PROC_FD_SIGNAL is added. It can be used to send signals to a > process via a file descriptor: > > int fd = open("/proc/1234", O_DIRECTORY | O_CLOEXEC); > ioctl(fd, PROC_FD_SIGNAL, SIGKILL); > close(fd); > > Note, the stable handle will allow us to carefully extend this feature in > the future. We still need the ability to synchronously wait on a process's death, as in my patch set. I will be refreshing that patch set.
On Sun, Nov 18, 2018 at 5:59 AM Daniel Colascione <dancol@google.com> wrote: > > I had been led to believe that the proposal would be a comprehensive > process API, not an ioctl basically equivalent to my previous patch. > If you had a more comprehensive proposal, please just share it on LKML > instead of limiting the discussion to those able to attend these > various conferences. If there's some determined opposition to a > general new process API, this opposition needs a fair and full airing, > as not everyone can attend these conferences. > > On Sun, Nov 18, 2018 at 3:17 AM, Christian Brauner <christian@brauner.io> wrote: > > With this patch an open() call on /proc/<pid> will give userspace a handle > > to struct pid of the process associated with /proc/<pid>. This allows to > > maintain a stable handle on a process. > > I have been discussing various approaches extensively during technical > > conferences this year culminating in a long argument with Eric at Linux > > Plumbers. The general consensus was that having a handle on a process > > will be something that is very simple and easy to maintain > > ioctls are the opposite of "easy to maintain". Their > file-descriptor-specific behavior makes it difficult to use the things > safely. If you want to take this approach, please make a new system > call. An ioctl is just a system call with a very strange spelling and > unfortunate collision semantics. > > > with the > > option of being extensible via a more advanced api if the need arises. > > The need *has* arisen; see my exithand patch. > > > I > > believe that this patch is the most simple, dumb, and therefore > > maintainable solution. > > > > The need for this has arisen in order to reliably kill a process without > > running into issues of the pid being recycled as has been described in the > > rejected patch [1]. > > That patch was not "rejected". It was tabled pending the more > comprehensive process API proposal that was supposed to have emerged. > This patch is just another variant of the sort of approach we > discussed on that patch's thread here. As I mentioned on that thread, > the right approach option is a new system call, not an ioctl. > > To fulfill the need described in that patchset a new > > ioctl() PROC_FD_SIGNAL is added. It can be used to send signals to a > > process via a file descriptor: > > > > int fd = open("/proc/1234", O_DIRECTORY | O_CLOEXEC); > > ioctl(fd, PROC_FD_SIGNAL, SIGKILL); > > close(fd); > > > > Note, the stable handle will allow us to carefully extend this feature in > > the future. > > We still need the ability to synchronously wait on a process's death, > as in my patch set. I will be refreshing that patch set. I fully agree that a more comprehensive, less expensive API for managing processes would be nice. But I also think that this patch (using the directory fd and ioctl) is better from a security perspective than using a new file in /proc. I have an old patch to make proc directory fds pollable: https://lore.kernel.org/patchwork/patch/345098/ That patch plus the one in this thread might make a nice addition to the kernel even if we expect something much better to come along later. This patch that uses ioctl
On Sun, Nov 18, 2018 at 7:38 AM, Andy Lutomirski <luto@kernel.org> wrote: > I fully agree that a more comprehensive, less expensive API for > managing processes would be nice. But I also think that this patch > (using the directory fd and ioctl) is better from a security > perspective than using a new file in /proc. That's an assertion, not an argument. And I'm not opposed to an operation on the directory FD, now that it's clear Linus has banned "write(2)-as-a-command" APIs. I just insist that we implement the API with a system call instead of a less-reliable ioctl due to the inherent namespace collision issues in ioctl command names. > I have an old patch to make proc directory fds pollable: > > https://lore.kernel.org/patchwork/patch/345098/ > > That patch plus the one in this thread might make a nice addition to > the kernel even if we expect something much better to come along > later. I've always commented on that patch. You never addressed my technical objections. Why are you bringing up this patch again as if that discussion had never happened? To review, that patch has various race conditions, and even if it were technically correct, it'd be an abuse of directory objects (in what other circumstance do we poll directories?) and not logically generalizable to a model in which we expose process exit status via the exit-monitoring API.
On Sun, Nov 18, 2018 at 3:17 AM, Christian Brauner <christian@brauner.io> wrote: > +static int proc_tgid_open(struct inode *inode, struct file *file) > +{ > + /* grab reference to struct pid */ > + file->private_data = get_pid(proc_pid(inode)); Why grab another reference to the struct pid when the inode already has one?
On Sun, Nov 18, 2018 at 7:53 AM Daniel Colascione <dancol@google.com> wrote: > > On Sun, Nov 18, 2018 at 7:38 AM, Andy Lutomirski <luto@kernel.org> wrote: > > I fully agree that a more comprehensive, less expensive API for > > managing processes would be nice. But I also think that this patch > > (using the directory fd and ioctl) is better from a security > > perspective than using a new file in /proc. > > That's an assertion, not an argument. And I'm not opposed to an > operation on the directory FD, now that it's clear Linus has banned > "write(2)-as-a-command" APIs. I just insist that we implement the API > with a system call instead of a less-reliable ioctl due to the > inherent namespace collision issues in ioctl command names. Linus banned it because of bugs iike the ones in the patch. > > > I have an old patch to make proc directory fds pollable: > > > > https://lore.kernel.org/patchwork/patch/345098/ > > > > That patch plus the one in this thread might make a nice addition to > > the kernel even if we expect something much better to come along > > later. > > I've always commented on that patch. You never addressed my technical > objections. Why are you bringing up this patch again as if that > discussion had never happened? To review, that patch has various race > conditions I don't think I ever saw that review. > and even if it were technically correct, it'd be an abuse > of directory objects (in what other circumstance do we poll > directories?) and not logically generalizable to a model in which we > expose process exit status via the exit-monitoring API. I agree it's weird. It might be better to have /proc/PID/exit_status and make *that* pollable.
On Sun, Nov 18, 2018 at 8:17 AM, Andy Lutomirski <luto@kernel.org> wrote: > On Sun, Nov 18, 2018 at 7:53 AM Daniel Colascione <dancol@google.com> wrote: >> >> On Sun, Nov 18, 2018 at 7:38 AM, Andy Lutomirski <luto@kernel.org> wrote: >> > I fully agree that a more comprehensive, less expensive API for >> > managing processes would be nice. But I also think that this patch >> > (using the directory fd and ioctl) is better from a security >> > perspective than using a new file in /proc. >> >> That's an assertion, not an argument. And I'm not opposed to an >> operation on the directory FD, now that it's clear Linus has banned >> "write(2)-as-a-command" APIs. I just insist that we implement the API >> with a system call instead of a less-reliable ioctl due to the >> inherent namespace collision issues in ioctl command names. > > Linus banned it because of bugs iike the ones in the patch. Maybe: he didn't provide a reason. What's your point? >> > I have an old patch to make proc directory fds pollable: >> > >> > https://lore.kernel.org/patchwork/patch/345098/ >> > >> > That patch plus the one in this thread might make a nice addition to >> > the kernel even if we expect something much better to come along >> > later. >> >> I've always commented on that patch. You never addressed my technical >> objections. Why are you bringing up this patch again as if that >> discussion had never happened? To review, that patch has various race >> conditions > > I don't think I ever saw that review. https://lore.kernel.org/lkml/CAKOZuevXXqwJepmLNUtrU=f8jtdgdKAzUAnAA2+KVcWoMxMyFg@mail.gmail.com/ >> and even if it were technically correct, it'd be an abuse >> of directory objects (in what other circumstance do we poll >> directories?) and not logically generalizable to a model in which we >> expose process exit status via the exit-monitoring API. > > I agree it's weird. It might be better to have /proc/PID/exit_status > and make *that* pollable. That's exactly what my patch provides. I feel like we're rehashing the previous discussion.
On 11/18/18 8:17 AM, Andy Lutomirski wrote: > On Sun, Nov 18, 2018 at 7:53 AM Daniel Colascione <dancol@google.com> wrote: >> >> On Sun, Nov 18, 2018 at 7:38 AM, Andy Lutomirski <luto@kernel.org> wrote: >>> I fully agree that a more comprehensive, less expensive API for >>> managing processes would be nice. But I also think that this patch >>> (using the directory fd and ioctl) is better from a security >>> perspective than using a new file in /proc. >> >> That's an assertion, not an argument. And I'm not opposed to an >> operation on the directory FD, now that it's clear Linus has banned >> "write(2)-as-a-command" APIs. I just insist that we implement the API >> with a system call instead of a less-reliable ioctl due to the >> inherent namespace collision issues in ioctl command names. > > Linus banned it because of bugs iike the ones in the patch. > >> >>> I have an old patch to make proc directory fds pollable: >>> >>> https://lore.kernel.org/patchwork/patch/345098/ >>> >>> That patch plus the one in this thread might make a nice addition to >>> the kernel even if we expect something much better to come along >>> later. >> >> I've always commented on that patch. You never addressed my technical >> objections. Why are you bringing up this patch again as if that >> discussion had never happened? To review, that patch has various race >> conditions > > I don't think I ever saw that review. > >> and even if it were technically correct, it'd be an abuse >> of directory objects (in what other circumstance do we poll >> directories?) and not logically generalizable to a model in which we >> expose process exit status via the exit-monitoring API. > > I agree it's weird. It might be better to have /proc/PID/exit_status > and make *that* pollable. > If there is a new exit_status file, it could even be more than 8 bits of exit status: See https://lore.kernel.org/lkml/alpine.LSU.2.20.1507091257010.9602@nerf40.vanv.qr/T/#u and http://austingroupbugs.net/view.php?id=594#c1317
On Sun, Nov 18, 2018 at 8:33 AM, Randy Dunlap <rdunlap@infradead.org> wrote: > On 11/18/18 8:17 AM, Andy Lutomirski wrote: >> On Sun, Nov 18, 2018 at 7:53 AM Daniel Colascione <dancol@google.com> wrote: >>> >>> On Sun, Nov 18, 2018 at 7:38 AM, Andy Lutomirski <luto@kernel.org> wrote: >>>> I fully agree that a more comprehensive, less expensive API for >>>> managing processes would be nice. But I also think that this patch >>>> (using the directory fd and ioctl) is better from a security >>>> perspective than using a new file in /proc. >>> >>> That's an assertion, not an argument. And I'm not opposed to an >>> operation on the directory FD, now that it's clear Linus has banned >>> "write(2)-as-a-command" APIs. I just insist that we implement the API >>> with a system call instead of a less-reliable ioctl due to the >>> inherent namespace collision issues in ioctl command names. >> >> Linus banned it because of bugs iike the ones in the patch. >> >>> >>>> I have an old patch to make proc directory fds pollable: >>>> >>>> https://lore.kernel.org/patchwork/patch/345098/ >>>> >>>> That patch plus the one in this thread might make a nice addition to >>>> the kernel even if we expect something much better to come along >>>> later. >>> >>> I've always commented on that patch. You never addressed my technical >>> objections. Why are you bringing up this patch again as if that >>> discussion had never happened? To review, that patch has various race >>> conditions >> >> I don't think I ever saw that review. >> >>> and even if it were technically correct, it'd be an abuse >>> of directory objects (in what other circumstance do we poll >>> directories?) and not logically generalizable to a model in which we >>> expose process exit status via the exit-monitoring API. >> >> I agree it's weird. It might be better to have /proc/PID/exit_status >> and make *that* pollable. >> > > If there is a new exit_status file, it could even be more than > 8 bits of exit status: > > See https://lore.kernel.org/lkml/alpine.LSU.2.20.1507091257010.9602@nerf40.vanv.qr/T/#u > and http://austingroupbugs.net/view.php?id=594#c1317 First of all, as I discussed in [1], we need to first figure out *who* should have access to the process exit information. FreeBSD appears to make it public without disaster, and if we make exit status public, a lot of problems just disappear. [1] https://lore.kernel.org/lkml/CAKOZueussVwABQaC+O9fW+MZayccvttKQZfWg0hh-cZ+1ykXig@mail.gmail.com/ Providing more than eight bits of status information is a separate discussion. I don't think it's necessary, since it breaks assumptions people make today without really adding much in the way of new capabilities. If we want to let processes die while leaving behind more information, let's let them leave behind an arbitrary-sized string or something instead of repeating the mistake of an integral exit status, but with more bits.
On Sun, Nov 18, 2018 at 8:49 AM Daniel Colascione <dancol@google.com> wrote: > > On Sun, Nov 18, 2018 at 8:33 AM, Randy Dunlap <rdunlap@infradead.org> wrote: > > On 11/18/18 8:17 AM, Andy Lutomirski wrote: > >> On Sun, Nov 18, 2018 at 7:53 AM Daniel Colascione <dancol@google.com> wrote: > >>> > >>> On Sun, Nov 18, 2018 at 7:38 AM, Andy Lutomirski <luto@kernel.org> wrote: > >>>> I fully agree that a more comprehensive, less expensive API for > >>>> managing processes would be nice. But I also think that this patch > >>>> (using the directory fd and ioctl) is better from a security > >>>> perspective than using a new file in /proc. > >>> > >>> That's an assertion, not an argument. And I'm not opposed to an > >>> operation on the directory FD, now that it's clear Linus has banned > >>> "write(2)-as-a-command" APIs. I just insist that we implement the API > >>> with a system call instead of a less-reliable ioctl due to the > >>> inherent namespace collision issues in ioctl command names. > >> > >> Linus banned it because of bugs iike the ones in the patch. > >> > >>> > >>>> I have an old patch to make proc directory fds pollable: > >>>> > >>>> https://lore.kernel.org/patchwork/patch/345098/ > >>>> > >>>> That patch plus the one in this thread might make a nice addition to > >>>> the kernel even if we expect something much better to come along > >>>> later. > >>> > >>> I've always commented on that patch. You never addressed my technical > >>> objections. Why are you bringing up this patch again as if that > >>> discussion had never happened? To review, that patch has various race > >>> conditions > >> > >> I don't think I ever saw that review. > >> > >>> and even if it were technically correct, it'd be an abuse > >>> of directory objects (in what other circumstance do we poll > >>> directories?) and not logically generalizable to a model in which we > >>> expose process exit status via the exit-monitoring API. > >> > >> I agree it's weird. It might be better to have /proc/PID/exit_status > >> and make *that* pollable. > >> > > > > If there is a new exit_status file, it could even be more than > > 8 bits of exit status: > > > > See https://lore.kernel.org/lkml/alpine.LSU.2.20.1507091257010.9602@nerf40.vanv.qr/T/#u > > and http://austingroupbugs.net/view.php?id=594#c1317 > > First of all, as I discussed in [1], we need to first figure out *who* > should have access to the process exit information. FreeBSD appears to > make it public without disaster, and if we make exit status public, a > lot of problems just disappear. I kind of want to go in the other direction of making a lot of process information (especially cmdline) less publicly accessible. In general, any kind of API where a process has an fd is tricky to do right on UNIXy systems because of SUID, SGID, and LSM transition rules. Windows has an easy time of it because it's always safe for a parent process to introspect the child. (Well, almost, because Windows gained their privilege elevation stuff. I'm not saying we shouldn't do it -- I'm just saying that it's nontrivial.
On Sun, Nov 18, 2018 at 8:29 AM Daniel Colascione <dancol@google.com> wrote: > > On Sun, Nov 18, 2018 at 8:17 AM, Andy Lutomirski <luto@kernel.org> wrote: > > On Sun, Nov 18, 2018 at 7:53 AM Daniel Colascione <dancol@google.com> wrote: > >> > >> On Sun, Nov 18, 2018 at 7:38 AM, Andy Lutomirski <luto@kernel.org> wrote: > >> > I fully agree that a more comprehensive, less expensive API for > >> > managing processes would be nice. But I also think that this patch > >> > (using the directory fd and ioctl) is better from a security > >> > perspective than using a new file in /proc. > >> > >> That's an assertion, not an argument. And I'm not opposed to an > >> operation on the directory FD, now that it's clear Linus has banned > >> "write(2)-as-a-command" APIs. I just insist that we implement the API > >> with a system call instead of a less-reliable ioctl due to the > >> inherent namespace collision issues in ioctl command names. > > > > Linus banned it because of bugs iike the ones in the patch. > > Maybe: he didn't provide a reason. What's your point? My point is that an API that involves a file like /proc/PID/kill is very tricky to get right. Here are some considerations: 1. The .write handler for the file must not behave differently depending on current. So we can't check the caller's creds. (There are legacy things that are generally buggy that look at current's creds in write handlers. They're legacy, and they're almost universally at least a little bit buggy, and many have been exploitable.) 2. Even if we have ioctl() or a new syscall on /proc/PID/kill, we at least have to define whether *opening* kill checks any credentials. It probably shouldn't, since I don't see what the semantics should be. 3. The current Linux kill_pid stuff doesn't take a cred parameter, so it's not so easy to check f_cred even if we wanted to. So the simplest implementation using /proc/PID/kill would be for .open to do essentially nothing and for ioctl or a new syscall to check credentials as usual. But this seems to have no technical advantage over just using /proc/PID itself, and it's a good deal slower, as it requires an open/close cycle. Now if we had an ioctlat() API, maybe it would make sense. But we don't, and I think it would be a bit crazy to add one. --Andy
On Sun, Nov 18, 2018 at 9:13 AM, Andy Lutomirski <luto@kernel.org> wrote: > On Sun, Nov 18, 2018 at 8:29 AM Daniel Colascione <dancol@google.com> wrote: >> >> On Sun, Nov 18, 2018 at 8:17 AM, Andy Lutomirski <luto@kernel.org> wrote: >> > On Sun, Nov 18, 2018 at 7:53 AM Daniel Colascione <dancol@google.com> wrote: >> >> >> >> On Sun, Nov 18, 2018 at 7:38 AM, Andy Lutomirski <luto@kernel.org> wrote: >> >> > I fully agree that a more comprehensive, less expensive API for >> >> > managing processes would be nice. But I also think that this patch >> >> > (using the directory fd and ioctl) is better from a security >> >> > perspective than using a new file in /proc. >> >> >> >> That's an assertion, not an argument. And I'm not opposed to an >> >> operation on the directory FD, now that it's clear Linus has banned >> >> "write(2)-as-a-command" APIs. I just insist that we implement the API >> >> with a system call instead of a less-reliable ioctl due to the >> >> inherent namespace collision issues in ioctl command names. >> > >> > Linus banned it because of bugs iike the ones in the patch. >> >> Maybe: he didn't provide a reason. What's your point? > > My point is that an API that involves a file like /proc/PID/kill is > very tricky to get right. Here are some considerations: Moot. write(2) for this interface is off the table anyway. The right approach here is a system call that accepts a /proc/pid directory file descriptor, a signal number, and a signal information field (as in sigqueue(2)). > Now if we had an ioctlat() API, maybe it would make sense. But we > don't, and I think it would be a bit crazy to add one. A process is not a driver. Why won't this idea of using an ioctl for the kill-process-by-dfd thing just won't die? An ioctl has *zero* advantage in this context.
On Sun, Nov 18, 2018 at 9:09 AM, Andy Lutomirski <luto@kernel.org> wrote: > On Sun, Nov 18, 2018 at 8:49 AM Daniel Colascione <dancol@google.com> wrote: >> >> On Sun, Nov 18, 2018 at 8:33 AM, Randy Dunlap <rdunlap@infradead.org> wrote: >> > On 11/18/18 8:17 AM, Andy Lutomirski wrote: >> >> On Sun, Nov 18, 2018 at 7:53 AM Daniel Colascione <dancol@google.com> wrote: >> >>> >> >>> On Sun, Nov 18, 2018 at 7:38 AM, Andy Lutomirski <luto@kernel.org> wrote: >> >>>> I fully agree that a more comprehensive, less expensive API for >> >>>> managing processes would be nice. But I also think that this patch >> >>>> (using the directory fd and ioctl) is better from a security >> >>>> perspective than using a new file in /proc. >> >>> >> >>> That's an assertion, not an argument. And I'm not opposed to an >> >>> operation on the directory FD, now that it's clear Linus has banned >> >>> "write(2)-as-a-command" APIs. I just insist that we implement the API >> >>> with a system call instead of a less-reliable ioctl due to the >> >>> inherent namespace collision issues in ioctl command names. >> >> >> >> Linus banned it because of bugs iike the ones in the patch. >> >> >> >>> >> >>>> I have an old patch to make proc directory fds pollable: >> >>>> >> >>>> https://lore.kernel.org/patchwork/patch/345098/ >> >>>> >> >>>> That patch plus the one in this thread might make a nice addition to >> >>>> the kernel even if we expect something much better to come along >> >>>> later. >> >>> >> >>> I've always commented on that patch. You never addressed my technical >> >>> objections. Why are you bringing up this patch again as if that >> >>> discussion had never happened? To review, that patch has various race >> >>> conditions >> >> >> >> I don't think I ever saw that review. >> >> >> >>> and even if it were technically correct, it'd be an abuse >> >>> of directory objects (in what other circumstance do we poll >> >>> directories?) and not logically generalizable to a model in which we >> >>> expose process exit status via the exit-monitoring API. >> >> >> >> I agree it's weird. It might be better to have /proc/PID/exit_status >> >> and make *that* pollable. >> >> >> > >> > If there is a new exit_status file, it could even be more than >> > 8 bits of exit status: >> > >> > See https://lore.kernel.org/lkml/alpine.LSU.2.20.1507091257010.9602@nerf40.vanv.qr/T/#u >> > and http://austingroupbugs.net/view.php?id=594#c1317 >> >> First of all, as I discussed in [1], we need to first figure out *who* >> should have access to the process exit information. FreeBSD appears to >> make it public without disaster, and if we make exit status public, a >> lot of problems just disappear. > > I kind of want to go in the other direction of making a lot of process > information (especially cmdline) less publicly accessible. Okay. That has nothing to do with exit status. Please address the points related to the API we're discussing and that I raised in the other thread. Assuming we don't broaden exit status readability (which would make a lot of things simpler), the exit notification mechanism must work like this: if you can see a process in /proc, you should be able to wait on it. If you learn that process's exit status through some other means --- e.g., you're the process's parent, you can ptrace the process, you have CAP_WHATEVER_IT_IS_ --- then you should be able to learn the fate of the process. Otherwise you just be able to learn that the process exited. > Windows has an easy time of it because Windows has an easier time of it because it doesn't use an ad-hoc ambient authority permission model. In Windows, if you can open a handle to do something, that handle lets you do the thing. Period. There's none of this "well, I opened this process FD, but since I opened it, the process called setuid, so now I can't get its exit status" nonsense. Privilege elevation is always accomplished via a separate call to CreateProcessWithToken, which creates a *new* process with the elevated privileges. An existing process can't suddenly and magically become this special thing that you can't inspect, but that has the same PID and identity as this other process that you used to be able to inspect. The model is just better, because permission is baked into the HANDLE. Now, that ship has sailed. We're stuck with setreuid and exec. But let's be clear about what's causing the complexity.
On Sun, Nov 18, 2018 at 07:38:09AM -0800, Andy Lutomirski wrote: > On Sun, Nov 18, 2018 at 5:59 AM Daniel Colascione <dancol@google.com> wrote: > > > > I had been led to believe that the proposal would be a comprehensive > > process API, not an ioctl basically equivalent to my previous patch. > > If you had a more comprehensive proposal, please just share it on LKML > > instead of limiting the discussion to those able to attend these > > various conferences. If there's some determined opposition to a > > general new process API, this opposition needs a fair and full airing, > > as not everyone can attend these conferences. > > > > On Sun, Nov 18, 2018 at 3:17 AM, Christian Brauner <christian@brauner.io> wrote: > > > With this patch an open() call on /proc/<pid> will give userspace a handle > > > to struct pid of the process associated with /proc/<pid>. This allows to > > > maintain a stable handle on a process. > > > I have been discussing various approaches extensively during technical > > > conferences this year culminating in a long argument with Eric at Linux > > > Plumbers. The general consensus was that having a handle on a process > > > will be something that is very simple and easy to maintain > > > > ioctls are the opposite of "easy to maintain". Their > > file-descriptor-specific behavior makes it difficult to use the things > > safely. If you want to take this approach, please make a new system > > call. An ioctl is just a system call with a very strange spelling and > > unfortunate collision semantics. > > > > > with the > > > option of being extensible via a more advanced api if the need arises. > > > > The need *has* arisen; see my exithand patch. > > > > > I > > > believe that this patch is the most simple, dumb, and therefore > > > maintainable solution. > > > > > > The need for this has arisen in order to reliably kill a process without > > > running into issues of the pid being recycled as has been described in the > > > rejected patch [1]. > > > > That patch was not "rejected". It was tabled pending the more > > comprehensive process API proposal that was supposed to have emerged. > > This patch is just another variant of the sort of approach we > > discussed on that patch's thread here. As I mentioned on that thread, > > the right approach option is a new system call, not an ioctl. > > > > To fulfill the need described in that patchset a new > > > ioctl() PROC_FD_SIGNAL is added. It can be used to send signals to a > > > process via a file descriptor: > > > > > > int fd = open("/proc/1234", O_DIRECTORY | O_CLOEXEC); > > > ioctl(fd, PROC_FD_SIGNAL, SIGKILL); > > > close(fd); > > > > > > Note, the stable handle will allow us to carefully extend this feature in > > > the future. > > > > We still need the ability to synchronously wait on a process's death, > > as in my patch set. I will be refreshing that patch set. > > I fully agree that a more comprehensive, less expensive API for > managing processes would be nice. But I also think that this patch > (using the directory fd and ioctl) is better from a security > perspective than using a new file in /proc. > > I have an old patch to make proc directory fds pollable: > > https://lore.kernel.org/patchwork/patch/345098/ > > That patch plus the one in this thread might make a nice addition to > the kernel even if we expect something much better to come along > later. I agree. Eric's point was to make the first implementation of this as simple as possible that's why this patch is intentionally almost trivial. And I like it for its simplicity. I had a more comprehensive API proposal of which open(/proc/<pid>) was a part. I didn't send out alongside this patch as Eric clearly prefered to only have the /proc/<pid> part. Here is the full proposal as I intended to originally send it out: The gist is to have file descriptors for processes which is obviously not a new idea. This has been done before in other OSes and it has been tried before in Linux [2], [3] (Thanks to Kees for pointing out these patches.). So I want to make it very clear that I'm not laying claim to this being my or even a novel idea in any way. However, I want to diverge from previous approaches with my suggestion. (Though I can't be sure that there's not something similar in other OSes already.) One of the main motivations for having procfds is to have a race-free way of configuring, starting, polling, and killing a process. Basically, a process lifecycle api if you want to think about it that way. The api should also be easily extendable in the future to avoid running into the limitations we currently see with the clone*() syscall(s) again. One of the crucial points of the api is to *separate the configuration of a process through a procfd from actually creating the process*. This is a crucial property expressed in the open*() system calls. First, get a stable handle on an object then allow for ways to configure it. As such the procfd api shares the same insight with Al's and David's new mount api. (Fwiw, Andy also pointed out similarities with posix_spawn().) What I envisioned was to have the following syscalls (multiple name suggestions): 1. int process_open / proc_open / procopen 2. int process_config / proc_config / procconfig or ioctl()-based 3. int process_info / proc_info / procinfo or ioctl()-based 4. int process_manage / proc_manage / procmanage or ioctl()-based and the following procfs extension: int procfd = open("/proc/<pid>", O_DIRECTORY | O_CLOEXEC); Some of you will notice right away that we could replace 2-4 with ioctl()s. #### process_open() will return an fd that creates a process context. The fd returned by process_open() does neither refer to any existing process nor has the process actually been started yet. So non-configuration operations on it or trying to interact with it would fail with e.g. ESRCH/EINVAL. #### process_config() / ioctl() takes an fd returned by process_open() and can be used to configure a process context *before it is alive*. Some things that I would like to be able to do with this syscall are: - configure signals - set clone flags - write idmappings if the process runs in a new user namespace - configure what happens when all procfds referring to the process are gone - ... Just to get a very rough feel for this without detailing parameters right now: /* process should have own mountns */ process_config/ioctl(fd, PROC_SET_FLAG, CLONE_NEWNS, <potentially additional arguments>) /* process should get SIGKILL when all procfds are closed */ process_config/ioctl(fd, PROC_SET_CLOSE, SIGKILL, <potentially additional arguments>) After the caller is done configuring the process there would be a final step: process_config/ioctl(fd, PROC_CREATE, 0, <potentially additional arguments>) which would create the process and (either as return value or through a parameter) return the pid of the newly created process. These fds should be pollable (though this is maybe out of scope for a first implementation). In combination with the split between getting an fd for a process context and starting the process would this would then allow for nice things such as adding an fd gotten via process_open() to an epoll() instance where other processes can poll the fd to e.g. (given appropriate privileges) get an event when process_config/ioctl()(fd, PROC_CREATE, *, <potentially additional arguments>) has actually started the process or it exited. #### int process_info / ioctl() allows to retrieve information about a process (e.g. signals, namespaces, or even information available through getrusage()). This would be a more performant and race-free way then parsing through various files in /proc. I remember quite some people asking for a variation of this. #### process_manage / ioctl() allows to interact/manage a process through a procfd. Specifically, one would be able to send signals to the process, retrieve the exit status from it etc. Here is an example to get a feel for it: /* send SIGTERM to process / process_manage/ioctl(fd, PROC_SIGNAL, SIGTERM, <potentially additional arguments>) /* block until the process has exited and retrieve exit information via * <potentially additional arguments>. * One could also make it possible to specify a timeout here. */ process_manage/ioctl(fd, PROC_WAIT, 0, <potentially additional arguments>) #### /proc/<pid> allow to get a procfd for an existing process. This adds a new file "handle" to /proc that serves as a way to get a procfd for a process. I hope that's enough information for now without too much detail. I think that /proc/<pid> is probably the easiest to target and that I prototyped. [1]: https://lkml.org/lkml/2018/10/30/118 [2]: https://lkml.kernel.org/r/cover.1426180120.git.josh@joshtriplett.org [3]: https://lore.kernel.org/lkml/2279556.Wl6mCVq5Zi@tjmaciei-mobl4
On Sun, Nov 18, 2018 at 9:24 AM Daniel Colascione <dancol@google.com> wrote: > Assuming we don't broaden exit status readability (which would make a > lot of things simpler), the exit notification mechanism must work like > this: if you can see a process in /proc, you should be able to wait on > it. If you learn that process's exit status through some other means > --- e.g., you're the process's parent, you can ptrace the process, you > have CAP_WHATEVER_IT_IS_ --- then you should be able to learn the fate > of the process. Otherwise you just be able to learn that the process > exited. Sounds reasonable to me. Except for the obvious turd that, if you open /proc/PID/whatever, and the process calls execve(), then the resulting semantics are awkward at best. > > > Windows has an easy time of it because > > Windows has an easier time of it because it doesn't use an ad-hoc > ambient authority permission model. In Windows, if you can open a > handle to do something, that handle lets you do the thing. Period. > There's none of this "well, I opened this process FD, but since I > opened it, the process called setuid, so now I can't get its exit > status" nonsense. Privilege elevation is always accomplished via a > separate call to CreateProcessWithToken, which creates a *new* process > with the elevated privileges. An existing process can't suddenly and > magically become this special thing that you can't inspect, but that > has the same PID and identity as this other process that you used to > be able to inspect. The model is just better, because permission is > baked into the HANDLE. Now, that ship has sailed. We're stuck with > setreuid and exec. But let's be clear about what's causing the > complexity. I'm not entirely sure that ship has sailed. In the kernel, we already have a bit of a distinction between a pid (and tid, etc -- I'm referring to struct pid) and a task. If we make a new process-management API, we could put a distinction like this into the API. As a straw-man proposal (highly incomplete and probably wrong, but maybe it gets the idea across): Have a way to get an fd that refers to a "running program". (I'm calling it that to distinguish it from "task" and "pid", both of which already mean something.) You'd be able to open such an fd given a pid, and your permissions would be checked at that time. R access means you can read the running program's memory and otherwise introspect it. W means you can modify it's memory and otherwise mess with it. X means you can send it signals. We might need more bits to really do this right. Now here's the kicker: if the "running program" calls execve(), it goes away. The fd gets some sort of notification that this happened and there's an API to get a handle to the new running program *if the caller has the appropriate permissions*. setresuid() has no effect here -- if you have W access to the process and the process calls setresuid(), you still have W access. To make this fully useful, we'd probably want to elaborate it with a race-free way to track all descendents and, if needed, kill them all, subject to permissions. This API ought to be extensible to replace ptrace() eventually. Does this seem like a reasonable direction to go in?
Daniel Colascione <dancol@google.com> writes: > On Sun, Nov 18, 2018 at 9:13 AM, Andy Lutomirski <luto@kernel.org> wrote: >> On Sun, Nov 18, 2018 at 8:29 AM Daniel Colascione <dancol@google.com> wrote: >>> >>> On Sun, Nov 18, 2018 at 8:17 AM, Andy Lutomirski <luto@kernel.org> wrote: >>> > On Sun, Nov 18, 2018 at 7:53 AM Daniel Colascione <dancol@google.com> wrote: >>> >> >>> >> On Sun, Nov 18, 2018 at 7:38 AM, Andy Lutomirski <luto@kernel.org> wrote: >>> >> > I fully agree that a more comprehensive, less expensive API for >>> >> > managing processes would be nice. But I also think that this patch >>> >> > (using the directory fd and ioctl) is better from a security >>> >> > perspective than using a new file in /proc. >>> >> >>> >> That's an assertion, not an argument. And I'm not opposed to an >>> >> operation on the directory FD, now that it's clear Linus has banned >>> >> "write(2)-as-a-command" APIs. I just insist that we implement the API >>> >> with a system call instead of a less-reliable ioctl due to the >>> >> inherent namespace collision issues in ioctl command names. >>> > >>> > Linus banned it because of bugs iike the ones in the patch. >>> >>> Maybe: he didn't provide a reason. What's your point? >> >> My point is that an API that involves a file like /proc/PID/kill is >> very tricky to get right. Here are some considerations: > > Moot. write(2) for this interface is off the table anyway. The right > approach here is a system call that accepts a /proc/pid directory file > descriptor, a signal number, and a signal information field (as in > sigqueue(2)). If we did not have the permission check challenges and could perform the permission checks in open, write(2) would be on the table. Performing write(2) would only be concrend about data. Unfortunately we have setresuid and exec which make that infeasible for the kill operations. >> Now if we had an ioctlat() API, maybe it would make sense. But we >> don't, and I think it would be a bit crazy to add one. > > A process is not a driver. Why won't this idea of using an ioctl for > the kill-process-by-dfd thing just won't die? An ioctl has *zero* > advantage in this context. An ioctl has an advantage in implementation complexity. An ioctl is very much easier to wire up that a system call. I don't know if that outweighs ioctls disadvantages in long term maintainability. Eric
On Sun, Nov 18, 2018 at 9:42 AM Christian Brauner <christian@brauner.io> wrote: > > On Sun, Nov 18, 2018 at 07:38:09AM -0800, Andy Lutomirski wrote: > > On Sun, Nov 18, 2018 at 5:59 AM Daniel Colascione <dancol@google.com> wrote: > > > > > > I had been led to believe that the proposal would be a comprehensive > > > process API, not an ioctl basically equivalent to my previous patch. > > > If you had a more comprehensive proposal, please just share it on LKML > > > instead of limiting the discussion to those able to attend these > > > various conferences. If there's some determined opposition to a > > > general new process API, this opposition needs a fair and full airing, > > > as not everyone can attend these conferences. > > > > > > On Sun, Nov 18, 2018 at 3:17 AM, Christian Brauner <christian@brauner.io> wrote: > > > > With this patch an open() call on /proc/<pid> will give userspace a handle > > > > to struct pid of the process associated with /proc/<pid>. This allows to > > > > maintain a stable handle on a process. > > > > I have been discussing various approaches extensively during technical > > > > conferences this year culminating in a long argument with Eric at Linux > > > > Plumbers. The general consensus was that having a handle on a process > > > > will be something that is very simple and easy to maintain > > > > > > ioctls are the opposite of "easy to maintain". Their > > > file-descriptor-specific behavior makes it difficult to use the things > > > safely. If you want to take this approach, please make a new system > > > call. An ioctl is just a system call with a very strange spelling and > > > unfortunate collision semantics. > > > > > > > with the > > > > option of being extensible via a more advanced api if the need arises. > > > > > > The need *has* arisen; see my exithand patch. > > > > > > > I > > > > believe that this patch is the most simple, dumb, and therefore > > > > maintainable solution. > > > > > > > > The need for this has arisen in order to reliably kill a process without > > > > running into issues of the pid being recycled as has been described in the > > > > rejected patch [1]. > > > > > > That patch was not "rejected". It was tabled pending the more > > > comprehensive process API proposal that was supposed to have emerged. > > > This patch is just another variant of the sort of approach we > > > discussed on that patch's thread here. As I mentioned on that thread, > > > the right approach option is a new system call, not an ioctl. > > > > > > To fulfill the need described in that patchset a new > > > > ioctl() PROC_FD_SIGNAL is added. It can be used to send signals to a > > > > process via a file descriptor: > > > > > > > > int fd = open("/proc/1234", O_DIRECTORY | O_CLOEXEC); > > > > ioctl(fd, PROC_FD_SIGNAL, SIGKILL); > > > > close(fd); > > > > > > > > Note, the stable handle will allow us to carefully extend this feature in > > > > the future. > > > > > > We still need the ability to synchronously wait on a process's death, > > > as in my patch set. I will be refreshing that patch set. > > > > I fully agree that a more comprehensive, less expensive API for > > managing processes would be nice. But I also think that this patch > > (using the directory fd and ioctl) is better from a security > > perspective than using a new file in /proc. > > > > I have an old patch to make proc directory fds pollable: > > > > https://lore.kernel.org/patchwork/patch/345098/ > > > > That patch plus the one in this thread might make a nice addition to > > the kernel even if we expect something much better to come along > > later. > > I agree. Eric's point was to make the first implementation of this as > simple as possible that's why this patch is intentionally almost > trivial. And I like it for its simplicity. > > I had a more comprehensive API proposal of which open(/proc/<pid>) was a > part. I didn't send out alongside this patch as Eric clearly prefered to > only have the /proc/<pid> part. Here is the full proposal as I intended > to originally send it out: > > The gist is to have file descriptors for processes which is obviously not a new > idea. This has been done before in other OSes and it has been tried before in > Linux [2], [3] (Thanks to Kees for pointing out these patches.). So I want to > make it very clear that I'm not laying claim to this being my or even a novel > idea in any way. However, I want to diverge from previous approaches with my > suggestion. (Though I can't be sure that there's not something similar in other > OSes already.) > > One of the main motivations for having procfds is to have a race-free way of > configuring, starting, polling, and killing a process. Basically, a process > lifecycle api if you want to think about it that way. The api should also be > easily extendable in the future to avoid running into the limitations we > currently see with the clone*() syscall(s) again. > > One of the crucial points of the api is to *separate the configuration > of a process through a procfd from actually creating the process*. > This is a crucial property expressed in the open*() system calls. First, get a > stable handle on an object then allow for ways to configure it. As such the > procfd api shares the same insight with Al's and David's new mount api. > (Fwiw, Andy also pointed out similarities with posix_spawn().) > What I envisioned was to have the following syscalls (multiple name suggestions): > > 1. int process_open / proc_open / procopen > 2. int process_config / proc_config / procconfig or ioctl()-based > 3. int process_info / proc_info / procinfo or ioctl()-based > 4. int process_manage / proc_manage / procmanage or ioctl()-based Emails crossed :( For process management, I generally like this, although we might do better if we make execve() effectively invalidate the handle. Then we avoid a bunch of nasty permission issues. For process *creation*, we have the problem that libc authors feel that they can't safely use fds at all. There was a proposal for "high fds" a long time back to solve that. We might finally need to do something like that.
On Sun, Nov 18, 2018 at 9:43 AM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Daniel Colascione <dancol@google.com> writes: > > > On Sun, Nov 18, 2018 at 9:13 AM, Andy Lutomirski <luto@kernel.org> wrote: > >> On Sun, Nov 18, 2018 at 8:29 AM Daniel Colascione <dancol@google.com> wrote: > >>> > >>> On Sun, Nov 18, 2018 at 8:17 AM, Andy Lutomirski <luto@kernel.org> wrote: > >>> > On Sun, Nov 18, 2018 at 7:53 AM Daniel Colascione <dancol@google.com> wrote: > >>> >> > >>> >> On Sun, Nov 18, 2018 at 7:38 AM, Andy Lutomirski <luto@kernel.org> wrote: > >>> >> > I fully agree that a more comprehensive, less expensive API for > >>> >> > managing processes would be nice. But I also think that this patch > >>> >> > (using the directory fd and ioctl) is better from a security > >>> >> > perspective than using a new file in /proc. > >>> >> > >>> >> That's an assertion, not an argument. And I'm not opposed to an > >>> >> operation on the directory FD, now that it's clear Linus has banned > >>> >> "write(2)-as-a-command" APIs. I just insist that we implement the API > >>> >> with a system call instead of a less-reliable ioctl due to the > >>> >> inherent namespace collision issues in ioctl command names. > >>> > > >>> > Linus banned it because of bugs iike the ones in the patch. > >>> > >>> Maybe: he didn't provide a reason. What's your point? > >> > >> My point is that an API that involves a file like /proc/PID/kill is > >> very tricky to get right. Here are some considerations: > > > > Moot. write(2) for this interface is off the table anyway. The right > > approach here is a system call that accepts a /proc/pid directory file > > descriptor, a signal number, and a signal information field (as in > > sigqueue(2)). > > If we did not have the permission check challenges and could perform > the permission checks in open, write(2) would be on the table. > Performing write(2) would only be concrend about data. > > Unfortunately we have setresuid and exec which make that infeasible > for the kill operations. setresuid() should be irrelevant. If you had permission to kill a process and the process calls setresuid(), you should still have permission to kill it. For execve(), we could make execve() invalidate the fd. (See other email.)
On Sun, Nov 18, 2018 at 9:42 AM, Andy Lutomirski <luto@kernel.org> wrote: > On Sun, Nov 18, 2018 at 9:24 AM Daniel Colascione <dancol@google.com> wrote: >> Assuming we don't broaden exit status readability (which would make a >> lot of things simpler), the exit notification mechanism must work like >> this: if you can see a process in /proc, you should be able to wait on >> it. If you learn that process's exit status through some other means >> --- e.g., you're the process's parent, you can ptrace the process, you >> have CAP_WHATEVER_IT_IS_ --- then you should be able to learn the fate >> of the process. Otherwise you just be able to learn that the process >> exited. > > Sounds reasonable to me. Except for the obvious turd that, if you > open /proc/PID/whatever, and the process calls execve(), then the > resulting semantics are awkward at best. A process calling execve does not give up its logical identity. Lots of programs exec themselves, e.g., to reload configuration. >> > Windows has an easy time of it because >> >> Windows has an easier time of it because it doesn't use an ad-hoc >> ambient authority permission model. In Windows, if you can open a >> handle to do something, that handle lets you do the thing. Period. >> There's none of this "well, I opened this process FD, but since I >> opened it, the process called setuid, so now I can't get its exit >> status" nonsense. Privilege elevation is always accomplished via a >> separate call to CreateProcessWithToken, which creates a *new* process >> with the elevated privileges. An existing process can't suddenly and >> magically become this special thing that you can't inspect, but that >> has the same PID and identity as this other process that you used to >> be able to inspect. The model is just better, because permission is >> baked into the HANDLE. Now, that ship has sailed. We're stuck with >> setreuid and exec. But let's be clear about what's causing the >> complexity. > > I'm not entirely sure that ship has sailed. In the kernel, we already > have a bit of a distinction between a pid (and tid, etc -- I'm > referring to struct pid) and a task. If we make a new > process-management API, we could put a distinction like this into the > API. It would be a disaster to have different APIs give callers a different idea of process identity over its lifetime. The precedent is well-established that execve and setreuid do not change a process's identity. Invaliding some identifiers but not others in response to supposedly-internal things a process might do under rare circumstances is creating a bug machine.. > setresuid() has no effect > here -- if you have W access to the process and the process calls > setresuid(), you still have W access. Now you've created a situation in which an operation that security policy previously blocked now becomes possible, invaliding previous designs based on the old security invariant. That's the definition of introducing a security hole.
On Sun, Nov 18, 2018 at 9:43 AM, Eric W. Biederman <ebiederm@xmission.com> wrote: > Daniel Colascione <dancol@google.com> writes: > >> On Sun, Nov 18, 2018 at 9:13 AM, Andy Lutomirski <luto@kernel.org> wrote: >>> On Sun, Nov 18, 2018 at 8:29 AM Daniel Colascione <dancol@google.com> wrote: >>>> >>>> On Sun, Nov 18, 2018 at 8:17 AM, Andy Lutomirski <luto@kernel.org> wrote: >>>> > On Sun, Nov 18, 2018 at 7:53 AM Daniel Colascione <dancol@google.com> wrote: >>>> >> >>>> >> On Sun, Nov 18, 2018 at 7:38 AM, Andy Lutomirski <luto@kernel.org> wrote: >>>> >> > I fully agree that a more comprehensive, less expensive API for >>>> >> > managing processes would be nice. But I also think that this patch >>>> >> > (using the directory fd and ioctl) is better from a security >>>> >> > perspective than using a new file in /proc. >>>> >> >>>> >> That's an assertion, not an argument. And I'm not opposed to an >>>> >> operation on the directory FD, now that it's clear Linus has banned >>>> >> "write(2)-as-a-command" APIs. I just insist that we implement the API >>>> >> with a system call instead of a less-reliable ioctl due to the >>>> >> inherent namespace collision issues in ioctl command names. >>>> > >>>> > Linus banned it because of bugs iike the ones in the patch. >>>> >>>> Maybe: he didn't provide a reason. What's your point? >>> >>> My point is that an API that involves a file like /proc/PID/kill is >>> very tricky to get right. Here are some considerations: >> >> Moot. write(2) for this interface is off the table anyway. The right >> approach here is a system call that accepts a /proc/pid directory file >> descriptor, a signal number, and a signal information field (as in >> sigqueue(2)). > > If we did not have the permission check challenges and could perform > the permission checks in open, write(2) would be on the table. > Performing write(2) would only be concrend about data. > > Unfortunately we have setresuid and exec which make that infeasible > for the kill operations. > >>> Now if we had an ioctlat() API, maybe it would make sense. But we >>> don't, and I think it would be a bit crazy to add one. >> >> A process is not a driver. Why won't this idea of using an ioctl for >> the kill-process-by-dfd thing just won't die? An ioctl has *zero* >> advantage in this context. > > An ioctl has an advantage in implementation complexity. An ioctl is > very much easier to wire up that a system call. > > I don't know if that outweighs ioctls disadvantages in long term > maintainability. It's not just maintainability. It's safety. We want to expose the new kill interface to userspace via some kill(1) extension, probably. So you should be able to write something like `cd /proc/12345 && kill --by-handle .`. How does kill --by-handle know that it's safe to perform the kill-by-proc-dfd operation on the file descriptor that it opens? If the kill operation is an ioctl, you could pass it /proc/self/fd/whatever of a completely different type; kill would call ioctl on whatever FD it got, and potentially do a completely random thing instead of killing a process. In the same situation, a new system call would fail reliably. Yes, kill could check that the device numbers of the file it opened matched proc's somehow, but that's annoying and error-prone and nobody's going to bother in practice. A new system call, by contrast, fails safe. I really don't want to give up safety and fail-safe behavior forever just because it's annoying, today, to wire up a new system call. (The new table-driven system call stuff, if it ever lands, would make things easier.)
On Sun, Nov 18, 2018 at 9:41 AM, Christian Brauner <christian@brauner.io> wrote: > On Sun, Nov 18, 2018 at 07:38:09AM -0800, Andy Lutomirski wrote: >> On Sun, Nov 18, 2018 at 5:59 AM Daniel Colascione <dancol@google.com> wrote: >> > >> > I had been led to believe that the proposal would be a comprehensive >> > process API, not an ioctl basically equivalent to my previous patch. >> > If you had a more comprehensive proposal, please just share it on LKML >> > instead of limiting the discussion to those able to attend these >> > various conferences. If there's some determined opposition to a >> > general new process API, this opposition needs a fair and full airing, >> > as not everyone can attend these conferences. >> > >> > On Sun, Nov 18, 2018 at 3:17 AM, Christian Brauner <christian@brauner.io> wrote: >> > > With this patch an open() call on /proc/<pid> will give userspace a handle >> > > to struct pid of the process associated with /proc/<pid>. This allows to >> > > maintain a stable handle on a process. >> > > I have been discussing various approaches extensively during technical >> > > conferences this year culminating in a long argument with Eric at Linux >> > > Plumbers. The general consensus was that having a handle on a process >> > > will be something that is very simple and easy to maintain >> > >> > ioctls are the opposite of "easy to maintain". Their >> > file-descriptor-specific behavior makes it difficult to use the things >> > safely. If you want to take this approach, please make a new system >> > call. An ioctl is just a system call with a very strange spelling and >> > unfortunate collision semantics. >> > >> > > with the >> > > option of being extensible via a more advanced api if the need arises. >> > >> > The need *has* arisen; see my exithand patch. >> > >> > > I >> > > believe that this patch is the most simple, dumb, and therefore >> > > maintainable solution. >> > > >> > > The need for this has arisen in order to reliably kill a process without >> > > running into issues of the pid being recycled as has been described in the >> > > rejected patch [1]. >> > >> > That patch was not "rejected". It was tabled pending the more >> > comprehensive process API proposal that was supposed to have emerged. >> > This patch is just another variant of the sort of approach we >> > discussed on that patch's thread here. As I mentioned on that thread, >> > the right approach option is a new system call, not an ioctl. >> > >> > To fulfill the need described in that patchset a new >> > > ioctl() PROC_FD_SIGNAL is added. It can be used to send signals to a >> > > process via a file descriptor: >> > > >> > > int fd = open("/proc/1234", O_DIRECTORY | O_CLOEXEC); >> > > ioctl(fd, PROC_FD_SIGNAL, SIGKILL); >> > > close(fd); >> > > >> > > Note, the stable handle will allow us to carefully extend this feature in >> > > the future. >> > >> > We still need the ability to synchronously wait on a process's death, >> > as in my patch set. I will be refreshing that patch set. >> >> I fully agree that a more comprehensive, less expensive API for >> managing processes would be nice. But I also think that this patch >> (using the directory fd and ioctl) is better from a security >> perspective than using a new file in /proc. >> >> I have an old patch to make proc directory fds pollable: >> >> https://lore.kernel.org/patchwork/patch/345098/ >> >> That patch plus the one in this thread might make a nice addition to >> the kernel even if we expect something much better to come along >> later. > > I agree. Eric's point was to make the first implementation of this as > simple as possible that's why this patch is intentionally almost > trivial. And I like it for its simplicity. > > I had a more comprehensive API proposal of which open(/proc/<pid>) was a > part. I didn't send out alongside this patch as Eric clearly prefered to > only have the /proc/<pid> part. Here is the full proposal as I intended > to originally send it out: Thanks. > The gist is to have file descriptors for processes which is obviously not a new > idea. This has been done before in other OSes and it has been tried before in > Linux [2], [3] (Thanks to Kees for pointing out these patches.). So I want to > make it very clear that I'm not laying claim to this being my or even a novel > idea in any way. However, I want to diverge from previous approaches with my > suggestion. (Though I can't be sure that there's not something similar in other > OSes already.) Windows works basically as you describe. You can create a process is suspended state, configure it however you want, then let it run. CreateProcess (and even moreso, NtCreateProcess) also provide a rich (and *extensible*) interface for pre-creation process configuration. >> One of the main motivations for having procfds is to have a race-free way of > configuring, starting, polling, and killing a process. Basically, a process > lifecycle api if you want to think about it that way. The api should also be > easily extendable in the future to avoid running into the limitations we > currently see with the clone*() syscall(s) again. > > One of the crucial points of the api is to *separate the configuration > of a process through a procfd from actually creating the process*. > This is a crucial property expressed in the open*() system calls. First, get a > stable handle on an object then allow for ways to configure it. As such the > procfd api shares the same insight with Al's and David's new mount api. > (Fwiw, Andy also pointed out similarities with posix_spawn().) > What I envisioned was to have the following syscalls (multiple name suggestions): > > 1. int process_open / proc_open / procopen > 2. int process_config / proc_config / procconfig or ioctl()-based > 3. int process_info / proc_info / procinfo or ioctl()-based > 4. int process_manage / proc_manage / procmanage or ioctl()-based The API you've proposed seems fine to me, although I'd either 1) consolidate operations further into one system call, or 2) separate the different management operations into more and different system calls that can be audited independently. The grouping you've proposed seems to have the worst aspects of API splitting and API multiplexing. But I have no objection to it in spirit. That said, while I do want to fix process configuration and startup generally, I want to fix specific holes in the existing API surface first. The two patches I've already sent do that, and this work shouldn't wait on an ideal larger process-API overhaul that may or may not arrive. Based on previous history, I suspect that an API of the scope you're proposing would take years to overcome all LKML objections and land. I don't want to wait that long when we can make smaller fixes that would not conflict with the general architecture. The original patch on this thread is half of the right fix. While I think we should use a system call instead of an ioctl, and while I have some specific implementation critiques (which I described in a different message), it's the right general sort of thing. We should merge it. Next, I want to merge my exithand proposal, or something like it. It's likewise a simple change that, in a minimal way, addresses a longstanding API deficiency. I'm very strongly against the POLLERR-on-directory variant of the idea.
On Sun, Nov 18, 2018 at 10:07 AM Daniel Colascione <dancol@google.com> wrote: > Next, I want to merge my exithand proposal, or something like it. It's > likewise a simple change that, in a minimal way, addresses a > longstanding API deficiency. I'm very strongly against the > POLLERR-on-directory variant of the idea. Can you explain why you don't like POLLERR-on-a-directory? I'm not saying that POLLERR-on-a-directory is fantastic, but I'd like to understand what your objection is.
On Sun, Nov 18, 2018 at 9:51 AM Daniel Colascione <dancol@google.com> wrote: > > > I'm not entirely sure that ship has sailed. In the kernel, we already > > have a bit of a distinction between a pid (and tid, etc -- I'm > > referring to struct pid) and a task. If we make a new > > process-management API, we could put a distinction like this into the > > API. > > It would be a disaster to have different APIs give callers a different > idea of process identity over its lifetime. The precedent is > well-established that execve and setreuid do not change a process's > identity. Invaliding some identifiers but not others in response to > supposedly-internal things a process might do under rare circumstances > is creating a bug machine.. Here's my point: if we're really going to make a new API to manipulate processes by their fd, I think we should have at least a decent idea of how that API will get extended in the future. Right now, we have an extremely awkward situation where opening an fd in /proc requires certain capabilities or uids, and using those fds often also checks current's capabilities, and the target process may have changed its own security context, including gaining privilege via SUID, SGID, or LSM transition rules in the mean time. This has been a huge source of security bugs. It would be nice to have a model for future APIs that avoids these problems. And I didn't say in my proposal that a process's identity should fundamentally change when it calls execve(). I'm suggesting that certain operations that could cause a process to gain privilege or otherwise require greater permission to introspect (mainly execve) could be handled by invalidating the new process management fds. Sure, if init re-execs itself, it's still PID 1, but that doesn't necessarily mean that: fd = process_open_management_fd(1); [init reexecs] process_do_something(fd); needs to work. > > > setresuid() has no effect > > here -- if you have W access to the process and the process calls > > setresuid(), you still have W access. > > Now you've created a situation in which an operation that security > policy previously blocked now becomes possible, invaliding previous > designs based on the old security invariant. That's the definition of > introducing a security hole. I think you're overstating your case. To a pretty good approximation, setresuid() allows the caller to remove elements from the set {ruid, suid, euid}, unless the caller has CAP_SETUID. If you could ptrace a process before it calls setresuid(), you might as well be able to ptrace() it after, since you could have just ptraced it and made it call setresuid() while still ptracing it. Similarly, it seems like it's probably safe to be able to open an fd that lets you watch the exit status of a process, have the process call setresuid(), and still see the exit status. Regardless of how you feel about these issues, if you're going to add an API by which you open an fd, wait for a process to exit, and read the exit status, you need to define the conditions under which you may open the fd and under which you may read the exit status once you have the fd. There are probably multiple valid answers, but the question still needs to be answered. My POLLERR hack, aside from being ugly, avoids this particular issue because it merely lets you wait for something you already could have observed using readdir().
On Sun, Nov 18, 2018 at 10:15 AM, Andy Lutomirski <luto@kernel.org> wrote: > On Sun, Nov 18, 2018 at 10:07 AM Daniel Colascione <dancol@google.com> wrote: >> Next, I want to merge my exithand proposal, or something like it. It's >> likewise a simple change that, in a minimal way, addresses a >> longstanding API deficiency. I'm very strongly against the >> POLLERR-on-directory variant of the idea. > > Can you explain why you don't like POLLERR-on-a-directory? I'm not > saying that POLLERR-on-a-directory is fantastic, but I'd like to > understand what your objection is. I've written my objections in more detail at [1]. Basically, 1) Nothing else today works by polling on directory file descriptors. 2) POLLERR is wrong because POLLERR indicates, well, an error, but since we want notifications upon either a transition to a zombie *or* actual death, and /proc/pid operations work perfectly well on zombie processes, there's no error to report, and reporting POLLERR would be a weird kind of lie. POLLHUP might be less awkward here. 3) POLLERR is a single bit of information. I want exit status as well, or at least a logical path from whatever we add to some kind of exit status reporting. You can get exit status from a zombie with openat on /proc/pid/stat, but what if the process fully dies by the time you get around to reading its exit status? Should we synthesize a /proc/pid/stat? It seems simpler to be able to just read(2) the exit information from some FD. 4) Event monitoring frameworks generally treat POLLERR as some kind of black sheep. Most people think in terms of bits indicating reading and writing. I want something that can cleanly integrate into existing wait mechanisms. 5) Poll events are *hints* that some other operation probably won't block if attempted. Using poll as the primary way of communicating a bit of information instead of an attempt-IO-now hint feels awkward. 6) A POLLERR interface can't be used by the shell without some kind of helper. What *advantage* does a POLLERR interface have? That you don't have to openat a separate file? That's a trivial operation in profs, especially compared to the machinery of process shutdown. I'm not particularly tied to a proc file; if we're adding a system call interface for killing a process by its procfs dfd, we can add one for returning an eventfd-like FD representing that process's status. It's unfortunate that the process handle FD also happens to be a directory FD; if it were a standalone object type, we could just use POLLIN more naturally. [1] https://lore.kernel.org/lkml/CAKOZueszfoSM0pxhmuFLOuPmJqSfYXxgutstyCgqxAyoUi4h3w@mail.gmail.com/
On Sun, Nov 18, 2018 at 10:28 AM, Andy Lutomirski <luto@kernel.org> wrote: > On Sun, Nov 18, 2018 at 9:51 AM Daniel Colascione <dancol@google.com> wrote: >> >> > I'm not entirely sure that ship has sailed. In the kernel, we already >> > have a bit of a distinction between a pid (and tid, etc -- I'm >> > referring to struct pid) and a task. If we make a new >> > process-management API, we could put a distinction like this into the >> > API. >> >> It would be a disaster to have different APIs give callers a different >> idea of process identity over its lifetime. The precedent is >> well-established that execve and setreuid do not change a process's >> identity. Invaliding some identifiers but not others in response to >> supposedly-internal things a process might do under rare circumstances >> is creating a bug machine.. > > Here's my point: if we're really going to make a new API to manipulate > processes by their fd, I think we should have at least a decent idea > of how that API will get extended in the future. Right now, we have > an extremely awkward situation where opening an fd in /proc requires > certain capabilities or uids, and using those fds often also checks > current's capabilities, and the target process may have changed its > own security context, including gaining privilege via SUID, SGID, or > LSM transition rules in the mean time. This has been a huge source of > security bugs. It would be nice to have a model for future APIs that > avoids these problems. > > And I didn't say in my proposal that a process's identity should > fundamentally change when it calls execve(). I'm suggesting that > certain operations that could cause a process to gain privilege or > otherwise require greater permission to introspect (mainly execve) > could be handled by invalidating the new process management fds. > Sure, if init re-execs itself, it's still PID 1, but that doesn't > necessarily mean that: > > fd = process_open_management_fd(1); > [init reexecs] > process_do_something(fd); > > needs to work. PID 1 is a bad example here, because it doesn't get recycled. Other PIDs do. The snippet you gave *does* need to work, in general, because if exec invalidates the handle, and you need to reopen by PID to re-establish your right to do something with the process, that process may in fact have died between the invalidation and your reopen, and your reopened FD may refer to some other random process. The only way around this problem is to have two separate FDs --- one to represent process identity, which *must* be continuous across execve, and the other to represent some specific capability, some ability to do something to that process. It's reasonable to invalidate capability after execve, but it's not reasonable to invalidate identity. In concrete terms, I don't see a big advantage to this separation, and I think a single identity FD combined with per-operation capability checks is sufficient. And much simpler. >> > setresuid() has no effect >> > here -- if you have W access to the process and the process calls >> > setresuid(), you still have W access. >> >> Now you've created a situation in which an operation that security >> policy previously blocked now becomes possible, invaliding previous >> designs based on the old security invariant. That's the definition of >> introducing a security hole. > > I think you're overstating your case. To a pretty good approximation, > setresuid() allows the caller to remove elements from the set {ruid, > suid, euid}, unless the caller has CAP_SETUID. If you could ptrace a > process before it calls setresuid(), you might as well be able to > ptrace() it after, since you could have just ptraced it and made it > call setresuid() while still ptracing it. What about a child that execs a setuid binary? > Similarly, it seems like > it's probably safe to be able to open an fd that lets you watch the > exit status of a process, have the process call setresuid(), and still > see the exit status. Is it? That's an open question. > > Regardless of how you feel about these issues, if you're going to add > an API by which you open an fd, wait for a process to exit, and read > the exit status, you need to define the conditions under which you may > open the fd and under which you may read the exit status once you have > the fd. There are probably multiple valid answers, but the question > still needs to be answered. Yes. That's the point I made in that previous message of mine that I referenced. > My POLLERR hack, aside from being ugly, > avoids this particular issue because it merely lets you wait for > something you already could have observed using readdir(). Yes. I mentioned this same issue-punting as the motivation behind exithand, initially, just reading EOF on exit.
On 2018-11-18, Daniel Colascione <dancol@google.com> wrote: > > Here's my point: if we're really going to make a new API to manipulate > > processes by their fd, I think we should have at least a decent idea > > of how that API will get extended in the future. Right now, we have > > an extremely awkward situation where opening an fd in /proc requires > > certain capabilities or uids, and using those fds often also checks > > current's capabilities, and the target process may have changed its > > own security context, including gaining privilege via SUID, SGID, or > > LSM transition rules in the mean time. This has been a huge source of > > security bugs. It would be nice to have a model for future APIs that > > avoids these problems. > > > > And I didn't say in my proposal that a process's identity should > > fundamentally change when it calls execve(). I'm suggesting that > > certain operations that could cause a process to gain privilege or > > otherwise require greater permission to introspect (mainly execve) > > could be handled by invalidating the new process management fds. > > Sure, if init re-execs itself, it's still PID 1, but that doesn't > > necessarily mean that: > > > > fd = process_open_management_fd(1); > > [init reexecs] > > process_do_something(fd); > > > > needs to work. > > PID 1 is a bad example here, because it doesn't get recycled. Other > PIDs do. The snippet you gave *does* need to work, in general, because > if exec invalidates the handle, and you need to reopen by PID to > re-establish your right to do something with the process, that process > may in fact have died between the invalidation and your reopen, and > your reopened FD may refer to some other random process. I imagine the error would be -EPERM rather than -ESRCH in this case, which would be incredibly trivial for userspace to differentiate between. If you wish to re-open the path that is also trivial by re-opening through /proc/self/fd/$fd -- which will re-do any permission checks and will guarantee that you are re-opening the same 'struct file' and thus the same 'struct pid'. > The only way around this problem is to have two separate FDs --- one > to represent process identity, which *must* be continuous across > execve, and the other to represent some specific capability, some > ability to do something to that process. It's reasonable to invalidate > capability after execve, but it's not reasonable to invalidate > identity. In concrete terms, I don't see a big advantage to this > separation, and I think a single identity FD combined with > per-operation capability checks is sufficient. And much simpler. I think that the error separation above would trivially allow user-space to know whether the identity or capability of a process being monitored has changed. Currently, all operations on a '/proc/$pid' which you've previously opened and has died will give you -ESRCH. So the above separation I mentioned is entirely consistent with how users are using '/proc/$pid' to check for PID death today. > > I think you're overstating your case. To a pretty good approximation, > > setresuid() allows the caller to remove elements from the set {ruid, > > suid, euid}, unless the caller has CAP_SETUID. If you could ptrace a > > process before it calls setresuid(), you might as well be able to > > ptrace() it after, since you could have just ptraced it and made it > > call setresuid() while still ptracing it. > > What about a child that execs a setuid binary? Yeah, for this reason I think that using -EPERM on operations that we think are not reasonable to allow possibly-less-privileged processes to do -- probably the most reasonable choice would be ptrace_may_access(). > > Similarly, it seems like > > it's probably safe to be able to open an fd that lets you watch the > > exit status of a process, have the process call setresuid(), and still > > see the exit status. > > Is it? That's an open question. Well, if we consider wait4(2) it seems that this is already the case. If you fork+exec a setuid binary you can definitely see its exit code. > > My POLLERR hack, aside from being ugly, > > avoids this particular issue because it merely lets you wait for > > something you already could have observed using readdir(). > > Yes. I mentioned this same issue-punting as the motivation behind > exithand, initially, just reading EOF on exit. One question I have about EOF-on-exit is that if we wish to extend it to allow providing the exit status (which is something we discussed in the original thread), how will multiple-readers be handled in such a scenario? Would we be storing the exit status or siginfo in the equivalent of a locked memfd?
On Sun, Nov 18, 2018 at 10:07:31AM -0800, Daniel Colascione wrote: > On Sun, Nov 18, 2018 at 9:41 AM, Christian Brauner <christian@brauner.io> wrote: > > On Sun, Nov 18, 2018 at 07:38:09AM -0800, Andy Lutomirski wrote: > >> On Sun, Nov 18, 2018 at 5:59 AM Daniel Colascione <dancol@google.com> wrote: > >> > > >> > I had been led to believe that the proposal would be a comprehensive > >> > process API, not an ioctl basically equivalent to my previous patch. > >> > If you had a more comprehensive proposal, please just share it on LKML > >> > instead of limiting the discussion to those able to attend these > >> > various conferences. If there's some determined opposition to a > >> > general new process API, this opposition needs a fair and full airing, > >> > as not everyone can attend these conferences. > >> > > >> > On Sun, Nov 18, 2018 at 3:17 AM, Christian Brauner <christian@brauner.io> wrote: > >> > > With this patch an open() call on /proc/<pid> will give userspace a handle > >> > > to struct pid of the process associated with /proc/<pid>. This allows to > >> > > maintain a stable handle on a process. > >> > > I have been discussing various approaches extensively during technical > >> > > conferences this year culminating in a long argument with Eric at Linux > >> > > Plumbers. The general consensus was that having a handle on a process > >> > > will be something that is very simple and easy to maintain > >> > > >> > ioctls are the opposite of "easy to maintain". Their > >> > file-descriptor-specific behavior makes it difficult to use the things > >> > safely. If you want to take this approach, please make a new system > >> > call. An ioctl is just a system call with a very strange spelling and > >> > unfortunate collision semantics. > >> > > >> > > with the > >> > > option of being extensible via a more advanced api if the need arises. > >> > > >> > The need *has* arisen; see my exithand patch. > >> > > >> > > I > >> > > believe that this patch is the most simple, dumb, and therefore > >> > > maintainable solution. > >> > > > >> > > The need for this has arisen in order to reliably kill a process without > >> > > running into issues of the pid being recycled as has been described in the > >> > > rejected patch [1]. > >> > > >> > That patch was not "rejected". It was tabled pending the more > >> > comprehensive process API proposal that was supposed to have emerged. > >> > This patch is just another variant of the sort of approach we > >> > discussed on that patch's thread here. As I mentioned on that thread, > >> > the right approach option is a new system call, not an ioctl. > >> > > >> > To fulfill the need described in that patchset a new > >> > > ioctl() PROC_FD_SIGNAL is added. It can be used to send signals to a > >> > > process via a file descriptor: > >> > > > >> > > int fd = open("/proc/1234", O_DIRECTORY | O_CLOEXEC); > >> > > ioctl(fd, PROC_FD_SIGNAL, SIGKILL); > >> > > close(fd); > >> > > > >> > > Note, the stable handle will allow us to carefully extend this feature in > >> > > the future. > >> > > >> > We still need the ability to synchronously wait on a process's death, > >> > as in my patch set. I will be refreshing that patch set. > >> > >> I fully agree that a more comprehensive, less expensive API for > >> managing processes would be nice. But I also think that this patch > >> (using the directory fd and ioctl) is better from a security > >> perspective than using a new file in /proc. > >> > >> I have an old patch to make proc directory fds pollable: > >> > >> https://lore.kernel.org/patchwork/patch/345098/ > >> > >> That patch plus the one in this thread might make a nice addition to > >> the kernel even if we expect something much better to come along > >> later. > > > > I agree. Eric's point was to make the first implementation of this as > > simple as possible that's why this patch is intentionally almost > > trivial. And I like it for its simplicity. > > > > I had a more comprehensive API proposal of which open(/proc/<pid>) was a > > part. I didn't send out alongside this patch as Eric clearly prefered to > > only have the /proc/<pid> part. Here is the full proposal as I intended > > to originally send it out: > > Thanks. > > > The gist is to have file descriptors for processes which is obviously not a new > > idea. This has been done before in other OSes and it has been tried before in > > Linux [2], [3] (Thanks to Kees for pointing out these patches.). So I want to > > make it very clear that I'm not laying claim to this being my or even a novel > > idea in any way. However, I want to diverge from previous approaches with my > > suggestion. (Though I can't be sure that there's not something similar in other > > OSes already.) > > Windows works basically as you describe. You can create a process is > suspended state, configure it however you want, then let it run. > CreateProcess (and even moreso, NtCreateProcess) also provide a rich > (and *extensible*) interface for pre-creation process configuration. > > >> One of the main motivations for having procfds is to have a race-free way of > > configuring, starting, polling, and killing a process. Basically, a process > > lifecycle api if you want to think about it that way. The api should also be > > easily extendable in the future to avoid running into the limitations we > > currently see with the clone*() syscall(s) again. > > > > One of the crucial points of the api is to *separate the configuration > > of a process through a procfd from actually creating the process*. > > This is a crucial property expressed in the open*() system calls. First, get a > > stable handle on an object then allow for ways to configure it. As such the > > procfd api shares the same insight with Al's and David's new mount api. > > (Fwiw, Andy also pointed out similarities with posix_spawn().) > > What I envisioned was to have the following syscalls (multiple name suggestions): > > > > 1. int process_open / proc_open / procopen > > 2. int process_config / proc_config / procconfig or ioctl()-based > > 3. int process_info / proc_info / procinfo or ioctl()-based > > 4. int process_manage / proc_manage / procmanage or ioctl()-based > > The API you've proposed seems fine to me, although I'd either 1) > consolidate operations further into one system call, or 2) separate > the different management operations into more and different system > calls that can be audited independently. The grouping you've proposed > seems to have the worst aspects of API splitting and API multiplexing. > But I have no objection to it in spirit. > > That said, while I do want to fix process configuration and startup > generally, I want to fix specific holes in the existing API surface > first. The two patches I've already sent do that, and this work > shouldn't wait on an ideal larger process-API overhaul that may or may > not arrive. Based on previous history, I suspect that an API of the > scope you're proposing would take years to overcome all LKML > objections and land. I don't want to wait that long when we can make > smaller fixes that would not conflict with the general architecture. > > The original patch on this thread is half of the right fix. While I > think we should use a system call instead of an ioctl, and while I > have some specific implementation critiques (which I described in a > different message), it's the right general sort of thing. We should > merge it. Thanks. I agree. Note, I don't care if it's an ioctl() or not. I'm happy to instead add a syscall process_signal() alongside this patchset. What do people prefer? > > Next, I want to merge my exithand proposal, or something like it. It's > likewise a simple change that, in a minimal way, addresses a > longstanding API deficiency. I'm very strongly against the > POLLERR-on-directory variant of the idea.
On Sun, Nov 18, 2018 at 11:05 AM, Aleksa Sarai <cyphar@cyphar.com> wrote: > On 2018-11-18, Daniel Colascione <dancol@google.com> wrote: >> > Here's my point: if we're really going to make a new API to manipulate >> > processes by their fd, I think we should have at least a decent idea >> > of how that API will get extended in the future. Right now, we have >> > an extremely awkward situation where opening an fd in /proc requires >> > certain capabilities or uids, and using those fds often also checks >> > current's capabilities, and the target process may have changed its >> > own security context, including gaining privilege via SUID, SGID, or >> > LSM transition rules in the mean time. This has been a huge source of >> > security bugs. It would be nice to have a model for future APIs that >> > avoids these problems. >> > >> > And I didn't say in my proposal that a process's identity should >> > fundamentally change when it calls execve(). I'm suggesting that >> > certain operations that could cause a process to gain privilege or >> > otherwise require greater permission to introspect (mainly execve) >> > could be handled by invalidating the new process management fds. >> > Sure, if init re-execs itself, it's still PID 1, but that doesn't >> > necessarily mean that: >> > >> > fd = process_open_management_fd(1); >> > [init reexecs] >> > process_do_something(fd); >> > >> > needs to work. >> >> PID 1 is a bad example here, because it doesn't get recycled. Other >> PIDs do. The snippet you gave *does* need to work, in general, because >> if exec invalidates the handle, and you need to reopen by PID to >> re-establish your right to do something with the process, that process >> may in fact have died between the invalidation and your reopen, and >> your reopened FD may refer to some other random process. > > I imagine the error would be -EPERM rather than -ESRCH in this case, > which would be incredibly trivial for userspace to differentiate > between. Why would userspace necessarily see EPERM? The PID might get recycled into a different random process that the caller has the ability to affect. > If you wish to re-open the path that is also trivial by > re-opening through /proc/self/fd/$fd -- which will re-do any permission > checks and will guarantee that you are re-opening the same 'struct file' > and thus the same 'struct pid'. When you reopen via /proc/self/fd, you get a new struct file referencing the existing inode, not the same struct file. A new reference to the old struct file would just be dup. Anyway: what other API requires, for correct operation, occasional reopening through /proc/self/fd? It's cumbersome, and it doesn't add anything. If we invalidate process handles on execve, and processes are legally allowed to re-exec themselves for arbitrary reasons at any time, that's tantamount to saying that handles might become invalid at any time and that all callers must be prepared to go through the reopen-and-retry path before any operation. Why are we making them do that? So that a process can have an open FD that represents a process-operation capability? Which capability does the open FD represent? I think when you and Andy must be talking about is an API that looks like this: int open_process_operation_handle(int procfs_dirfd, int capability_bitmask) capability_bitmask would have bits like PROCESS_CAPABILITY_KILL --- send a signal PROCESS_CAPABILITY_PTRACE --- attach to a process PROCESS_CAPABILITY_READ_EXIT_STATUS --- what it says on the tin PROCESS_CAPABILITY_READ_CMDLINE --- etc. Then you'd have system calls like int process_kill(int process_capability_fd, int signo, const union sigval data) int process_ptrace_attach(int process_capability_fd) int process_wait_for_exit(int process_capability_fd, siginfo_t* exit_info) that worked on these capability bits. If a process execs or does something else to change its security capabilities, operations on these capability FDs would fail with ESTALE or something and callers would have to re-acquire their capabilities. This approach works fine. It has some nice theoretical properties, and could allow for things like nicer privilege separation for debuggers. I wouldn't mind something like this getting into the kernel. I just don't think this model is necessary right now. I want a small change from what we have today, one likely to actually make it into the tree. And bypassing the capability FDs and just allowing callers to operate directly on process *identity* FDs, using privileges in effect at the time of all, is at least no worse than what we have now. That is, I'm proposing an API that looks like this: int process_kill(int procfs_dfd, int signo, const union sigval value) If, later, process_kill were to *also* accept process-capability FDs, nothing would break. >> The only way around this problem is to have two separate FDs --- one >> to represent process identity, which *must* be continuous across >> execve, and the other to represent some specific capability, some >> ability to do something to that process. It's reasonable to invalidate >> capability after execve, but it's not reasonable to invalidate >> identity. In concrete terms, I don't see a big advantage to this >> separation, and I think a single identity FD combined with >> per-operation capability checks is sufficient. And much simpler. > > I think that the error separation above would trivially allow user-space > to know whether the identity or capability of a process being monitored > has changed. > > Currently, all operations on a '/proc/$pid' which you've previously > opened and has died will give you -ESRCH. Not the case. Zombies have died, but profs operations work fine on zombies. >> > Similarly, it seems like >> > it's probably safe to be able to open an fd that lets you watch the >> > exit status of a process, have the process call setresuid(), and still >> > see the exit status. >> >> Is it? That's an open question. > > Well, if we consider wait4(2) it seems that this is already the case. > If you fork+exec a setuid binary you can definitely see its exit code. Only if you're the parent. Otherwise, you can't see the process exit status unless you pass a ptrace access check and consult /proc/pid/stat after the process dies, but before the zombie disappears. Random unrelated and unprivileged processes can't see exit statuses from distant parts of the system. >> > My POLLERR hack, aside from being ugly, >> > avoids this particular issue because it merely lets you wait for >> > something you already could have observed using readdir(). >> >> Yes. I mentioned this same issue-punting as the motivation behind >> exithand, initially, just reading EOF on exit. > > One question I have about EOF-on-exit is that if we wish to extend it to > allow providing the exit status (which is something we discussed in the > original thread), how will multiple-readers be handled in such a > scenario? > Would we be storing the exit status or siginfo in the > equivalent of a locked memfd? Yes, that's what I have in mind. A siginfo_t is small enough that we could just store it as a blob allocated off the procfs inode or something like that without bothering with a shmfs file. You'd be able to read(2) the exit status as many times as you wanted.
On Sun, Nov 18, 2018 at 11:44:19AM -0800, Daniel Colascione wrote: > On Sun, Nov 18, 2018 at 11:05 AM, Aleksa Sarai <cyphar@cyphar.com> wrote: > > On 2018-11-18, Daniel Colascione <dancol@google.com> wrote: > >> > Here's my point: if we're really going to make a new API to manipulate > >> > processes by their fd, I think we should have at least a decent idea > >> > of how that API will get extended in the future. Right now, we have > >> > an extremely awkward situation where opening an fd in /proc requires > >> > certain capabilities or uids, and using those fds often also checks > >> > current's capabilities, and the target process may have changed its > >> > own security context, including gaining privilege via SUID, SGID, or > >> > LSM transition rules in the mean time. This has been a huge source of > >> > security bugs. It would be nice to have a model for future APIs that > >> > avoids these problems. > >> > > >> > And I didn't say in my proposal that a process's identity should > >> > fundamentally change when it calls execve(). I'm suggesting that > >> > certain operations that could cause a process to gain privilege or > >> > otherwise require greater permission to introspect (mainly execve) > >> > could be handled by invalidating the new process management fds. > >> > Sure, if init re-execs itself, it's still PID 1, but that doesn't > >> > necessarily mean that: > >> > > >> > fd = process_open_management_fd(1); > >> > [init reexecs] > >> > process_do_something(fd); > >> > > >> > needs to work. > >> > >> PID 1 is a bad example here, because it doesn't get recycled. Other > >> PIDs do. The snippet you gave *does* need to work, in general, because > >> if exec invalidates the handle, and you need to reopen by PID to > >> re-establish your right to do something with the process, that process > >> may in fact have died between the invalidation and your reopen, and > >> your reopened FD may refer to some other random process. > > > > I imagine the error would be -EPERM rather than -ESRCH in this case, > > which would be incredibly trivial for userspace to differentiate > > between. > > Why would userspace necessarily see EPERM? The PID might get recycled > into a different random process that the caller has the ability to > affect. > > > If you wish to re-open the path that is also trivial by > > re-opening through /proc/self/fd/$fd -- which will re-do any permission > > checks and will guarantee that you are re-opening the same 'struct file' > > and thus the same 'struct pid'. > > When you reopen via /proc/self/fd, you get a new struct file > referencing the existing inode, not the same struct file. A new > reference to the old struct file would just be dup. > > Anyway: what other API requires, for correct operation, occasional > reopening through /proc/self/fd? It's cumbersome, and it doesn't add > anything. If we invalidate process handles on execve, and processes > are legally allowed to re-exec themselves for arbitrary reasons at any > time, that's tantamount to saying that handles might become invalid at > any time and that all callers must be prepared to go through the > reopen-and-retry path before any operation. > > Why are we making them do that? So that a process can have an open FD > that represents a process-operation capability? Which capability does > the open FD represent? > > I think when you and Andy must be talking about is an API that looks like this: > > int open_process_operation_handle(int procfs_dirfd, int capability_bitmask) > > capability_bitmask would have bits like > > PROCESS_CAPABILITY_KILL --- send a signal > PROCESS_CAPABILITY_PTRACE --- attach to a process > PROCESS_CAPABILITY_READ_EXIT_STATUS --- what it says on the tin > PROCESS_CAPABILITY_READ_CMDLINE --- etc. > > Then you'd have system calls like > > int process_kill(int process_capability_fd, int signo, const union sigval data) > int process_ptrace_attach(int process_capability_fd) > int process_wait_for_exit(int process_capability_fd, siginfo_t* exit_info) > > that worked on these capability bits. If a process execs or does > something else to change its security capabilities, operations on > these capability FDs would fail with ESTALE or something and callers > would have to re-acquire their capabilities. > > This approach works fine. It has some nice theoretical properties, and > could allow for things like nicer privilege separation for debuggers. > I wouldn't mind something like this getting into the kernel. > > I just don't think this model is necessary right now. I want a small > change from what we have today, one likely to actually make it into > the tree. And bypassing the capability FDs and just allowing callers > to operate directly on process *identity* FDs, using privileges in > effect at the time of all, is at least no worse than what we have now. > > That is, I'm proposing an API that looks like this: > > int process_kill(int procfs_dfd, int signo, const union sigval value) I've started a second tree with process_signal(int procpid_dfd, int sig) instead of an ioctl(). Why do you want sigval too? > > If, later, process_kill were to *also* accept process-capability FDs, > nothing would break. > > >> The only way around this problem is to have two separate FDs --- one > >> to represent process identity, which *must* be continuous across > >> execve, and the other to represent some specific capability, some > >> ability to do something to that process. It's reasonable to invalidate > >> capability after execve, but it's not reasonable to invalidate > >> identity. In concrete terms, I don't see a big advantage to this > >> separation, and I think a single identity FD combined with > >> per-operation capability checks is sufficient. And much simpler. > > > > I think that the error separation above would trivially allow user-space > > to know whether the identity or capability of a process being monitored > > has changed. > > > > Currently, all operations on a '/proc/$pid' which you've previously > > opened and has died will give you -ESRCH. > > Not the case. Zombies have died, but profs operations work fine on zombies. > > >> > Similarly, it seems like > >> > it's probably safe to be able to open an fd that lets you watch the > >> > exit status of a process, have the process call setresuid(), and still > >> > see the exit status. > >> > >> Is it? That's an open question. > > > > Well, if we consider wait4(2) it seems that this is already the case. > > If you fork+exec a setuid binary you can definitely see its exit code. > > Only if you're the parent. Otherwise, you can't see the process exit > status unless you pass a ptrace access check and consult > /proc/pid/stat after the process dies, but before the zombie > disappears. Random unrelated and unprivileged processes can't see exit > statuses from distant parts of the system. > > >> > My POLLERR hack, aside from being ugly, > >> > avoids this particular issue because it merely lets you wait for > >> > something you already could have observed using readdir(). > >> > >> Yes. I mentioned this same issue-punting as the motivation behind > >> exithand, initially, just reading EOF on exit. > > > > One question I have about EOF-on-exit is that if we wish to extend it to > > allow providing the exit status (which is something we discussed in the > > original thread), how will multiple-readers be handled in such a > > scenario? > > Would we be storing the exit status or siginfo in the > > equivalent of a locked memfd? > > Yes, that's what I have in mind. A siginfo_t is small enough that we > could just store it as a blob allocated off the procfs inode or > something like that without bothering with a shmfs file. You'd be able > to read(2) the exit status as many times as you wanted.
On Sun, Nov 18, 2018 at 12:15 PM, Christian Brauner <christian@brauner.io> wrote: >> That is, I'm proposing an API that looks like this: >> >> int process_kill(int procfs_dfd, int signo, const union sigval value) > > I've started a second tree with process_signal(int procpid_dfd, int sig) Thanks. > instead of an ioctl(). Why do you want sigval too? API completeness. The sigqueue interface is a superset of kill, and I don't want process_kill to do less than any PID-based kill. Maybe taking a siginfo_t, like rt_sigqueueinfo does, would be even better in that respect, come to think of it.
> On Nov 18, 2018, at 12:44 PM, Daniel Colascione <dancol@google.com> wrote: > > > That is, I'm proposing an API that looks like this: > > int process_kill(int procfs_dfd, int signo, const union sigval value) > > If, later, process_kill were to *also* accept process-capability FDs, > nothing would break. Except that this makes it ambiguous to the caller as to whether their current creds are considered. So it would need to be a different syscall or at least a flag. Otherwise a lot of those nice theoretical properties go away. > Yes, that's what I have in mind. A siginfo_t is small enough that we > could just store it as a blob allocated off the procfs inode or > something like that without bothering with a shmfs file. You'd be able > to read(2) the exit status as many times as you wanted. I think that, if the syscall in question is read(2), then it should work *once* per struct file. Otherwise running cat on the file would behave very oddly. Read and poll have the same problem as write: we can’t check caps in read or poll either.
On Sun, Nov 18, 2018 at 12:28 PM, Andy Lutomirski <luto@amacapital.net> wrote: >> That is, I'm proposing an API that looks like this: >> >> int process_kill(int procfs_dfd, int signo, const union sigval value) >> >> If, later, process_kill were to *also* accept process-capability FDs, >> nothing would break. > > Except that this makes it ambiguous to the caller as to whether their current creds are considered. So it would need to be a different syscall or at least a flag. Otherwise a lot of those nice theoretical properties go away. Sure. A flag might make for better ergonomics. >> Yes, that's what I have in mind. A siginfo_t is small enough that we >> could just store it as a blob allocated off the procfs inode or >> something like that without bothering with a shmfs file. You'd be able >> to read(2) the exit status as many times as you wanted. > > I think that, if the syscall in question is read(2), then it should work *once* per struct file. Otherwise running cat on the file would behave very oddly. Why? The file pointer would work normally. > Read and poll have the same problem as write: we can’t check caps in read or poll either. Why not? Reading /proc/pid/stat does an access check today and conditionally replaces the exit status with zero.
On Sun, Nov 18, 2018 at 01:28:41PM -0700, Andy Lutomirski wrote: > > > > On Nov 18, 2018, at 12:44 PM, Daniel Colascione <dancol@google.com> wrote: > > > > > > > That is, I'm proposing an API that looks like this: > > > > int process_kill(int procfs_dfd, int signo, const union sigval value) > > > > If, later, process_kill were to *also* accept process-capability FDs, > > nothing would break. > > Except that this makes it ambiguous to the caller as to whether their current creds are considered. So it would need to be a different syscall or at least a flag. Otherwise a lot of those nice theoretical properties go away. I can add a flag argument int process_signal(int procfs_dfd, int signo, siginfo_t *info, int flags) The way I see it process_signal() should be equivalent to kill(pid, signal) for now. That is siginfo_t is cleared and set to: info.si_signo = sig; info.si_errno = 0; info.si_code = SI_USER; info.si_pid = task_tgid_vnr(current); info.si_uid = from_kuid_munged(current_user_ns(), current_uid()); > > > Yes, that's what I have in mind. A siginfo_t is small enough that we > > could just store it as a blob allocated off the procfs inode or > > something like that without bothering with a shmfs file. You'd be able > > to read(2) the exit status as many times as you wanted. > > I think that, if the syscall in question is read(2), then it should work *once* per struct file. Otherwise running cat on the file would behave very oddly. > > Read and poll have the same problem as write: we can’t check caps in read or poll either.
On Sun, Nov 18, 2018 at 12:43 PM, Christian Brauner <christian@brauner.io> wrote: > On Sun, Nov 18, 2018 at 01:28:41PM -0700, Andy Lutomirski wrote: >> >> >> > On Nov 18, 2018, at 12:44 PM, Daniel Colascione <dancol@google.com> wrote: >> > >> >> > >> > That is, I'm proposing an API that looks like this: >> > >> > int process_kill(int procfs_dfd, int signo, const union sigval value) >> > >> > If, later, process_kill were to *also* accept process-capability FDs, >> > nothing would break. >> >> Except that this makes it ambiguous to the caller as to whether their current creds are considered. So it would need to be a different syscall or at least a flag. Otherwise a lot of those nice theoretical properties go away. > > I can add a flag argument > int process_signal(int procfs_dfd, int signo, siginfo_t *info, int flags) > The way I see it process_signal() should be equivalent to kill(pid, signal) for now. > That is siginfo_t is cleared and set to: > > info.si_signo = sig; > info.si_errno = 0; > info.si_code = SI_USER; > info.si_pid = task_tgid_vnr(current); > info.si_uid = from_kuid_munged(current_user_ns(), current_uid()); That makes sense. I just don't want to get into a situation where callers feel that they *have* to use the PID-based APIs to send a signal because process_kill doesn't offer some bit of functionality. Are you imagining something like requiring info t be NULL unless flags contains some "I have a siginfo_t" value? BTW: passing SI_USER to rt_sigqueueinfo *should* as long as the passed-in si_pid and si_uid match what the kernel would set them to in the kill(2) case. The whole point of SI_USER is that the recipient knows that it can trust the origin information embedded in the siginfo_t in the signal handler. If the kernel verifies that a signal sender isn't actually lying, why not let people send SI_USER with rt_sigqueueinfo?
On Sun, Nov 18, 2018 at 12:54:10PM -0800, Daniel Colascione wrote: > On Sun, Nov 18, 2018 at 12:43 PM, Christian Brauner > <christian@brauner.io> wrote: > > On Sun, Nov 18, 2018 at 01:28:41PM -0700, Andy Lutomirski wrote: > >> > >> > >> > On Nov 18, 2018, at 12:44 PM, Daniel Colascione <dancol@google.com> wrote: > >> > > >> > >> > > >> > That is, I'm proposing an API that looks like this: > >> > > >> > int process_kill(int procfs_dfd, int signo, const union sigval value) > >> > > >> > If, later, process_kill were to *also* accept process-capability FDs, > >> > nothing would break. > >> > >> Except that this makes it ambiguous to the caller as to whether their current creds are considered. So it would need to be a different syscall or at least a flag. Otherwise a lot of those nice theoretical properties go away. > > > > I can add a flag argument > > int process_signal(int procfs_dfd, int signo, siginfo_t *info, int flags) > > The way I see it process_signal() should be equivalent to kill(pid, signal) for now. > > That is siginfo_t is cleared and set to: > > > > info.si_signo = sig; > > info.si_errno = 0; > > info.si_code = SI_USER; > > info.si_pid = task_tgid_vnr(current); > > info.si_uid = from_kuid_munged(current_user_ns(), current_uid()); > > That makes sense. I just don't want to get into a situation where > callers feel that they *have* to use the PID-based APIs to send a > signal because process_kill doesn't offer some bit of functionality. Yeah. > > Are you imagining something like requiring info t be NULL unless flags > contains some "I have a siginfo_t" value? Well, I was actually thinking about something like: /** * sys_process_signal - send a signal to a process trough a process file descriptor * @fd: the file descriptor of the process * @sig: signal to be sent * @info: the signal info * @flags: future flags to be passed */ SYSCALL_DEFINE4(process_signal, int, fd, int, sig, siginfo_t __user *, info, int, flags) { struct pid *pid; struct fd *f; kernel_siginfo_t kinfo; /* Do not allow users to pass garbage. */ if (flags) return -EINVAL; int ret = __copy_siginfo_from_user(sig, &kinfo, info); if (unlikely(ret)) return ret; /* For now, enforce that caller's creds are used. */ kinfo.si_pid = task_tgid_vnr(current); kinfo.si_uid = from_kuid_munged(current_user_ns(), current_uid()); if (signal_impersonates_kernel(kinfo)) return -EPERM; f = fdget(fd); if (!f.file) return -EBADF; pid = f.file->private_data; if (!pid) return -EBADF; return kill_pid_info(sig, kinfo, pid); } > > BTW: passing SI_USER to rt_sigqueueinfo *should* as long as the > passed-in si_pid and si_uid match what the kernel would set them to in > the kill(2) case. The whole point of SI_USER is that the recipient > knows that it can trust the origin information embedded in the > siginfo_t in the signal handler. If the kernel verifies that a signal > sender isn't actually lying, why not let people send SI_USER with > rt_sigqueueinfo?
On Sun, Nov 18, 2018 at 10:23:36PM +0100, Christian Brauner wrote: > On Sun, Nov 18, 2018 at 12:54:10PM -0800, Daniel Colascione wrote: > > On Sun, Nov 18, 2018 at 12:43 PM, Christian Brauner > > <christian@brauner.io> wrote: > > > On Sun, Nov 18, 2018 at 01:28:41PM -0700, Andy Lutomirski wrote: > > >> > > >> > > >> > On Nov 18, 2018, at 12:44 PM, Daniel Colascione <dancol@google.com> wrote: > > >> > > > >> > > >> > > > >> > That is, I'm proposing an API that looks like this: > > >> > > > >> > int process_kill(int procfs_dfd, int signo, const union sigval value) > > >> > > > >> > If, later, process_kill were to *also* accept process-capability FDs, > > >> > nothing would break. > > >> > > >> Except that this makes it ambiguous to the caller as to whether their current creds are considered. So it would need to be a different syscall or at least a flag. Otherwise a lot of those nice theoretical properties go away. > > > > > > I can add a flag argument > > > int process_signal(int procfs_dfd, int signo, siginfo_t *info, int flags) > > > The way I see it process_signal() should be equivalent to kill(pid, signal) for now. > > > That is siginfo_t is cleared and set to: > > > > > > info.si_signo = sig; > > > info.si_errno = 0; > > > info.si_code = SI_USER; > > > info.si_pid = task_tgid_vnr(current); > > > info.si_uid = from_kuid_munged(current_user_ns(), current_uid()); > > > > That makes sense. I just don't want to get into a situation where > > callers feel that they *have* to use the PID-based APIs to send a > > signal because process_kill doesn't offer some bit of functionality. > > Yeah. > > > > > Are you imagining something like requiring info t be NULL unless flags > > contains some "I have a siginfo_t" value? > > Well, I was actually thinking about something like: > > /** > * sys_process_signal - send a signal to a process trough a process file descriptor > * @fd: the file descriptor of the process > * @sig: signal to be sent > * @info: the signal info > * @flags: future flags to be passed > */ > SYSCALL_DEFINE4(process_signal, int, fd, int, sig, siginfo_t __user *, info, > int, flags) > { > struct pid *pid; > struct fd *f; > kernel_siginfo_t kinfo; > > /* Do not allow users to pass garbage. */ > if (flags) > return -EINVAL; > > int ret = __copy_siginfo_from_user(sig, &kinfo, info); > if (unlikely(ret)) > return ret; > > /* For now, enforce that caller's creds are used. */ > kinfo.si_pid = task_tgid_vnr(current); > kinfo.si_uid = from_kuid_munged(current_user_ns(), current_uid()); > > if (signal_impersonates_kernel(kinfo)) > return -EPERM; > > f = fdget(fd); > if (!f.file) > return -EBADF; > > pid = f.file->private_data; > if (!pid) > return -EBADF; > > return kill_pid_info(sig, kinfo, pid); > } Just jotted this down here briefly. This will need an fput and so on obvs. > > > > > BTW: passing SI_USER to rt_sigqueueinfo *should* as long as the > > passed-in si_pid and si_uid match what the kernel would set them to in > > the kill(2) case. The whole point of SI_USER is that the recipient > > knows that it can trust the origin information embedded in the > > siginfo_t in the signal handler. If the kernel verifies that a signal > > sender isn't actually lying, why not let people send SI_USER with > > rt_sigqueueinfo?
On 2018-11-18, Daniel Colascione <dancol@google.com> wrote: > > The gist is to have file descriptors for processes which is obviously not a new > > idea. This has been done before in other OSes and it has been tried before in > > Linux [2], [3] (Thanks to Kees for pointing out these patches.). So I want to > > make it very clear that I'm not laying claim to this being my or even a novel > > idea in any way. However, I want to diverge from previous approaches with my > > suggestion. (Though I can't be sure that there's not something similar in other > > OSes already.) > > Windows works basically as you describe. You can create a process is > suspended state, configure it however you want, then let it run. > CreateProcess (and even moreso, NtCreateProcess) also provide a rich > (and *extensible*) interface for pre-creation process configuration. > > >> One of the main motivations for having procfds is to have a race-free way of > > configuring, starting, polling, and killing a process. Basically, a process > > lifecycle api if you want to think about it that way. The api should also be > > easily extendable in the future to avoid running into the limitations we > > currently see with the clone*() syscall(s) again. > > > > One of the crucial points of the api is to *separate the configuration > > of a process through a procfd from actually creating the process*. > > This is a crucial property expressed in the open*() system calls. First, get a > > stable handle on an object then allow for ways to configure it. As such the > > procfd api shares the same insight with Al's and David's new mount api. > > (Fwiw, Andy also pointed out similarities with posix_spawn().) > > What I envisioned was to have the following syscalls (multiple name suggestions): > > > > 1. int process_open / proc_open / procopen > > 2. int process_config / proc_config / procconfig or ioctl()-based > > 3. int process_info / proc_info / procinfo or ioctl()-based > > 4. int process_manage / proc_manage / procmanage or ioctl()-based > > The API you've proposed seems fine to me, although I'd either 1) > consolidate operations further into one system call, or 2) separate > the different management operations into more and different system > calls that can be audited independently. The grouping you've proposed > seems to have the worst aspects of API splitting and API multiplexing. > But I have no objection to it in spirit. I think combining it all into one API is going to be a soft re-invention of ioctls, but specifically for procfds. This would be an improvement over just ioctls (since the current ioctl namespacing is based on well-behaved drivers and hoping we never have more than 256 ioctl drivers), but I'm not sure it would help make the API nicer than having separate syscalls (we'd have to do something similar to bpf(2) which I'm not a huge fan of). > That said, while I do want to fix process configuration and startup > generally, I want to fix specific holes in the existing API surface > first. The two patches I've already sent do that, and this work > shouldn't wait on an ideal larger process-API overhaul that may or may > not arrive. Based on previous history, I suspect that an API of the > scope you're proposing would take years to overcome all LKML > objections and land. I don't want to wait that long when we can make > smaller fixes that would not conflict with the general architecture. I believe this is precisely what Christian is trying to do with this patch (and you say as much later in your mail). I think that adding all of {sighand,sighand_exitcode,kill,...} would not help the path of landing a much larger API change. We should instead think about the API we want at the end of the day, and then land smaller changes which are long-term compatible (and won't just become deprecated APIs -- there's far too many of them, let's not add more needlessly). If the plan is to have an ioctl API we should merge minor ioctls first. If the idea is to have an explosion of syscalls, then we should merge minor syscalls first. We shouldn't merge procfiles if the end goal is to not use them. > Next, I want to merge my exithand proposal, or something like it. It's > likewise a simple change that, in a minimal way, addresses a > longstanding API deficiency. I'm very strongly against the > POLLERR-on-directory variant of the idea. I agree with you on this need. I will admit I do somewhat like the EOF solution (mainly because it seamlessly deals with the multi-reader case) but I'm still not sure I like /proc/$pid/exit_sighand. As mentioned in the other discussion, ideally we would be only ever operating with the An ugly strawman of an alternative would be an interface that gave you an fd that you could similarly wait-until-EOF on, but that's probably not a major API improvement unless we also make said API provide exit status information in a way that works with the multiple-readers-with-different-creds usecase. One other thing I think we should eventually consider is to provide an API which pings a listener whenever a process does an execve() (and possibly fork()). This is something you can get from FreeBSD's kqueue -- and is something that we have in a really neutered form in the "proc connector". But of course we can discuss this separately, especially if we have an extensible API idea in mind when we start.
On 2018-11-18, Daniel Colascione <dancol@google.com> wrote: > On Sun, Nov 18, 2018 at 11:05 AM, Aleksa Sarai <cyphar@cyphar.com> wrote: > > On 2018-11-18, Daniel Colascione <dancol@google.com> wrote: > >> > Here's my point: if we're really going to make a new API to manipulate > >> > processes by their fd, I think we should have at least a decent idea > >> > of how that API will get extended in the future. Right now, we have > >> > an extremely awkward situation where opening an fd in /proc requires > >> > certain capabilities or uids, and using those fds often also checks > >> > current's capabilities, and the target process may have changed its > >> > own security context, including gaining privilege via SUID, SGID, or > >> > LSM transition rules in the mean time. This has been a huge source of > >> > security bugs. It would be nice to have a model for future APIs that > >> > avoids these problems. > >> > > >> > And I didn't say in my proposal that a process's identity should > >> > fundamentally change when it calls execve(). I'm suggesting that > >> > certain operations that could cause a process to gain privilege or > >> > otherwise require greater permission to introspect (mainly execve) > >> > could be handled by invalidating the new process management fds. > >> > Sure, if init re-execs itself, it's still PID 1, but that doesn't > >> > necessarily mean that: > >> > > >> > fd = process_open_management_fd(1); > >> > [init reexecs] > >> > process_do_something(fd); > >> > > >> > needs to work. > >> > >> PID 1 is a bad example here, because it doesn't get recycled. Other > >> PIDs do. The snippet you gave *does* need to work, in general, because > >> if exec invalidates the handle, and you need to reopen by PID to > >> re-establish your right to do something with the process, that process > >> may in fact have died between the invalidation and your reopen, and > >> your reopened FD may refer to some other random process. > > > > I imagine the error would be -EPERM rather than -ESRCH in this case, > > which would be incredibly trivial for userspace to differentiate > > between. > > Why would userspace necessarily see EPERM? The PID might get recycled > into a different random process that the caller has the ability to > affect. I'm not sure what you're talking about. execve() doesn't change the PID of a process, and in the case we are talking about: pidX_handle = open_pid_handle(pidX); [ pidX execs a setuid binary ] do_something(pidX_handle); pidX still has the same PID (so PID recycling is irrelevant in this case). The key point is whether do_something() should give you an error in such a state transition, and in that case I would say you'd get -EPERM which would indicate (obviously) insufficient privileges. If the PID has died you'd get -ESRCH. Even if it was eventually recycled. Because you've pinned a 'struct pid'. > > If you wish to re-open the path that is also trivial by > > re-opening through /proc/self/fd/$fd -- which will re-do any permission > > checks and will guarantee that you are re-opening the same 'struct file' > > and thus the same 'struct pid'. > > When you reopen via /proc/self/fd, you get a new struct file > referencing the existing inode, not the same struct file. A new > reference to the old struct file would just be dup. I don't think this is really relevant to what I'm trying to say... > Anyway: what other API requires, for correct operation, occasional > reopening through /proc/self/fd? It's cumbersome, and it doesn't add > anything. If we invalidate process handles on execve, and processes > are legally allowed to re-exec themselves for arbitrary reasons at any > time, that's tantamount to saying that handles might become invalid at > any time and that all callers must be prepared to go through the > reopen-and-retry path before any operation. O_PATH. In container runtimes this is necessary for several reasons to protect against malicious container root filesystems as well as avoiding exposing a dirfd to the container. In LXC, O_PATH re-opening is used for /dev/ptmx as well as some other operations. In runc we use it for FIFO re-opening so that we can signal pid1 in a container to execve() into user code. So this isn't a new thing. > Why are we making them do that? So that a process can have an open FD > that represents a process-operation capability? Which capability does > the open FD represent? The re-opening part was just an argument to show that there isn't a condition where you wouldn't be able to get access to the 'struct pid'. I doubt that anyone would actually need to use this -- since you'd need to pass "/proc/pid/fd/..." to a more privileged process in order to use the re-opening. But this also means that we don't need to have a concept of a pidfd that isn't actually associated with a PID but is instead associated with current->mm (which is what you appear to be proposing with the whole "identity fd" concept). > I think when you and Andy must be talking about is an API that looks like this: > > int open_process_operation_handle(int procfs_dirfd, int capability_bitmask) > > capability_bitmask would have bits like > > PROCESS_CAPABILITY_KILL --- send a signal > PROCESS_CAPABILITY_PTRACE --- attach to a process > PROCESS_CAPABILITY_READ_EXIT_STATUS --- what it says on the tin > PROCESS_CAPABILITY_READ_CMDLINE --- etc. > > Then you'd have system calls like > > int process_kill(int process_capability_fd, int signo, const union sigval data) > int process_ptrace_attach(int process_capability_fd) > int process_wait_for_exit(int process_capability_fd, siginfo_t* exit_info) > > that worked on these capability bits. If a process execs or does > something else to change its security capabilities, operations on > these capability FDs would fail with ESTALE or something and callers > would have to re-acquire their capabilities. > > This approach works fine. It has some nice theoretical properties, and > could allow for things like nicer privilege separation for debuggers. > I wouldn't mind something like this getting into the kernel. Andy might be arguing for this (and as you said, I can see the benefit of doing it this way). I'm not convinced that doing permission checks on-open is necessary here -- I get Andy's point about write(2) semantics but I think a new set of proc_* syscalls wouldn't need to follow those semantics. I might be wrong though. > I just don't think this model is necessary right now. I want a small > change from what we have today, one likely to actually make it into > the tree. And bypassing the capability FDs and just allowing callers > to operate directly on process *identity* FDs, using privileges in > effect at the time of all, is at least no worse than what we have now. > > That is, I'm proposing an API that looks like this: > > int process_kill(int procfs_dfd, int signo, const union sigval value) > > If, later, process_kill were to *also* accept process-capability FDs, > nothing would break. Again, I think we should agree on whether it's necessary to have both types of fds before we commit to maintaining both APIs forever... > >> The only way around this problem is to have two separate FDs --- one > >> to represent process identity, which *must* be continuous across > >> execve, and the other to represent some specific capability, some > >> ability to do something to that process. It's reasonable to invalidate > >> capability after execve, but it's not reasonable to invalidate > >> identity. In concrete terms, I don't see a big advantage to this > >> separation, and I think a single identity FD combined with > >> per-operation capability checks is sufficient. And much simpler. > > > > I think that the error separation above would trivially allow user-space > > to know whether the identity or capability of a process being monitored > > has changed. > > > > Currently, all operations on a '/proc/$pid' which you've previously > > opened and has died will give you -ESRCH. > > Not the case. Zombies have died, but profs operations work fine on zombies. It is the case if the process is dead in the sense that the PID might be re-used. That is what I meant be "dead" here, not semi-dead in the sense that zombies are. > >> > Similarly, it seems like > >> > it's probably safe to be able to open an fd that lets you watch the > >> > exit status of a process, have the process call setresuid(), and still > >> > see the exit status. > >> > >> Is it? That's an open question. > > > > Well, if we consider wait4(2) it seems that this is already the case. > > If you fork+exec a setuid binary you can definitely see its exit code. > > Only if you're the parent. Otherwise, you can't see the process exit > status unless you pass a ptrace access check and consult > /proc/pid/stat after the process dies, but before the zombie > disappears. Random unrelated and unprivileged processes can't see exit > statuses from distant parts of the system. Sure, I'd propose that ptrace_may_access() is what we should use for operation permission checks.
On Sun, Nov 18, 2018 at 1:30 PM, Christian Brauner <christian@brauner.io> wrote: > On Sun, Nov 18, 2018 at 10:23:36PM +0100, Christian Brauner wrote: >> On Sun, Nov 18, 2018 at 12:54:10PM -0800, Daniel Colascione wrote: >> > On Sun, Nov 18, 2018 at 12:43 PM, Christian Brauner >> > <christian@brauner.io> wrote: >> > > On Sun, Nov 18, 2018 at 01:28:41PM -0700, Andy Lutomirski wrote: >> > >> >> > >> >> > >> > On Nov 18, 2018, at 12:44 PM, Daniel Colascione <dancol@google.com> wrote: >> > >> > >> > >> >> > >> > >> > >> > That is, I'm proposing an API that looks like this: >> > >> > >> > >> > int process_kill(int procfs_dfd, int signo, const union sigval value) >> > >> > >> > >> > If, later, process_kill were to *also* accept process-capability FDs, >> > >> > nothing would break. >> > >> >> > >> Except that this makes it ambiguous to the caller as to whether their current creds are considered. So it would need to be a different syscall or at least a flag. Otherwise a lot of those nice theoretical properties go away. >> > > >> > > I can add a flag argument >> > > int process_signal(int procfs_dfd, int signo, siginfo_t *info, int flags) >> > > The way I see it process_signal() should be equivalent to kill(pid, signal) for now. >> > > That is siginfo_t is cleared and set to: >> > > >> > > info.si_signo = sig; >> > > info.si_errno = 0; >> > > info.si_code = SI_USER; >> > > info.si_pid = task_tgid_vnr(current); >> > > info.si_uid = from_kuid_munged(current_user_ns(), current_uid()); >> > >> > That makes sense. I just don't want to get into a situation where >> > callers feel that they *have* to use the PID-based APIs to send a >> > signal because process_kill doesn't offer some bit of functionality. >> >> Yeah. >> >> > >> > Are you imagining something like requiring info t be NULL unless flags >> > contains some "I have a siginfo_t" value? >> >> Well, I was actually thinking about something like: >> >> /** >> * sys_process_signal - send a signal to a process trough a process file descriptor >> * @fd: the file descriptor of the process >> * @sig: signal to be sent >> * @info: the signal info >> * @flags: future flags to be passed >> */ >> SYSCALL_DEFINE4(process_signal, int, fd, int, sig, siginfo_t __user *, info, >> int, flags) >> { >> struct pid *pid; >> struct fd *f; >> kernel_siginfo_t kinfo; >> >> /* Do not allow users to pass garbage. */ >> if (flags) >> return -EINVAL; >> >> int ret = __copy_siginfo_from_user(sig, &kinfo, info); >> if (unlikely(ret)) >> return ret; >> >> /* For now, enforce that caller's creds are used. */ >> kinfo.si_pid = task_tgid_vnr(current); >> kinfo.si_uid = from_kuid_munged(current_user_ns(), current_uid()); How about doing it this way? If info is NULL, act like kill(2); otherwise, act like rt_sigqueueinfo(2). (Not actual working or compiled code.) SYSCALL_DEFINE4(process_signal, int, fd, int, sig, siginfo_t __user *, info, int, flags) { struct fd f = { 0 }; kernel_siginfo_t kinfo; int ret; /* Make API extension possible. */ ret = -EINVAL; if (flags) goto out; ret = -EBADF; f = fdget(fd); if (!f.file) goto out; ret = mumble_mumble_check_real_proc_file(f.file); if (ret) goto out; /* Act like kill(2) or rt_sigqueueinfo(2) depending on whether * the user gave us a siginfo structure. */ if (info) { ret = __copy_siginfo_from_user(sig, &kinfo, info); if (ret) goto out; /* Combine this logic with rt_sigqueueinfo(2) */ ret = -EPERM; if ((info->si_code >= 0 || info->si_code == SI_TKILL) && (task_pid_vnr(current) != pid)) goto out; } else { /* Combine this logic with kill(2) */ clear_siginfo(&kinfo); kinfo.si_signo = sig; kinfo.si_errno = 0; kinfo.si_code = SI_USER; kinfo.si_pid = task_tgid_vnr(current); kinfo.si_uid = from_kuid_munged(current_user_ns(), current_uid()); } ret = kill_pid_info(sig, &kinfo, proc_pid(file_inode(f.file))); out: if (f.file) fput(f); return ret; }
On Sun, Nov 18, 2018 at 04:31:22PM -0800, Daniel Colascione wrote: > On Sun, Nov 18, 2018 at 1:30 PM, Christian Brauner <christian@brauner.io> wrote: > > On Sun, Nov 18, 2018 at 10:23:36PM +0100, Christian Brauner wrote: > >> On Sun, Nov 18, 2018 at 12:54:10PM -0800, Daniel Colascione wrote: > >> > On Sun, Nov 18, 2018 at 12:43 PM, Christian Brauner > >> > <christian@brauner.io> wrote: > >> > > On Sun, Nov 18, 2018 at 01:28:41PM -0700, Andy Lutomirski wrote: > >> > >> > >> > >> > >> > >> > On Nov 18, 2018, at 12:44 PM, Daniel Colascione <dancol@google.com> wrote: > >> > >> > > >> > >> > >> > >> > > >> > >> > That is, I'm proposing an API that looks like this: > >> > >> > > >> > >> > int process_kill(int procfs_dfd, int signo, const union sigval value) > >> > >> > > >> > >> > If, later, process_kill were to *also* accept process-capability FDs, > >> > >> > nothing would break. > >> > >> > >> > >> Except that this makes it ambiguous to the caller as to whether their current creds are considered. So it would need to be a different syscall or at least a flag. Otherwise a lot of those nice theoretical properties go away. > >> > > > >> > > I can add a flag argument > >> > > int process_signal(int procfs_dfd, int signo, siginfo_t *info, int flags) > >> > > The way I see it process_signal() should be equivalent to kill(pid, signal) for now. > >> > > That is siginfo_t is cleared and set to: > >> > > > >> > > info.si_signo = sig; > >> > > info.si_errno = 0; > >> > > info.si_code = SI_USER; > >> > > info.si_pid = task_tgid_vnr(current); > >> > > info.si_uid = from_kuid_munged(current_user_ns(), current_uid()); > >> > > >> > That makes sense. I just don't want to get into a situation where > >> > callers feel that they *have* to use the PID-based APIs to send a > >> > signal because process_kill doesn't offer some bit of functionality. > >> > >> Yeah. > >> > >> > > >> > Are you imagining something like requiring info t be NULL unless flags > >> > contains some "I have a siginfo_t" value? > >> > >> Well, I was actually thinking about something like: > >> > >> /** > >> * sys_process_signal - send a signal to a process trough a process file descriptor > >> * @fd: the file descriptor of the process > >> * @sig: signal to be sent > >> * @info: the signal info > >> * @flags: future flags to be passed > >> */ > >> SYSCALL_DEFINE4(process_signal, int, fd, int, sig, siginfo_t __user *, info, > >> int, flags) > >> { > >> struct pid *pid; > >> struct fd *f; > >> kernel_siginfo_t kinfo; > >> > >> /* Do not allow users to pass garbage. */ > >> if (flags) > >> return -EINVAL; > >> > >> int ret = __copy_siginfo_from_user(sig, &kinfo, info); > >> if (unlikely(ret)) > >> return ret; > >> > >> /* For now, enforce that caller's creds are used. */ > >> kinfo.si_pid = task_tgid_vnr(current); > >> kinfo.si_uid = from_kuid_munged(current_user_ns(), current_uid()); > > How about doing it this way? If info is NULL, act like kill(2); > otherwise, act like rt_sigqueueinfo(2). > > (Not actual working or compiled code.) > > SYSCALL_DEFINE4(process_signal, int, fd, int, sig, siginfo_t __user *, info, > int, flags) > { > struct fd f = { 0 }; > kernel_siginfo_t kinfo; > int ret; > > /* Make API extension possible. */ > ret = -EINVAL; > if (flags) > goto out; > > ret = -EBADF; > f = fdget(fd); > if (!f.file) > goto out; > > ret = mumble_mumble_check_real_proc_file(f.file); > if (ret) > goto out; > > /* Act like kill(2) or rt_sigqueueinfo(2) depending on whether > * the user gave us a siginfo structure. > */ > if (info) { > ret = __copy_siginfo_from_user(sig, &kinfo, info); > if (ret) > goto out; > /* Combine this logic with rt_sigqueueinfo(2) */ > ret = -EPERM; > if ((info->si_code >= 0 || info->si_code == SI_TKILL) && > (task_pid_vnr(current) != pid)) > goto out; > > } else { > /* Combine this logic with kill(2) */ > clear_siginfo(&kinfo); > kinfo.si_signo = sig; > kinfo.si_errno = 0; > kinfo.si_code = SI_USER; > kinfo.si_pid = task_tgid_vnr(current); > kinfo.si_uid = from_kuid_munged(current_user_ns(), > current_uid()); > } > > ret = kill_pid_info(sig, &kinfo, proc_pid(file_inode(f.file))); > > out: > if (f.file) > fput(f); > return ret; > } Right, allowing to ass NULL might make sense. I had: /** * sys_process_signal - send a signal to a process trough a process file descriptor * @fd: the file descriptor of the process * @sig: signal to be sent * @info: the signal info * @flags: future flags to be passed */ SYSCALL_DEFINE4(procfd_kill, int, fd, int, sig, siginfo_t __user *, info, int, flags) { int ret; struct pid *pid; kernel_siginfo_t kinfo; struct fd f; /* Do not allow users to pass garbage. */ if (flags) return -EINVAL; ret = __copy_siginfo_from_user(sig, &kinfo, info); if (unlikely(ret)) return ret; /* For now, enforce that caller's creds are used. */ kinfo.si_pid = task_tgid_vnr(current); kinfo.si_uid = from_kuid_munged(current_user_ns(), current_uid()); f = fdget_raw(fd); if (!f.file) return -EBADF; ret = -EINVAL; /* Is this a process file descriptor? */ if (!proc_is_procfd(f.file) || !d_is_dir(f.file->f_path.dentry)) goto err; pid = f.file->private_data; if (!pid) goto err; ret = -EPERM; /* * Not even root can pretend to send signals from the kernel. * Nor can they impersonate a kill()/tgkill(), which adds source info. */ if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) && (task_pid(current) != pid)) goto err; ret = kill_pid_info(sig, &kinfo, pid); err: fdput(f); return ret; }
On Sun, Nov 18, 2018 at 4:09 PM, Aleksa Sarai <cyphar@cyphar.com> wrote: > On 2018-11-18, Daniel Colascione <dancol@google.com> wrote: >> On Sun, Nov 18, 2018 at 11:05 AM, Aleksa Sarai <cyphar@cyphar.com> wrote: >> > On 2018-11-18, Daniel Colascione <dancol@google.com> wrote: >> >> > Here's my point: if we're really going to make a new API to manipulate >> >> > processes by their fd, I think we should have at least a decent idea >> >> > of how that API will get extended in the future. Right now, we have >> >> > an extremely awkward situation where opening an fd in /proc requires >> >> > certain capabilities or uids, and using those fds often also checks >> >> > current's capabilities, and the target process may have changed its >> >> > own security context, including gaining privilege via SUID, SGID, or >> >> > LSM transition rules in the mean time. This has been a huge source of >> >> > security bugs. It would be nice to have a model for future APIs that >> >> > avoids these problems. >> >> > >> >> > And I didn't say in my proposal that a process's identity should >> >> > fundamentally change when it calls execve(). I'm suggesting that >> >> > certain operations that could cause a process to gain privilege or >> >> > otherwise require greater permission to introspect (mainly execve) >> >> > could be handled by invalidating the new process management fds. >> >> > Sure, if init re-execs itself, it's still PID 1, but that doesn't >> >> > necessarily mean that: >> >> > >> >> > fd = process_open_management_fd(1); >> >> > [init reexecs] >> >> > process_do_something(fd); >> >> > >> >> > needs to work. >> >> >> >> PID 1 is a bad example here, because it doesn't get recycled. Other >> >> PIDs do. The snippet you gave *does* need to work, in general, because >> >> if exec invalidates the handle, and you need to reopen by PID to >> >> re-establish your right to do something with the process, that process >> >> may in fact have died between the invalidation and your reopen, and >> >> your reopened FD may refer to some other random process. >> > >> > I imagine the error would be -EPERM rather than -ESRCH in this case, >> > which would be incredibly trivial for userspace to differentiate >> > between. >> >> Why would userspace necessarily see EPERM? The PID might get recycled >> into a different random process that the caller has the ability to >> affect. > > I'm not sure what you're talking about. execve() doesn't change the PID > of a process, and in the case we are talking about: > > pidX_handle = open_pid_handle(pidX); > [ pidX execs a setuid binary ] > do_something(pidX_handle); > > pidX still has the same PID (so PID recycling is irrelevant in this > case). The key point is whether do_something() should give you an error > in such a state transition, and in that case I would say you'd get > -EPERM which would indicate (obviously) insufficient privileges. EPERM is the wrong error. All that's happened here is that the process has execed itself; you may still have permission to operate on the post-execve process. ESTALE is the right error here. But yes, there is a PID trace. What do you do after getting ESTALE? You reopen the handle and retry your operation. How do you open a new handle? Unless you're using some awful /proc/self/fd/... hack, you reopen by PID. And at that point, you've introduced a PID race again. That's why, in my sketch below, I imagined creating the capability handle from the process-identity handle and not, as in the snippet above, directly from the PID. >> Anyway: what other API requires, for correct operation, occasional >> reopening through /proc/self/fd? It's cumbersome, and it doesn't add >> anything. If we invalidate process handles on execve, and processes >> are legally allowed to re-exec themselves for arbitrary reasons at any >> time, that's tantamount to saying that handles might become invalid at >> any time and that all callers must be prepared to go through the >> reopen-and-retry path before any operation. > > O_PATH. In container runtimes this is necessary for several reasons to > protect against malicious container root filesystems as well as avoiding > exposing a dirfd to the container. > > In LXC, O_PATH re-opening is used for /dev/ptmx as well as some other > operations. In runc we use it for FIFO re-opening so that we can signal > pid1 in a container to execve() into user code. > > So this isn't a new thing. Yuck. I'd still argue that 1) the reopen trick isn't really intended as the mainline path for that kernel functionality, and 2) there ought to be a way to do what you're describing in a cleaner way. I'd classify this approach as a hack. It's one thing to require a hack in specialized container initialization code, but it's another to bake it into a hopefully-common API for something as fundamental as process management, especially when there's a perfectly good alternative that doesn't require this hack. >> Why are we making them do that? So that a process can have an open FD >> that represents a process-operation capability? Which capability does >> the open FD represent? > > The re-opening part was just an argument to show that there isn't a > condition where you wouldn't be able to get access to the 'struct pid'. > I doubt that anyone would actually need to use this -- since you'd need > to pass "/proc/pid/fd/..." to a more privileged process in order to use > the re-opening. > > But this also means that we don't need to have a concept of a pidfd that > isn't actually associated with a PID but is instead associated with > current->mm (which is what you appear to be proposing with the whole > "identity fd" concept). Not current->mm; that can be shared with clone. struct signal is the right long-term identity. It's usually easier to keep the struct pid around though, which is exactly what a procfs FD is today: just a lightweight handle to a struct pid. >> I think when you and Andy must be talking about is an API that looks like this: >> >> int open_process_operation_handle(int procfs_dirfd, int capability_bitmask) >> >> capability_bitmask would have bits like >> >> PROCESS_CAPABILITY_KILL --- send a signal >> PROCESS_CAPABILITY_PTRACE --- attach to a process >> PROCESS_CAPABILITY_READ_EXIT_STATUS --- what it says on the tin >> PROCESS_CAPABILITY_READ_CMDLINE --- etc. >> >> Then you'd have system calls like >> >> int process_kill(int process_capability_fd, int signo, const union sigval data) >> int process_ptrace_attach(int process_capability_fd) >> int process_wait_for_exit(int process_capability_fd, siginfo_t* exit_info) >> >> that worked on these capability bits. If a process execs or does >> something else to change its security capabilities, operations on >> these capability FDs would fail with ESTALE or something and callers >> would have to re-acquire their capabilities. >> >> This approach works fine. It has some nice theoretical properties, and >> could allow for things like nicer privilege separation for debuggers. >> I wouldn't mind something like this getting into the kernel. > > Andy might be arguing for this (and as you said, I can see the benefit > of doing it this way). > > I'm not convinced that doing permission checks on-open is necessary here > -- I get Andy's point about write(2) semantics but I think a new set of > proc_* syscalls wouldn't need to follow those semantics. I might be > wrong though. For now, it's fine to just expose system calls that operate directly on the procfs dfd. >> I just don't think this model is necessary right now. I want a small >> change from what we have today, one likely to actually make it into >> the tree. And bypassing the capability FDs and just allowing callers >> to operate directly on process *identity* FDs, using privileges in >> effect at the time of all, is at least no worse than what we have now. >> >> That is, I'm proposing an API that looks like this: >> >> int process_kill(int procfs_dfd, int signo, const union sigval value) >> >> If, later, process_kill were to *also* accept process-capability FDs, >> nothing would break. > > Again, I think we should agree on whether it's necessary to have both > types of fds before we commit to maintaining both APIs forever... I don't think noting that an API *could* be extended in a certain way in the future creates any obligation to decide, immediately, whether that extension will ever be needed. Right now, I don't see a reason to supply the capability FD API I described. I'm just saying that it could be added in a low-friction way if necessary one day. >> >> > Similarly, it seems like >> >> > it's probably safe to be able to open an fd that lets you watch the >> >> > exit status of a process, have the process call setresuid(), and still >> >> > see the exit status. >> >> >> >> Is it? That's an open question. >> > >> > Well, if we consider wait4(2) it seems that this is already the case. >> > If you fork+exec a setuid binary you can definitely see its exit code. >> >> Only if you're the parent. Otherwise, you can't see the process exit >> status unless you pass a ptrace access check and consult >> /proc/pid/stat after the process dies, but before the zombie >> disappears. Random unrelated and unprivileged processes can't see exit >> statuses from distant parts of the system. > > Sure, I'd propose that ptrace_may_access() is what we should use for > operation permission checks. The tricky part is that ptrace_may_access takes a struct task. We want logic that's *like* ptrace_may_access, but that works posthumously. It's especially tricky because there's an LSM hook that lets __ptrace_may_access do arbitrary things. And we can't just run that hook upon process death, since *after* a process dies, a process holding an exithand FD (or whatever we call it) may pass that FD to another process, and *that* process can read(2) from it. Another option is doing the exithand access check at open time. I think that's probably fine, and it would make things a lot simpler. But if we use this option, we should understand what we're doing, and get some security-conscious people to think through the implications.
On Sun, Nov 18, 2018 at 4:08 PM, Aleksa Sarai <cyphar@cyphar.com> wrote: > On 2018-11-18, Daniel Colascione <dancol@google.com> wrote: >> > The gist is to have file descriptors for processes which is obviously not a new >> > idea. This has been done before in other OSes and it has been tried before in >> > Linux [2], [3] (Thanks to Kees for pointing out these patches.). So I want to >> > make it very clear that I'm not laying claim to this being my or even a novel >> > idea in any way. However, I want to diverge from previous approaches with my >> > suggestion. (Though I can't be sure that there's not something similar in other >> > OSes already.) >> >> Windows works basically as you describe. You can create a process is >> suspended state, configure it however you want, then let it run. >> CreateProcess (and even moreso, NtCreateProcess) also provide a rich >> (and *extensible*) interface for pre-creation process configuration. >> >> >> One of the main motivations for having procfds is to have a race-free way of >> > configuring, starting, polling, and killing a process. Basically, a process >> > lifecycle api if you want to think about it that way. The api should also be >> > easily extendable in the future to avoid running into the limitations we >> > currently see with the clone*() syscall(s) again. >> > >> > One of the crucial points of the api is to *separate the configuration >> > of a process through a procfd from actually creating the process*. >> > This is a crucial property expressed in the open*() system calls. First, get a >> > stable handle on an object then allow for ways to configure it. As such the >> > procfd api shares the same insight with Al's and David's new mount api. >> > (Fwiw, Andy also pointed out similarities with posix_spawn().) >> > What I envisioned was to have the following syscalls (multiple name suggestions): >> > >> > 1. int process_open / proc_open / procopen >> > 2. int process_config / proc_config / procconfig or ioctl()-based >> > 3. int process_info / proc_info / procinfo or ioctl()-based >> > 4. int process_manage / proc_manage / procmanage or ioctl()-based >> >> The API you've proposed seems fine to me, although I'd either 1) >> consolidate operations further into one system call, or 2) separate >> the different management operations into more and different system >> calls that can be audited independently. The grouping you've proposed >> seems to have the worst aspects of API splitting and API multiplexing. >> But I have no objection to it in spirit. > > I think combining it all into one API is going to be a soft re-invention > of ioctls, but specifically for procfds. This would be an improvement > over just ioctls (since the current ioctl namespacing is based on > well-behaved drivers and hoping we never have more than 256 ioctl > drivers), but I'm not sure it would help make the API nicer than having > separate syscalls (we'd have to do something similar to bpf(2) which I'm > not a huge fan of). Right. Multiplexers are nothing new, and I'm not a huge fan of them. From an API design perspective, having a bunch of different system calls is probably best. (I do wonder what happens to system call cache behavior once the top-level system call table becomes huge though.) >> That said, while I do want to fix process configuration and startup >> generally, I want to fix specific holes in the existing API surface >> first. The two patches I've already sent do that, and this work >> shouldn't wait on an ideal larger process-API overhaul that may or may >> not arrive. Based on previous history, I suspect that an API of the >> scope you're proposing would take years to overcome all LKML >> objections and land. I don't want to wait that long when we can make >> smaller fixes that would not conflict with the general architecture. > > I believe this is precisely what Christian is trying to do with this > patch (and you say as much later in your mail). > > I think that adding all of {sighand,sighand_exitcode,kill,...} would not > help the path of landing a much larger API change. We should instead > think about the API we want at the end of the day, and then land smaller > changes which are long-term compatible (and won't just become deprecated > APIs -- there's far too many of them, let's not add more needlessly). I don't think we need to reach consensus on some long-term design to address specific problems that we know today. The changes we're talking about here *are* long-term compatible with a bigger process API overhaul. They may or may not be *part* of that solution, but I don't see them making that solution harder either. And the proposals so far all seem to go in the right direction. >> Next, I want to merge my exithand proposal, or something like it. It's >> likewise a simple change that, in a minimal way, addresses a >> longstanding API deficiency. I'm very strongly against the >> POLLERR-on-directory variant of the idea. > > I agree with you on this need. I will admit I do somewhat like the EOF > solution (mainly because it seamlessly deals with the multi-reader case) > but I'm still not sure I like /proc/$pid/exit_sighand. As mentioned in > the other discussion, ideally we would be only ever operating with the This sentence got cut off. > An ugly strawman of an alternative would be an interface that gave you > an fd that you could similarly wait-until-EOF on, but that's probably > not a major API improvement unless we also make said API provide exit > status information in a way that works with the > multiple-readers-with-different-creds usecase. What about something like this then? #define PROCESS_EXIT_HANDLE_CLOEXEC (1<<0) #define PROCESS_EXIT_HANDLE_NONBLOCK (1<<1) #define PROCESS_EXIT_HANDLE_WANT_STATUS (1<<2) /* Open an "status handle" for the process identified by PROCFS_DFD, * which must be an open descriptor to a /proc/pid directory. FLAGS is * a combination of zero or more of the * PROCESS_EXIT_HANDLE_* constants. * * Return -1 on error. On success, return a descriptor for a process * status handle. Before the process identified by PROCFS_DFD exits, * reads from the status handle block. After exit, reads from the * status handle yield either EOF (if PROCESS_EXIT_HANDLE_WANT_STATUS * is not specified) or a siginfo_t describing how the process exited * (if PROCESS_EXIT_HANDLE_WANT_STATUS is specified), as from * waitid(2). * * The returned file descriptor also supports poll(2) and other * notification APIs. * * Any process may call process_get_statusfd from any PROCFS_DFD * without PROCESS_EXIT_HANDLE_WANT_STATUS. * If PROCESS_EXIT_HANDLE_WANT_STATUS is specified, PROCFS_DFD must * refer either to the calling process (or one of its threads), a * child of the current process, or a process for which the current * process would be able to successfully call ptrace(2). * * The PROCESS_EXIT_HANDLE_WANT_STATUS permission check happens only * at open time, not at read time, and the process handle can be transferred like any other FD. */ int process_get_statusfd(int procfs_dfd, int flags); > One other thing I think we should eventually consider is to provide an > API which pings a listener whenever a process does an execve() (and > possibly fork()). I thought about providing this facility too, but it's not immediately apparent to me who would use it. ISTM that most callers that would want this would be happy grabbing the process with ptrace or passively monitoring it with ftrace or the process connector.
On Sun, Nov 18, 2018 at 4:53 PM, Daniel Colascione <dancol@google.com> wrote: >> Sure, I'd propose that ptrace_may_access() is what we should use for >> operation permission checks. > > The tricky part is that ptrace_may_access takes a struct task. We want > logic that's *like* ptrace_may_access, but that works posthumously. > It's especially tricky because there's an LSM hook that lets > __ptrace_may_access do arbitrary things. And we can't just run that > hook upon process death, since *after* a process dies, a process > holding an exithand FD (or whatever we call it) may pass that FD to > another process, and *that* process can read(2) from it. > > Another option is doing the exithand access check at open time. I > think that's probably fine, and it would make things a lot simpler. > But if we use this option, we should understand what we're doing, and > get some security-conscious people to think through the implications. A ptrace check is also probably too strict. Yama's ptrace_scope feature will block ptrace between unrelated processes within a single user context, but applying this restriction to exit code monitoring seems too severe to me.
On Sun, Nov 18, 2018 at 12:32 PM Daniel Colascione <dancol@google.com> wrote: > > On Sun, Nov 18, 2018 at 12:28 PM, Andy Lutomirski <luto@amacapital.net> wrote: > >> That is, I'm proposing an API that looks like this: > >> > >> int process_kill(int procfs_dfd, int signo, const union sigval value) > >> > >> If, later, process_kill were to *also* accept process-capability FDs, > >> nothing would break. > > > > Except that this makes it ambiguous to the caller as to whether their current creds are considered. So it would need to be a different syscall or at least a flag. Otherwise a lot of those nice theoretical properties go away. > > Sure. A flag might make for better ergonomics. > > >> Yes, that's what I have in mind. A siginfo_t is small enough that we > >> could just store it as a blob allocated off the procfs inode or > >> something like that without bothering with a shmfs file. You'd be able > >> to read(2) the exit status as many times as you wanted. > > > > I think that, if the syscall in question is read(2), then it should work *once* per struct file. Otherwise running cat on the file would behave very oddly. > > Why? The file pointer would work normally. Can you explain the exact semantics? If I have an fd where read(2) returns the same 4-byte value every time read(2) is called, then cat will just return an infinite sequence of the same value. This is not a complete disaster, but it's not really a good thing. > > > Read and poll have the same problem as write: we can’t check caps in read or poll either. > > Why not? Reading /proc/pid/stat does an access check today and > conditionally replaces the exit status with zero. And that's probably a bug. It's at least a giant kludge that we shouldn't copy. Here is the general rule: the basic operations that are expected to treat file descriptors as capabilities *must* treat file descriptors as capabilities, at least for new APIs. This includes read(2), write(2), and poll(2). We should have an exceedingly good reason to check current's creds, mm, or anything else about current in those syscalls. There is a good reason for this: consider what happens if you type: sudo >/proc/PID/whatever or sudo </proc/PID/whatever
On Sun, Nov 18, 2018 at 09:42:35AM -0800, Andy Lutomirski wrote: > Now here's the kicker: if the "running program" calls execve(), it > goes away. The fd gets some sort of notification that this happened Type error, parser failed. Define "fd", please. If it's a "file descriptor", thank you do playing, you've lost. That's not going to work. If it's "opened file" (aka "file description" in horrible POSIXese), who's going to get notifications and what kind of exclusion are you going to use?
On Sun, Nov 18, 2018 at 6:47 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Sun, Nov 18, 2018 at 09:42:35AM -0800, Andy Lutomirski wrote: > > > Now here's the kicker: if the "running program" calls execve(), it > > goes away. The fd gets some sort of notification that this happened > > Type error, parser failed. > > Define "fd", please. If it's a "file descriptor", thank you do playing, > you've lost. That's not going to work. If it's "opened file" (aka > "file description" in horrible POSIXese), who's going to get notifications > and what kind of exclusion are you going to use? What I meant was: a program that has one of these fds would be able to find out that an execve() happened and the program needs to refresh its access to the target task. This could be as simple as POLLHUP and, if needed, some syscall indicating exactly why we got POLLHUP (e.g. execve vs exit). There would be some sort of indication that a program that holds an fd pointing at an "opened file" could get -- probably poll() would return some status indicating that execve() happened and our capability is gone, and, if needed
Hi Christian,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on linus/master]
[also build test WARNING on v4.20-rc3 next-20181116]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Christian-Brauner/proc-allow-killing-processes-via-file-descriptors/20181119-170316
config: riscv-tinyconfig (attached as .config)
compiler: riscv64-linux-gcc (GCC) 8.1.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=8.1.0 make.cross ARCH=riscv
All warnings (new ones prefixed by >>):
./usr/include/linux/v4l2-controls.h:1105: found __[us]{8,16,32,64} type without #include <linux/types.h>
>> ./usr/include/linux/procfd.h:8: found __[us]{8,16,32,64} type without #include <linux/types.h>
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
From: > Christian Brauner > Sent: 18 November 2018 11:18 > > With this patch an open() call on /proc/<pid> will give userspace a handle > to struct pid of the process associated with /proc/<pid>. This allows to > maintain a stable handle on a process. My 3c... You need to add a version of fork() that returns an open fd to /proc/pid to the parent. Is it possible to overload fcntl() rather than ioctl() ? More interestingly what about a 'unique pid' (eg the pid extended to (say) 128 bits with a use count) that can be safely put into a /var/run/pid file for a daemon and used later in a 'kill' that will only ever reference the correct process. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Sun, Nov 18, 2018 at 12:17:51PM +0100, Christian Brauner wrote: > With this patch an open() call on /proc/<pid> will give userspace a handle > to struct pid of the process associated with /proc/<pid>. This allows to > maintain a stable handle on a process. > I have been discussing various approaches extensively during technical > conferences this year culminating in a long argument with Eric at Linux > Plumbers. The general consensus was that having a handle on a process > will be something that is very simple and easy to maintain with the > option of being extensible via a more advanced api if the need arises. I > believe that this patch is the most simple, dumb, and therefore > maintainable solution. > > The need for this has arisen in order to reliably kill a process without > running into issues of the pid being recycled as has been described in the > rejected patch [1]. To fulfill the need described in that patchset a new It would certainly be good to fix this. IIUC, things like pkill(1) are a gamble today and can probably kill a process that doesn't fulfil the match criteria due to PID recycling. (If not, I'd certainly like to understand how that is prevented.) > ioctl() PROC_FD_SIGNAL is added. It can be used to send signals to a > process via a file descriptor: > > int fd = open("/proc/1234", O_DIRECTORY | O_CLOEXEC); > ioctl(fd, PROC_FD_SIGNAL, SIGKILL); > close(fd); > > Note, the stable handle will allow us to carefully extend this feature in > the future. A concern here would be that an fd-based shadow API may need to be created, duplicating the whole PID-based API that already exists. However, so long as the PID is stabilised against recycling, an ordinary kill() call seems fine as a way to kill the target process, and no new API or permission model seems to be needed. It occurs to me that a mechanism for holding a reference on a third process already exists: ptrace. Suppose we were to have something like ptrace(PTRACE_MONITOR, pid, 0, 0); that subscribes the caller for the same set of notifications via wait() as the process' real parent gets, and prevents PID recycling until the zombie is consumed be everyone who is subscribed. Multiple PTRACE_MONITOR attachments could be allowed for a given target, in addition to the real parent and regular ptrace-parent (if any). There are a couple of wrinkles: * ptrace() operates on tasks, not processes, so the precise semantics of PTRACE_MONITOR attachment would need a bit of thought. * Odd mechanisms for discovering PIDs, like the unix(7) SCM_CREDENTIALS message might need a special variant to get a PTRACE_MONITOR attachment along with the message. This variant behaviour would need to be opt-in for the recipient. OTOH, extending ptrace may bring problems of its own :/ Cheers ---Dave
On Sun, 18 Nov 2018 at 18:30, Andy Lutomirski <luto@kernel.org> wrote: > Here's my point: if we're really going to make a new API to manipulate > processes by their fd, I think we should have at least a decent idea > of how that API will get extended in the future. Right now, we have > an extremely awkward situation where opening an fd in /proc requires > certain capabilities or uids, and using those fds often also checks > current's capabilities, and the target process may have changed its > own security context, including gaining privilege via SUID, SGID, or > LSM transition rules in the mean time. This has been a huge source of > security bugs. It would be nice to have a model for future APIs that > avoids these problems. > > And I didn't say in my proposal that a process's identity should > fundamentally change when it calls execve(). I'm suggesting that > certain operations that could cause a process to gain privilege or > otherwise require greater permission to introspect (mainly execve) > could be handled by invalidating the new process management fds. > Sure, if init re-execs itself, it's still PID 1, but that doesn't > necessarily mean that: > > fd = process_open_management_fd(1); > [init reexecs] > process_do_something(fd); > > needs to work. > > > > > > setresuid() has no effect > > > here -- if you have W access to the process and the process calls > > > setresuid(), you still have W access. > > > > Now you've created a situation in which an operation that security > > policy previously blocked now becomes possible, invaliding previous > > designs based on the old security invariant. That's the definition of > > introducing a security hole. > > I think you're overstating your case. To a pretty good approximation, > setresuid() allows the caller to remove elements from the set {ruid, > suid, euid}, unless the caller has CAP_SETUID. If you could ptrace a > process before it calls setresuid(), you might as well be able to > ptrace() it after, since you could have just ptraced it and made it > call setresuid() while still ptracing it. Similarly, it seems like > it's probably safe to be able to open an fd that lets you watch the > exit status of a process, have the process call setresuid(), and still > see the exit status. > > Regardless of how you feel about these issues, if you're going to add > an API by which you open an fd, wait for a process to exit, and read > the exit status, you need to define the conditions under which you may > open the fd and under which you may read the exit status once you have > the fd. There are probably multiple valid answers, but the question > still needs to be answered. My POLLERR hack, aside from being ugly, > avoids this particular issue because it merely lets you wait for > something you already could have observed using readdir(). Beg your pardon for hijacking the thread.. I wonder how fast it would be holding a pid with another open()ed fd. And then you need to read comm (or how you filter whom to kill). It seems to me that procfs will be even slower with this safe-way. But I might misunderstand the idea, excuses. So, I just wanted to gently remind about procfs with netlink socket[1]. It seems to me that whenever you receive() pid information, you can request some uniq 64(?) bit number and kill the process using it. Whenever uniqueness of 64-bit number to handle pids will be a question the netlink message might be painlessly extended to 128 or whatever. Also, it may provide the facilities to atomically kill process say by name by adding another field to netlink message. Probably, if it's time to add a new API for procfs, netlink may be more desirable. [1]: https://lwn.net/Articles/650243/ Thanks, Dmitry
Dmitry Safonov <0x7f454c46@gmail.com> writes: > > So, I just wanted to gently remind about procfs with netlink socket[1]. > It seems to me that whenever you receive() pid information, you > can request some uniq 64(?) bit number and kill the process using it. > Whenever uniqueness of 64-bit number to handle pids will be a question > the netlink message might be painlessly extended to 128 or whatever. No. I have seen this requested twice in this thread now, and despite understanding where everyone is coming from I do not believe it will be wise to implement larger pids. The problem is that we then have to make these larger pids live in the pid namespace, make struct pid larger to hold them and update CRIU to save and restore them. All for a very small handful of processes that use this extended API. Eric
On Mon, Nov 19, 2018 at 8:13 AM, Dmitry Safonov <0x7f454c46@gmail.com> wrote: > I wonder how fast it would be holding a pid with another open()ed fd. > And then you need to read comm (or how you filter whom to kill). > It seems to me that procfs will be even slower with this safe-way. > But I might misunderstand the idea, excuses. > > So, I just wanted to gently remind about procfs with netlink socket[1]. We discussed netlink was extensively on the thread about /proc/pid/kill. For numerous reasons, it's not suitable for fundamental process management. We really need an FD-based interface to processes, just like we have FD-based interfaces to other resource types. We need something consistent and reliable, not an abuse of a monitoring interface. > Probably, if it's time to add a new API for procfs, netlink may be more > desirable. Definitely not.
On 2018-11-19, Daniel Colascione <dancol@google.com> wrote: > > I wonder how fast it would be holding a pid with another open()ed fd. > > And then you need to read comm (or how you filter whom to kill). > > It seems to me that procfs will be even slower with this safe-way. > > But I might misunderstand the idea, excuses. > > > > So, I just wanted to gently remind about procfs with netlink socket[1]. > > We discussed netlink was extensively on the thread about > /proc/pid/kill. For numerous reasons, it's not suitable for > fundamental process management. We really need an FD-based interface > to processes, just like we have FD-based interfaces to other resource > types. We need something consistent and reliable, not an abuse of a > monitoring interface. Another significant problem with using netlink for something like this is that (as its name suggest) it's tied to network namespaces and not pid namespaces so you wouldn't reasonably be able to use the API inside a container. Using an fd side-steps the problem somewhat (though this just gave me an idea -- I will add it to the other thread).
diff --git a/fs/proc/base.c b/fs/proc/base.c index ce3465479447..dfde564a21eb 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -88,6 +88,7 @@ #include <linux/fs_struct.h> #include <linux/slab.h> #include <linux/sched/autogroup.h> +#include <linux/sched/signal.h> #include <linux/sched/mm.h> #include <linux/sched/coredump.h> #include <linux/sched/debug.h> @@ -95,6 +96,7 @@ #include <linux/flex_array.h> #include <linux/posix-timers.h> #include <trace/events/oom.h> +#include <uapi/linux/procfd.h> #include "internal.h" #include "fd.h" @@ -3032,10 +3034,41 @@ static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx) tgid_base_stuff, ARRAY_SIZE(tgid_base_stuff)); } +static int proc_tgid_open(struct inode *inode, struct file *file) +{ + /* grab reference to struct pid */ + file->private_data = get_pid(proc_pid(inode)); + return 0; +} + +static int proc_tgid_release(struct inode *inode, struct file *file) +{ + struct pid *pid = file->private_data; + /* drop reference to struct pid */ + put_pid(pid); + return 0; +} + +static long proc_tgid_ioctl(struct file *file, unsigned int cmd, + unsigned long arg) +{ + struct pid *pid = file->private_data; + + switch (cmd) { + case PROC_FD_SIGNAL: + return kill_pid(pid, arg, 1); + default: + return -EINVAL; + } +} + static const struct file_operations proc_tgid_base_operations = { + .open = proc_tgid_open, .read = generic_read_dir, .iterate_shared = proc_tgid_base_readdir, .llseek = generic_file_llseek, + .release = proc_tgid_release, + .unlocked_ioctl = proc_tgid_ioctl, }; static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags) diff --git a/include/uapi/linux/procfd.h b/include/uapi/linux/procfd.h new file mode 100644 index 000000000000..8e4c07a9f3a3 --- /dev/null +++ b/include/uapi/linux/procfd.h @@ -0,0 +1,11 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef __LINUX_PROCFD_H +#define __LINUX_PROCFD_H + +#include <linux/ioctl.h> + +/* Returns a file descriptor that refers to a struct pid */ +#define PROC_FD_SIGNAL _IOW('p', 1, __s32) + +#endif /* __LINUX_PROCFD_H */ +
With this patch an open() call on /proc/<pid> will give userspace a handle to struct pid of the process associated with /proc/<pid>. This allows to maintain a stable handle on a process. I have been discussing various approaches extensively during technical conferences this year culminating in a long argument with Eric at Linux Plumbers. The general consensus was that having a handle on a process will be something that is very simple and easy to maintain with the option of being extensible via a more advanced api if the need arises. I believe that this patch is the most simple, dumb, and therefore maintainable solution. The need for this has arisen in order to reliably kill a process without running into issues of the pid being recycled as has been described in the rejected patch [1]. To fulfill the need described in that patchset a new ioctl() PROC_FD_SIGNAL is added. It can be used to send signals to a process via a file descriptor: int fd = open("/proc/1234", O_DIRECTORY | O_CLOEXEC); ioctl(fd, PROC_FD_SIGNAL, SIGKILL); close(fd); Note, the stable handle will allow us to carefully extend this feature in the future. [1]: https://lkml.org/lkml/2018/10/30/118 Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Christian Brauner <christian@brauner.io> --- fs/proc/base.c | 33 +++++++++++++++++++++++++++++++++ include/uapi/linux/procfd.h | 11 +++++++++++ 2 files changed, 44 insertions(+) create mode 100644 include/uapi/linux/procfd.h