Message ID | 20181228224853.26032-1-christian@brauner.io (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v5,RESEND] signal: add pidfd_send_signal() syscall | expand |
On Fri, 28 Dec 2018 23:48:53 +0100 Christian Brauner <christian@brauner.io> wrote: > The kill() syscall operates on process identifiers (pid). After a process > has exited its pid can be reused by another process. If a caller sends a > signal to a reused pid it will end up signaling the wrong process. This > issue has often surfaced and there has been a push to address this problem [1]. > > This patch uses file descriptors (fd) from proc/<pid> as stable handles on > struct pid. Even if a pid is recycled the handle will not change. The fd > can be used to send signals to the process it refers to. > Thus, the new syscall pidfd_send_signal() is introduced to solve this > problem. Instead of pids it operates on process fds (pidfd). I can't see a description of what happens when the target process has exited. Is the task_struct pinned by the fd? Does the entire procfs directory remain visible? Just one entry within it? Does the pid remain reserved? Do attempts to signal that fd return errors? etcetera. These behaviors should be described in the changelog and manipulate please. The code in signal.c appears to be compiled in even when CONFIG_PROC_FS=y. Can we add the appropriate ifdefs and an entry in sys_ni.c? A selftest in toole/testing/selftests would be nice. And it will be helpful to architecture maintainers as they wire this up. The feature doesn't have its own Kconfig setting. Perhaps it should? It should presumably depend on PROC_FS. I must say that I dislike the linkage to procfs. procfs is a high-level thing which is manipulated using syscalls. To turn around and make a syscall dependent upon the presence of procfs seems just ... wrong. Is there a cleaner way of obtaining the fd? Another syscall perhaps. This fd-for-a-process sounds like a handy thing and people may well think up other uses for it in the future, probably unrelated to signals. Are the code and the interface designed to permit such future applications? I guess "no" - it presently assumes that anything which is written to that fd is a signal. Perhaps there should be a tag at the start of the message (which is what it is) which identifies the message's type? Now I think about it, why a new syscall? This thing is looking rather like an ioctl?
On Fri, Dec 28, 2018 at 03:20:12PM -0800, Andrew Morton wrote: > On Fri, 28 Dec 2018 23:48:53 +0100 Christian Brauner <christian@brauner.io> wrote: > > > The kill() syscall operates on process identifiers (pid). After a process > > has exited its pid can be reused by another process. If a caller sends a > > signal to a reused pid it will end up signaling the wrong process. This > > issue has often surfaced and there has been a push to address this problem [1]. > > > > This patch uses file descriptors (fd) from proc/<pid> as stable handles on > > struct pid. Even if a pid is recycled the handle will not change. The fd > > can be used to send signals to the process it refers to. > > Thus, the new syscall pidfd_send_signal() is introduced to solve this > > problem. Instead of pids it operates on process fds (pidfd). So most of the questions you ask have been extensively discussed in prior threads. All of them have links above in the long commit message. > > I can't see a description of what happens when the target process has > exited. Is the task_struct pinned by the fd? Does the entire procfs A reference to struct pid is kept. struct pid - as far as I understand - was created exactly for the reason to not require to pin struct task_struct. This is also noted in the comments to struct pid: https://elixir.bootlin.com/linux/v4.20-rc1/source/include/linux/pid.h#L16 > directory remain visible? Just one entry within it? Does the pid The same thing that happens when you currently keep an fd to /proc/<pid> open. > remain reserved? Do attempts to signal that fd return errors? The pid does not remain reserved. Which leads back to your question about what happens when you try to signal a process that has exited: you get ESRCH. > etcetera. These behaviors should be described in the changelog and > manipulate please. I can add those questions to the changelog too. > > The code in signal.c appears to be compiled in even when > CONFIG_PROC_FS=y. Can we add the appropriate ifdefs and an entry in > sys_ni.c? Without having looked super close at this from the top of my head I'd say, yes we can. > > A selftest in toole/testing/selftests would be nice. And it will be > helpful to architecture maintainers as they wire this up. I can extend the sample program in the commit message to a selftest. > > The feature doesn't have its own Kconfig setting. Perhaps it should? > It should presumably depend on PROC_FS. Not sure why we should do this. > > I must say that I dislike the linkage to procfs. procfs is a > high-level thing which is manipulated using syscalls. To turn around > and make a syscall dependent upon the presence of procfs seems just ... > wrong. Is there a cleaner way of obtaining the fd? Another syscall > perhaps. We may do something like this in the future. There's another syscall lined up at some point translate_pid() which we may extend to also give back an fd to a process. For now the open() on /proc/<pid> is the cleanest way of doing this. > > This fd-for-a-process sounds like a handy thing and people may well > think up other uses for it in the future, probably unrelated to > signals. Are the code and the interface designed to permit such future > applications? I guess "no" - it presently assumes that anything which This too has been discussed in prior threads linked in the commit message. Yes, it does permit of such extension and we have already publicly discussed future extensions (e.g. wait maybe a new clone syscall). > is written to that fd is a signal. Perhaps there should be a tag at > the start of the message (which is what it is) which identifies the > message's type? > > Now I think about it, why a new syscall? This thing is looking rather > like an ioctl? Again, we have had a lengthy discussion about this and by now we all agree that a dedicated syscall makes more sense than an ioctl() for security reasons, because processes are a core-kernel concept, and we intend to extend this api with syscalls in the future (e.g. wait etc. also discussed in prior threads linked in here). There's also a summary article on lwn about parts of this (https://lwn.net/Articles/773459/). Thanks! Christian
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 3cf7b533b3d1..6804c1e84b36 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -398,3 +398,4 @@ 384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl 385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents 386 i386 rseq sys_rseq __ia32_sys_rseq +387 i386 pidfd_send_signal sys_pidfd_send_signal __ia32_sys_pidfd_send_signal diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index f0b1709a5ffb..aa4b858fa0f1 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -343,6 +343,7 @@ 332 common statx __x64_sys_statx 333 common io_pgetevents __x64_sys_io_pgetevents 334 common rseq __x64_sys_rseq +335 common pidfd_send_signal __x64_sys_pidfd_send_signal # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/proc/base.c b/fs/proc/base.c index ce3465479447..bf680b7b603a 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -716,8 +716,6 @@ static int proc_pid_permission(struct inode *inode, int mask) return generic_permission(inode, mask); } - - static const struct inode_operations proc_def_inode_operations = { .setattr = proc_setattr, }; @@ -3038,6 +3036,15 @@ static const struct file_operations proc_tgid_base_operations = { .llseek = generic_file_llseek, }; +struct pid *tgid_pidfd_to_pid(const struct file *file) +{ + if (!d_is_dir(file->f_path.dentry) || + (file->f_op != &proc_tgid_base_operations)) + return ERR_PTR(-EBADF); + + return proc_pid(file_inode(file)); +} + static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags) { return proc_pident_lookup(dir, dentry, @@ -3422,6 +3429,15 @@ static const struct file_operations proc_tid_base_operations = { .llseek = generic_file_llseek, }; +struct pid *tid_pidfd_to_pid(const struct file *file) +{ + if (!d_is_dir(file->f_path.dentry) || + (file->f_op != &proc_tid_base_operations)) + return ERR_PTR(-EBADF); + + return proc_pid(file_inode(file)); +} + static const struct inode_operations proc_tid_base_inode_operations = { .lookup = proc_tid_base_lookup, .getattr = pid_getattr, diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h index d0e1f1522a78..eb150e5c0ab8 100644 --- a/include/linux/proc_fs.h +++ b/include/linux/proc_fs.h @@ -73,6 +73,8 @@ struct proc_dir_entry *proc_create_net_single_write(const char *name, umode_t mo int (*show)(struct seq_file *, void *), proc_write_t write, void *data); +extern struct pid *tgid_pidfd_to_pid(const struct file *file); +extern struct pid *tid_pidfd_to_pid(const struct file *file); #else /* CONFIG_PROC_FS */ @@ -114,6 +116,16 @@ static inline int remove_proc_subtree(const char *name, struct proc_dir_entry *p #define proc_create_net(name, mode, parent, state_size, ops) ({NULL;}) #define proc_create_net_single(name, mode, parent, show, data) ({NULL;}) +static inline struct pid *tgid_pidfd_to_pid(const struct file *file) +{ + return ERR_PTR(-EBADF); +} + +static inline struct pid *tid_pidfd_to_pid(const struct file *file) +{ + return ERR_PTR(-EBADF); +} + #endif /* CONFIG_PROC_FS */ struct net; diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 2ac3d13a915b..fd85b9045a9f 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -907,6 +907,9 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags, unsigned mask, struct statx __user *buffer); asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len, int flags, uint32_t sig); +asmlinkage long sys_pidfd_send_signal(int pidfd, int sig, + siginfo_t __user *info, + unsigned int flags); /* * Architecture-specific system calls diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index d90127298f12..b77538af7aca 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -740,9 +740,11 @@ __SC_COMP(__NR_io_pgetevents, sys_io_pgetevents, compat_sys_io_pgetevents) __SYSCALL(__NR_rseq, sys_rseq) #define __NR_kexec_file_load 294 __SYSCALL(__NR_kexec_file_load, sys_kexec_file_load) +#define __NR_pidfd_send_signal 295 +__SYSCALL(__NR_pidfd_send_signal, sys_pidfd_send_signal) #undef __NR_syscalls -#define __NR_syscalls 295 +#define __NR_syscalls 296 /* * 32 bit systems traditionally used different diff --git a/kernel/signal.c b/kernel/signal.c index 9a32bc2088c9..3c83d3a5c7c5 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -19,7 +19,9 @@ #include <linux/sched/task.h> #include <linux/sched/task_stack.h> #include <linux/sched/cputime.h> +#include <linux/file.h> #include <linux/fs.h> +#include <linux/proc_fs.h> #include <linux/tty.h> #include <linux/binfmts.h> #include <linux/coredump.h> @@ -3286,6 +3288,16 @@ COMPAT_SYSCALL_DEFINE4(rt_sigtimedwait, compat_sigset_t __user *, uthese, } #endif +static inline void prepare_kill_siginfo(int sig, struct kernel_siginfo *info) +{ + clear_siginfo(info); + info->si_signo = sig; + info->si_errno = 0; + info->si_code = SI_USER; + info->si_pid = task_tgid_vnr(current); + info->si_uid = from_kuid_munged(current_user_ns(), current_uid()); +} + /** * sys_kill - send a signal to a process * @pid: the PID of the process @@ -3295,16 +3307,133 @@ SYSCALL_DEFINE2(kill, pid_t, pid, int, sig) { struct kernel_siginfo info; - clear_siginfo(&info); - info.si_signo = sig; - info.si_errno = 0; - info.si_code = SI_USER; - info.si_pid = task_tgid_vnr(current); - info.si_uid = from_kuid_munged(current_user_ns(), current_uid()); + prepare_kill_siginfo(sig, &info); return kill_something_info(sig, &info, pid); } +/* + * Verify that the signaler and signalee either are in the same pid namespace + * or that the signaler's pid namespace is an ancestor of the signalee's pid + * namespace. + */ +static bool access_pidfd_pidns(struct pid *pid) +{ + struct pid_namespace *active = task_active_pid_ns(current); + struct pid_namespace *p = ns_of_pid(pid); + + for (;;) { + if (!p) + return false; + if (p == active) + break; + p = p->parent; + } + + return true; +} + +static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info) +{ +#ifdef CONFIG_COMPAT + /* + * Avoid hooking up compat syscalls and instead handle necessary + * conversions here. Note, this is a stop-gap measure and should not be + * considered a generic solution. + */ + if (in_compat_syscall()) + return copy_siginfo_from_user32( + kinfo, (struct compat_siginfo __user *)info); +#endif + return copy_siginfo_from_user(kinfo, info); +} + +/** + * sys_pidfd_send_signal - send a signal to a process through a task file + * descriptor + * @pidfd: the file descriptor of the process + * @sig: signal to be sent + * @info: the signal info + * @flags: future flags to be passed + * + * The syscall currently only signals via PIDTYPE_PID which covers + * kill(<positive-pid>, <signal>. It does not signal threads or process + * groups. + * In order to extend the syscall to threads and process groups the @flags + * argument should be used. In essence, the @flags argument will determine + * what is signaled and not the file descriptor itself. Put in other words, + * grouping is a property of the flags argument not a property of the file + * descriptor. + * + * Return: 0 on success, negative errno on failure + */ +SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig, + siginfo_t __user *, info, unsigned int, flags) +{ + int ret; + struct fd f; + struct pid *pid; + kernel_siginfo_t kinfo; + + /* Enforce flags be set to 0 until we add an extension. */ + if (flags) + return -EINVAL; + + f = fdget_raw(pidfd); + if (!f.file) + return -EBADF; + + pid = tid_pidfd_to_pid(f.file); + if (!IS_ERR(pid)) { + /* + * Give userspace a way to detect /proc/<pid>/task/<tid> + * support when we add it. + */ + ret = -EOPNOTSUPP; + goto err; + } + + /* Is this a pidfd? */ + pid = tgid_pidfd_to_pid(f.file); + if (IS_ERR(pid)) { + ret = PTR_ERR(pid); + goto err; + } + + ret = -EINVAL; + if (!access_pidfd_pidns(pid)) + goto err; + + if (info) { + ret = copy_siginfo_from_user_any(&kinfo, info); + if (unlikely(ret)) + goto err; + + ret = -EINVAL; + if (unlikely(sig != kinfo.si_signo)) + goto err; + + if ((task_pid(current) != pid) && + (kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL)) { + /* Only allow sending arbitrary signals to yourself. */ + ret = -EPERM; + if (kinfo.si_code != SI_USER) + goto err; + + /* Turn this into a regular kill signal. */ + prepare_kill_siginfo(sig, &kinfo); + } + } else { + prepare_kill_siginfo(sig, &kinfo); + } + + ret = kill_pid_info(sig, &kinfo, pid); + +err: + fdput(f); + return ret; +} + static int do_send_specific(pid_t tgid, pid_t pid, int sig, struct kernel_siginfo *info) {