diff mbox series

[v4] signal: add taskfd_send_signal() syscall

Message ID 20181206121858.12215-1-christian@brauner.io (mailing list archive)
State New, archived
Headers show
Series [v4] signal: add taskfd_send_signal() syscall | expand

Commit Message

Christian Brauner Dec. 6, 2018, 12:18 p.m. UTC
The kill() syscall operates on process identifiers (pid). After a process
has exited its pid can be reused by another process. If a caller sends a
signal to a reused pid it will end up signaling the wrong process. This
issue has often surfaced and there has been a push [1] to address this
problem.

This patch uses file descriptors (fd) from proc/<pid> as stable handles on
struct pid. Even if a pid is recycled the handle will not change. The fd
can be used to send signals to the process it refers to.
Thus, the new syscall taskfd_send_signal() is introduced to solve this
problem. Instead of pids it operates on process fds (taskfd).

/* prototype and argument /*
long taskfd_send_signal(int taskfd, int sig, siginfo_t *info, unsigned int flags);

In addition to the taskfd and signal argument it takes an additional
siginfo_t and flags argument. If the siginfo_t argument is NULL then
taskfd_send_signal() behaves like kill(). If it is not NULL
taskfd_send_signal() behaves like rt_sigqueueinfo().
The flags argument is added to allow for future extensions of this syscall.
It currently needs to be passed as 0. Failing to do so will cause EINVAL.

/* taskfd_send_signal() replaces multiple pid-based syscalls */
The taskfd_send_signal() syscall currently takes on the job of the
following syscalls that operate on pids:
- kill(2)
- rt_sigqueueinfo(2)
The syscall is defined in such a way that it can also operate on thread fds
instead of process fds. In a future patchset I will extend it to operate on
taskfds from /proc/<pid>/task/<tid> at which point it will additionally
take on the job of:
- tgkill(2)
- rt_tgsigqueueinfo(2)
Right now this is intentionally left out to keep this patchset as simple as
possible (cf. [4]). If a taskfd of /proc/<pid>/task/<tid> is passed
EOPNOTSUPP will be returned to give userspace a way to detect when I add
support for such taskfds (cf. [10]).

/* naming */
The original prefix of the syscall was "procfd_". However, it has been
pointed out that since this syscall will eventually operate on both
processes and threads the name should reflect this (cf. [12]). The best
possible candidate even from a userspace perspective seems to be "task".
Although "task" is used internally we are alreday deviating from POSIX by
using file descriptors to processes in the first place so it seems fine to
use the "taskfd_" prefix.

The name taskfd_send_signal() was also chosen to reflect the fact that it
takes on the job of multiple syscalls. It is intentional that the name is
not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the fomer
because it might imply that taskfd_send_signal() is only a replacement for
kill(2) and not the latter because it is a hazzle to remember the correct
spelling (especially for non-native speakers) and because it is not
descriptive enough of what the syscall actually does. The name
"taskfd_send_signal" makes it very clear that its job is to send signals.

/* O_PATH file descriptors */
taskfds opened as O_PATH fds cannot be used to send signals to a process
(cf. [2]). Signaling processes through taskfds is the equivalent of writing
to a file. Thus, this is not an operation that operates "purely at the
file descriptor level" as required by the open(2) manpage.

/* zombies */
Zombies can be signaled just as any other process. No special error will be
reported since a zombie state is an unreliable state (cf. [3]).

/* cross-namespace signals */
The patch currently enforces that the signaler and signalee either are in
the same pid namespace or that the signaler's pid namespace is an ancestor
of the signalee's pid namespace. This is done for the sake of simplicity
and because it is unclear to what values certain members of struct
siginfo_t would need to be set to (cf. [5], [6]).

/* compat syscalls */
It became clear that we would like to avoid adding compat syscalls (cf.
[7]). The compat syscall handling is now done in kernel/signal.c itself by
adding __copy_siginfo_from_user_generic() which lets us avoid compat
syscalls (cf. [8]). It should be noted that the addition of
__copy_siginfo_from_user_any() is caused by a bug in the original
implementation of rt_sigqueueinfo(2) (cf. 12).
With upcoming rework for syscall handling things might improve
significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain
any additional callers.

/* testing */
This patch was tested on x64 and x86.

/* userspace usage */
An asciinema recording for the basic functionality can be found under [9].
With this patch a process can be killed via:

 #define _GNU_SOURCE
 #include <errno.h>
 #include <fcntl.h>
 #include <signal.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <sys/stat.h>
 #include <sys/syscall.h>
 #include <sys/types.h>
 #include <unistd.h>

 static inline int do_taskfd_send_signal(int taskfd, int sig, siginfo_t *info,
                                         unsigned int flags)
 {
 #ifdef __NR_taskfd_send_signal
         return syscall(__NR_taskfd_send_signal, taskfd, sig, info, flags);
 #else
         return -ENOSYS;
 #endif
 }

 int main(int argc, char *argv[])
 {
         int fd, ret, saved_errno, sig;

         if (argc < 3)
                 exit(EXIT_FAILURE);

         fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
         if (fd < 0) {
                 printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]);
                 exit(EXIT_FAILURE);
         }

         sig = atoi(argv[2]);

         printf("Sending signal %d to process %s\n", sig, argv[1]);
         ret = do_taskfd_send_signal(fd, sig, NULL, 0);

         saved_errno = errno;
         close(fd);
         errno = saved_errno;

         if (ret < 0) {
                 printf("%s - Failed to send signal %d to process %s\n",
                        strerror(errno), sig, argv[1]);
                 exit(EXIT_FAILURE);
         }

         exit(EXIT_SUCCESS);
 }

[1]:  https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/
[2]:  https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/
[3]:  https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/
[4]:  https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
[5]:  https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/
[6]:  https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/
[7]:  https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/
[8]:  https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/
[9]:  https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
[10]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
[11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/
[12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Jann Horn <jannh@google.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
Changelog:
v4:
- updated asciinema to use "taskfd_" prefix
- s/procfd_send_signal/taskfd_send_signal/g
- s/proc_is_tgid_procfd/tgid_taskfd_to_pid/b
- s/proc_is_tid_procfd/tid_taskfd_to_pid/b
- s/__copy_siginfo_from_user_generic/__copy_siginfo_from_user_any/g
- make it clear that __copy_siginfo_from_user_any() is a workaround caused
  by a bug in the original implementation of rt_sigqueueinfo()
- when spoofing signals turn them into regular kill signals if si_code is
  set to SI_USER
- make proc_is_t{g}id_procfd() return struct pid to allow proc_pid() to
  stay private to fs/proc/
v3:
- add __copy_siginfo_from_user_generic() to avoid adding compat syscalls
- s/procfd_signal/procfd_send_signal/g
- change type of flags argument from int to unsigned int
- add comment about what happens to zombies
- add proc_is_tid_procfd()
- return EOPNOTSUPP when /proc/<pid>/task/<tid> fd is passed so userspace
  has a way of knowing that tidfds are not supported currently.
v2:
- define __NR_procfd_signal in unistd.h
- wire up compat syscall
- s/proc_is_procfd/proc_is_tgid_procfd/g
- provide stubs when CONFIG_PROC_FS=n
- move proc_pid() to linux/proc_fs.h header
- use proc_pid() to grab struct pid from /proc/<pid> fd
v1:
- patch introduced
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/proc/base.c                         |  20 +++-
 include/linux/proc_fs.h                |  12 +++
 include/linux/syscalls.h               |   3 +
 include/uapi/asm-generic/unistd.h      |   4 +-
 kernel/signal.c                        | 132 +++++++++++++++++++++++--
 7 files changed, 164 insertions(+), 9 deletions(-)

Comments

Florian Weimer Dec. 6, 2018, 12:30 p.m. UTC | #1
* Christian Brauner:

> /* zombies */
> Zombies can be signaled just as any other process. No special error will be
> reported since a zombie state is an unreliable state (cf. [3]).

I still disagree with this analysis.  If I know that the target process
is still alive, and it is not, this is a persistent error condition
which can be reliably reported.  Given that someone might send SIGKILL
to the process behind my back, detecting this error condition could be
useful.

Rest looks good to me (with the usual caveats).

Thanks,
Florian
Jürg Billeter Dec. 6, 2018, 12:45 p.m. UTC | #2
On Thu, 2018-12-06 at 13:30 +0100, Florian Weimer wrote:
> * Christian Brauner:
> 
> > /* zombies */
> > Zombies can be signaled just as any other process. No special error will be
> > reported since a zombie state is an unreliable state (cf. [3]).
> 
> I still disagree with this analysis.  If I know that the target process
> is still alive, and it is not, this is a persistent error condition
> which can be reliably reported.  Given that someone might send SIGKILL
> to the process behind my back, detecting this error condition could be
> useful.

As I understand it, kill() behaves the same way. I think it's good that
this new syscall keeps the behavior as close as possible to kill().
E.g., this would allow emulating kill() (or a higher level API
equivalent) on top of taskfds without subtle differences in behavior.

As the new syscall supports flags, we could consider introducing a flag
that changes the behavior in the zombie case. However, I think that
should be a separate discussion (after merge of the syscall) and the
default behavior makes sense as is.

Jürg
Christian Brauner Dec. 6, 2018, 12:53 p.m. UTC | #3
On Thu, Dec 06, 2018 at 01:30:19PM +0100, Florian Weimer wrote:
> * Christian Brauner:
> 
> > /* zombies */
> > Zombies can be signaled just as any other process. No special error will be
> > reported since a zombie state is an unreliable state (cf. [3]).
> 
> I still disagree with this analysis.  If I know that the target process
> is still alive, and it is not, this is a persistent error condition
> which can be reliably reported.  Given that someone might send SIGKILL
> to the process behind my back, detecting this error condition could be
> useful.

Apart from my objection that this is not actually a reliable state
because of timing issues between e.g. calling wait and a process exiting
I have two more concerns and one helpful suggestion.
First, this is hooking pretty deep into kernel internals. So far
EXIT_ZOMBIE is only exposed in kernel/exit.c and I don't see enough
value to drag all of this into kernel/signal.c
Second, all other signal syscalls don't do report errors when signaling
to zombies as well. It would be odd if this one suddenly did.
Third, if this really becomes such a big issue for userspace in the
future that we want to do that work then we can add a flag like
TASKFD_DETECT_ZOMBIE (or some such name) that will allow userspace to
get an error back when signaling a zombie.
As far as I'm concerned, this is out of scope for an initial
implementation. We are going to use fds for tasks that's enough
excitement for one patchset!

> 
> Rest looks good to me (with the usual caveats).

I take it that's your way of saying Acked-by? :)

Christian
Florian Weimer Dec. 6, 2018, 1:12 p.m. UTC | #4
* Jürg Billeter:

> On Thu, 2018-12-06 at 13:30 +0100, Florian Weimer wrote:
>> * Christian Brauner:
>> 
>> > /* zombies */
>> > Zombies can be signaled just as any other process. No special error will be
>> > reported since a zombie state is an unreliable state (cf. [3]).
>> 
>> I still disagree with this analysis.  If I know that the target process
>> is still alive, and it is not, this is a persistent error condition
>> which can be reliably reported.  Given that someone might send SIGKILL
>> to the process behind my back, detecting this error condition could be
>> useful.
>
> As I understand it, kill() behaves the same way. I think it's good that
> this new syscall keeps the behavior as close as possible to kill().

No, kill does not behave in this way because the PID can be reused.
The error condition is not stable there.

Thanks,
Florian
Florian Weimer Dec. 6, 2018, 1:17 p.m. UTC | #5
* Christian Brauner:

> On Thu, Dec 06, 2018 at 01:30:19PM +0100, Florian Weimer wrote:
>> * Christian Brauner:
>> 
>> > /* zombies */
>> > Zombies can be signaled just as any other process. No special error will be
>> > reported since a zombie state is an unreliable state (cf. [3]).
>> 
>> I still disagree with this analysis.  If I know that the target process
>> is still alive, and it is not, this is a persistent error condition
>> which can be reliably reported.  Given that someone might send SIGKILL
>> to the process behind my back, detecting this error condition could be
>> useful.
>
> Apart from my objection that this is not actually a reliable state
> because of timing issues between e.g. calling wait and a process
> exiting

The point is that if you are in an error state, the error state does not
go away, *especially* if you do not expect the process to terminate and
have not arranged for something calling waitpid on the PID.

> I have two more concerns and one helpful suggestion.
> First, this is hooking pretty deep into kernel internals. So far
> EXIT_ZOMBIE is only exposed in kernel/exit.c and I don't see enough
> value to drag all of this into kernel/signal.c
> Second, all other signal syscalls don't do report errors when signaling
> to zombies as well.

They cannot do this reliably because the error state is not persistent:
the PID can be reused.  So for the legacy interface, a difference in
error signaling would just have encouraged a bad programming model.

> It would be odd if this one suddenly did.

I don't think so.  My point is that the FD-based mechanism finally
allows to cope with this in a reasonable way.

> Third, if this really becomes such a big issue for userspace in the
> future that we want to do that work then we can add a flag like
> TASKFD_DETECT_ZOMBIE (or some such name) that will allow userspace to
> get an error back when signaling a zombie.

I can live with that.

Thanks,
Florian
Jürg Billeter Dec. 6, 2018, 1:18 p.m. UTC | #6
On Thu, 2018-12-06 at 14:12 +0100, Florian Weimer wrote:
> * Jürg Billeter:
> 
> > On Thu, 2018-12-06 at 13:30 +0100, Florian Weimer wrote:
> > > * Christian Brauner:
> > > 
> > > > /* zombies */
> > > > Zombies can be signaled just as any other process. No special error will be
> > > > reported since a zombie state is an unreliable state (cf. [3]).
> > > 
> > > I still disagree with this analysis.  If I know that the target process
> > > is still alive, and it is not, this is a persistent error condition
> > > which can be reliably reported.  Given that someone might send SIGKILL
> > > to the process behind my back, detecting this error condition could be
> > > useful.
> > 
> > As I understand it, kill() behaves the same way. I think it's good that
> > this new syscall keeps the behavior as close as possible to kill().
> 
> No, kill does not behave in this way because the PID can be reused.
> The error condition is not stable there.

The PID can't be reused as long as it's a zombie. It can only be reused
when it has been wait()ed for. Or am I misunderstanding something?

Jürg
Florian Weimer Dec. 6, 2018, 1:20 p.m. UTC | #7
* Jürg Billeter:

> On Thu, 2018-12-06 at 14:12 +0100, Florian Weimer wrote:
>> * Jürg Billeter:
>> 
>> > On Thu, 2018-12-06 at 13:30 +0100, Florian Weimer wrote:
>> > > * Christian Brauner:
>> > > 
>> > > > /* zombies */
>> > > > Zombies can be signaled just as any other process. No special error will be
>> > > > reported since a zombie state is an unreliable state (cf. [3]).
>> > > 
>> > > I still disagree with this analysis.  If I know that the target process
>> > > is still alive, and it is not, this is a persistent error condition
>> > > which can be reliably reported.  Given that someone might send SIGKILL
>> > > to the process behind my back, detecting this error condition could be
>> > > useful.
>> > 
>> > As I understand it, kill() behaves the same way. I think it's good that
>> > this new syscall keeps the behavior as close as possible to kill().
>> 
>> No, kill does not behave in this way because the PID can be reused.
>> The error condition is not stable there.
>
> The PID can't be reused as long as it's a zombie. It can only be reused
> when it has been wait()ed for. Or am I misunderstanding something?

Hmm, that's a fair point.  So the original interface is just broken.

Florian
Eric W. Biederman Dec. 6, 2018, 1:40 p.m. UTC | #8
Florian Weimer <fweimer@redhat.com> writes:

> * Jürg Billeter:
>
>> On Thu, 2018-12-06 at 13:30 +0100, Florian Weimer wrote:
>>> * Christian Brauner:
>>> 
>>> > /* zombies */
>>> > Zombies can be signaled just as any other process. No special error will be
>>> > reported since a zombie state is an unreliable state (cf. [3]).
>>> 
>>> I still disagree with this analysis.  If I know that the target process
>>> is still alive, and it is not, this is a persistent error condition
>>> which can be reliably reported.  Given that someone might send SIGKILL
>>> to the process behind my back, detecting this error condition could be
>>> useful.
>>
>> As I understand it, kill() behaves the same way. I think it's good that
>> this new syscall keeps the behavior as close as possible to kill().
>
> No, kill does not behave in this way because the PID can be reused.
> The error condition is not stable there.

I am not quite certain what is being discussed here.

Posix says:
    [ESRCH]
        No process or process group can be found corresponding to that specified by pid.

The linux man page says:
       ESRCH  The process or process group does not exist.  Note that an
              existing process might be a zombie, a process that has terminated
              execution, but has not yet been wait(2)ed for.

What happens with this new system call is exactly the linux behavior.
Success is returned until the specified process or thread group has
been waited for.

The only difference from the behavior of kill is that because the
association between the file descriptor and the pid is not affected by
pid reuse once ESRCH is returned ESRCH will always be returned.

Floriam are you seeing a problem with this behavior or the way Christian
was describing it?

Eric
Florian Weimer Dec. 6, 2018, 1:44 p.m. UTC | #9
* Eric W. Biederman:

> Floriam are you seeing a problem with this behavior or the way Christian
> was describing it?

My hope is that you could use taskfd_send_signal one day to send a
signal to a process which you *known* (based on how you've written your
application) should be running and not in a zombie state, and get back
an error if it has exited.

If you get this error, only then you wait on the process, using the file
descriptor you have, and run some recovery code.

Wouldn't that be a reasonable approach once we've got task descriptors?

Thanks,
Florian
Aleksa Sarai Dec. 6, 2018, 2:27 p.m. UTC | #10
On 2018-12-06, Florian Weimer <fweimer@redhat.com> wrote:
> > Floriam are you seeing a problem with this behavior or the way Christian
> > was describing it?
> 
> My hope is that you could use taskfd_send_signal one day to send a
> signal to a process which you *known* (based on how you've written your
> application) should be running and not in a zombie state, and get back
> an error if it has exited.

You can detect if a process is a zombie via the procfd by observing "stat"
(the state will be "Z"). Personally I'm with Christian that we should
maintain compatibility with the rest of the signal APIs -- sending a
signal to a zombie is a defined (though no-op) concept.

I don't understand why sending a signal should fail in this case -- a
zombie is not the same as a non-existent process. If we need to have a
way of checking whether something is a zombie (other than through
"stat") we can add another method (or flag if it has to be atomic) in
the future. And given the complexity of doing it, I'm even less of a
fan of doing it in the initial patchset.

> If you get this error, only then you wait on the process, using the file
> descriptor you have, and run some recovery code.
> 
> Wouldn't that be a reasonable approach once we've got task descriptors?

I think taskfd_wait() is something we'll need eventually, but I don't
think that making taskfd_send_signal() do something that is contrary to
existing kill(2) interfaces (making it so that transitioning to it won't
be seamless), 

What would the error be? ESRCH would be _very_ wrong, given that it
would confuse the two states (zombie/dead-for-real) and would lead to
weird cases where fstatat(taskfd) succeeds but taskfd_send_signal(2)
fails.
Eric W. Biederman Dec. 6, 2018, 2:46 p.m. UTC | #11
Florian Weimer <fweimer@redhat.com> writes:

> * Eric W. Biederman:
>
>> Floriam are you seeing a problem with this behavior or the way Christian
>> was describing it?
>
> My hope is that you could use taskfd_send_signal one day to send a
> signal to a process which you *known* (based on how you've written your
> application) should be running and not in a zombie state, and get back
> an error if it has exited.
>
> If you get this error, only then you wait on the process, using the file
> descriptor you have, and run some recovery code.
>
> Wouldn't that be a reasonable approach once we've got task descriptors?

Getting an error back if the target was a zombie does seem reasonable,
as in principle it is an easy thing to notice, and post zombie once the
process has been reaped we definitely get an error back.

I also agree that it sounds like an extension, as changing the default
would violate the princile of least surprise.

Eric
Eric W. Biederman Dec. 6, 2018, 3:01 p.m. UTC | #12
Christian Brauner <christian@brauner.io> writes:

> The kill() syscall operates on process identifiers (pid). After a process
> has exited its pid can be reused by another process. If a caller sends a
> signal to a reused pid it will end up signaling the wrong process. This
> issue has often surfaced and there has been a push [1] to address this
> problem.
>
> This patch uses file descriptors (fd) from proc/<pid> as stable handles on
> struct pid. Even if a pid is recycled the handle will not change. The fd
> can be used to send signals to the process it refers to.
> Thus, the new syscall taskfd_send_signal() is introduced to solve this
> problem. Instead of pids it operates on process fds (taskfd).

I am not yet thrilled with the taskfd naming.
Is there any plan to support sesssions and process groups?

I am concerned about using kill_pid_info.  It does this:


	rcu_read_lock();
	p = pid_task(pid, PIDTYPE_PID);
	if (p)
		error = group_send_sig_info(sig, info, p, PIDTYPE_TGID);
	rcu_read_unlock();

That pid_task(PIDTYPE_PID) is fine for existing callers that need bug
compatibility.  For new interfaces I would strongly prefer pid_task(PIDTYPE_TGID).


Eric

>
> /* prototype and argument /*
> long taskfd_send_signal(int taskfd, int sig, siginfo_t *info, unsigned int flags);
>
> In addition to the taskfd and signal argument it takes an additional
> siginfo_t and flags argument. If the siginfo_t argument is NULL then
> taskfd_send_signal() behaves like kill(). If it is not NULL
> taskfd_send_signal() behaves like rt_sigqueueinfo().
> The flags argument is added to allow for future extensions of this syscall.
> It currently needs to be passed as 0. Failing to do so will cause EINVAL.
>
> /* taskfd_send_signal() replaces multiple pid-based syscalls */
> The taskfd_send_signal() syscall currently takes on the job of the
> following syscalls that operate on pids:
> - kill(2)
> - rt_sigqueueinfo(2)
> The syscall is defined in such a way that it can also operate on thread fds
> instead of process fds. In a future patchset I will extend it to operate on
> taskfds from /proc/<pid>/task/<tid> at which point it will additionally
> take on the job of:
> - tgkill(2)
> - rt_tgsigqueueinfo(2)
> Right now this is intentionally left out to keep this patchset as simple as
> possible (cf. [4]). If a taskfd of /proc/<pid>/task/<tid> is passed
> EOPNOTSUPP will be returned to give userspace a way to detect when I add
> support for such taskfds (cf. [10]).
>
> /* naming */
> The original prefix of the syscall was "procfd_". However, it has been
> pointed out that since this syscall will eventually operate on both
> processes and threads the name should reflect this (cf. [12]). The best
> possible candidate even from a userspace perspective seems to be "task".
> Although "task" is used internally we are alreday deviating from POSIX by
> using file descriptors to processes in the first place so it seems fine to
> use the "taskfd_" prefix.
>
> The name taskfd_send_signal() was also chosen to reflect the fact that it
> takes on the job of multiple syscalls. It is intentional that the name is
> not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the fomer
> because it might imply that taskfd_send_signal() is only a replacement for
> kill(2) and not the latter because it is a hazzle to remember the correct
> spelling (especially for non-native speakers) and because it is not
> descriptive enough of what the syscall actually does. The name
> "taskfd_send_signal" makes it very clear that its job is to send signals.
>
> /* O_PATH file descriptors */
> taskfds opened as O_PATH fds cannot be used to send signals to a process
> (cf. [2]). Signaling processes through taskfds is the equivalent of writing
> to a file. Thus, this is not an operation that operates "purely at the
> file descriptor level" as required by the open(2) manpage.
>
> /* zombies */
> Zombies can be signaled just as any other process. No special error will be
> reported since a zombie state is an unreliable state (cf. [3]).
>
> /* cross-namespace signals */
> The patch currently enforces that the signaler and signalee either are in
> the same pid namespace or that the signaler's pid namespace is an ancestor
> of the signalee's pid namespace. This is done for the sake of simplicity
> and because it is unclear to what values certain members of struct
> siginfo_t would need to be set to (cf. [5], [6]).
>
> /* compat syscalls */
> It became clear that we would like to avoid adding compat syscalls (cf.
> [7]). The compat syscall handling is now done in kernel/signal.c itself by
> adding __copy_siginfo_from_user_generic() which lets us avoid compat
> syscalls (cf. [8]). It should be noted that the addition of
> __copy_siginfo_from_user_any() is caused by a bug in the original
> implementation of rt_sigqueueinfo(2) (cf. 12).
> With upcoming rework for syscall handling things might improve
> significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain
> any additional callers.
>
> /* testing */
> This patch was tested on x64 and x86.
>
> /* userspace usage */
> An asciinema recording for the basic functionality can be found under [9].
> With this patch a process can be killed via:
>
>  #define _GNU_SOURCE
>  #include <errno.h>
>  #include <fcntl.h>
>  #include <signal.h>
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <string.h>
>  #include <sys/stat.h>
>  #include <sys/syscall.h>
>  #include <sys/types.h>
>  #include <unistd.h>
>
>  static inline int do_taskfd_send_signal(int taskfd, int sig, siginfo_t *info,
>                                          unsigned int flags)
>  {
>  #ifdef __NR_taskfd_send_signal
>          return syscall(__NR_taskfd_send_signal, taskfd, sig, info, flags);
>  #else
>          return -ENOSYS;
>  #endif
>  }
>
>  int main(int argc, char *argv[])
>  {
>          int fd, ret, saved_errno, sig;
>
>          if (argc < 3)
>                  exit(EXIT_FAILURE);
>
>          fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
>          if (fd < 0) {
>                  printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]);
>                  exit(EXIT_FAILURE);
>          }
>
>          sig = atoi(argv[2]);
>
>          printf("Sending signal %d to process %s\n", sig, argv[1]);
>          ret = do_taskfd_send_signal(fd, sig, NULL, 0);
>
>          saved_errno = errno;
>          close(fd);
>          errno = saved_errno;
>
>          if (ret < 0) {
>                  printf("%s - Failed to send signal %d to process %s\n",
>                         strerror(errno), sig, argv[1]);
>                  exit(EXIT_FAILURE);
>          }
>
>          exit(EXIT_SUCCESS);
>  }
>
> [1]:  https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/
> [2]:  https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/
> [3]:  https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/
> [4]:  https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
> [5]:  https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/
> [6]:  https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/
> [7]:  https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/
> [8]:  https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/
> [9]:  https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
> [10]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
> [11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/
> [12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/
>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Serge Hallyn <serge@hallyn.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: Andy Lutomirsky <luto@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Aleksa Sarai <cyphar@cyphar.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Florian Weimer <fweimer@redhat.com>
> Signed-off-by: Christian Brauner <christian@brauner.io>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> ---
> Changelog:
> v4:
> - updated asciinema to use "taskfd_" prefix
> - s/procfd_send_signal/taskfd_send_signal/g
> - s/proc_is_tgid_procfd/tgid_taskfd_to_pid/b
> - s/proc_is_tid_procfd/tid_taskfd_to_pid/b
> - s/__copy_siginfo_from_user_generic/__copy_siginfo_from_user_any/g
> - make it clear that __copy_siginfo_from_user_any() is a workaround caused
>   by a bug in the original implementation of rt_sigqueueinfo()
> - when spoofing signals turn them into regular kill signals if si_code is
>   set to SI_USER
> - make proc_is_t{g}id_procfd() return struct pid to allow proc_pid() to
>   stay private to fs/proc/
> v3:
> - add __copy_siginfo_from_user_generic() to avoid adding compat syscalls
> - s/procfd_signal/procfd_send_signal/g
> - change type of flags argument from int to unsigned int
> - add comment about what happens to zombies
> - add proc_is_tid_procfd()
> - return EOPNOTSUPP when /proc/<pid>/task/<tid> fd is passed so userspace
>   has a way of knowing that tidfds are not supported currently.
> v2:
> - define __NR_procfd_signal in unistd.h
> - wire up compat syscall
> - s/proc_is_procfd/proc_is_tgid_procfd/g
> - provide stubs when CONFIG_PROC_FS=n
> - move proc_pid() to linux/proc_fs.h header
> - use proc_pid() to grab struct pid from /proc/<pid> fd
> v1:
> - patch introduced
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  fs/proc/base.c                         |  20 +++-
>  include/linux/proc_fs.h                |  12 +++
>  include/linux/syscalls.h               |   3 +
>  include/uapi/asm-generic/unistd.h      |   4 +-
>  kernel/signal.c                        | 132 +++++++++++++++++++++++--
>  7 files changed, 164 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 3cf7b533b3d1..7efb63fd0617 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -398,3 +398,4 @@
>  384	i386	arch_prctl		sys_arch_prctl			__ia32_compat_sys_arch_prctl
>  385	i386	io_pgetevents		sys_io_pgetevents		__ia32_compat_sys_io_pgetevents
>  386	i386	rseq			sys_rseq			__ia32_sys_rseq
> +387	i386	taskfd_send_signal	sys_taskfd_send_signal		__ia32_sys_taskfd_send_signal
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index f0b1709a5ffb..be894f4a84e9 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -343,6 +343,7 @@
>  332	common	statx			__x64_sys_statx
>  333	common	io_pgetevents		__x64_sys_io_pgetevents
>  334	common	rseq			__x64_sys_rseq
> +335	common	taskfd_send_signal	__x64_sys_taskfd_send_signal
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index ce3465479447..b8b88bfee455 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -716,8 +716,6 @@ static int proc_pid_permission(struct inode *inode, int mask)
>  	return generic_permission(inode, mask);
>  }
>  
> -
> -
>  static const struct inode_operations proc_def_inode_operations = {
>  	.setattr	= proc_setattr,
>  };
> @@ -3038,6 +3036,15 @@ static const struct file_operations proc_tgid_base_operations = {
>  	.llseek		= generic_file_llseek,
>  };
>  
> +struct pid *tgid_taskfd_to_pid(const struct file *file)
> +{
> +	if (!d_is_dir(file->f_path.dentry) ||
> +	    (file->f_op != &proc_tgid_base_operations))
> +		return ERR_PTR(-EBADF);
> +
> +	return proc_pid(file_inode(file));
> +}
> +
>  static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
>  {
>  	return proc_pident_lookup(dir, dentry,
> @@ -3422,6 +3429,15 @@ static const struct file_operations proc_tid_base_operations = {
>  	.llseek		= generic_file_llseek,
>  };
>  
> +struct pid *tid_taskfd_to_pid(const struct file *file)
> +{
> +	if (!d_is_dir(file->f_path.dentry) ||
> +	    (file->f_op != &proc_tid_base_operations))
> +		return ERR_PTR(-EBADF);
> +
> +	return proc_pid(file_inode(file));
> +}
> +
>  static const struct inode_operations proc_tid_base_inode_operations = {
>  	.lookup		= proc_tid_base_lookup,
>  	.getattr	= pid_getattr,
> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> index d0e1f1522a78..96817415c420 100644
> --- a/include/linux/proc_fs.h
> +++ b/include/linux/proc_fs.h
> @@ -73,6 +73,8 @@ struct proc_dir_entry *proc_create_net_single_write(const char *name, umode_t mo
>  						    int (*show)(struct seq_file *, void *),
>  						    proc_write_t write,
>  						    void *data);
> +extern struct pid *tgid_taskfd_to_pid(const struct file *file);
> +extern struct pid *tid_taskfd_to_pid(const struct file *file);
>  
>  #else /* CONFIG_PROC_FS */
>  
> @@ -114,6 +116,16 @@ static inline int remove_proc_subtree(const char *name, struct proc_dir_entry *p
>  #define proc_create_net(name, mode, parent, state_size, ops) ({NULL;})
>  #define proc_create_net_single(name, mode, parent, show, data) ({NULL;})
>  
> +static inline struct pid *tgid_taskfd_to_pid(const struct file *file)
> +{
> +	return ERR_PTR(-EBADF);
> +}
> +
> +static inline struct pid *tid_taskfd_to_pid(const struct file *file)
> +{
> +	return ERR_PTR(-EBADF);
> +}
> +
>  #endif /* CONFIG_PROC_FS */
>  
>  struct net;
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 2ac3d13a915b..5ffe194ef29b 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -907,6 +907,9 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
>  			  unsigned mask, struct statx __user *buffer);
>  asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
>  			 int flags, uint32_t sig);
> +asmlinkage long sys_taskfd_send_signal(int taskfd, int sig,
> +				       siginfo_t __user *info,
> +				       unsigned int flags);
>  
>  /*
>   * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 538546edbfbd..9343dca63fd9 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -738,9 +738,11 @@ __SYSCALL(__NR_statx,     sys_statx)
>  __SC_COMP(__NR_io_pgetevents, sys_io_pgetevents, compat_sys_io_pgetevents)
>  #define __NR_rseq 293
>  __SYSCALL(__NR_rseq, sys_rseq)
> +#define __NR_taskfd_send_signal 294
> +__SYSCALL(__NR_taskfd_send_signal, sys_taskfd_send_signal)
>  
>  #undef __NR_syscalls
> -#define __NR_syscalls 294
> +#define __NR_syscalls 295
>  
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 9a32bc2088c9..a00a4bcb7605 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -19,7 +19,9 @@
>  #include <linux/sched/task.h>
>  #include <linux/sched/task_stack.h>
>  #include <linux/sched/cputime.h>
> +#include <linux/file.h>
>  #include <linux/fs.h>
> +#include <linux/proc_fs.h>
>  #include <linux/tty.h>
>  #include <linux/binfmts.h>
>  #include <linux/coredump.h>
> @@ -3286,6 +3288,16 @@ COMPAT_SYSCALL_DEFINE4(rt_sigtimedwait, compat_sigset_t __user *, uthese,
>  }
>  #endif
>  
> +static inline void prepare_kill_siginfo(int sig, struct kernel_siginfo *info)
> +{
> +	clear_siginfo(info);
> +	info->si_signo = sig;
> +	info->si_errno = 0;
> +	info->si_code = SI_USER;
> +	info->si_pid = task_tgid_vnr(current);
> +	info->si_uid = from_kuid_munged(current_user_ns(), current_uid());
> +}
> +
>  /**
>   *  sys_kill - send a signal to a process
>   *  @pid: the PID of the process
> @@ -3295,16 +3307,124 @@ SYSCALL_DEFINE2(kill, pid_t, pid, int, sig)
>  {
>  	struct kernel_siginfo info;
>  
> -	clear_siginfo(&info);
> -	info.si_signo = sig;
> -	info.si_errno = 0;
> -	info.si_code = SI_USER;
> -	info.si_pid = task_tgid_vnr(current);
> -	info.si_uid = from_kuid_munged(current_user_ns(), current_uid());
> +	prepare_kill_siginfo(sig, &info);
>  
>  	return kill_something_info(sig, &info, pid);
>  }
>  
> +/*
> + * Verify that the signaler and signalee either are in the same pid namespace
> + * or that the signaler's pid namespace is an ancestor of the signalee's pid
> + * namespace.
> + */
> +static bool may_signal_taskfd(struct pid *pid)
> +{
> +	struct pid_namespace *active = task_active_pid_ns(current);
> +	struct pid_namespace *p = ns_of_pid(pid);
> +
> +	for (;;) {
> +		if (!p)
> +			return false;
> +		if (p == active)
> +			break;
> +		p = p->parent;
> +	}
> +
> +	return true;
> +}
> +
> +static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
> +{
> +#ifdef CONFIG_COMPAT
> +	/*
> +	 * Avoid hooking up compat syscalls and instead handle necessary
> +	 * conversions here. Note, this is a stop-gap measure and should not be
> +	 * considered a generic solution.
> +	 */
> +	if (in_compat_syscall())
> +		return copy_siginfo_from_user32(
> +			kinfo, (struct compat_siginfo __user *)info);
> +#endif
> +	return copy_siginfo_from_user(kinfo, info);
> +}
> +
> +/**
> + *  sys_taskfd_send_signal - send a signal to a process through a task file
> + *                           descriptor
> + *  @taskfd: the file descriptor of the process
> + *  @sig:    signal to be sent
> + *  @info:   the signal info
> + *  @flags:  future flags to be passed
> + *
> + *  Return: 0 on success, negative errno on failure
> + */
> +SYSCALL_DEFINE4(taskfd_send_signal, int, taskfd, int, sig,
> +		siginfo_t __user *, info, unsigned int, flags)
> +{
> +	int ret;
> +	struct fd f;
> +	struct pid *pid;
> +	kernel_siginfo_t kinfo;
> +
> +	/* Enforce flags be set to 0 until we add an extension. */
> +	if (flags)
> +		return -EINVAL;
> +
> +	f = fdget_raw(taskfd);
> +	if (!f.file)
> +		return -EBADF;
> +
> +	pid = tid_taskfd_to_pid(f.file);
> +	if (!IS_ERR(pid)) {
> +		/*
> +		 * Give userspace a way to detect /proc/<pid>/task/<tid>
> +		 * support when we add it.
> +		 */
> +		ret = -EOPNOTSUPP;
> +		goto err;
> +	}
> +
> +	/* Is this a procfd? */
> +	pid = tgid_taskfd_to_pid(f.file);
> +	if (IS_ERR(pid)) {
> +		ret = PTR_ERR(pid);
> +		goto err;
> +	}
> +
> +	ret = -EINVAL;
> +	if (!may_signal_taskfd(pid))
> +		goto err;
> +
> +	if (info) {
> +		ret = copy_siginfo_from_user_any(&kinfo, info);
> +		if (unlikely(ret))
> +			goto err;
> +
> +		ret = -EINVAL;
> +		if (unlikely(sig != kinfo.si_signo))
> +			goto err;
> +
> +		if ((task_pid(current) != pid) &&
> +		    (kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL)) {
> +			/* Only allow sending arbitrary signals to yourself. */
> +			ret = -EPERM;
> +			if (kinfo.si_code != SI_USER)
> +				goto err;
> +
> +			/* Turn this into a regular kill signal. */
> +			prepare_kill_siginfo(sig, &kinfo);
> +		}
> +	} else {
> +		prepare_kill_siginfo(sig, &kinfo);
> +	}
> +
> +	ret = kill_pid_info(sig, &kinfo, pid);
> +
> +err:
> +	fdput(f);
> +	return ret;
> +}
> +
>  static int
>  do_send_specific(pid_t tgid, pid_t pid, int sig, struct kernel_siginfo *info)
>  {
Daniel Colascione Dec. 6, 2018, 4:17 p.m. UTC | #13
On Thu, Dec 6, 2018 at 7:02 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Christian Brauner <christian@brauner.io> writes:
>
> > The kill() syscall operates on process identifiers (pid). After a process
> > has exited its pid can be reused by another process. If a caller sends a
> > signal to a reused pid it will end up signaling the wrong process. This
> > issue has often surfaced and there has been a push [1] to address this
> > problem.
> >
> > This patch uses file descriptors (fd) from proc/<pid> as stable handles on
> > struct pid. Even if a pid is recycled the handle will not change. The fd
> > can be used to send signals to the process it refers to.
> > Thus, the new syscall taskfd_send_signal() is introduced to solve this
> > problem. Instead of pids it operates on process fds (taskfd).
>
> I am not yet thrilled with the taskfd naming.

Both the old and new names were fine. Do you want to suggest a name at
this point? You can't just say "I don't like this. Guess again"
forever.

> Is there any plan to support sesssions and process groups?

Such a thing could be added with flags in the future. Why complicate this patch?

> I am concerned about using kill_pid_info.  It does this:
>
>
>         rcu_read_lock();
>         p = pid_task(pid, PIDTYPE_PID);
>         if (p)
>                 error = group_send_sig_info(sig, info, p, PIDTYPE_TGID);
>         rcu_read_unlock();
>
> That pid_task(PIDTYPE_PID) is fine for existing callers that need bug
> compatibility.  For new interfaces I would strongly prefer pid_task(PIDTYPE_TGID).

What is the bug that PIDTYPE_PID preserves?
Christian Brauner Dec. 6, 2018, 5:14 p.m. UTC | #14
On December 7, 2018 4:01:19 AM GMT+13:00, ebiederm@xmission.com wrote:
>Christian Brauner <christian@brauner.io> writes:
>
>> The kill() syscall operates on process identifiers (pid). After a
>process
>> has exited its pid can be reused by another process. If a caller
>sends a
>> signal to a reused pid it will end up signaling the wrong process.
>This
>> issue has often surfaced and there has been a push [1] to address
>this
>> problem.
>>
>> This patch uses file descriptors (fd) from proc/<pid> as stable
>handles on
>> struct pid. Even if a pid is recycled the handle will not change. The
>fd
>> can be used to send signals to the process it refers to.
>> Thus, the new syscall taskfd_send_signal() is introduced to solve
>this
>> problem. Instead of pids it operates on process fds (taskfd).
>
>I am not yet thrilled with the taskfd naming.

Userspace cares about what does this thing operate on?
It operates on processes and threads.
The most common term people use is "task".
I literally "polled" ten non-kernel people for that purpose and asked:
"What term would you use to refer to a process and a thread?"
Turns out it is task. So if find this pretty apt.
Additionally, the proc manpage uses task in the exact same way (also see the commit message).
If you can get behind that name even if feeling it's not optimal it would be great.

>Is there any plan to support sesssions and process groups?

I don't see the necessity.
As I said in previous mails:
we can emulate all interesting signal syscalls with this one.
We succeeded in doing that.
No need to get more fancy.
There's currently no obvious need for more features.
Features should be implemented when someone actually needs them.

>
>I am concerned about using kill_pid_info.  It does this:
>
>
>	rcu_read_lock();
>	p = pid_task(pid, PIDTYPE_PID);
>	if (p)
>		error = group_send_sig_info(sig, info, p, PIDTYPE_TGID);
>	rcu_read_unlock();
>
>That pid_task(PIDTYPE_PID) is fine for existing callers that need bug
>compatibility.  For new interfaces I would strongly prefer
>pid_task(PIDTYPE_TGID).
>
>
>Eric
>
>>
>> /* prototype and argument /*
>> long taskfd_send_signal(int taskfd, int sig, siginfo_t *info,
>unsigned int flags);
>>
>> In addition to the taskfd and signal argument it takes an additional
>> siginfo_t and flags argument. If the siginfo_t argument is NULL then
>> taskfd_send_signal() behaves like kill(). If it is not NULL
>> taskfd_send_signal() behaves like rt_sigqueueinfo().
>> The flags argument is added to allow for future extensions of this
>syscall.
>> It currently needs to be passed as 0. Failing to do so will cause
>EINVAL.
>>
>> /* taskfd_send_signal() replaces multiple pid-based syscalls */
>> The taskfd_send_signal() syscall currently takes on the job of the
>> following syscalls that operate on pids:
>> - kill(2)
>> - rt_sigqueueinfo(2)
>> The syscall is defined in such a way that it can also operate on
>thread fds
>> instead of process fds. In a future patchset I will extend it to
>operate on
>> taskfds from /proc/<pid>/task/<tid> at which point it will
>additionally
>> take on the job of:
>> - tgkill(2)
>> - rt_tgsigqueueinfo(2)
>> Right now this is intentionally left out to keep this patchset as
>simple as
>> possible (cf. [4]). If a taskfd of /proc/<pid>/task/<tid> is passed
>> EOPNOTSUPP will be returned to give userspace a way to detect when I
>add
>> support for such taskfds (cf. [10]).
>>
>> /* naming */
>> The original prefix of the syscall was "procfd_". However, it has
>been
>> pointed out that since this syscall will eventually operate on both
>> processes and threads the name should reflect this (cf. [12]). The
>best
>> possible candidate even from a userspace perspective seems to be
>"task".
>> Although "task" is used internally we are alreday deviating from
>POSIX by
>> using file descriptors to processes in the first place so it seems
>fine to
>> use the "taskfd_" prefix.
>>
>> The name taskfd_send_signal() was also chosen to reflect the fact
>that it
>> takes on the job of multiple syscalls. It is intentional that the
>name is
>> not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the
>fomer
>> because it might imply that taskfd_send_signal() is only a
>replacement for
>> kill(2) and not the latter because it is a hazzle to remember the
>correct
>> spelling (especially for non-native speakers) and because it is not
>> descriptive enough of what the syscall actually does. The name
>> "taskfd_send_signal" makes it very clear that its job is to send
>signals.
>>
>> /* O_PATH file descriptors */
>> taskfds opened as O_PATH fds cannot be used to send signals to a
>process
>> (cf. [2]). Signaling processes through taskfds is the equivalent of
>writing
>> to a file. Thus, this is not an operation that operates "purely at
>the
>> file descriptor level" as required by the open(2) manpage.
>>
>> /* zombies */
>> Zombies can be signaled just as any other process. No special error
>will be
>> reported since a zombie state is an unreliable state (cf. [3]).
>>
>> /* cross-namespace signals */
>> The patch currently enforces that the signaler and signalee either
>are in
>> the same pid namespace or that the signaler's pid namespace is an
>ancestor
>> of the signalee's pid namespace. This is done for the sake of
>simplicity
>> and because it is unclear to what values certain members of struct
>> siginfo_t would need to be set to (cf. [5], [6]).
>>
>> /* compat syscalls */
>> It became clear that we would like to avoid adding compat syscalls
>(cf.
>> [7]). The compat syscall handling is now done in kernel/signal.c
>itself by
>> adding __copy_siginfo_from_user_generic() which lets us avoid compat
>> syscalls (cf. [8]). It should be noted that the addition of
>> __copy_siginfo_from_user_any() is caused by a bug in the original
>> implementation of rt_sigqueueinfo(2) (cf. 12).
>> With upcoming rework for syscall handling things might improve
>> significantly (cf. [11]) and __copy_siginfo_from_user_any() will not
>gain
>> any additional callers.
>>
>> /* testing */
>> This patch was tested on x64 and x86.
>>
>> /* userspace usage */
>> An asciinema recording for the basic functionality can be found under
>[9].
>> With this patch a process can be killed via:
>>
>>  #define _GNU_SOURCE
>>  #include <errno.h>
>>  #include <fcntl.h>
>>  #include <signal.h>
>>  #include <stdio.h>
>>  #include <stdlib.h>
>>  #include <string.h>
>>  #include <sys/stat.h>
>>  #include <sys/syscall.h>
>>  #include <sys/types.h>
>>  #include <unistd.h>
>>
>>  static inline int do_taskfd_send_signal(int taskfd, int sig,
>siginfo_t *info,
>>                                          unsigned int flags)
>>  {
>>  #ifdef __NR_taskfd_send_signal
>>          return syscall(__NR_taskfd_send_signal, taskfd, sig, info,
>flags);
>>  #else
>>          return -ENOSYS;
>>  #endif
>>  }
>>
>>  int main(int argc, char *argv[])
>>  {
>>          int fd, ret, saved_errno, sig;
>>
>>          if (argc < 3)
>>                  exit(EXIT_FAILURE);
>>
>>          fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
>>          if (fd < 0) {
>>                  printf("%s - Failed to open \"%s\"\n",
>strerror(errno), argv[1]);
>>                  exit(EXIT_FAILURE);
>>          }
>>
>>          sig = atoi(argv[2]);
>>
>>          printf("Sending signal %d to process %s\n", sig, argv[1]);
>>          ret = do_taskfd_send_signal(fd, sig, NULL, 0);
>>
>>          saved_errno = errno;
>>          close(fd);
>>          errno = saved_errno;
>>
>>          if (ret < 0) {
>>                  printf("%s - Failed to send signal %d to process
>%s\n",
>>                         strerror(errno), sig, argv[1]);
>>                  exit(EXIT_FAILURE);
>>          }
>>
>>          exit(EXIT_SUCCESS);
>>  }
>>
>> [1]: 
>https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/
>> [2]: 
>https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/
>> [3]: 
>https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/
>> [4]: 
>https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
>> [5]: 
>https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/
>> [6]: 
>https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/
>> [7]: 
>https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/
>> [8]:  https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/
>> [9]:  https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
>> [10]:
>https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
>> [11]:
>https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/
>> [12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/
>>
>> Cc: Arnd Bergmann <arnd@arndb.de>
>> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
>> Cc: Serge Hallyn <serge@hallyn.com>
>> Cc: Jann Horn <jannh@google.com>
>> Cc: Andy Lutomirsky <luto@kernel.org>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Oleg Nesterov <oleg@redhat.com>
>> Cc: Aleksa Sarai <cyphar@cyphar.com>
>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>> Cc: Florian Weimer <fweimer@redhat.com>
>> Signed-off-by: Christian Brauner <christian@brauner.io>
>> Reviewed-by: Kees Cook <keescook@chromium.org>
>> ---
>> Changelog:
>> v4:
>> - updated asciinema to use "taskfd_" prefix
>> - s/procfd_send_signal/taskfd_send_signal/g
>> - s/proc_is_tgid_procfd/tgid_taskfd_to_pid/b
>> - s/proc_is_tid_procfd/tid_taskfd_to_pid/b
>> - s/__copy_siginfo_from_user_generic/__copy_siginfo_from_user_any/g
>> - make it clear that __copy_siginfo_from_user_any() is a workaround
>caused
>>   by a bug in the original implementation of rt_sigqueueinfo()
>> - when spoofing signals turn them into regular kill signals if
>si_code is
>>   set to SI_USER
>> - make proc_is_t{g}id_procfd() return struct pid to allow proc_pid()
>to
>>   stay private to fs/proc/
>> v3:
>> - add __copy_siginfo_from_user_generic() to avoid adding compat
>syscalls
>> - s/procfd_signal/procfd_send_signal/g
>> - change type of flags argument from int to unsigned int
>> - add comment about what happens to zombies
>> - add proc_is_tid_procfd()
>> - return EOPNOTSUPP when /proc/<pid>/task/<tid> fd is passed so
>userspace
>>   has a way of knowing that tidfds are not supported currently.
>> v2:
>> - define __NR_procfd_signal in unistd.h
>> - wire up compat syscall
>> - s/proc_is_procfd/proc_is_tgid_procfd/g
>> - provide stubs when CONFIG_PROC_FS=n
>> - move proc_pid() to linux/proc_fs.h header
>> - use proc_pid() to grab struct pid from /proc/<pid> fd
>> v1:
>> - patch introduced
>> ---
>>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>>  fs/proc/base.c                         |  20 +++-
>>  include/linux/proc_fs.h                |  12 +++
>>  include/linux/syscalls.h               |   3 +
>>  include/uapi/asm-generic/unistd.h      |   4 +-
>>  kernel/signal.c                        | 132
>+++++++++++++++++++++++--
>>  7 files changed, 164 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl
>b/arch/x86/entry/syscalls/syscall_32.tbl
>> index 3cf7b533b3d1..7efb63fd0617 100644
>> --- a/arch/x86/entry/syscalls/syscall_32.tbl
>> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
>> @@ -398,3 +398,4 @@
>>  384	i386	arch_prctl		sys_arch_prctl			__ia32_compat_sys_arch_prctl
>> 
>385	i386	io_pgetevents		sys_io_pgetevents		__ia32_compat_sys_io_pgetevents
>>  386	i386	rseq			sys_rseq			__ia32_sys_rseq
>>
>+387	i386	taskfd_send_signal	sys_taskfd_send_signal		__ia32_sys_taskfd_send_signal
>> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl
>b/arch/x86/entry/syscalls/syscall_64.tbl
>> index f0b1709a5ffb..be894f4a84e9 100644
>> --- a/arch/x86/entry/syscalls/syscall_64.tbl
>> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
>> @@ -343,6 +343,7 @@
>>  332	common	statx			__x64_sys_statx
>>  333	common	io_pgetevents		__x64_sys_io_pgetevents
>>  334	common	rseq			__x64_sys_rseq
>> +335	common	taskfd_send_signal	__x64_sys_taskfd_send_signal
>>  
>>  #
>>  # x32-specific system call numbers start at 512 to avoid cache
>impact
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index ce3465479447..b8b88bfee455 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -716,8 +716,6 @@ static int proc_pid_permission(struct inode
>*inode, int mask)
>>  	return generic_permission(inode, mask);
>>  }
>>  
>> -
>> -
>>  static const struct inode_operations proc_def_inode_operations = {
>>  	.setattr	= proc_setattr,
>>  };
>> @@ -3038,6 +3036,15 @@ static const struct file_operations
>proc_tgid_base_operations = {
>>  	.llseek		= generic_file_llseek,
>>  };
>>  
>> +struct pid *tgid_taskfd_to_pid(const struct file *file)
>> +{
>> +	if (!d_is_dir(file->f_path.dentry) ||
>> +	    (file->f_op != &proc_tgid_base_operations))
>> +		return ERR_PTR(-EBADF);
>> +
>> +	return proc_pid(file_inode(file));
>> +}
>> +
>>  static struct dentry *proc_tgid_base_lookup(struct inode *dir,
>struct dentry *dentry, unsigned int flags)
>>  {
>>  	return proc_pident_lookup(dir, dentry,
>> @@ -3422,6 +3429,15 @@ static const struct file_operations
>proc_tid_base_operations = {
>>  	.llseek		= generic_file_llseek,
>>  };
>>  
>> +struct pid *tid_taskfd_to_pid(const struct file *file)
>> +{
>> +	if (!d_is_dir(file->f_path.dentry) ||
>> +	    (file->f_op != &proc_tid_base_operations))
>> +		return ERR_PTR(-EBADF);
>> +
>> +	return proc_pid(file_inode(file));
>> +}
>> +
>>  static const struct inode_operations proc_tid_base_inode_operations
>= {
>>  	.lookup		= proc_tid_base_lookup,
>>  	.getattr	= pid_getattr,
>> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
>> index d0e1f1522a78..96817415c420 100644
>> --- a/include/linux/proc_fs.h
>> +++ b/include/linux/proc_fs.h
>> @@ -73,6 +73,8 @@ struct proc_dir_entry
>*proc_create_net_single_write(const char *name, umode_t mo
>>  						    int (*show)(struct seq_file *, void *),
>>  						    proc_write_t write,
>>  						    void *data);
>> +extern struct pid *tgid_taskfd_to_pid(const struct file *file);
>> +extern struct pid *tid_taskfd_to_pid(const struct file *file);
>>  
>>  #else /* CONFIG_PROC_FS */
>>  
>> @@ -114,6 +116,16 @@ static inline int remove_proc_subtree(const char
>*name, struct proc_dir_entry *p
>>  #define proc_create_net(name, mode, parent, state_size, ops)
>({NULL;})
>>  #define proc_create_net_single(name, mode, parent, show, data)
>({NULL;})
>>  
>> +static inline struct pid *tgid_taskfd_to_pid(const struct file
>*file)
>> +{
>> +	return ERR_PTR(-EBADF);
>> +}
>> +
>> +static inline struct pid *tid_taskfd_to_pid(const struct file *file)
>> +{
>> +	return ERR_PTR(-EBADF);
>> +}
>> +
>>  #endif /* CONFIG_PROC_FS */
>>  
>>  struct net;
>> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>> index 2ac3d13a915b..5ffe194ef29b 100644
>> --- a/include/linux/syscalls.h
>> +++ b/include/linux/syscalls.h
>> @@ -907,6 +907,9 @@ asmlinkage long sys_statx(int dfd, const char
>__user *path, unsigned flags,
>>  			  unsigned mask, struct statx __user *buffer);
>>  asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t
>rseq_len,
>>  			 int flags, uint32_t sig);
>> +asmlinkage long sys_taskfd_send_signal(int taskfd, int sig,
>> +				       siginfo_t __user *info,
>> +				       unsigned int flags);
>>  
>>  /*
>>   * Architecture-specific system calls
>> diff --git a/include/uapi/asm-generic/unistd.h
>b/include/uapi/asm-generic/unistd.h
>> index 538546edbfbd..9343dca63fd9 100644
>> --- a/include/uapi/asm-generic/unistd.h
>> +++ b/include/uapi/asm-generic/unistd.h
>> @@ -738,9 +738,11 @@ __SYSCALL(__NR_statx,     sys_statx)
>>  __SC_COMP(__NR_io_pgetevents, sys_io_pgetevents,
>compat_sys_io_pgetevents)
>>  #define __NR_rseq 293
>>  __SYSCALL(__NR_rseq, sys_rseq)
>> +#define __NR_taskfd_send_signal 294
>> +__SYSCALL(__NR_taskfd_send_signal, sys_taskfd_send_signal)
>>  
>>  #undef __NR_syscalls
>> -#define __NR_syscalls 294
>> +#define __NR_syscalls 295
>>  
>>  /*
>>   * 32 bit systems traditionally used different
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 9a32bc2088c9..a00a4bcb7605 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -19,7 +19,9 @@
>>  #include <linux/sched/task.h>
>>  #include <linux/sched/task_stack.h>
>>  #include <linux/sched/cputime.h>
>> +#include <linux/file.h>
>>  #include <linux/fs.h>
>> +#include <linux/proc_fs.h>
>>  #include <linux/tty.h>
>>  #include <linux/binfmts.h>
>>  #include <linux/coredump.h>
>> @@ -3286,6 +3288,16 @@ COMPAT_SYSCALL_DEFINE4(rt_sigtimedwait,
>compat_sigset_t __user *, uthese,
>>  }
>>  #endif
>>  
>> +static inline void prepare_kill_siginfo(int sig, struct
>kernel_siginfo *info)
>> +{
>> +	clear_siginfo(info);
>> +	info->si_signo = sig;
>> +	info->si_errno = 0;
>> +	info->si_code = SI_USER;
>> +	info->si_pid = task_tgid_vnr(current);
>> +	info->si_uid = from_kuid_munged(current_user_ns(), current_uid());
>> +}
>> +
>>  /**
>>   *  sys_kill - send a signal to a process
>>   *  @pid: the PID of the process
>> @@ -3295,16 +3307,124 @@ SYSCALL_DEFINE2(kill, pid_t, pid, int, sig)
>>  {
>>  	struct kernel_siginfo info;
>>  
>> -	clear_siginfo(&info);
>> -	info.si_signo = sig;
>> -	info.si_errno = 0;
>> -	info.si_code = SI_USER;
>> -	info.si_pid = task_tgid_vnr(current);
>> -	info.si_uid = from_kuid_munged(current_user_ns(), current_uid());
>> +	prepare_kill_siginfo(sig, &info);
>>  
>>  	return kill_something_info(sig, &info, pid);
>>  }
>>  
>> +/*
>> + * Verify that the signaler and signalee either are in the same pid
>namespace
>> + * or that the signaler's pid namespace is an ancestor of the
>signalee's pid
>> + * namespace.
>> + */
>> +static bool may_signal_taskfd(struct pid *pid)
>> +{
>> +	struct pid_namespace *active = task_active_pid_ns(current);
>> +	struct pid_namespace *p = ns_of_pid(pid);
>> +
>> +	for (;;) {
>> +		if (!p)
>> +			return false;
>> +		if (p == active)
>> +			break;
>> +		p = p->parent;
>> +	}
>> +
>> +	return true;
>> +}
>> +
>> +static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo,
>siginfo_t *info)
>> +{
>> +#ifdef CONFIG_COMPAT
>> +	/*
>> +	 * Avoid hooking up compat syscalls and instead handle necessary
>> +	 * conversions here. Note, this is a stop-gap measure and should
>not be
>> +	 * considered a generic solution.
>> +	 */
>> +	if (in_compat_syscall())
>> +		return copy_siginfo_from_user32(
>> +			kinfo, (struct compat_siginfo __user *)info);
>> +#endif
>> +	return copy_siginfo_from_user(kinfo, info);
>> +}
>> +
>> +/**
>> + *  sys_taskfd_send_signal - send a signal to a process through a
>task file
>> + *                           descriptor
>> + *  @taskfd: the file descriptor of the process
>> + *  @sig:    signal to be sent
>> + *  @info:   the signal info
>> + *  @flags:  future flags to be passed
>> + *
>> + *  Return: 0 on success, negative errno on failure
>> + */
>> +SYSCALL_DEFINE4(taskfd_send_signal, int, taskfd, int, sig,
>> +		siginfo_t __user *, info, unsigned int, flags)
>> +{
>> +	int ret;
>> +	struct fd f;
>> +	struct pid *pid;
>> +	kernel_siginfo_t kinfo;
>> +
>> +	/* Enforce flags be set to 0 until we add an extension. */
>> +	if (flags)
>> +		return -EINVAL;
>> +
>> +	f = fdget_raw(taskfd);
>> +	if (!f.file)
>> +		return -EBADF;
>> +
>> +	pid = tid_taskfd_to_pid(f.file);
>> +	if (!IS_ERR(pid)) {
>> +		/*
>> +		 * Give userspace a way to detect /proc/<pid>/task/<tid>
>> +		 * support when we add it.
>> +		 */
>> +		ret = -EOPNOTSUPP;
>> +		goto err;
>> +	}
>> +
>> +	/* Is this a procfd? */
>> +	pid = tgid_taskfd_to_pid(f.file);
>> +	if (IS_ERR(pid)) {
>> +		ret = PTR_ERR(pid);
>> +		goto err;
>> +	}
>> +
>> +	ret = -EINVAL;
>> +	if (!may_signal_taskfd(pid))
>> +		goto err;
>> +
>> +	if (info) {
>> +		ret = copy_siginfo_from_user_any(&kinfo, info);
>> +		if (unlikely(ret))
>> +			goto err;
>> +
>> +		ret = -EINVAL;
>> +		if (unlikely(sig != kinfo.si_signo))
>> +			goto err;
>> +
>> +		if ((task_pid(current) != pid) &&
>> +		    (kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL)) {
>> +			/* Only allow sending arbitrary signals to yourself. */
>> +			ret = -EPERM;
>> +			if (kinfo.si_code != SI_USER)
>> +				goto err;
>> +
>> +			/* Turn this into a regular kill signal. */
>> +			prepare_kill_siginfo(sig, &kinfo);
>> +		}
>> +	} else {
>> +		prepare_kill_siginfo(sig, &kinfo);
>> +	}
>> +
>> +	ret = kill_pid_info(sig, &kinfo, pid);
>> +
>> +err:
>> +	fdput(f);
>> +	return ret;
>> +}
>> +
>>  static int
>>  do_send_specific(pid_t tgid, pid_t pid, int sig, struct
>kernel_siginfo *info)
>>  {
Eric W. Biederman Dec. 6, 2018, 5:24 p.m. UTC | #15
Daniel Colascione <dancol@google.com> writes:

> On Thu, Dec 6, 2018 at 7:02 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> Christian Brauner <christian@brauner.io> writes:
>>
>> > The kill() syscall operates on process identifiers (pid). After a process
>> > has exited its pid can be reused by another process. If a caller sends a
>> > signal to a reused pid it will end up signaling the wrong process. This
>> > issue has often surfaced and there has been a push [1] to address this
>> > problem.
>> >
>> > This patch uses file descriptors (fd) from proc/<pid> as stable handles on
>> > struct pid. Even if a pid is recycled the handle will not change. The fd
>> > can be used to send signals to the process it refers to.
>> > Thus, the new syscall taskfd_send_signal() is introduced to solve this
>> > problem. Instead of pids it operates on process fds (taskfd).
>>
>> I am not yet thrilled with the taskfd naming.
>
> Both the old and new names were fine. Do you want to suggest a name at
> this point? You can't just say "I don't like this. Guess again"
> forever.

Both names suck, as neither name actually describes what the function is
designed to do.

Most issues happen at the interface between abstractions.  A name that
confuses your users will just make that confusion more likely.  So it is
important that we do the very best with the name that we can do.

We are already having questions about what happens when you perform the
non-sense operation of sending a signal to a zombie.  It comes up
because there are races when a process may die and you are not expecting
it.  That is an issue with the existing signal sending API, that has
caused confusion.   That isn't half as confusing as the naming issue.

A task in linux is a single thread.  A process is all of the threads.
If we are going to support both cases it doesn't make sense to hard code
a single case in the name.

I would be inclined to simplify things and call the syscall something
like "fdkill(int fd, struct siginfo *info, int flags)".  Or perhaps
just "fd_send_signal(int fd, struct siginfo *info, int flags)".

Then we are not overspecifying what the system call does in the name.
Plus it makes it clear that the fd specifies where the signal goes.
Something I see that by your reply below that you were confused about.

>> Is there any plan to support sesssions and process groups?
>
> Such a thing could be added with flags in the future. Why complicate
> this patch?

Actually that isn't the way this is designed.  You would have to have
another kind of file descriptor.  I am asking because it goes to the
question of naming and what we are trying to do here.

We don't need to implement that but we have already looked into this
kind of extensibility.  If we want the extensibility we should make
room for it, or just close the door.  Having the door half open and a
confusing interface is a problem for users.

>> I am concerned about using kill_pid_info.  It does this:
>>
>>
>>         rcu_read_lock();
>>         p = pid_task(pid, PIDTYPE_PID);
>>         if (p)
>>                 error = group_send_sig_info(sig, info, p, PIDTYPE_TGID);
>>         rcu_read_unlock();
>>
>> That pid_task(PIDTYPE_PID) is fine for existing callers that need bug
>> compatibility.  For new interfaces I would strongly prefer pid_task(PIDTYPE_TGID).
>
> What is the bug that PIDTYPE_PID preserves?

I am not 100% certain all of the bits for this to matter have been
merged yet but we are close enough that it would not be hard to make it
matter.

There are two strange behaviours of ordinary kill on the linux kernel
that I am aware of.

1) kill(thread_id,...) where the thread_id is not the id of the first
   thread and the thread_id thus the pid of the process sends the signal
   to the entire process.  Something that arguably should not happen.

2) kill(pid,...) where the original thread has exited performs the
   permission checks against the dead signal group leader.  Which means
   that the permission checks for sending a signal are very likely wrong
   for a multi-threaded processes that calls a function like setuid.

To fix the second case we are going to have to perform the permission
checks on a non-zombie thread.  That is something that is straight
forward to make work with PIDTYPE_TGID.  It is not so easy to make work
with PIDTYPE_PID.

I looked and it doesn't look like I have merged the logic of having
PIDTYPE_TGID point to a living thread when the signal group leader
exits and becomes a zombie.  It isn't hard but it does require some very
delicate surgery on the code, so that we don't break some of the
historic confusion of threads and process groups.

Eric
Christian Brauner Dec. 6, 2018, 5:41 p.m. UTC | #16
On Thu, Dec 06, 2018 at 11:24:28AM -0600, Eric W. Biederman wrote:
> Daniel Colascione <dancol@google.com> writes:
> 
> > On Thu, Dec 6, 2018 at 7:02 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >>
> >> Christian Brauner <christian@brauner.io> writes:
> >>
> >> > The kill() syscall operates on process identifiers (pid). After a process
> >> > has exited its pid can be reused by another process. If a caller sends a
> >> > signal to a reused pid it will end up signaling the wrong process. This
> >> > issue has often surfaced and there has been a push [1] to address this
> >> > problem.
> >> >
> >> > This patch uses file descriptors (fd) from proc/<pid> as stable handles on
> >> > struct pid. Even if a pid is recycled the handle will not change. The fd
> >> > can be used to send signals to the process it refers to.
> >> > Thus, the new syscall taskfd_send_signal() is introduced to solve this
> >> > problem. Instead of pids it operates on process fds (taskfd).
> >>
> >> I am not yet thrilled with the taskfd naming.
> >
> > Both the old and new names were fine. Do you want to suggest a name at
> > this point? You can't just say "I don't like this. Guess again"
> > forever.
> 
> Both names suck, as neither name actually describes what the function is
> designed to do.
> 
> Most issues happen at the interface between abstractions.  A name that
> confuses your users will just make that confusion more likely.  So it is
> important that we do the very best with the name that we can do.
> 
> We are already having questions about what happens when you perform the
> non-sense operation of sending a signal to a zombie.  It comes up
> because there are races when a process may die and you are not expecting
> it.  That is an issue with the existing signal sending API, that has
> caused confusion.   That isn't half as confusing as the naming issue.
> 
> A task in linux is a single thread.  A process is all of the threads.
> If we are going to support both cases it doesn't make sense to hard code
> a single case in the name.
> 
> I would be inclined to simplify things and call the syscall something
> like "fdkill(int fd, struct siginfo *info, int flags)".  Or perhaps

No, definitely nothing with "kill" will be used because that's
absolutely not expressing what this syscall is doing.

> just "fd_send_signal(int fd, struct siginfo *info, int flags)".
> 
> Then we are not overspecifying what the system call does in the name.

I feel changing the name around by a single persons preferences is not
really a nice thing to do community-wise. So I'd like to hear other
people chime in first before I make that change.

> Plus it makes it clear that the fd specifies where the signal goes.
> Something I see that by your reply below that you were confused about.
> 
> >> Is there any plan to support sesssions and process groups?
> >
> > Such a thing could be added with flags in the future. Why complicate
> > this patch?
> 
> Actually that isn't the way this is designed.  You would have to have
> another kind of file descriptor.  I am asking because it goes to the
> question of naming and what we are trying to do here.
> 
> We don't need to implement that but we have already looked into this
> kind of extensibility.  If we want the extensibility we should make
> room for it, or just close the door.  Having the door half open and a
> confusing interface is a problem for users.
> 
> >> I am concerned about using kill_pid_info.  It does this:
> >>
> >>
> >>         rcu_read_lock();
> >>         p = pid_task(pid, PIDTYPE_PID);
> >>         if (p)
> >>                 error = group_send_sig_info(sig, info, p, PIDTYPE_TGID);
> >>         rcu_read_unlock();
> >>
> >> That pid_task(PIDTYPE_PID) is fine for existing callers that need bug
> >> compatibility.  For new interfaces I would strongly prefer pid_task(PIDTYPE_TGID).
> >
> > What is the bug that PIDTYPE_PID preserves?
> 
> I am not 100% certain all of the bits for this to matter have been
> merged yet but we are close enough that it would not be hard to make it
> matter.
> 
> There are two strange behaviours of ordinary kill on the linux kernel
> that I am aware of.
> 
> 1) kill(thread_id,...) where the thread_id is not the id of the first
>    thread and the thread_id thus the pid of the process sends the signal
>    to the entire process.  Something that arguably should not happen.
> 
> 2) kill(pid,...) where the original thread has exited performs the
>    permission checks against the dead signal group leader.  Which means
>    that the permission checks for sending a signal are very likely wrong
>    for a multi-threaded processes that calls a function like setuid.
> 
> To fix the second case we are going to have to perform the permission
> checks on a non-zombie thread.  That is something that is straight
> forward to make work with PIDTYPE_TGID.  It is not so easy to make work
> with PIDTYPE_PID.
> 
> I looked and it doesn't look like I have merged the logic of having
> PIDTYPE_TGID point to a living thread when the signal group leader
> exits and becomes a zombie.  It isn't hard but it does require some very
> delicate surgery on the code, so that we don't break some of the
> historic confusion of threads and process groups.

Then this seems irrelevant to the current patch. It seems we can simply
switch to PIDTYPE_PGID once your new logic lands but not right now.
Kees Cook Dec. 6, 2018, 6:30 p.m. UTC | #17
On Thu, Dec 6, 2018 at 9:41 AM Christian Brauner <christian@brauner.io> wrote:
> I feel changing the name around by a single persons preferences is not
> really a nice thing to do community-wise. So I'd like to hear other
> people chime in first before I make that change.

I don't think the name is hugely critical (but it's always the hardest
to settle on). My preference order would be:

taskfd_send_signal()
pidfd_send_signal()
procfd_send_signal()
fd_send_signal()

But, agreed, I think fdkill() should not be used.
Eric W. Biederman Dec. 6, 2018, 7:17 p.m. UTC | #18
Christian Brauner <christian@brauner.io> writes:

> On December 7, 2018 4:01:19 AM GMT+13:00, ebiederm@xmission.com wrote:
>>Christian Brauner <christian@brauner.io> writes:
>>
>>> The kill() syscall operates on process identifiers (pid). After a
>>process
>>> has exited its pid can be reused by another process. If a caller
>>sends a
>>> signal to a reused pid it will end up signaling the wrong process.
>>This
>>> issue has often surfaced and there has been a push [1] to address
>>this
>>> problem.
>>>
>>> This patch uses file descriptors (fd) from proc/<pid> as stable
>>handles on
>>> struct pid. Even if a pid is recycled the handle will not change. The
>>fd
>>> can be used to send signals to the process it refers to.
>>> Thus, the new syscall taskfd_send_signal() is introduced to solve
>>this
>>> problem. Instead of pids it operates on process fds (taskfd).
>>
>>I am not yet thrilled with the taskfd naming.
>
> Userspace cares about what does this thing operate on?
> It operates on processes and threads.
> The most common term people use is "task".
> I literally "polled" ten non-kernel people for that purpose and asked:
> "What term would you use to refer to a process and a thread?"
> Turns out it is task. So if find this pretty apt.
> Additionally, the proc manpage uses task in the exact same way (also see the commit message).
> If you can get behind that name even if feeling it's not optimal it would be great.

Once I understand why threads and not process groups.  I don't see that
logic yet.

>>Is there any plan to support sesssions and process groups?
>
> I don't see the necessity.
> As I said in previous mails:
> we can emulate all interesting signal syscalls with this one.

I don't know what you mean by all of the interesting signal system
calls.  I do know you can not replicate kill(2).

Sending signals to a process group the "kill(-pgrp)" case with kill
sends the signals to an atomic snapshot of processes.  If the signal
is SIGKILL then it is guaranteed that then entire process group is
killed with no survivors.

> We succeeded in doing that.

I am not certain you have.

> No need to get more fancy.
> There's currently no obvious need for more features.
> Features should be implemented when someone actually needs them.

That is fair.  I don't understand what you are doing with sending
signals to a thread.  That seems like one of the least useful
corner cases of sending signals.

Eric
Christian Brauner Dec. 6, 2018, 7:30 p.m. UTC | #19
On Thu, Dec 06, 2018 at 01:17:24PM -0600, Eric W. Biederman wrote:
> Christian Brauner <christian@brauner.io> writes:
> 
> > On December 7, 2018 4:01:19 AM GMT+13:00, ebiederm@xmission.com wrote:
> >>Christian Brauner <christian@brauner.io> writes:
> >>
> >>> The kill() syscall operates on process identifiers (pid). After a
> >>process
> >>> has exited its pid can be reused by another process. If a caller
> >>sends a
> >>> signal to a reused pid it will end up signaling the wrong process.
> >>This
> >>> issue has often surfaced and there has been a push [1] to address
> >>this
> >>> problem.
> >>>
> >>> This patch uses file descriptors (fd) from proc/<pid> as stable
> >>handles on
> >>> struct pid. Even if a pid is recycled the handle will not change. The
> >>fd
> >>> can be used to send signals to the process it refers to.
> >>> Thus, the new syscall taskfd_send_signal() is introduced to solve
> >>this
> >>> problem. Instead of pids it operates on process fds (taskfd).
> >>
> >>I am not yet thrilled with the taskfd naming.
> >
> > Userspace cares about what does this thing operate on?
> > It operates on processes and threads.
> > The most common term people use is "task".
> > I literally "polled" ten non-kernel people for that purpose and asked:
> > "What term would you use to refer to a process and a thread?"
> > Turns out it is task. So if find this pretty apt.
> > Additionally, the proc manpage uses task in the exact same way (also see the commit message).
> > If you can get behind that name even if feeling it's not optimal it would be great.
> 
> Once I understand why threads and not process groups.  I don't see that
> logic yet.

The point is: userspace takes "task" to be a generic term for processes
and tasks. Which is what is important. The term also covers process
groups for all that its worth. Most of userspace isn't even aware of
that distinction necessarily.

fd_send_signal() makes the syscall name meaningless: what is userspace
signaling too? The point being that there's a lot more that you require
userspace to infer from fd_send_signal() than from task_send_signal()
where most people get the right idea right away: "signals to a process
or thread".

> 
> >>Is there any plan to support sesssions and process groups?
> >
> > I don't see the necessity.
> > As I said in previous mails:
> > we can emulate all interesting signal syscalls with this one.
> 
> I don't know what you mean by all of the interesting signal system
> calls.  I do know you can not replicate kill(2).

[1]: You cannot replicate certain aspects of kill *yet*. We have
established this before. If we want process group support later we do
have the flags argument to extend the sycall.

> 
> Sending signals to a process group the "kill(-pgrp)" case with kill
> sends the signals to an atomic snapshot of processes.  If the signal
> is SIGKILL then it is guaranteed that then entire process group is
> killed with no survivors.

See [1].

> 
> > We succeeded in doing that.
> 
> I am not certain you have.

See [1].

> 
> > No need to get more fancy.
> > There's currently no obvious need for more features.
> > Features should be implemented when someone actually needs them.
> 
> That is fair.  I don't understand what you are doing with sending
> signals to a thread.  That seems like one of the least useful
> corner cases of sending signals.

It's what glibc and Florian care about for pthreads and their our
biggest user atm so they get some I'd argue they get some say in this. :)

> 
> Eric
Eric W. Biederman Dec. 6, 2018, 8:29 p.m. UTC | #20
Christian Brauner <christian@brauner.io> writes:

> On Thu, Dec 06, 2018 at 01:17:24PM -0600, Eric W. Biederman wrote:
>> Christian Brauner <christian@brauner.io> writes:
>> 
>> > On December 7, 2018 4:01:19 AM GMT+13:00, ebiederm@xmission.com wrote:
>> >>Christian Brauner <christian@brauner.io> writes:
>> >>
>> >>> The kill() syscall operates on process identifiers (pid). After a
>> >>process
>> >>> has exited its pid can be reused by another process. If a caller
>> >>sends a
>> >>> signal to a reused pid it will end up signaling the wrong process.
>> >>This
>> >>> issue has often surfaced and there has been a push [1] to address
>> >>this
>> >>> problem.
>> >>>
>> >>> This patch uses file descriptors (fd) from proc/<pid> as stable
>> >>handles on
>> >>> struct pid. Even if a pid is recycled the handle will not change. The
>> >>fd
>> >>> can be used to send signals to the process it refers to.
>> >>> Thus, the new syscall taskfd_send_signal() is introduced to solve
>> >>this
>> >>> problem. Instead of pids it operates on process fds (taskfd).
>> >>
>> >>I am not yet thrilled with the taskfd naming.
>> >
>> > Userspace cares about what does this thing operate on?
>> > It operates on processes and threads.
>> > The most common term people use is "task".
>> > I literally "polled" ten non-kernel people for that purpose and asked:
>> > "What term would you use to refer to a process and a thread?"
>> > Turns out it is task. So if find this pretty apt.
>> > Additionally, the proc manpage uses task in the exact same way (also see the commit message).
>> > If you can get behind that name even if feeling it's not optimal it would be great.
>> 
>> Once I understand why threads and not process groups.  I don't see that
>> logic yet.
>
> The point is: userspace takes "task" to be a generic term for processes
> and tasks. Which is what is important. The term also covers process
> groups for all that its worth. Most of userspace isn't even aware of
> that distinction necessarily.
>
> fd_send_signal() makes the syscall name meaningless: what is userspace
> signaling too? The point being that there's a lot more that you require
> userspace to infer from fd_send_signal() than from task_send_signal()
> where most people get the right idea right away: "signals to a process
> or thread".
>
>> 
>> >>Is there any plan to support sesssions and process groups?
>> >
>> > I don't see the necessity.
>> > As I said in previous mails:
>> > we can emulate all interesting signal syscalls with this one.
>> 
>> I don't know what you mean by all of the interesting signal system
>> calls.  I do know you can not replicate kill(2).
>
> [1]: You cannot replicate certain aspects of kill *yet*. We have
> established this before. If we want process group support later we do
> have the flags argument to extend the sycall.

Then you have horrible contradiction in the API.

Either the grouping is a property of your file descriptor or the
grouping comes from the flags argument.

If the grouping is specified in the flags argument then pidfd is the
proper name for your file descriptors, and the appropriate prefix
for your system call.

If the grouping is a property of your file descriptor it does not make
sense to talk about using the flags argument later.

Your intention is to add the thread case to support pthreads once the
process case is sorted out.  So this is something that needs to be made
clear.  Did I miss how you plan to handle threads?

And this fundamentally and definitely gets into all of my concerns about
proper handling of pid_task and PIDTYPE_TGID etc.

>> Sending signals to a process group the "kill(-pgrp)" case with kill
>> sends the signals to an atomic snapshot of processes.  If the signal
>> is SIGKILL then it is guaranteed that then entire process group is
>> killed with no survivors.
>
> See [1].
>
>> 
>> > We succeeded in doing that.
>> 
>> I am not certain you have.
>
> See [1].
>
>> 
>> > No need to get more fancy.
>> > There's currently no obvious need for more features.
>> > Features should be implemented when someone actually needs them.
>> 
>> That is fair.  I don't understand what you are doing with sending
>> signals to a thread.  That seems like one of the least useful
>> corner cases of sending signals.
>
> It's what glibc and Florian care about for pthreads and their our
> biggest user atm so they get some I'd argue they get some say in this. :)

Fair enough.  If glibc could use this then we certainly have users and a
user case.

Eric
Daniel Colascione Dec. 6, 2018, 8:37 p.m. UTC | #21
On Thu, Dec 6, 2018 at 12:29 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> Christian Brauner <christian@brauner.io> writes:
>
> > On Thu, Dec 06, 2018 at 01:17:24PM -0600, Eric W. Biederman wrote:
> >> Christian Brauner <christian@brauner.io> writes:
> >>
> >> > On December 7, 2018 4:01:19 AM GMT+13:00, ebiederm@xmission.com wrote:
> >> >>Christian Brauner <christian@brauner.io> writes:
> >> >>
> >> >>> The kill() syscall operates on process identifiers (pid). After a
> >> >>process
> >> >>> has exited its pid can be reused by another process. If a caller
> >> >>sends a
> >> >>> signal to a reused pid it will end up signaling the wrong process.
> >> >>This
> >> >>> issue has often surfaced and there has been a push [1] to address
> >> >>this
> >> >>> problem.
> >> >>>
> >> >>> This patch uses file descriptors (fd) from proc/<pid> as stable
> >> >>handles on
> >> >>> struct pid. Even if a pid is recycled the handle will not change. The
> >> >>fd
> >> >>> can be used to send signals to the process it refers to.
> >> >>> Thus, the new syscall taskfd_send_signal() is introduced to solve
> >> >>this
> >> >>> problem. Instead of pids it operates on process fds (taskfd).
> >> >>
> >> >>I am not yet thrilled with the taskfd naming.
> >> >
> >> > Userspace cares about what does this thing operate on?
> >> > It operates on processes and threads.
> >> > The most common term people use is "task".
> >> > I literally "polled" ten non-kernel people for that purpose and asked:
> >> > "What term would you use to refer to a process and a thread?"
> >> > Turns out it is task. So if find this pretty apt.
> >> > Additionally, the proc manpage uses task in the exact same way (also see the commit message).
> >> > If you can get behind that name even if feeling it's not optimal it would be great.
> >>
> >> Once I understand why threads and not process groups.  I don't see that
> >> logic yet.
> >
> > The point is: userspace takes "task" to be a generic term for processes
> > and tasks. Which is what is important. The term also covers process
> > groups for all that its worth. Most of userspace isn't even aware of
> > that distinction necessarily.
> >
> > fd_send_signal() makes the syscall name meaningless: what is userspace
> > signaling too? The point being that there's a lot more that you require
> > userspace to infer from fd_send_signal() than from task_send_signal()
> > where most people get the right idea right away: "signals to a process
> > or thread".
> >
> >>
> >> >>Is there any plan to support sesssions and process groups?
> >> >
> >> > I don't see the necessity.
> >> > As I said in previous mails:
> >> > we can emulate all interesting signal syscalls with this one.
> >>
> >> I don't know what you mean by all of the interesting signal system
> >> calls.  I do know you can not replicate kill(2).
> >
> > [1]: You cannot replicate certain aspects of kill *yet*. We have
> > established this before. If we want process group support later we do
> > have the flags argument to extend the sycall.
>
> Then you have horrible contradiction in the API.
>
> Either the grouping is a property of your file descriptor or the
> grouping comes from the flags argument.
>
> If the grouping is specified in the flags argument then pidfd is the
> proper name for your file descriptors, and the appropriate prefix
> for your system call.

Yes and no. "taskfd" is fine, since even if we do add a
kill-process-group capability, the general facility under discussion
is still *about* tasks in general, so "taskfd" still tells you in a
general sense what the thing does. "pidfd" would be wrong, and for the
same reason that the kernel's "struct pid" is badly-named: the object
being named is a *task*, and signaling a particular task instead of
whatever task happens to be labeled with a particular numeric PID at
the time of all is the whole point of this change.

> If the grouping is a property of your file descriptor it does not make
> sense to talk about using the flags argument later.
>
> Your intention is to add the thread case to support pthreads once the
> process case is sorted out.  So this is something that needs to be made
> clear.  Did I miss how you plan to handle threads?
>
> And this fundamentally and definitely gets into all of my concerns about
> proper handling of pid_task and PIDTYPE_TGID etc.

To the extent that it's possible, this system call should mimic the
behavior of a signal-send to a positive numeric PID (i.e., a specific
task), so if we change one, we should change both.
Christian Brauner Dec. 6, 2018, 9:31 p.m. UTC | #22
On Thu, Dec 06, 2018 at 02:29:13PM -0600, Eric W. Biederman wrote:
> Christian Brauner <christian@brauner.io> writes:
> 
> > On Thu, Dec 06, 2018 at 01:17:24PM -0600, Eric W. Biederman wrote:
> >> Christian Brauner <christian@brauner.io> writes:
> >> 
> >> > On December 7, 2018 4:01:19 AM GMT+13:00, ebiederm@xmission.com wrote:
> >> >>Christian Brauner <christian@brauner.io> writes:
> >> >>
> >> >>> The kill() syscall operates on process identifiers (pid). After a
> >> >>process
> >> >>> has exited its pid can be reused by another process. If a caller
> >> >>sends a
> >> >>> signal to a reused pid it will end up signaling the wrong process.
> >> >>This
> >> >>> issue has often surfaced and there has been a push [1] to address
> >> >>this
> >> >>> problem.
> >> >>>
> >> >>> This patch uses file descriptors (fd) from proc/<pid> as stable
> >> >>handles on
> >> >>> struct pid. Even if a pid is recycled the handle will not change. The
> >> >>fd
> >> >>> can be used to send signals to the process it refers to.
> >> >>> Thus, the new syscall taskfd_send_signal() is introduced to solve
> >> >>this
> >> >>> problem. Instead of pids it operates on process fds (taskfd).
> >> >>
> >> >>I am not yet thrilled with the taskfd naming.
> >> >
> >> > Userspace cares about what does this thing operate on?
> >> > It operates on processes and threads.
> >> > The most common term people use is "task".
> >> > I literally "polled" ten non-kernel people for that purpose and asked:
> >> > "What term would you use to refer to a process and a thread?"
> >> > Turns out it is task. So if find this pretty apt.
> >> > Additionally, the proc manpage uses task in the exact same way (also see the commit message).
> >> > If you can get behind that name even if feeling it's not optimal it would be great.
> >> 
> >> Once I understand why threads and not process groups.  I don't see that
> >> logic yet.
> >
> > The point is: userspace takes "task" to be a generic term for processes
> > and tasks. Which is what is important. The term also covers process
> > groups for all that its worth. Most of userspace isn't even aware of
> > that distinction necessarily.
> >
> > fd_send_signal() makes the syscall name meaningless: what is userspace
> > signaling too? The point being that there's a lot more that you require
> > userspace to infer from fd_send_signal() than from task_send_signal()
> > where most people get the right idea right away: "signals to a process
> > or thread".
> >
> >> 
> >> >>Is there any plan to support sesssions and process groups?
> >> >
> >> > I don't see the necessity.
> >> > As I said in previous mails:
> >> > we can emulate all interesting signal syscalls with this one.
> >> 
> >> I don't know what you mean by all of the interesting signal system
> >> calls.  I do know you can not replicate kill(2).
> >
> > [1]: You cannot replicate certain aspects of kill *yet*. We have
> > established this before. If we want process group support later we do
> > have the flags argument to extend the sycall.
> 
> Then you have horrible contradiction in the API.
> 
> Either the grouping is a property of your file descriptor or the
> grouping comes from the flags argument.

See the first part of Daniel's answer in [1] answer which makes sense to
me. I won't repeat it here.
[1]: https://lore.kernel.org/lkml/CAKOZuevgPv1CbAZF-ha0njdq6rd6QkU7T8qmmERJLsL45CeVzg@mail.gmail.com/

> 
> If the grouping is specified in the flags argument then pidfd is the
> proper name for your file descriptors, and the appropriate prefix
> for your system call.
> 
> If the grouping is a property of your file descriptor it does not make
> sense to talk about using the flags argument later.
> 
> Your intention is to add the thread case to support pthreads once the
> process case is sorted out.  So this is something that needs to be made
> clear.  Did I miss how you plan to handle threads?

Yeah, maybe you missed it in the commit message [2] which is based on a
discussion with Andy [3] and Arnd [4]:

[2]: https://lore.kernel.org/lkml/20181206121858.12215-1-christian@brauner.io/
     /* taskfd_send_signal() replaces multiple pid-based syscalls */
    The taskfd_send_signal() syscall currently takes on the job of the
    following syscalls that operate on pids:
    - kill(2)
    - rt_sigqueueinfo(2)
    The syscall is defined in such a way that it can also operate on thread fds
    instead of process fds. In a future patchset I will extend it to operate on
    taskfds from /proc/<pid>/task/<tid> at which point it will additionally
    take on the job of:
    - tgkill(2)
    - rt_tgsigqueueinfo(2)
    Right now this is intentionally left out to keep this patchset as simple as
    possible (cf. [4]). If a taskfd of /proc/<pid>/task/<tid> is passed
    EOPNOTSUPP will be returned to give userspace a way to detect when I add
    support for such taskfds (cf. [10]).

[3]: https://lore.kernel.org/lkml/CALCETrUY=Hk6=BjgPuDBgj5J1T_b5Q5u1uVOWHjGWXwmHoZBEQ@mail.gmail.com/
     > Yes, I see no reason why not. My idea is to extend it - after we have a
     > basic version in - to also work with:
     > /proc/<pid>/task/<tid>
     > If I'm not mistaken this should be sufficient to get rt_tgsigqueueinfo.
     > The thread will be uniquely identified by the tid descriptor and no
     > combination of /proc/<pid> and /proc/<pid>/task/<tid> is needed. Does
     > that sound reasonable?
    
     Yes. So it would currently replace rt_gsigqueueinfo() but
     not rt_tgsigqueueinfo(), and could be extended to do both
     afterwards, without making the interface ugly in any form?
     
     I suppose we can always add more flags if needed, and you
     already ensure that flags is zero for the moment.

[4]: https://lore.kernel.org/lkml/CAK8P3a1_if7+Ak2eefU6JeZT9KW827gkLHaObX-QOsHrB5ZwfA@mail.gmail.com/
     "Is the current procfd_signal() proposal (under whichever name)
     sufficient to correctly implement both sys_rt_sigqueueinfo() and
     sys_rt_tgsigqueueinfo()?"

> 
> And this fundamentally and definitely gets into all of my concerns about
> proper handling of pid_task and PIDTYPE_TGID etc.
> 
> >> Sending signals to a process group the "kill(-pgrp)" case with kill
> >> sends the signals to an atomic snapshot of processes.  If the signal
> >> is SIGKILL then it is guaranteed that then entire process group is
> >> killed with no survivors.
> >
> > See [1].
> >
> >> 
> >> > We succeeded in doing that.
> >> 
> >> I am not certain you have.
> >
> > See [1].
> >
> >> 
> >> > No need to get more fancy.
> >> > There's currently no obvious need for more features.
> >> > Features should be implemented when someone actually needs them.
> >> 
> >> That is fair.  I don't understand what you are doing with sending
> >> signals to a thread.  That seems like one of the least useful
> >> corner cases of sending signals.
> >
> > It's what glibc and Florian care about for pthreads and their our
> > biggest user atm so they get some I'd argue they get some say in this. :)
> 
> Fair enough.  If glibc could use this then we certainly have users and a
> user case.

Florian was asking specifically in [5]:

[5]: https://lore.kernel.org/lkml/87in0g5aqo.fsf@oldenburg.str.redhat.com/
     "Looking at the rt_tgsigqueueinfo interface, is there a way to implement
     the “tg” part with the current procfd_signal interface?"

> 
> Eric
Eric W. Biederman Dec. 6, 2018, 9:46 p.m. UTC | #23
Christian Brauner <christian@brauner.io> writes:

>> Your intention is to add the thread case to support pthreads once the
>> process case is sorted out.  So this is something that needs to be made
>> clear.  Did I miss how you plan to handle threads?
>
> Yeah, maybe you missed it in the commit message [2] which is based on a
> discussion with Andy [3] and Arnd [4]:

Looking at your references I haven't missed it.  You are not deciding
anything as of yet to keep it simple.  Except you are returning
EOPNOTSUPP.  You are very much intending to do something.

Decide.  Do you use the flags parameter or is the width of the
target depending on the flags.

That is fundamental to how the system call and it's extensions work.
That is fundamental to my review.

Until that is decided.
Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>

There are a lot of fundamental maintenance issues and you can very easily
get them wrong if you are not clear on the job of the file descriptor
and the job of the flags argument.

I want don't want new crap that we have to abandon that has a nasty set
of bugs because no one wanted to think through the system call all of
the way and understand the corner cases.

Eric
Daniel Colascione Dec. 6, 2018, 10:01 p.m. UTC | #24
On Thu, Dec 6, 2018 at 1:47 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Christian Brauner <christian@brauner.io> writes:
>
> >> Your intention is to add the thread case to support pthreads once the
> >> process case is sorted out.  So this is something that needs to be made
> >> clear.  Did I miss how you plan to handle threads?
> >
> > Yeah, maybe you missed it in the commit message [2] which is based on a
> > discussion with Andy [3] and Arnd [4]:
>
> Looking at your references I haven't missed it.  You are not deciding
> anything as of yet to keep it simple.  Except you are returning
> EOPNOTSUPP.  You are very much intending to do something.

So what *should* happen in that case? A panic? Come on. There's
nothing wrong with returning an error pending an expansion of
capabilities later.

> Decide.  Do you use the flags parameter or is the width of the
> target depending on the flags.

Huh?

> That is fundamental to how the system call and it's extensions work.
> That is fundamental to my review.

Your review makes no sense and comes off as an increasingly nitpicky
strategy of blocking the change no matter what Christian does. On
several occasions, you've just said "no, I don't like this" without
constructively trying to suggest an alternative that allows us to make
progress. That's obstruction, and this patch should get into the tree
over your nack.

> Until that is decided.
> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>
> There are a lot of fundamental maintenance issues and you can very easily
> get them wrong if you are not clear on the job of the file descriptor
> and the job of the flags argument.

There are no maintenance issues. Christian has bent over backwards
trying to address all the code concerns raised in this thread, and
nothing has been good enough.

> I want don't want new crap that we have to abandon that has a nasty set
> of bugs because no one wanted to think through the system call all of
> the way and understand the corner cases.

What bugs? You have identified no bugs. There is no problem with the
API signature. It signals a task. You get that from proc.
Eric W. Biederman Dec. 6, 2018, 10:22 p.m. UTC | #25
Daniel Colascione <dancol@google.com> writes:

> On Thu, Dec 6, 2018 at 12:29 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>> Christian Brauner <christian@brauner.io> writes:
>>
>> > [1]: You cannot replicate certain aspects of kill *yet*. We have
>> > established this before. If we want process group support later we do
>> > have the flags argument to extend the sycall.
>>
>> Then you have horrible contradiction in the API.
>>
>> Either the grouping is a property of your file descriptor or the
>> grouping comes from the flags argument.
>>
>> If the grouping is specified in the flags argument then pidfd is the
>> proper name for your file descriptors, and the appropriate prefix
>> for your system call.
>
> Yes and no. "taskfd" is fine, since even if we do add a
> kill-process-group capability, the general facility under discussion
> is still *about* tasks in general, so "taskfd" still tells you in a
> general sense what the thing does. "pidfd" would be wrong, and for the
> same reason that the kernel's "struct pid" is badly-named: the object
> being named is a *task*, and signaling a particular task instead of
> whatever task happens to be labeled with a particular numeric PID at
> the time of all is the whole point of this change.

No.  The object being named by struct pid is an identifier.

The same identifier can label processes, the same identifier can
label threads, the same identifier can label process groups the
same identifier can label sessions.

That is fundamental to how that identifier works.

Now the new file descriptor can either be a handle to the struct pid
the general purpose identifier, or the file descriptor can be a handle
to a process.  Other file descriptor types would be handles to different
kinds of processes.


When people tell me the new interface can handle process groups or
sessions by just adding a flag.  I hear they want to have an handle
to struct pid then general purpose identifier.  Because it doesn't make
any sense at all to add a flag to when you need to add a new file
descriptor type to keep the API consistent.

Later when I hear people say that they want an identifier for a process
and not have a handle to a general purpose identifier I think they need
to think about differnt types of file descriptors for the different
purposes.  The flags argument does not help at all there.

>> If the grouping is a property of your file descriptor it does not make
>> sense to talk about using the flags argument later.
>>
>> Your intention is to add the thread case to support pthreads once the
>> process case is sorted out.  So this is something that needs to be made
>> clear.  Did I miss how you plan to handle threads?
>>
>> And this fundamentally and definitely gets into all of my concerns about
>> proper handling of pid_task and PIDTYPE_TGID etc.
>
> To the extent that it's possible, this system call should mimic the
> behavior of a signal-send to a positive numeric PID (i.e., a specific
> task), so if we change one, we should change both.

There is a special case in kill(2) where we treat an identifier that
only identifies a thread as an identifier of a process.  We should
not copy that.

Making kill(2) only accept the pid of thread group leaders would make
all kinds of sense.  Unfortunately we need to stay bug compatible with
past versions of linux.  So kill(2) won't be changing unless it appears
clear that not a single application cares about that case.

I will eventually be changing kill(2) to not sending signals through
zombie group leaders so the permission checks are performed correctly.

All I am asking is that we don't copy that bug compatibility code into a
new system call, and that use use the current internal APIs in a fashion
that shows you are not replicating misfeatures of old system calls.

I know it takes understanding a little more kernel code but for people
designing a signal sending API it is not unreasonable to ask that they
understand the current APIs and the current code.  The fact that this
discussion has revealed some significant misconceptions of how the
current code works, and what I am asking for concerns me.

Eric
Serge E. Hallyn Dec. 6, 2018, 10:27 p.m. UTC | #26
On Thu, Dec 06, 2018 at 10:30:40AM -0800, Kees Cook wrote:
> On Thu, Dec 6, 2018 at 9:41 AM Christian Brauner <christian@brauner.io> wrote:
> > I feel changing the name around by a single persons preferences is not
> > really a nice thing to do community-wise. So I'd like to hear other
> > people chime in first before I make that change.
> 
> I don't think the name is hugely critical (but it's always the hardest
> to settle on). My preference order would be:
> 
> taskfd_send_signal()
> pidfd_send_signal()
> procfd_send_signal()
> fd_send_signal()

imo, either procfd_send_signal() or taskfd_send_signal()

It seems to me that using flags later to specify sending to pgrp vs thread
is fine:  it's specifying how to interpret the 'fd' in 'procfd_send_signal()'.

> But, agreed, I think fdkill() should not be used.
> 
> -- 
> Kees Cook
Christian Brauner Dec. 6, 2018, 10:39 p.m. UTC | #27
On Thu, Dec 06, 2018 at 03:46:53PM -0600, Eric W. Biederman wrote:
> Christian Brauner <christian@brauner.io> writes:
> 
> >> Your intention is to add the thread case to support pthreads once the
> >> process case is sorted out.  So this is something that needs to be made
> >> clear.  Did I miss how you plan to handle threads?
> >
> > Yeah, maybe you missed it in the commit message [2] which is based on a
> > discussion with Andy [3] and Arnd [4]:
> 
> Looking at your references I haven't missed it.  You are not deciding
> anything as of yet to keep it simple.  Except you are returning
> EOPNOTSUPP.  You are very much intending to do something.

That was clear all along and was pointed at every occassion in the
threads. I even went through the hazzle to give you all of the
references when there's lore.kernel.org.

> 
> Decide.  Do you use the flags parameter or is the width of the
> target depending on the flags.

I'm sorry I don't understand what you a) mean with this sentence "width
of the target" and b) what you *exactly* want from us to get this accepted.

> 
> That is fundamental to how the system call and it's extensions work.
> That is fundamental to my review.
> 
> Until that is decided.
> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> There are a lot of fundamental maintenance issues and you can very easily
> get them wrong if you are not clear on the job of the file descriptor
> and the job of the flags argument.
> 
> I want don't want new crap that we have to abandon that has a nasty set
> of bugs because no one wanted to think through the system call all of
> the way and understand the corner cases.

That's why this system call is exactly kept as simple as it is. Which I
had to fight for! You kept consistenly asking for features such as:
- add a pid parameter maybe
- allow to signal into ancestor and cousin pid namespaces
I fought all of those off for the sake of keeping this patch as simple
as possible so that we can get it in.
We don't even need to decide whether we go for flags or another type of
fd right now. The point was that we can postpone this to a later
discussion. Which also has been discussed in multiple threads. That's
literally what I wrote in prior mails and nearly everyone agreed that
this is a good strategy.

I honestly have no idea what to make of your review given that it
started from a naming issue and suddenly the whole api is crap.
If you want this to be named fd_send_signal() to get in. Then seriously,
I'm happy to rename it. I honestly don't care even though I think taskfd
is a better name.

> 
> Eric
Daniel Colascione Dec. 6, 2018, 10:43 p.m. UTC | #28
On Thu, Dec 6, 2018 at 2:22 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Daniel Colascione <dancol@google.com> writes:
>
> > On Thu, Dec 6, 2018 at 12:29 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >> Christian Brauner <christian@brauner.io> writes:
> >>
> >> > [1]: You cannot replicate certain aspects of kill *yet*. We have
> >> > established this before. If we want process group support later we do
> >> > have the flags argument to extend the sycall.
> >>
> >> Then you have horrible contradiction in the API.
> >>
> >> Either the grouping is a property of your file descriptor or the
> >> grouping comes from the flags argument.
> >>
> >> If the grouping is specified in the flags argument then pidfd is the
> >> proper name for your file descriptors, and the appropriate prefix
> >> for your system call.
> >
> > Yes and no. "taskfd" is fine, since even if we do add a
> > kill-process-group capability, the general facility under discussion
> > is still *about* tasks in general, so "taskfd" still tells you in a
> > general sense what the thing does. "pidfd" would be wrong, and for the
> > same reason that the kernel's "struct pid" is badly-named: the object
> > being named is a *task*, and signaling a particular task instead of
> > whatever task happens to be labeled with a particular numeric PID at
> > the time of all is the whole point of this change.
>
> No.  The object being named by struct pid is an identifier.
>
> The same identifier can label processes, the same identifier can
> label threads, the same identifier can label process groups the
> same identifier can label sessions.
>
> That is fundamental to how that identifier works.
>
> Now the new file descriptor can either be a handle to the struct pid
> the general purpose identifier, or the file descriptor can be a handle
> to a process.  Other file descriptor types would be handles to different
> kinds of processes.

If you open /proc/pid, you get a file descriptor of type A. If you
pass that file descriptor to taskfd_signal, you signal the process. If
you open a file descriptor to /proc/pid/task/tid, you get a file
descriptor of type B, and if you pass that file descriptor to
taskfd_signal, you signal the thread. A and B are *already* distinct.
The only reason you'd want to require a flag for case B is to reduce
the probability of error. The flag wouldn't let you open /proc/pid,
yielding an A, and then pass it to taskfd_signal with a flag saying
"interpret as B".

I understand that it'd be inconsistent to later add a flag saying
"interpret this A as a reference to its whole process group", because
with respect to the difference between threads and processes, it's the
file descriptor type that controls the operation invoked, not the
flags. We can solve that by introducing a new FD type for a process
group or by just allowing the "casting" via flag, but we're not at the
point of having to decide that just yet, and I don't think the API is
particularly bad either way.

If you don't want to require a flag value when passing a B, then fine.
It'll work either with or without the flag. If, instead, you want the
flag to allow "casting" an A to a B or vice versa, that's fine too.

> When people tell me the new interface can handle process groups or
> sessions by just adding a flag.  I hear they want to have an handle
> to struct pid then general purpose identifier.  Because it doesn't make
> any sense at all to add a flag to when you need to add a new file
> descriptor type to keep the API consistent.
>
> Later when I hear people say that they want an identifier for a process
> and not have a handle to a general purpose identifier I think they need
> to think about differnt types of file descriptors for the different
> purposes.  The flags argument does not help at all there.
>
> >> If the grouping is a property of your file descriptor it does not make
> >> sense to talk about using the flags argument later.
> >>
> >> Your intention is to add the thread case to support pthreads once the
> >> process case is sorted out.  So this is something that needs to be made
> >> clear.  Did I miss how you plan to handle threads?
> >>
> >> And this fundamentally and definitely gets into all of my concerns about
> >> proper handling of pid_task and PIDTYPE_TGID etc.
> >
> > To the extent that it's possible, this system call should mimic the
> > behavior of a signal-send to a positive numeric PID (i.e., a specific
> > task), so if we change one, we should change both.
>
> There is a special case in kill(2) where we treat an identifier that
> only identifies a thread as an identifier of a process.  We should
> not copy that.

Sure.
Christian Brauner Dec. 6, 2018, 11:17 p.m. UTC | #29
On Thu, Dec 06, 2018 at 11:39:48PM +0100, Christian Brauner wrote:
> On Thu, Dec 06, 2018 at 03:46:53PM -0600, Eric W. Biederman wrote:
> > Christian Brauner <christian@brauner.io> writes:
> > 
> > >> Your intention is to add the thread case to support pthreads once the
> > >> process case is sorted out.  So this is something that needs to be made
> > >> clear.  Did I miss how you plan to handle threads?
> > >
> > > Yeah, maybe you missed it in the commit message [2] which is based on a
> > > discussion with Andy [3] and Arnd [4]:
> > 
> > Looking at your references I haven't missed it.  You are not deciding
> > anything as of yet to keep it simple.  Except you are returning
> > EOPNOTSUPP.  You are very much intending to do something.
> 
> That was clear all along and was pointed at every occassion in the
> threads. I even went through the hazzle to give you all of the
> references when there's lore.kernel.org.
> 
> > 
> > Decide.  Do you use the flags parameter or is the width of the
> > target depending on the flags.

Ok, let's try to be constructive. I understand the general concern for
the future so let's put a contract into the commit message stating that
the width of the target aka *what is signaled* will be based on a flag
parameter if we ever extend it:

taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_PGID);
taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_TID);

with the current default being

taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_PID);

This seems to me the cleanest solution as we only use one type of file
descriptor. Can everyone be on board with this? If so I'm going to send
out a new version of the patch.

Christian
Serge E. Hallyn Dec. 7, 2018, 12:31 a.m. UTC | #30
On Fri, Dec 07, 2018 at 12:17:45AM +0100, Christian Brauner wrote:
> On Thu, Dec 06, 2018 at 11:39:48PM +0100, Christian Brauner wrote:
> > On Thu, Dec 06, 2018 at 03:46:53PM -0600, Eric W. Biederman wrote:
> > > Christian Brauner <christian@brauner.io> writes:
> > > 
> > > >> Your intention is to add the thread case to support pthreads once the
> > > >> process case is sorted out.  So this is something that needs to be made
> > > >> clear.  Did I miss how you plan to handle threads?
> > > >
> > > > Yeah, maybe you missed it in the commit message [2] which is based on a
> > > > discussion with Andy [3] and Arnd [4]:
> > > 
> > > Looking at your references I haven't missed it.  You are not deciding
> > > anything as of yet to keep it simple.  Except you are returning
> > > EOPNOTSUPP.  You are very much intending to do something.
> > 
> > That was clear all along and was pointed at every occassion in the
> > threads. I even went through the hazzle to give you all of the
> > references when there's lore.kernel.org.
> > 
> > > 
> > > Decide.  Do you use the flags parameter or is the width of the
> > > target depending on the flags.
> 
> Ok, let's try to be constructive. I understand the general concern for
> the future so let's put a contract into the commit message stating that
> the width of the target aka *what is signaled* will be based on a flag
> parameter if we ever extend it:
> 
> taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_PGID);
> taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_TID);
> 
> with the current default being
> 
> taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_PID);
> 
> This seems to me the cleanest solution as we only use one type of file
> descriptor. Can everyone be on board with this? If so I'm going to send
> out a new version of the patch.
> 
> Christian

I'm on board with this, but I think you need to also clarify what exactly
the fd stands for.  I think that (a) userspace should not have to care
about the struct pid implementation, and so (b) the procfd should stand
for all the pids.  So when taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_PGID)
becomes implemented, then open(/proc/5) will pin all three pids, as will
open(/proc/5/task/6).

-serge
Daniel Colascione Dec. 7, 2018, 12:34 a.m. UTC | #31
On Thu, Dec 6, 2018 at 4:31 PM Serge E. Hallyn <serge@hallyn.com> wrote:
>
> On Fri, Dec 07, 2018 at 12:17:45AM +0100, Christian Brauner wrote:
> > On Thu, Dec 06, 2018 at 11:39:48PM +0100, Christian Brauner wrote:
> > > On Thu, Dec 06, 2018 at 03:46:53PM -0600, Eric W. Biederman wrote:
> > > > Christian Brauner <christian@brauner.io> writes:
> > > >
> > > > >> Your intention is to add the thread case to support pthreads once the
> > > > >> process case is sorted out.  So this is something that needs to be made
> > > > >> clear.  Did I miss how you plan to handle threads?
> > > > >
> > > > > Yeah, maybe you missed it in the commit message [2] which is based on a
> > > > > discussion with Andy [3] and Arnd [4]:
> > > >
> > > > Looking at your references I haven't missed it.  You are not deciding
> > > > anything as of yet to keep it simple.  Except you are returning
> > > > EOPNOTSUPP.  You are very much intending to do something.
> > >
> > > That was clear all along and was pointed at every occassion in the
> > > threads. I even went through the hazzle to give you all of the
> > > references when there's lore.kernel.org.
> > >
> > > >
> > > > Decide.  Do you use the flags parameter or is the width of the
> > > > target depending on the flags.
> >
> > Ok, let's try to be constructive. I understand the general concern for
> > the future so let's put a contract into the commit message stating that
> > the width of the target aka *what is signaled* will be based on a flag
> > parameter if we ever extend it:
> >
> > taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_PGID);
> > taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_TID);
> >
> > with the current default being
> >
> > taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_PID);
> >
> > This seems to me the cleanest solution as we only use one type of file
> > descriptor. Can everyone be on board with this? If so I'm going to send
> > out a new version of the patch.
> >
> > Christian
>
> I'm on board with this, but I think you need to also clarify what exactly
> the fd stands for.  I think that (a) userspace should not have to care
> about the struct pid implementation, and so (b) the procfd should stand
> for all the pids.  So when taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_PGID)
> becomes implemented, then open(/proc/5) will pin all three pids, as will
> open(/proc/5/task/6).

This change doesn't "pin" any PID, and it makes no sense to make a
process FD stand for all its threads. What does that even mean?
Serge E. Hallyn Dec. 7, 2018, 12:59 a.m. UTC | #32
On Thu, Dec 06, 2018 at 04:34:54PM -0800, Daniel Colascione wrote:
> On Thu, Dec 6, 2018 at 4:31 PM Serge E. Hallyn <serge@hallyn.com> wrote:
> >
> > On Fri, Dec 07, 2018 at 12:17:45AM +0100, Christian Brauner wrote:
> > > On Thu, Dec 06, 2018 at 11:39:48PM +0100, Christian Brauner wrote:
> > > > On Thu, Dec 06, 2018 at 03:46:53PM -0600, Eric W. Biederman wrote:
> > > > > Christian Brauner <christian@brauner.io> writes:
> > > > >
> > > > > >> Your intention is to add the thread case to support pthreads once the
> > > > > >> process case is sorted out.  So this is something that needs to be made
> > > > > >> clear.  Did I miss how you plan to handle threads?
> > > > > >
> > > > > > Yeah, maybe you missed it in the commit message [2] which is based on a
> > > > > > discussion with Andy [3] and Arnd [4]:
> > > > >
> > > > > Looking at your references I haven't missed it.  You are not deciding
> > > > > anything as of yet to keep it simple.  Except you are returning
> > > > > EOPNOTSUPP.  You are very much intending to do something.
> > > >
> > > > That was clear all along and was pointed at every occassion in the
> > > > threads. I even went through the hazzle to give you all of the
> > > > references when there's lore.kernel.org.
> > > >
> > > > >
> > > > > Decide.  Do you use the flags parameter or is the width of the
> > > > > target depending on the flags.
> > >
> > > Ok, let's try to be constructive. I understand the general concern for
> > > the future so let's put a contract into the commit message stating that
> > > the width of the target aka *what is signaled* will be based on a flag
> > > parameter if we ever extend it:
> > >
> > > taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_PGID);
> > > taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_TID);
> > >
> > > with the current default being
> > >
> > > taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_PID);
> > >
> > > This seems to me the cleanest solution as we only use one type of file
> > > descriptor. Can everyone be on board with this? If so I'm going to send
> > > out a new version of the patch.
> > >
> > > Christian
> >
> > I'm on board with this, but I think you need to also clarify what exactly
> > the fd stands for.  I think that (a) userspace should not have to care
> > about the struct pid implementation, and so (b) the procfd should stand
> > for all the pids.  So when taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_PGID)
> > becomes implemented, then open(/proc/5) will pin all three pids, as will
> > open(/proc/5/task/6).
> 
> This change doesn't "pin" any PID, and it makes no sense to make a
> process FD stand for all its threads. What does that even mean?

Currently the patch relies on the procfd inode saving a copy to the PIDTYPE_PID
pid.  I'm not sure offhand, can it go to the PIDTYPE_PGID from that after the
task has died, or not?   I didn't think so.  If it can then great.

The point is (a) these are details which should not have to bother userspace,
and (b) how to decide who we're sending the signal to (tid/pid/pgid) should
be specified in precisely one way.  So either a flag, or comign from the type
of fd that was opened.

-serge
Daniel Colascione Dec. 7, 2018, 1:39 a.m. UTC | #33
On Thu, Dec 6, 2018 at 4:59 PM Serge E. Hallyn <serge@hallyn.com> wrote:
>
> On Thu, Dec 06, 2018 at 04:34:54PM -0800, Daniel Colascione wrote:
> > On Thu, Dec 6, 2018 at 4:31 PM Serge E. Hallyn <serge@hallyn.com> wrote:
> > >
> > > On Fri, Dec 07, 2018 at 12:17:45AM +0100, Christian Brauner wrote:
> > > > On Thu, Dec 06, 2018 at 11:39:48PM +0100, Christian Brauner wrote:
> > > > > On Thu, Dec 06, 2018 at 03:46:53PM -0600, Eric W. Biederman wrote:
> > > > > > Christian Brauner <christian@brauner.io> writes:
> > > > > >
> > > > > > >> Your intention is to add the thread case to support pthreads once the
> > > > > > >> process case is sorted out.  So this is something that needs to be made
> > > > > > >> clear.  Did I miss how you plan to handle threads?
> > > > > > >
> > > > > > > Yeah, maybe you missed it in the commit message [2] which is based on a
> > > > > > > discussion with Andy [3] and Arnd [4]:
> > > > > >
> > > > > > Looking at your references I haven't missed it.  You are not deciding
> > > > > > anything as of yet to keep it simple.  Except you are returning
> > > > > > EOPNOTSUPP.  You are very much intending to do something.
> > > > >
> > > > > That was clear all along and was pointed at every occassion in the
> > > > > threads. I even went through the hazzle to give you all of the
> > > > > references when there's lore.kernel.org.
> > > > >
> > > > > >
> > > > > > Decide.  Do you use the flags parameter or is the width of the
> > > > > > target depending on the flags.
> > > >
> > > > Ok, let's try to be constructive. I understand the general concern for
> > > > the future so let's put a contract into the commit message stating that
> > > > the width of the target aka *what is signaled* will be based on a flag
> > > > parameter if we ever extend it:
> > > >
> > > > taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_PGID);
> > > > taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_TID);
> > > >
> > > > with the current default being
> > > >
> > > > taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_PID);
> > > >
> > > > This seems to me the cleanest solution as we only use one type of file
> > > > descriptor. Can everyone be on board with this? If so I'm going to send
> > > > out a new version of the patch.
> > > >
> > > > Christian
> > >
> > > I'm on board with this, but I think you need to also clarify what exactly
> > > the fd stands for.  I think that (a) userspace should not have to care
> > > about the struct pid implementation, and so (b) the procfd should stand
> > > for all the pids.  So when taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_PGID)
> > > becomes implemented, then open(/proc/5) will pin all three pids, as will
> > > open(/proc/5/task/6).
> >
> > This change doesn't "pin" any PID, and it makes no sense to make a
> > process FD stand for all its threads. What does that even mean?
>
> Currently the patch relies on the procfd inode saving a copy to the PIDTYPE_PID
> pid.

struct pid doesn't have a type field. The interpretation depends on
the caller's use of the struct pid, and in the current path, that's
PIDTYPE_PID. What, specifically, is wrong with the current approach?

> I'm not sure offhand, can it go to the PIDTYPE_PGID from that after the
> task has died, or not?   I didn't think so.  If it can then great.

You're arguing that something that does, in fact, work, is somehow
broken in some unspecified way. The kill_pid_info lookup works fine.
What, specifically, is wrong with the semantics as implemented?

> The point is (a) these are details which should not have to bother userspace,

These details *don't* bother userspace.

You're raising concerns that are either imaginary or long-since
addressed. Does the patch cause some kind of maintenance burden? No,
it doesn't, not moreso than any other piece of code. Does the
interface have unclear semantics? No, it clearly sends a signal to a
process, just like kill. Does the patch expose kernel implementation
details? No, it doesn't, because the interface is simply not defined
in terms of these details. Do we need to change how signal delivery
works? No, because if it's fine for kill, it's fine for this facility,
and if some future signal cleanup separates the cases more, that
cleanup can change this code as well.

The change is well-documented, simple, extensible, and addresses an
actual problem. Every legitimate technical criticism has now been
addressed. I don't understand where this opposition is coming from,
since the objections refer to nothing that's actually in the patch or
exposed to the user.

> and (b) how to decide who we're sending the signal to (tid/pid/pgid) should
> be specified in precisely one way.  So either a flag, or comign from the type
> of fd that was opened.

You can't send signals to a thread with the current patch. There's no
ambiguity in providing zero ways to do something.
Christian Brauner Dec. 7, 2018, 1:54 a.m. UTC | #34
On Thu, Dec 06, 2018 at 05:39:18PM -0800, Daniel Colascione wrote:
> On Thu, Dec 6, 2018 at 4:59 PM Serge E. Hallyn <serge@hallyn.com> wrote:
> >
> > On Thu, Dec 06, 2018 at 04:34:54PM -0800, Daniel Colascione wrote:
> > > On Thu, Dec 6, 2018 at 4:31 PM Serge E. Hallyn <serge@hallyn.com> wrote:
> > > >
> > > > On Fri, Dec 07, 2018 at 12:17:45AM +0100, Christian Brauner wrote:
> > > > > On Thu, Dec 06, 2018 at 11:39:48PM +0100, Christian Brauner wrote:
> > > > > > On Thu, Dec 06, 2018 at 03:46:53PM -0600, Eric W. Biederman wrote:
> > > > > > > Christian Brauner <christian@brauner.io> writes:
> > > > > > >
> > > > > > > >> Your intention is to add the thread case to support pthreads once the
> > > > > > > >> process case is sorted out.  So this is something that needs to be made
> > > > > > > >> clear.  Did I miss how you plan to handle threads?
> > > > > > > >
> > > > > > > > Yeah, maybe you missed it in the commit message [2] which is based on a
> > > > > > > > discussion with Andy [3] and Arnd [4]:
> > > > > > >
> > > > > > > Looking at your references I haven't missed it.  You are not deciding
> > > > > > > anything as of yet to keep it simple.  Except you are returning
> > > > > > > EOPNOTSUPP.  You are very much intending to do something.
> > > > > >
> > > > > > That was clear all along and was pointed at every occassion in the
> > > > > > threads. I even went through the hazzle to give you all of the
> > > > > > references when there's lore.kernel.org.
> > > > > >
> > > > > > >
> > > > > > > Decide.  Do you use the flags parameter or is the width of the
> > > > > > > target depending on the flags.
> > > > >
> > > > > Ok, let's try to be constructive. I understand the general concern for
> > > > > the future so let's put a contract into the commit message stating that
> > > > > the width of the target aka *what is signaled* will be based on a flag
> > > > > parameter if we ever extend it:
> > > > >
> > > > > taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_PGID);
> > > > > taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_TID);
> > > > >
> > > > > with the current default being
> > > > >
> > > > > taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_PID);
> > > > >
> > > > > This seems to me the cleanest solution as we only use one type of file
> > > > > descriptor. Can everyone be on board with this? If so I'm going to send
> > > > > out a new version of the patch.
> > > > >
> > > > > Christian
> > > >
> > > > I'm on board with this, but I think you need to also clarify what exactly
> > > > the fd stands for.  I think that (a) userspace should not have to care
> > > > about the struct pid implementation, and so (b) the procfd should stand
> > > > for all the pids.  So when taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_PGID)
> > > > becomes implemented, then open(/proc/5) will pin all three pids, as will
> > > > open(/proc/5/task/6).
> > >
> > > This change doesn't "pin" any PID, and it makes no sense to make a
> > > process FD stand for all its threads. What does that even mean?
> >
> > Currently the patch relies on the procfd inode saving a copy to the PIDTYPE_PID
> > pid.
> 
> struct pid doesn't have a type field. The interpretation depends on
> the caller's use of the struct pid, and in the current path, that's
> PIDTYPE_PID. What, specifically, is wrong with the current approach?
> 
> > I'm not sure offhand, can it go to the PIDTYPE_PGID from that after the
> > task has died, or not?   I didn't think so.  If it can then great.
> 
> You're arguing that something that does, in fact, work, is somehow
> broken in some unspecified way. The kill_pid_info lookup works fine.
> What, specifically, is wrong with the semantics as implemented?
> 
> > The point is (a) these are details which should not have to bother userspace,
> 
> These details *don't* bother userspace.
> 
> You're raising concerns that are either imaginary or long-since
> addressed. Does the patch cause some kind of maintenance burden? No,
> it doesn't, not moreso than any other piece of code. Does the
> interface have unclear semantics? No, it clearly sends a signal to a
> process, just like kill. Does the patch expose kernel implementation
> details? No, it doesn't, because the interface is simply not defined
> in terms of these details. Do we need to change how signal delivery
> works? No, because if it's fine for kill, it's fine for this facility,
> and if some future signal cleanup separates the cases more, that
> cleanup can change this code as well.
> 
> The change is well-documented, simple, extensible, and addresses an
> actual problem. Every legitimate technical criticism has now been
> addressed. I don't understand where this opposition is coming from,
> since the objections refer to nothing that's actually in the patch or
> exposed to the user.
> 
> > and (b) how to decide who we're sending the signal to (tid/pid/pgid) should
> > be specified in precisely one way.  So either a flag, or comign from the type
> > of fd that was opened.
> 
> You can't send signals to a thread with the current patch. There's no
> ambiguity in providing zero ways to do something.

So Serge's point is not about changing the current patch. What he's
basically saying is: If we are expected to state how we were to extend
this syscall in the future which Serge and I figured is currently Eric's
only remaining objection then:
- flags are a good way to go (I agree)
- there's a concrete way how to do this by stashing the relevent struct
  pids for PIDTYPE_PID, PIDTYPE_TGID, PIDTYPE_PGID in file->private_data
  which can then be retrieved in taskfd_send_signal()

There is not intent nor requirement to do this right now. What we have
right now is fine for a start, I agree! But here's how we go forward if
we ever need to! :)

Christian
Serge E. Hallyn Dec. 7, 2018, 4:47 p.m. UTC | #35
On Thu, Dec 06, 2018 at 01:18:58PM +0100, Christian Brauner wrote:
> The kill() syscall operates on process identifiers (pid). After a process
> has exited its pid can be reused by another process. If a caller sends a
> signal to a reused pid it will end up signaling the wrong process. This
> issue has often surfaced and there has been a push [1] to address this
> problem.
> 
> This patch uses file descriptors (fd) from proc/<pid> as stable handles on
> struct pid. Even if a pid is recycled the handle will not change. The fd
> can be used to send signals to the process it refers to.
> Thus, the new syscall taskfd_send_signal() is introduced to solve this
> problem. Instead of pids it operates on process fds (taskfd).
> 
> /* prototype and argument /*
> long taskfd_send_signal(int taskfd, int sig, siginfo_t *info, unsigned int flags);
> 
> In addition to the taskfd and signal argument it takes an additional
> siginfo_t and flags argument. If the siginfo_t argument is NULL then
> taskfd_send_signal() behaves like kill(). If it is not NULL
> taskfd_send_signal() behaves like rt_sigqueueinfo().
> The flags argument is added to allow for future extensions of this syscall.
> It currently needs to be passed as 0. Failing to do so will cause EINVAL.
> 
> /* taskfd_send_signal() replaces multiple pid-based syscalls */
> The taskfd_send_signal() syscall currently takes on the job of the
> following syscalls that operate on pids:
> - kill(2)
> - rt_sigqueueinfo(2)
> The syscall is defined in such a way that it can also operate on thread fds
> instead of process fds. In a future patchset I will extend it to operate on
> taskfds from /proc/<pid>/task/<tid> at which point it will additionally
> take on the job of:
> - tgkill(2)
> - rt_tgsigqueueinfo(2)
> Right now this is intentionally left out to keep this patchset as simple as
> possible (cf. [4]). If a taskfd of /proc/<pid>/task/<tid> is passed
> EOPNOTSUPP will be returned to give userspace a way to detect when I add
> support for such taskfds (cf. [10]).
> 
> /* naming */
> The original prefix of the syscall was "procfd_". However, it has been
> pointed out that since this syscall will eventually operate on both
> processes and threads the name should reflect this (cf. [12]). The best
> possible candidate even from a userspace perspective seems to be "task".
> Although "task" is used internally we are alreday deviating from POSIX by
> using file descriptors to processes in the first place so it seems fine to
> use the "taskfd_" prefix.
> 
> The name taskfd_send_signal() was also chosen to reflect the fact that it
> takes on the job of multiple syscalls. It is intentional that the name is
> not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the fomer
> because it might imply that taskfd_send_signal() is only a replacement for
> kill(2) and not the latter because it is a hazzle to remember the correct
> spelling (especially for non-native speakers) and because it is not
> descriptive enough of what the syscall actually does. The name
> "taskfd_send_signal" makes it very clear that its job is to send signals.
> 
> /* O_PATH file descriptors */
> taskfds opened as O_PATH fds cannot be used to send signals to a process
> (cf. [2]). Signaling processes through taskfds is the equivalent of writing
> to a file. Thus, this is not an operation that operates "purely at the
> file descriptor level" as required by the open(2) manpage.
> 
> /* zombies */
> Zombies can be signaled just as any other process. No special error will be
> reported since a zombie state is an unreliable state (cf. [3]).
> 
> /* cross-namespace signals */
> The patch currently enforces that the signaler and signalee either are in
> the same pid namespace or that the signaler's pid namespace is an ancestor
> of the signalee's pid namespace. This is done for the sake of simplicity
> and because it is unclear to what values certain members of struct
> siginfo_t would need to be set to (cf. [5], [6]).
> 
> /* compat syscalls */
> It became clear that we would like to avoid adding compat syscalls (cf.
> [7]). The compat syscall handling is now done in kernel/signal.c itself by
> adding __copy_siginfo_from_user_generic() which lets us avoid compat
> syscalls (cf. [8]). It should be noted that the addition of
> __copy_siginfo_from_user_any() is caused by a bug in the original
> implementation of rt_sigqueueinfo(2) (cf. 12).
> With upcoming rework for syscall handling things might improve
> significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain
> any additional callers.
> 
> /* testing */
> This patch was tested on x64 and x86.
> 
> /* userspace usage */
> An asciinema recording for the basic functionality can be found under [9].
> With this patch a process can be killed via:
> 
>  #define _GNU_SOURCE
>  #include <errno.h>
>  #include <fcntl.h>
>  #include <signal.h>
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <string.h>
>  #include <sys/stat.h>
>  #include <sys/syscall.h>
>  #include <sys/types.h>
>  #include <unistd.h>
> 
>  static inline int do_taskfd_send_signal(int taskfd, int sig, siginfo_t *info,
>                                          unsigned int flags)
>  {
>  #ifdef __NR_taskfd_send_signal
>          return syscall(__NR_taskfd_send_signal, taskfd, sig, info, flags);
>  #else
>          return -ENOSYS;
>  #endif
>  }
> 
>  int main(int argc, char *argv[])
>  {
>          int fd, ret, saved_errno, sig;
> 
>          if (argc < 3)
>                  exit(EXIT_FAILURE);
> 
>          fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
>          if (fd < 0) {
>                  printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]);
>                  exit(EXIT_FAILURE);
>          }
> 
>          sig = atoi(argv[2]);
> 
>          printf("Sending signal %d to process %s\n", sig, argv[1]);
>          ret = do_taskfd_send_signal(fd, sig, NULL, 0);
> 
>          saved_errno = errno;
>          close(fd);
>          errno = saved_errno;
> 
>          if (ret < 0) {
>                  printf("%s - Failed to send signal %d to process %s\n",
>                         strerror(errno), sig, argv[1]);
>                  exit(EXIT_FAILURE);
>          }
> 
>          exit(EXIT_SUCCESS);
>  }
> 
> [1]:  https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/
> [2]:  https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/
> [3]:  https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/
> [4]:  https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
> [5]:  https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/
> [6]:  https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/
> [7]:  https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/
> [8]:  https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/
> [9]:  https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
> [10]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
> [11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/
> [12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/
> 
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Serge Hallyn <serge@hallyn.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: Andy Lutomirsky <luto@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Aleksa Sarai <cyphar@cyphar.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Florian Weimer <fweimer@redhat.com>
> Signed-off-by: Christian Brauner <christian@brauner.io>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> ---
> Changelog:
> v4:
> - updated asciinema to use "taskfd_" prefix
> - s/procfd_send_signal/taskfd_send_signal/g
> - s/proc_is_tgid_procfd/tgid_taskfd_to_pid/b
> - s/proc_is_tid_procfd/tid_taskfd_to_pid/b
> - s/__copy_siginfo_from_user_generic/__copy_siginfo_from_user_any/g
> - make it clear that __copy_siginfo_from_user_any() is a workaround caused
>   by a bug in the original implementation of rt_sigqueueinfo()
> - when spoofing signals turn them into regular kill signals if si_code is
>   set to SI_USER
> - make proc_is_t{g}id_procfd() return struct pid to allow proc_pid() to
>   stay private to fs/proc/
> v3:
> - add __copy_siginfo_from_user_generic() to avoid adding compat syscalls
> - s/procfd_signal/procfd_send_signal/g
> - change type of flags argument from int to unsigned int
> - add comment about what happens to zombies
> - add proc_is_tid_procfd()
> - return EOPNOTSUPP when /proc/<pid>/task/<tid> fd is passed so userspace
>   has a way of knowing that tidfds are not supported currently.
> v2:
> - define __NR_procfd_signal in unistd.h
> - wire up compat syscall
> - s/proc_is_procfd/proc_is_tgid_procfd/g
> - provide stubs when CONFIG_PROC_FS=n
> - move proc_pid() to linux/proc_fs.h header
> - use proc_pid() to grab struct pid from /proc/<pid> fd
> v1:
> - patch introduced
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  fs/proc/base.c                         |  20 +++-
>  include/linux/proc_fs.h                |  12 +++
>  include/linux/syscalls.h               |   3 +
>  include/uapi/asm-generic/unistd.h      |   4 +-
>  kernel/signal.c                        | 132 +++++++++++++++++++++++--
>  7 files changed, 164 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 3cf7b533b3d1..7efb63fd0617 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -398,3 +398,4 @@
>  384	i386	arch_prctl		sys_arch_prctl			__ia32_compat_sys_arch_prctl
>  385	i386	io_pgetevents		sys_io_pgetevents		__ia32_compat_sys_io_pgetevents
>  386	i386	rseq			sys_rseq			__ia32_sys_rseq
> +387	i386	taskfd_send_signal	sys_taskfd_send_signal		__ia32_sys_taskfd_send_signal
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index f0b1709a5ffb..be894f4a84e9 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -343,6 +343,7 @@
>  332	common	statx			__x64_sys_statx
>  333	common	io_pgetevents		__x64_sys_io_pgetevents
>  334	common	rseq			__x64_sys_rseq
> +335	common	taskfd_send_signal	__x64_sys_taskfd_send_signal
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index ce3465479447..b8b88bfee455 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -716,8 +716,6 @@ static int proc_pid_permission(struct inode *inode, int mask)
>  	return generic_permission(inode, mask);
>  }
>  
> -
> -
>  static const struct inode_operations proc_def_inode_operations = {
>  	.setattr	= proc_setattr,
>  };
> @@ -3038,6 +3036,15 @@ static const struct file_operations proc_tgid_base_operations = {
>  	.llseek		= generic_file_llseek,
>  };
>  
> +struct pid *tgid_taskfd_to_pid(const struct file *file)
> +{
> +	if (!d_is_dir(file->f_path.dentry) ||
> +	    (file->f_op != &proc_tgid_base_operations))
> +		return ERR_PTR(-EBADF);
> +
> +	return proc_pid(file_inode(file));
> +}
> +
>  static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
>  {
>  	return proc_pident_lookup(dir, dentry,
> @@ -3422,6 +3429,15 @@ static const struct file_operations proc_tid_base_operations = {
>  	.llseek		= generic_file_llseek,
>  };
>  
> +struct pid *tid_taskfd_to_pid(const struct file *file)
> +{
> +	if (!d_is_dir(file->f_path.dentry) ||
> +	    (file->f_op != &proc_tid_base_operations))
> +		return ERR_PTR(-EBADF);
> +
> +	return proc_pid(file_inode(file));
> +}
> +
>  static const struct inode_operations proc_tid_base_inode_operations = {
>  	.lookup		= proc_tid_base_lookup,
>  	.getattr	= pid_getattr,
> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> index d0e1f1522a78..96817415c420 100644
> --- a/include/linux/proc_fs.h
> +++ b/include/linux/proc_fs.h
> @@ -73,6 +73,8 @@ struct proc_dir_entry *proc_create_net_single_write(const char *name, umode_t mo
>  						    int (*show)(struct seq_file *, void *),
>  						    proc_write_t write,
>  						    void *data);
> +extern struct pid *tgid_taskfd_to_pid(const struct file *file);
> +extern struct pid *tid_taskfd_to_pid(const struct file *file);
>  
>  #else /* CONFIG_PROC_FS */
>  
> @@ -114,6 +116,16 @@ static inline int remove_proc_subtree(const char *name, struct proc_dir_entry *p
>  #define proc_create_net(name, mode, parent, state_size, ops) ({NULL;})
>  #define proc_create_net_single(name, mode, parent, show, data) ({NULL;})
>  
> +static inline struct pid *tgid_taskfd_to_pid(const struct file *file)
> +{
> +	return ERR_PTR(-EBADF);
> +}
> +
> +static inline struct pid *tid_taskfd_to_pid(const struct file *file)
> +{
> +	return ERR_PTR(-EBADF);
> +}
> +
>  #endif /* CONFIG_PROC_FS */
>  
>  struct net;
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 2ac3d13a915b..5ffe194ef29b 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -907,6 +907,9 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
>  			  unsigned mask, struct statx __user *buffer);
>  asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
>  			 int flags, uint32_t sig);
> +asmlinkage long sys_taskfd_send_signal(int taskfd, int sig,
> +				       siginfo_t __user *info,
> +				       unsigned int flags);
>  
>  /*
>   * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 538546edbfbd..9343dca63fd9 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -738,9 +738,11 @@ __SYSCALL(__NR_statx,     sys_statx)
>  __SC_COMP(__NR_io_pgetevents, sys_io_pgetevents, compat_sys_io_pgetevents)
>  #define __NR_rseq 293
>  __SYSCALL(__NR_rseq, sys_rseq)
> +#define __NR_taskfd_send_signal 294
> +__SYSCALL(__NR_taskfd_send_signal, sys_taskfd_send_signal)
>  
>  #undef __NR_syscalls
> -#define __NR_syscalls 294
> +#define __NR_syscalls 295
>  
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 9a32bc2088c9..a00a4bcb7605 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -19,7 +19,9 @@
>  #include <linux/sched/task.h>
>  #include <linux/sched/task_stack.h>
>  #include <linux/sched/cputime.h>
> +#include <linux/file.h>
>  #include <linux/fs.h>
> +#include <linux/proc_fs.h>
>  #include <linux/tty.h>
>  #include <linux/binfmts.h>
>  #include <linux/coredump.h>
> @@ -3286,6 +3288,16 @@ COMPAT_SYSCALL_DEFINE4(rt_sigtimedwait, compat_sigset_t __user *, uthese,
>  }
>  #endif
>  
> +static inline void prepare_kill_siginfo(int sig, struct kernel_siginfo *info)
> +{
> +	clear_siginfo(info);
> +	info->si_signo = sig;
> +	info->si_errno = 0;
> +	info->si_code = SI_USER;
> +	info->si_pid = task_tgid_vnr(current);
> +	info->si_uid = from_kuid_munged(current_user_ns(), current_uid());
> +}
> +
>  /**
>   *  sys_kill - send a signal to a process
>   *  @pid: the PID of the process
> @@ -3295,16 +3307,124 @@ SYSCALL_DEFINE2(kill, pid_t, pid, int, sig)
>  {
>  	struct kernel_siginfo info;
>  
> -	clear_siginfo(&info);
> -	info.si_signo = sig;
> -	info.si_errno = 0;
> -	info.si_code = SI_USER;
> -	info.si_pid = task_tgid_vnr(current);
> -	info.si_uid = from_kuid_munged(current_user_ns(), current_uid());
> +	prepare_kill_siginfo(sig, &info);
>  
>  	return kill_something_info(sig, &info, pid);
>  }
>  
> +/*
> + * Verify that the signaler and signalee either are in the same pid namespace
> + * or that the signaler's pid namespace is an ancestor of the signalee's pid
> + * namespace.
> + */
> +static bool may_signal_taskfd(struct pid *pid)

may_signal_taskfd seems like a misnomer here.  It leads the reader to think it
is about permission, when it's really about accessibility.

> +{
> +	struct pid_namespace *active = task_active_pid_ns(current);
> +	struct pid_namespace *p = ns_of_pid(pid);
> +
> +	for (;;) {
> +		if (!p)
> +			return false;
> +		if (p == active)
> +			break;
> +		p = p->parent;
> +	}
> +
> +	return true;
> +}
> +
> +static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
> +{
> +#ifdef CONFIG_COMPAT
> +	/*
> +	 * Avoid hooking up compat syscalls and instead handle necessary
> +	 * conversions here. Note, this is a stop-gap measure and should not be
> +	 * considered a generic solution.
> +	 */
> +	if (in_compat_syscall())
> +		return copy_siginfo_from_user32(
> +			kinfo, (struct compat_siginfo __user *)info);
> +#endif
> +	return copy_siginfo_from_user(kinfo, info);
> +}
> +
> +/**
> + *  sys_taskfd_send_signal - send a signal to a process through a task file
> + *                           descriptor
> + *  @taskfd: the file descriptor of the process
> + *  @sig:    signal to be sent
> + *  @info:   the signal info
> + *  @flags:  future flags to be passed
> + *
> + *  Return: 0 on success, negative errno on failure
> + */
> +SYSCALL_DEFINE4(taskfd_send_signal, int, taskfd, int, sig,
> +		siginfo_t __user *, info, unsigned int, flags)
> +{
> +	int ret;
> +	struct fd f;
> +	struct pid *pid;
> +	kernel_siginfo_t kinfo;
> +
> +	/* Enforce flags be set to 0 until we add an extension. */
> +	if (flags)
> +		return -EINVAL;
> +
> +	f = fdget_raw(taskfd);
> +	if (!f.file)
> +		return -EBADF;
> +
> +	pid = tid_taskfd_to_pid(f.file);
> +	if (!IS_ERR(pid)) {
> +		/*
> +		 * Give userspace a way to detect /proc/<pid>/task/<tid>
> +		 * support when we add it.
> +		 */
> +		ret = -EOPNOTSUPP;
> +		goto err;
> +	}
> +
> +	/* Is this a procfd? */
> +	pid = tgid_taskfd_to_pid(f.file);
> +	if (IS_ERR(pid)) {
> +		ret = PTR_ERR(pid);
> +		goto err;
> +	}
> +
> +	ret = -EINVAL;
> +	if (!may_signal_taskfd(pid))
> +		goto err;
> +
> +	if (info) {
> +		ret = copy_siginfo_from_user_any(&kinfo, info);
> +		if (unlikely(ret))
> +			goto err;
> +
> +		ret = -EINVAL;
> +		if (unlikely(sig != kinfo.si_signo))
> +			goto err;
> +
> +		if ((task_pid(current) != pid) &&
> +		    (kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL)) {
> +			/* Only allow sending arbitrary signals to yourself. */
> +			ret = -EPERM;
> +			if (kinfo.si_code != SI_USER)
> +				goto err;
> +
> +			/* Turn this into a regular kill signal. */
> +			prepare_kill_siginfo(sig, &kinfo);
> +		}
> +	} else {
> +		prepare_kill_siginfo(sig, &kinfo);
> +	}
> +
> +	ret = kill_pid_info(sig, &kinfo, pid);
> +
> +err:
> +	fdput(f);
> +	return ret;
> +}
> +
>  static int
>  do_send_specific(pid_t tgid, pid_t pid, int sig, struct kernel_siginfo *info)
>  {
> -- 
> 2.19.1
Serge E. Hallyn Dec. 7, 2018, 4:49 p.m. UTC | #36
On Fri, Dec 07, 2018 at 02:54:25AM +0100, Christian Brauner wrote:
> On Thu, Dec 06, 2018 at 05:39:18PM -0800, Daniel Colascione wrote:
> > On Thu, Dec 6, 2018 at 4:59 PM Serge E. Hallyn <serge@hallyn.com> wrote:
> > >
> > > On Thu, Dec 06, 2018 at 04:34:54PM -0800, Daniel Colascione wrote:
> > > > On Thu, Dec 6, 2018 at 4:31 PM Serge E. Hallyn <serge@hallyn.com> wrote:
> > > > >
> > > > > On Fri, Dec 07, 2018 at 12:17:45AM +0100, Christian Brauner wrote:
> > > > > > On Thu, Dec 06, 2018 at 11:39:48PM +0100, Christian Brauner wrote:
> > > > > > > On Thu, Dec 06, 2018 at 03:46:53PM -0600, Eric W. Biederman wrote:
> > > > > > > > Christian Brauner <christian@brauner.io> writes:
> > > > > > > >
> > > > > > > > >> Your intention is to add the thread case to support pthreads once the
> > > > > > > > >> process case is sorted out.  So this is something that needs to be made
> > > > > > > > >> clear.  Did I miss how you plan to handle threads?
> > > > > > > > >
> > > > > > > > > Yeah, maybe you missed it in the commit message [2] which is based on a
> > > > > > > > > discussion with Andy [3] and Arnd [4]:
> > > > > > > >
> > > > > > > > Looking at your references I haven't missed it.  You are not deciding
> > > > > > > > anything as of yet to keep it simple.  Except you are returning
> > > > > > > > EOPNOTSUPP.  You are very much intending to do something.
> > > > > > >
> > > > > > > That was clear all along and was pointed at every occassion in the
> > > > > > > threads. I even went through the hazzle to give you all of the
> > > > > > > references when there's lore.kernel.org.
> > > > > > >
> > > > > > > >
> > > > > > > > Decide.  Do you use the flags parameter or is the width of the
> > > > > > > > target depending on the flags.
> > > > > >
> > > > > > Ok, let's try to be constructive. I understand the general concern for
> > > > > > the future so let's put a contract into the commit message stating that
> > > > > > the width of the target aka *what is signaled* will be based on a flag
> > > > > > parameter if we ever extend it:
> > > > > >
> > > > > > taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_PGID);
> > > > > > taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_TID);
> > > > > >
> > > > > > with the current default being
> > > > > >
> > > > > > taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_PID);
> > > > > >
> > > > > > This seems to me the cleanest solution as we only use one type of file
> > > > > > descriptor. Can everyone be on board with this? If so I'm going to send
> > > > > > out a new version of the patch.
> > > > > >
> > > > > > Christian
> > > > >
> > > > > I'm on board with this, but I think you need to also clarify what exactly
> > > > > the fd stands for.  I think that (a) userspace should not have to care
> > > > > about the struct pid implementation, and so (b) the procfd should stand
> > > > > for all the pids.  So when taskfd_send_signal(fd, SIGSTOP, NULL, TASKFD_PGID)
> > > > > becomes implemented, then open(/proc/5) will pin all three pids, as will
> > > > > open(/proc/5/task/6).
> > > >
> > > > This change doesn't "pin" any PID, and it makes no sense to make a
> > > > process FD stand for all its threads. What does that even mean?
> > >
> > > Currently the patch relies on the procfd inode saving a copy to the PIDTYPE_PID
> > > pid.
> > 
> > struct pid doesn't have a type field. The interpretation depends on
> > the caller's use of the struct pid, and in the current path, that's
> > PIDTYPE_PID. What, specifically, is wrong with the current approach?
> > 
> > > I'm not sure offhand, can it go to the PIDTYPE_PGID from that after the
> > > task has died, or not?   I didn't think so.  If it can then great.
> > 
> > You're arguing that something that does, in fact, work, is somehow
> > broken in some unspecified way. The kill_pid_info lookup works fine.
> > What, specifically, is wrong with the semantics as implemented?
> > 
> > > The point is (a) these are details which should not have to bother userspace,
> > 
> > These details *don't* bother userspace.
> > 
> > You're raising concerns that are either imaginary or long-since
> > addressed. Does the patch cause some kind of maintenance burden? No,
> > it doesn't, not moreso than any other piece of code. Does the
> > interface have unclear semantics? No, it clearly sends a signal to a
> > process, just like kill. Does the patch expose kernel implementation
> > details? No, it doesn't, because the interface is simply not defined
> > in terms of these details. Do we need to change how signal delivery
> > works? No, because if it's fine for kill, it's fine for this facility,
> > and if some future signal cleanup separates the cases more, that
> > cleanup can change this code as well.
> > 
> > The change is well-documented, simple, extensible, and addresses an
> > actual problem. Every legitimate technical criticism has now been
> > addressed. I don't understand where this opposition is coming from,
> > since the objections refer to nothing that's actually in the patch or
> > exposed to the user.
> > 
> > > and (b) how to decide who we're sending the signal to (tid/pid/pgid) should
> > > be specified in precisely one way.  So either a flag, or comign from the type
> > > of fd that was opened.
> > 
> > You can't send signals to a thread with the current patch. There's no
> > ambiguity in providing zero ways to do something.
> 
> So Serge's point is not about changing the current patch. What he's

Right, I'm an ack on the patch.  As is no changes are needed.

> basically saying is: If we are expected to state how we were to extend
> this syscall in the future which Serge and I figured is currently Eric's
> only remaining objection then:
> - flags are a good way to go (I agree)
> - there's a concrete way how to do this by stashing the relevent struct
>   pids for PIDTYPE_PID, PIDTYPE_TGID, PIDTYPE_PGID in file->private_data
>   which can then be retrieved in taskfd_send_signal()
> There is not intent nor requirement to do this right now. What we have
> right now is fine for a start, I agree! But here's how we go forward if
> we ever need to! :)
> 
> Christian
kernel test robot Dec. 8, 2018, 9:46 p.m. UTC | #37
Hi Christian,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.20-rc5]
[cannot apply to next-20181207]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Christian-Brauner/signal-add-taskfd_send_signal-syscall/20181209-044342
config: ia64-allnoconfig (attached as .config)
compiler: ia64-linux-gcc (GCC) 8.1.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=8.1.0 make.cross ARCH=ia64 

All warnings (new ones prefixed by >>):

   <stdin>:1185:2: warning: #warning syscall perf_event_open not implemented [-Wcpp]
   <stdin>:1239:2: warning: #warning syscall seccomp not implemented [-Wcpp]
   <stdin>:1317:2: warning: #warning syscall pkey_mprotect not implemented [-Wcpp]
   <stdin>:1320:2: warning: #warning syscall pkey_alloc not implemented [-Wcpp]
   <stdin>:1323:2: warning: #warning syscall pkey_free not implemented [-Wcpp]
   <stdin>:1326:2: warning: #warning syscall statx not implemented [-Wcpp]
   <stdin>:1332:2: warning: #warning syscall io_pgetevents not implemented [-Wcpp]
   <stdin>:1335:2: warning: #warning syscall rseq not implemented [-Wcpp]
>> <stdin>:1338:2: warning: #warning syscall taskfd_send_signal not implemented [-Wcpp]

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
diff mbox series

Patch

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 3cf7b533b3d1..7efb63fd0617 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -398,3 +398,4 @@ 
 384	i386	arch_prctl		sys_arch_prctl			__ia32_compat_sys_arch_prctl
 385	i386	io_pgetevents		sys_io_pgetevents		__ia32_compat_sys_io_pgetevents
 386	i386	rseq			sys_rseq			__ia32_sys_rseq
+387	i386	taskfd_send_signal	sys_taskfd_send_signal		__ia32_sys_taskfd_send_signal
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index f0b1709a5ffb..be894f4a84e9 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -343,6 +343,7 @@ 
 332	common	statx			__x64_sys_statx
 333	common	io_pgetevents		__x64_sys_io_pgetevents
 334	common	rseq			__x64_sys_rseq
+335	common	taskfd_send_signal	__x64_sys_taskfd_send_signal
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ce3465479447..b8b88bfee455 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -716,8 +716,6 @@  static int proc_pid_permission(struct inode *inode, int mask)
 	return generic_permission(inode, mask);
 }
 
-
-
 static const struct inode_operations proc_def_inode_operations = {
 	.setattr	= proc_setattr,
 };
@@ -3038,6 +3036,15 @@  static const struct file_operations proc_tgid_base_operations = {
 	.llseek		= generic_file_llseek,
 };
 
+struct pid *tgid_taskfd_to_pid(const struct file *file)
+{
+	if (!d_is_dir(file->f_path.dentry) ||
+	    (file->f_op != &proc_tgid_base_operations))
+		return ERR_PTR(-EBADF);
+
+	return proc_pid(file_inode(file));
+}
+
 static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
 {
 	return proc_pident_lookup(dir, dentry,
@@ -3422,6 +3429,15 @@  static const struct file_operations proc_tid_base_operations = {
 	.llseek		= generic_file_llseek,
 };
 
+struct pid *tid_taskfd_to_pid(const struct file *file)
+{
+	if (!d_is_dir(file->f_path.dentry) ||
+	    (file->f_op != &proc_tid_base_operations))
+		return ERR_PTR(-EBADF);
+
+	return proc_pid(file_inode(file));
+}
+
 static const struct inode_operations proc_tid_base_inode_operations = {
 	.lookup		= proc_tid_base_lookup,
 	.getattr	= pid_getattr,
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index d0e1f1522a78..96817415c420 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -73,6 +73,8 @@  struct proc_dir_entry *proc_create_net_single_write(const char *name, umode_t mo
 						    int (*show)(struct seq_file *, void *),
 						    proc_write_t write,
 						    void *data);
+extern struct pid *tgid_taskfd_to_pid(const struct file *file);
+extern struct pid *tid_taskfd_to_pid(const struct file *file);
 
 #else /* CONFIG_PROC_FS */
 
@@ -114,6 +116,16 @@  static inline int remove_proc_subtree(const char *name, struct proc_dir_entry *p
 #define proc_create_net(name, mode, parent, state_size, ops) ({NULL;})
 #define proc_create_net_single(name, mode, parent, show, data) ({NULL;})
 
+static inline struct pid *tgid_taskfd_to_pid(const struct file *file)
+{
+	return ERR_PTR(-EBADF);
+}
+
+static inline struct pid *tid_taskfd_to_pid(const struct file *file)
+{
+	return ERR_PTR(-EBADF);
+}
+
 #endif /* CONFIG_PROC_FS */
 
 struct net;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2ac3d13a915b..5ffe194ef29b 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -907,6 +907,9 @@  asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
 			  unsigned mask, struct statx __user *buffer);
 asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
 			 int flags, uint32_t sig);
+asmlinkage long sys_taskfd_send_signal(int taskfd, int sig,
+				       siginfo_t __user *info,
+				       unsigned int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 538546edbfbd..9343dca63fd9 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -738,9 +738,11 @@  __SYSCALL(__NR_statx,     sys_statx)
 __SC_COMP(__NR_io_pgetevents, sys_io_pgetevents, compat_sys_io_pgetevents)
 #define __NR_rseq 293
 __SYSCALL(__NR_rseq, sys_rseq)
+#define __NR_taskfd_send_signal 294
+__SYSCALL(__NR_taskfd_send_signal, sys_taskfd_send_signal)
 
 #undef __NR_syscalls
-#define __NR_syscalls 294
+#define __NR_syscalls 295
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/signal.c b/kernel/signal.c
index 9a32bc2088c9..a00a4bcb7605 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -19,7 +19,9 @@ 
 #include <linux/sched/task.h>
 #include <linux/sched/task_stack.h>
 #include <linux/sched/cputime.h>
+#include <linux/file.h>
 #include <linux/fs.h>
+#include <linux/proc_fs.h>
 #include <linux/tty.h>
 #include <linux/binfmts.h>
 #include <linux/coredump.h>
@@ -3286,6 +3288,16 @@  COMPAT_SYSCALL_DEFINE4(rt_sigtimedwait, compat_sigset_t __user *, uthese,
 }
 #endif
 
+static inline void prepare_kill_siginfo(int sig, struct kernel_siginfo *info)
+{
+	clear_siginfo(info);
+	info->si_signo = sig;
+	info->si_errno = 0;
+	info->si_code = SI_USER;
+	info->si_pid = task_tgid_vnr(current);
+	info->si_uid = from_kuid_munged(current_user_ns(), current_uid());
+}
+
 /**
  *  sys_kill - send a signal to a process
  *  @pid: the PID of the process
@@ -3295,16 +3307,124 @@  SYSCALL_DEFINE2(kill, pid_t, pid, int, sig)
 {
 	struct kernel_siginfo info;
 
-	clear_siginfo(&info);
-	info.si_signo = sig;
-	info.si_errno = 0;
-	info.si_code = SI_USER;
-	info.si_pid = task_tgid_vnr(current);
-	info.si_uid = from_kuid_munged(current_user_ns(), current_uid());
+	prepare_kill_siginfo(sig, &info);
 
 	return kill_something_info(sig, &info, pid);
 }
 
+/*
+ * Verify that the signaler and signalee either are in the same pid namespace
+ * or that the signaler's pid namespace is an ancestor of the signalee's pid
+ * namespace.
+ */
+static bool may_signal_taskfd(struct pid *pid)
+{
+	struct pid_namespace *active = task_active_pid_ns(current);
+	struct pid_namespace *p = ns_of_pid(pid);
+
+	for (;;) {
+		if (!p)
+			return false;
+		if (p == active)
+			break;
+		p = p->parent;
+	}
+
+	return true;
+}
+
+static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
+{
+#ifdef CONFIG_COMPAT
+	/*
+	 * Avoid hooking up compat syscalls and instead handle necessary
+	 * conversions here. Note, this is a stop-gap measure and should not be
+	 * considered a generic solution.
+	 */
+	if (in_compat_syscall())
+		return copy_siginfo_from_user32(
+			kinfo, (struct compat_siginfo __user *)info);
+#endif
+	return copy_siginfo_from_user(kinfo, info);
+}
+
+/**
+ *  sys_taskfd_send_signal - send a signal to a process through a task file
+ *                           descriptor
+ *  @taskfd: the file descriptor of the process
+ *  @sig:    signal to be sent
+ *  @info:   the signal info
+ *  @flags:  future flags to be passed
+ *
+ *  Return: 0 on success, negative errno on failure
+ */
+SYSCALL_DEFINE4(taskfd_send_signal, int, taskfd, int, sig,
+		siginfo_t __user *, info, unsigned int, flags)
+{
+	int ret;
+	struct fd f;
+	struct pid *pid;
+	kernel_siginfo_t kinfo;
+
+	/* Enforce flags be set to 0 until we add an extension. */
+	if (flags)
+		return -EINVAL;
+
+	f = fdget_raw(taskfd);
+	if (!f.file)
+		return -EBADF;
+
+	pid = tid_taskfd_to_pid(f.file);
+	if (!IS_ERR(pid)) {
+		/*
+		 * Give userspace a way to detect /proc/<pid>/task/<tid>
+		 * support when we add it.
+		 */
+		ret = -EOPNOTSUPP;
+		goto err;
+	}
+
+	/* Is this a procfd? */
+	pid = tgid_taskfd_to_pid(f.file);
+	if (IS_ERR(pid)) {
+		ret = PTR_ERR(pid);
+		goto err;
+	}
+
+	ret = -EINVAL;
+	if (!may_signal_taskfd(pid))
+		goto err;
+
+	if (info) {
+		ret = copy_siginfo_from_user_any(&kinfo, info);
+		if (unlikely(ret))
+			goto err;
+
+		ret = -EINVAL;
+		if (unlikely(sig != kinfo.si_signo))
+			goto err;
+
+		if ((task_pid(current) != pid) &&
+		    (kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL)) {
+			/* Only allow sending arbitrary signals to yourself. */
+			ret = -EPERM;
+			if (kinfo.si_code != SI_USER)
+				goto err;
+
+			/* Turn this into a regular kill signal. */
+			prepare_kill_siginfo(sig, &kinfo);
+		}
+	} else {
+		prepare_kill_siginfo(sig, &kinfo);
+	}
+
+	ret = kill_pid_info(sig, &kinfo, pid);
+
+err:
+	fdput(f);
+	return ret;
+}
+
 static int
 do_send_specific(pid_t tgid, pid_t pid, int sig, struct kernel_siginfo *info)
 {