mbox series

[00/11] Introduce kernel_clone(), kill _do_fork()

Message ID 20200818173411.404104-1-christian.brauner@ubuntu.com
Headers show
Series Introduce kernel_clone(), kill _do_fork() | expand

Message

Christian Brauner Aug. 18, 2020, 5:34 p.m. UTC
Hey everyone,

This is a follow-up to the do_fork() cleanup from last cycle based on a
short discussion this was merged.
Last cycle we removed copy_thread_tls() and the associated Kconfig
option for each architecture. Now we are only left with copy_thread().
Part of this work was removing the old do_fork() legacy clone()-style
calling convention in favor of the new struct kernel_clone args calling
convention.
The only remaining function callable outside of kernel/fork.c is
_do_fork(). It doesn't really follow the naming of kernel-internal
syscall helpers as Christoph righly pointed out. Switch all callers and
references to kernel_clone() and remove _do_fork() once and for all.

For all architectures I have done a full git rebase v5.9-rc1 -x "make
-j31". There were no built failures and the changes were fairly
mechanical.

The only helpers we have left now are kernel_thread() and kernel_clone()
where kernel_thread() just calls kernel_clone().

Thanks!
Christian

Christian Brauner (11):
  fork: introduce kernel_clone()
  h8300: switch to kernel_clone()
  ia64: switch to kernel_clone()
  m68k: switch to kernel_clone()
  nios2: switch to kernel_clone()
  sparc: switch to kernel_clone()
  x86: switch to kernel_clone()
  kprobes: switch to kernel_clone()
  kgdbts: switch to kernel_clone()
  tracing: switch to kernel_clone()
  sched: remove _do_fork()

 Documentation/trace/histogram.rst             |  4 +-
 arch/h8300/kernel/process.c                   |  2 +-
 arch/ia64/kernel/process.c                    |  4 +-
 arch/m68k/kernel/process.c                    | 10 ++--
 arch/nios2/kernel/process.c                   |  2 +-
 arch/sparc/kernel/process.c                   |  6 +--
 arch/x86/kernel/sys_ia32.c                    |  2 +-
 drivers/misc/kgdbts.c                         | 48 +++++++++----------
 include/linux/sched/task.h                    |  2 +-
 kernel/fork.c                                 | 14 +++---
 samples/kprobes/kprobe_example.c              |  6 +--
 samples/kprobes/kretprobe_example.c           |  4 +-
 .../test.d/dynevent/add_remove_kprobe.tc      |  2 +-
 .../test.d/dynevent/clear_select_events.tc    |  2 +-
 .../test.d/dynevent/generic_clear_event.tc    |  2 +-
 .../test.d/ftrace/func-filter-stacktrace.tc   |  4 +-
 .../ftrace/test.d/kprobe/add_and_remove.tc    |  2 +-
 .../ftrace/test.d/kprobe/busy_check.tc        |  2 +-
 .../ftrace/test.d/kprobe/kprobe_args.tc       |  4 +-
 .../ftrace/test.d/kprobe/kprobe_args_comm.tc  |  2 +-
 .../test.d/kprobe/kprobe_args_string.tc       |  4 +-
 .../test.d/kprobe/kprobe_args_symbol.tc       | 10 ++--
 .../ftrace/test.d/kprobe/kprobe_args_type.tc  |  2 +-
 .../ftrace/test.d/kprobe/kprobe_ftrace.tc     | 14 +++---
 .../ftrace/test.d/kprobe/kprobe_multiprobe.tc |  2 +-
 .../test.d/kprobe/kprobe_syntax_errors.tc     | 12 ++---
 .../ftrace/test.d/kprobe/kretprobe_args.tc    |  4 +-
 .../selftests/ftrace/test.d/kprobe/profile.tc |  2 +-
 28 files changed, 87 insertions(+), 87 deletions(-)


base-commit: 9123e3a74ec7b934a4a099e98af6a61c2f80bbf5

Comments

Matthew Wilcox Aug. 18, 2020, 5:44 p.m. UTC | #1
On Tue, Aug 18, 2020 at 07:34:00PM +0200, Christian Brauner wrote:
> The only remaining function callable outside of kernel/fork.c is
> _do_fork(). It doesn't really follow the naming of kernel-internal
> syscall helpers as Christoph righly pointed out. Switch all callers and
> references to kernel_clone() and remove _do_fork() once and for all.

My only concern is around return type.  long, int, pid_t ... can we
choose one and stick to it?  pid_t is probably the right return type
within the kernel, despite the return type of clone3().  It'll save us
some work if we ever go through the hassle of growing pid_t beyond 31-bit.
Christian Brauner Aug. 18, 2020, 5:57 p.m. UTC | #2
On Tue, Aug 18, 2020 at 06:44:47PM +0100, Matthew Wilcox wrote:
> On Tue, Aug 18, 2020 at 07:34:00PM +0200, Christian Brauner wrote:
> > The only remaining function callable outside of kernel/fork.c is
> > _do_fork(). It doesn't really follow the naming of kernel-internal
> > syscall helpers as Christoph righly pointed out. Switch all callers and
> > references to kernel_clone() and remove _do_fork() once and for all.
> 
> My only concern is around return type.  long, int, pid_t ... can we
> choose one and stick to it?  pid_t is probably the right return type
> within the kernel, despite the return type of clone3().  It'll save us
> some work if we ever go through the hassle of growing pid_t beyond 31-bit.

It should be safe to switch kernel_clone() to return pid_t. (Afair, the
syscall wrappers all have "long" as return type. (I think Linus provided
some more details on that in another mail. Also see
include/linux/syscalls.h. So the return type for clone3() is really
somewhat a userspace thing, I think.)

I wonder whether I should take the opportunity and switch the advertised
flag arguments for the legacy clone() syscalls and kernel_thread() from
unsigned long to unsigned int so we can get rid of the lower
		.flags		= (lower_32_bits(clone_flags) & ~CSIGNAL),
calls I added to fix sign extension issues glibc ran into...

Christian
Peter Zijlstra Aug. 19, 2020, 7:43 a.m. UTC | #3
On Tue, Aug 18, 2020 at 06:44:47PM +0100, Matthew Wilcox wrote:
> On Tue, Aug 18, 2020 at 07:34:00PM +0200, Christian Brauner wrote:
> > The only remaining function callable outside of kernel/fork.c is
> > _do_fork(). It doesn't really follow the naming of kernel-internal
> > syscall helpers as Christoph righly pointed out. Switch all callers and
> > references to kernel_clone() and remove _do_fork() once and for all.
> 
> My only concern is around return type.  long, int, pid_t ... can we
> choose one and stick to it?  pid_t is probably the right return type
> within the kernel, despite the return type of clone3().  It'll save us
> some work if we ever go through the hassle of growing pid_t beyond 31-bit.

We have at least the futex ABI restricting PID space to 30 bits.
Christian Brauner Aug. 19, 2020, 8:45 a.m. UTC | #4
On Wed, Aug 19, 2020 at 09:43:40AM +0200, peterz@infradead.org wrote:
> On Tue, Aug 18, 2020 at 06:44:47PM +0100, Matthew Wilcox wrote:
> > On Tue, Aug 18, 2020 at 07:34:00PM +0200, Christian Brauner wrote:
> > > The only remaining function callable outside of kernel/fork.c is
> > > _do_fork(). It doesn't really follow the naming of kernel-internal
> > > syscall helpers as Christoph righly pointed out. Switch all callers and
> > > references to kernel_clone() and remove _do_fork() once and for all.
> > 
> > My only concern is around return type.  long, int, pid_t ... can we
> > choose one and stick to it?  pid_t is probably the right return type
> > within the kernel, despite the return type of clone3().  It'll save us
> > some work if we ever go through the hassle of growing pid_t beyond 31-bit.
> 
> We have at least the futex ABI restricting PID space to 30 bits.

Ok, looking into kernel/futex.c I see 

pid_t pid = uval & FUTEX_TID_MASK;

which is probably what this referes to and /proc/sys/kernel/threads-max
is restricted to FUTEX_TID_MASK.

Afaict, that doesn't block switching kernel_clone() to return pid_t. It
can't create anything > FUTEX_TID_MASK anyway without yelling EAGAIN at
userspace. But it means that _if_ we were to change the size of pid_t
we'd likely need a new futex API. 

Christian
Matthew Wilcox Aug. 19, 2020, 11:18 a.m. UTC | #5
On Wed, Aug 19, 2020 at 10:45:56AM +0200, Christian Brauner wrote:
> On Wed, Aug 19, 2020 at 09:43:40AM +0200, peterz@infradead.org wrote:
> > On Tue, Aug 18, 2020 at 06:44:47PM +0100, Matthew Wilcox wrote:
> > > On Tue, Aug 18, 2020 at 07:34:00PM +0200, Christian Brauner wrote:
> > > > The only remaining function callable outside of kernel/fork.c is
> > > > _do_fork(). It doesn't really follow the naming of kernel-internal
> > > > syscall helpers as Christoph righly pointed out. Switch all callers and
> > > > references to kernel_clone() and remove _do_fork() once and for all.
> > > 
> > > My only concern is around return type.  long, int, pid_t ... can we
> > > choose one and stick to it?  pid_t is probably the right return type
> > > within the kernel, despite the return type of clone3().  It'll save us
> > > some work if we ever go through the hassle of growing pid_t beyond 31-bit.
> > 
> > We have at least the futex ABI restricting PID space to 30 bits.
> 
> Ok, looking into kernel/futex.c I see 
> 
> pid_t pid = uval & FUTEX_TID_MASK;
> 
> which is probably what this referes to and /proc/sys/kernel/threads-max
> is restricted to FUTEX_TID_MASK.
> 
> Afaict, that doesn't block switching kernel_clone() to return pid_t. It
> can't create anything > FUTEX_TID_MASK anyway without yelling EAGAIN at
> userspace. But it means that _if_ we were to change the size of pid_t
> we'd likely need a new futex API. 

Yes, there would be a lot of work to do to increase the size of pid_t.
I'd just like to not do anything to make that harder _now_.  Stick to
using pid_t within the kernel.
Eric W. Biederman Aug. 19, 2020, 1:32 p.m. UTC | #6
Matthew Wilcox <willy@infradead.org> writes:

> On Wed, Aug 19, 2020 at 10:45:56AM +0200, Christian Brauner wrote:
>> On Wed, Aug 19, 2020 at 09:43:40AM +0200, peterz@infradead.org wrote:
>> > On Tue, Aug 18, 2020 at 06:44:47PM +0100, Matthew Wilcox wrote:
>> > > On Tue, Aug 18, 2020 at 07:34:00PM +0200, Christian Brauner wrote:
>> > > > The only remaining function callable outside of kernel/fork.c is
>> > > > _do_fork(). It doesn't really follow the naming of kernel-internal
>> > > > syscall helpers as Christoph righly pointed out. Switch all callers and
>> > > > references to kernel_clone() and remove _do_fork() once and for all.
>> > > 
>> > > My only concern is around return type.  long, int, pid_t ... can we
>> > > choose one and stick to it?  pid_t is probably the right return type
>> > > within the kernel, despite the return type of clone3().  It'll save us
>> > > some work if we ever go through the hassle of growing pid_t beyond 31-bit.
>> > 
>> > We have at least the futex ABI restricting PID space to 30 bits.
>> 
>> Ok, looking into kernel/futex.c I see 
>> 
>> pid_t pid = uval & FUTEX_TID_MASK;
>> 
>> which is probably what this referes to and /proc/sys/kernel/threads-max
>> is restricted to FUTEX_TID_MASK.
>> 
>> Afaict, that doesn't block switching kernel_clone() to return pid_t. It
>> can't create anything > FUTEX_TID_MASK anyway without yelling EAGAIN at
>> userspace. But it means that _if_ we were to change the size of pid_t
>> we'd likely need a new futex API. 
>
> Yes, there would be a lot of work to do to increase the size of pid_t.
> I'd just like to not do anything to make that harder _now_.  Stick to
> using pid_t within the kernel.

Just so people are aware.  If you look in include/linux/threads.h you
can see that the maximum value of PID_MAX_LIMIT limits pids to 22 bits.

Further the design decisions of pids keeps us densly using pids.  So I
expect it will be a while before we even come close to using 30 bits of
pid space.

At the same time I do agree that it makes sense to use a consistent type
in the kernel to make it easier to read and update the code.

Eric
Christian Brauner Aug. 19, 2020, 1:46 p.m. UTC | #7
On Wed, Aug 19, 2020 at 08:32:59AM -0500, Eric W. Biederman wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> 
> > On Wed, Aug 19, 2020 at 10:45:56AM +0200, Christian Brauner wrote:
> >> On Wed, Aug 19, 2020 at 09:43:40AM +0200, peterz@infradead.org wrote:
> >> > On Tue, Aug 18, 2020 at 06:44:47PM +0100, Matthew Wilcox wrote:
> >> > > On Tue, Aug 18, 2020 at 07:34:00PM +0200, Christian Brauner wrote:
> >> > > > The only remaining function callable outside of kernel/fork.c is
> >> > > > _do_fork(). It doesn't really follow the naming of kernel-internal
> >> > > > syscall helpers as Christoph righly pointed out. Switch all callers and
> >> > > > references to kernel_clone() and remove _do_fork() once and for all.
> >> > > 
> >> > > My only concern is around return type.  long, int, pid_t ... can we
> >> > > choose one and stick to it?  pid_t is probably the right return type
> >> > > within the kernel, despite the return type of clone3().  It'll save us
> >> > > some work if we ever go through the hassle of growing pid_t beyond 31-bit.
> >> > 
> >> > We have at least the futex ABI restricting PID space to 30 bits.
> >> 
> >> Ok, looking into kernel/futex.c I see 
> >> 
> >> pid_t pid = uval & FUTEX_TID_MASK;
> >> 
> >> which is probably what this referes to and /proc/sys/kernel/threads-max
> >> is restricted to FUTEX_TID_MASK.
> >> 
> >> Afaict, that doesn't block switching kernel_clone() to return pid_t. It
> >> can't create anything > FUTEX_TID_MASK anyway without yelling EAGAIN at
> >> userspace. But it means that _if_ we were to change the size of pid_t
> >> we'd likely need a new futex API. 
> >
> > Yes, there would be a lot of work to do to increase the size of pid_t.
> > I'd just like to not do anything to make that harder _now_.  Stick to
> > using pid_t within the kernel.
> 
> Just so people are aware.  If you look in include/linux/threads.h you
> can see that the maximum value of PID_MAX_LIMIT limits pids to 22 bits.
> 
> Further the design decisions of pids keeps us densly using pids.  So I
> expect it will be a while before we even come close to using 30 bits of
> pid space.

Also because it's simply annoying to have to type really large pid
numbers on the shell. Yes yes, that's a very privileged
developer-centric complaint but it matters when you have to do a quick
kill -9. Chromebook users obviously won't care about how large their
pids are for sure.

Tbf, related to discussions last year, systemd now actually raises the
default limit from ~33000 to 4194304. Which seems like an ok compromise.

Christian
Eric W. Biederman Aug. 19, 2020, 3:01 p.m. UTC | #8
Christian Brauner <christian.brauner@ubuntu.com> writes:

> On Wed, Aug 19, 2020 at 08:32:59AM -0500, Eric W. Biederman wrote:
>> Matthew Wilcox <willy@infradead.org> writes:
>> 
>> > On Wed, Aug 19, 2020 at 10:45:56AM +0200, Christian Brauner wrote:
>> >> On Wed, Aug 19, 2020 at 09:43:40AM +0200, peterz@infradead.org wrote:
>> >> > On Tue, Aug 18, 2020 at 06:44:47PM +0100, Matthew Wilcox wrote:
>> >> > > On Tue, Aug 18, 2020 at 07:34:00PM +0200, Christian Brauner wrote:
>> >> > > > The only remaining function callable outside of kernel/fork.c is
>> >> > > > _do_fork(). It doesn't really follow the naming of kernel-internal
>> >> > > > syscall helpers as Christoph righly pointed out. Switch all callers and
>> >> > > > references to kernel_clone() and remove _do_fork() once and for all.
>> >> > > 
>> >> > > My only concern is around return type.  long, int, pid_t ... can we
>> >> > > choose one and stick to it?  pid_t is probably the right return type
>> >> > > within the kernel, despite the return type of clone3().  It'll save us
>> >> > > some work if we ever go through the hassle of growing pid_t beyond 31-bit.
>> >> > 
>> >> > We have at least the futex ABI restricting PID space to 30 bits.
>> >> 
>> >> Ok, looking into kernel/futex.c I see 
>> >> 
>> >> pid_t pid = uval & FUTEX_TID_MASK;
>> >> 
>> >> which is probably what this referes to and /proc/sys/kernel/threads-max
>> >> is restricted to FUTEX_TID_MASK.
>> >> 
>> >> Afaict, that doesn't block switching kernel_clone() to return pid_t. It
>> >> can't create anything > FUTEX_TID_MASK anyway without yelling EAGAIN at
>> >> userspace. But it means that _if_ we were to change the size of pid_t
>> >> we'd likely need a new futex API. 
>> >
>> > Yes, there would be a lot of work to do to increase the size of pid_t.
>> > I'd just like to not do anything to make that harder _now_.  Stick to
>> > using pid_t within the kernel.
>> 
>> Just so people are aware.  If you look in include/linux/threads.h you
>> can see that the maximum value of PID_MAX_LIMIT limits pids to 22 bits.
>> 
>> Further the design decisions of pids keeps us densly using pids.  So I
>> expect it will be a while before we even come close to using 30 bits of
>> pid space.
>
> Also because it's simply annoying to have to type really large pid
> numbers on the shell. Yes yes, that's a very privileged
> developer-centric complaint but it matters when you have to do a quick
> kill -9. Chromebook users obviously won't care about how large their
> pids are for sure.

Actually that is one of the reasons (possibly the primary reason) that
we have chosen to keep pid numbers dense.

There may be fewer users of unix shells then their used to be, and we
may now have pidfds.  But until people stop using pids in shells it is a
very valid reason to keep them densly packed.

> Tbf, related to discussions last year, systemd now actually raises the
> default limit from ~33000 to 4194304. Which seems like an ok compromise.

Intereseting.  I had not heard of that.  That seems a strange choice
for systemd rather than a system administrator to make.  Of course any
design decision that requires manual intervention to get large systems
to work is probably a bad one. 

Eric
David Laight Aug. 19, 2020, 3:41 p.m. UTC | #9
From: Eric W. Biederman
> Sent: 19 August 2020 16:01
...
> >> Further the design decisions of pids keeps us densly using pids.  So I
> >> expect it will be a while before we even come close to using 30 bits of
> >> pid space.
> >
> > Also because it's simply annoying to have to type really large pid
> > numbers on the shell. Yes yes, that's a very privileged
> > developer-centric complaint but it matters when you have to do a quick
> > kill -9. Chromebook users obviously won't care about how large their
> > pids are for sure.
> 
> Actually that is one of the reasons (possibly the primary reason) that
> we have chosen to keep pid numbers dense.

It also helps keep the ps output under 80 cols.

> There may be fewer users of unix shells then their used to be, and we
> may now have pidfds.  But until people stop using pids in shells it is a
> very valid reason to keep them densly packed.

ISTM that the upper limit should be increased automatically
when the number of allocated pids gets large enough that they
are likely to run out (or get reused very quickly).

Does linux have an O(1) (or do I mean o(1)) pid allocator?
Or does it have to do a linear scan to find a gap??

I made the NetBSD pid allocator/lookup use pid_array[pid & mask]
then check the high bits matched (incremented on allocate).
With a FIFO free list through the unused entries.
Fairly easy to double the size and 'unzip' when getting full.
And then allocate extra high bits to keep plenty of free
values in circulation.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Matthew Wilcox Aug. 19, 2020, 3:45 p.m. UTC | #10
On Wed, Aug 19, 2020 at 03:41:48PM +0000, David Laight wrote:
> Does linux have an O(1) (or do I mean o(1)) pid allocator?
> Or does it have to do a linear scan to find a gap??

O(log(n)).  It uses the IDR allocator, so 'n' in this case is the
number of PIDs currently allocated, and it's log_64 rather than log_2
(which makes no difference to O() but does make a bit of a difference
to performance)
David Laight Aug. 19, 2020, 3:55 p.m. UTC | #11
From: Matthew Wilcox
> Sent: 19 August 2020 16:45
> 
> On Wed, Aug 19, 2020 at 03:41:48PM +0000, David Laight wrote:
> > Does linux have an O(1) (or do I mean o(1)) pid allocator?
> > Or does it have to do a linear scan to find a gap??
> 
> O(log(n)).  It uses the IDR allocator, so 'n' in this case is the
> number of PIDs currently allocated, and it's log_64 rather than log_2
> (which makes no difference to O() but does make a bit of a difference
> to performance)

Still worse that O(1) - when that is just replacing a variable
with a value read out of an array.
Made pid lookup a trivial O(1) as well.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Matthew Wilcox Aug. 19, 2020, 4:24 p.m. UTC | #12
On Wed, Aug 19, 2020 at 03:55:47PM +0000, David Laight wrote:
> From: Matthew Wilcox
> > Sent: 19 August 2020 16:45
> > 
> > On Wed, Aug 19, 2020 at 03:41:48PM +0000, David Laight wrote:
> > > Does linux have an O(1) (or do I mean o(1)) pid allocator?
> > > Or does it have to do a linear scan to find a gap??
> > 
> > O(log(n)).  It uses the IDR allocator, so 'n' in this case is the
> > number of PIDs currently allocated, and it's log_64 rather than log_2
> > (which makes no difference to O() but does make a bit of a difference
> > to performance)
> 
> Still worse that O(1) - when that is just replacing a variable
> with a value read out of an array.
> Made pid lookup a trivial O(1) as well.

You'd be surprised.  We replaced the custom PID allocator with the
generic IDR allocator a few years ago and got a pretty decent speedup.

If you think you can do better, then submit patches.  You have to support
all the existing use cases, of course.