mbox series

[0/2] acct: don't allow access to internal filesystems

Message ID 20250211-work-acct-v1-0-1c16aecab8b3@kernel.org (mailing list archive)
Headers show
Series acct: don't allow access to internal filesystems | expand

Message

Christian Brauner Feb. 11, 2025, 5:15 p.m. UTC
In [1] it was reported that the acct(2) system call can be used to
trigger a NULL deref in cases where it is set to write to a file that
triggers an internal lookup.

This can e.g., happen when pointing acct(2) to /sys/power/resume. At the
point the where the write to this file happens the calling task has
already exited and called exit_fs() but an internal lookup might be
triggered through lookup_bdev(). This may trigger a NULL-deref
when accessing current->fs.

This series does two things:

- Reorganize the code so that the the final write happens from the
  workqueue but with the caller's credentials. This preserves the
  (strange) permission model and has almost no regression risk.

- Block access to kernel internal filesystems as well as procfs and
  sysfs in the first place.

This api should stop to exist imho.

Link: https://lore.kernel.org/r/20250127091811.3183623-1-quzicheng@huawei.com [1]

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Christian Brauner (2):
      acct: perform last write from workqueue
      acct: block access to kernel internal filesystems

 kernel/acct.c | 134 ++++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 84 insertions(+), 50 deletions(-)
---
base-commit: af69e27b3c8240f7889b6c457d71084458984d8e
change-id: 20250211-work-acct-a6d8e92a5fe0

Comments

Jeff Layton Feb. 11, 2025, 6:56 p.m. UTC | #1
On Tue, 2025-02-11 at 18:15 +0100, Christian Brauner wrote:
> In [1] it was reported that the acct(2) system call can be used to
> trigger a NULL deref in cases where it is set to write to a file that
> triggers an internal lookup.
> 
> This can e.g., happen when pointing acct(2) to /sys/power/resume. At the
> point the where the write to this file happens the calling task has
> already exited and called exit_fs() but an internal lookup might be
> triggered through lookup_bdev(). This may trigger a NULL-deref
> when accessing current->fs.
> 
> This series does two things:
> 
> - Reorganize the code so that the the final write happens from the
>   workqueue but with the caller's credentials. This preserves the
>   (strange) permission model and has almost no regression risk.
> 
> - Block access to kernel internal filesystems as well as procfs and
>   sysfs in the first place.
> 
> This api should stop to exist imho.
> 

I wonder who uses it these days, and what would we suggest they replace
it with? Maybe syscall auditing?

config BSD_PROCESS_ACCT
        bool "BSD Process Accounting"
        depends on MULTIUSER
        help
          If you say Y here, a user level program will be able to instruct the
          kernel (via a special system call) to write process accounting
          information to a file: whenever a process exits, information about
          that process will be appended to the file by the kernel.  The
          information includes things such as creation time, owning user,
          command name, memory usage, controlling terminal etc. (the complete
          list is in the struct acct in <file:include/linux/acct.h>).  It is
          up to the user level program to do useful things with this
          information.  This is generally a good idea, so say Y.

Maybe at least time to replace that last sentence and make this default
to 'n'?

> Link: https://lore.kernel.org/r/20250127091811.3183623-1-quzicheng@huawei.com [1]
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> Christian Brauner (2):
>       acct: perform last write from workqueue
>       acct: block access to kernel internal filesystems
> 
>  kernel/acct.c | 134 ++++++++++++++++++++++++++++++++++++----------------------
>  1 file changed, 84 insertions(+), 50 deletions(-)
> ---
> base-commit: af69e27b3c8240f7889b6c457d71084458984d8e
> change-id: 20250211-work-acct-a6d8e92a5fe0
>
Christian Brauner Feb. 12, 2025, 11:16 a.m. UTC | #2
On Tue, Feb 11, 2025 at 01:56:41PM -0500, Jeff Layton wrote:
> On Tue, 2025-02-11 at 18:15 +0100, Christian Brauner wrote:
> > In [1] it was reported that the acct(2) system call can be used to
> > trigger a NULL deref in cases where it is set to write to a file that
> > triggers an internal lookup.
> > 
> > This can e.g., happen when pointing acct(2) to /sys/power/resume. At the
> > point the where the write to this file happens the calling task has
> > already exited and called exit_fs() but an internal lookup might be
> > triggered through lookup_bdev(). This may trigger a NULL-deref
> > when accessing current->fs.
> > 
> > This series does two things:
> > 
> > - Reorganize the code so that the the final write happens from the
> >   workqueue but with the caller's credentials. This preserves the
> >   (strange) permission model and has almost no regression risk.
> > 
> > - Block access to kernel internal filesystems as well as procfs and
> >   sysfs in the first place.
> > 
> > This api should stop to exist imho.
> > 
> 
> I wonder who uses it these days, and what would we suggest they replace
> it with? Maybe syscall auditing?

Someone pointed me to atop but that also works without it. Since this is
a privileged api I think the natural candidate to replace all of this is
bpf. I'm pretty sure that it's relatively straightforward to get a lot
more information out of it than with acct(2) and it will probably be
more performant too.

Without any limitations as it is right now, acct(2) can easily lockup
the system quite easily by pointing it to various things in sysfs and
I'm sure it can be abused in other ways. So I wouldn't enable it.

> 
> config BSD_PROCESS_ACCT
>         bool "BSD Process Accounting"
>         depends on MULTIUSER
>         help
>           If you say Y here, a user level program will be able to instruct the
>           kernel (via a special system call) to write process accounting
>           information to a file: whenever a process exits, information about
>           that process will be appended to the file by the kernel.  The
>           information includes things such as creation time, owning user,
>           command name, memory usage, controlling terminal etc. (the complete
>           list is in the struct acct in <file:include/linux/acct.h>).  It is
>           up to the user level program to do useful things with this
>           information.  This is generally a good idea, so say Y.
> 
> Maybe at least time to replace that last sentence and make this default
> to 'n'?

I agree.

> 
> > Link: https://lore.kernel.org/r/20250127091811.3183623-1-quzicheng@huawei.com [1]
> > 
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> > Christian Brauner (2):
> >       acct: perform last write from workqueue
> >       acct: block access to kernel internal filesystems
> > 
> >  kernel/acct.c | 134 ++++++++++++++++++++++++++++++++++++----------------------
> >  1 file changed, 84 insertions(+), 50 deletions(-)
> > ---
> > base-commit: af69e27b3c8240f7889b6c457d71084458984d8e
> > change-id: 20250211-work-acct-a6d8e92a5fe0
> > 
> 
> -- 
> Jeff Layton <jlayton@kernel.org>
Christian Brauner Feb. 13, 2025, 2:56 p.m. UTC | #3
On Wed, Feb 12, 2025 at 12:16:44PM +0100, Christian Brauner wrote:
> On Tue, Feb 11, 2025 at 01:56:41PM -0500, Jeff Layton wrote:
> > On Tue, 2025-02-11 at 18:15 +0100, Christian Brauner wrote:
> > > In [1] it was reported that the acct(2) system call can be used to
> > > trigger a NULL deref in cases where it is set to write to a file that
> > > triggers an internal lookup.
> > > 
> > > This can e.g., happen when pointing acct(2) to /sys/power/resume. At the
> > > point the where the write to this file happens the calling task has
> > > already exited and called exit_fs() but an internal lookup might be
> > > triggered through lookup_bdev(). This may trigger a NULL-deref
> > > when accessing current->fs.
> > > 
> > > This series does two things:
> > > 
> > > - Reorganize the code so that the the final write happens from the
> > >   workqueue but with the caller's credentials. This preserves the
> > >   (strange) permission model and has almost no regression risk.
> > > 
> > > - Block access to kernel internal filesystems as well as procfs and
> > >   sysfs in the first place.
> > > 
> > > This api should stop to exist imho.
> > > 
> > 
> > I wonder who uses it these days, and what would we suggest they replace
> > it with? Maybe syscall auditing?
> 
> Someone pointed me to atop but that also works without it. Since this is
> a privileged api I think the natural candidate to replace all of this is
> bpf. I'm pretty sure that it's relatively straightforward to get a lot
> more information out of it than with acct(2) and it will probably be
> more performant too.
> 
> Without any limitations as it is right now, acct(2) can easily lockup
> the system quite easily by pointing it to various things in sysfs and
> I'm sure it can be abused in other ways. So I wouldn't enable it.

And I totally forgot about taskstats via Netlink:
https://www.kernel.org/doc/Documentation/accounting/taskstats.txt
include/uapi/linux/taskstats.h