mbox series

[v2,0/7] Remove in-tree usage of MAP_DENYWRITE

Message ID 20210816194840.42769-1-david@redhat.com (mailing list archive)
Headers show
Series Remove in-tree usage of MAP_DENYWRITE | expand

Message

David Hildenbrand Aug. 16, 2021, 7:48 p.m. UTC
This series removes all in-tree usage of MAP_DENYWRITE from the kernel
and removes VM_DENYWRITE. We stopped supporting MAP_DENYWRITE for
user space applications a while ago because of the chance for DoS.
The last renaming user is binfmt binary loading during exec and
legacy library loading via uselib().

With this change, MAP_DENYWRITE is effectively ignored throughout the
kernel. Although the net change is small (well, we actually add code and
comments), I think the cleanup in mmap() is quite nice.

There are some (minor) user-visible changes with this series:
1. We no longer deny write access to shared libaries loaded via legacy
   uselib(); this behavior matches modern user space e.g., via dlopen().
2. We no longer deny write access to the elf interpreter after exec
   completed, treating it just like shared libraries (which it often is).
3. We always deny write access to the file linked via /proc/pid/exe:
   sys_prctl(PR_SET_MM_MAP/EXE_FILE) will fail if write access to the file
   cannot be denied, and write access to the file will remain denied
   until the link is effectivel gone (exec, termination,
   sys_prctl(PR_SET_MM_MAP/EXE_FILE)) -- just as if exec'ing the file.

There is a related problem [2] with overlayfs, that should at least partly
be tackled by this series. I don't quite understand the interaction of
overlayfs and deny_write_access()/allow_write_access() at exec time:

If we end up denying write access to the wrong file and not to the
realfile, that would be fundamentally broken. We would have to reroute
our deny_write_access()/ allow_write_access() calls for the exec file to
the realfile -- but I leave figuring out the details to overlayfs guys, as
that would be a related but different issue.

There was a lengthy discussion in [3] whether to remove deny_write_access()
completely; however, if we decide to go that way, it would ideally be done
on top, because it could be that some applications even rely on the current
behavior.

v1 -> v2:
- "kernel/fork: factor out replacing the current MM exe_file"
-- Call the function "replace_mm_exe_file()" instead
-- Add some doc, similar to set_mm_exe_file()
-- Update patch subject/description
- "kernel/fork: always deny write access to current MM exe_file"
-- Introduce dup_mm_exe_file()
-- Make set_mm_exe_file() return an error to make the code easier to
   grasp.
-- Improve comments
- Added ACKs
- Mention "sys_prctl(PR_SET_MM_MAP/EXE_FILE)" everywhere instead of
  only "sys_prctl(PR_SET_MM_EXE_FILE)".

RFC -> v1:
- "binfmt: remove in-tree usage of MAP_DENYWRITE"
-- Add a note that this should fix part of a problem with overlayfs

[1] https://lore.kernel.org/r/20210423131640.20080-1-david@redhat.com/
[2] https://lore.kernel.org/r/YNHXzBgzRrZu1MrD@miu.piliscsaba.redhat.com/
[3] https://lkml.kernel.org/r/20210812084348.6521-1-david@redhat.com

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Kevin Brodsky <Kevin.Brodsky@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Price <steven.price@arm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Gabriel Krisman Bertazi <krisman@collabora.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
Cc: Thomas Cedeno <thomascedeno@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chengguang Xu <cgxu519@mykernel.net>
Cc: "Christian König" <ckoenig.leichtzumerken@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: linux-unionfs@vger.kernel.org
Cc: linux-api@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org

David Hildenbrand (7):
  binfmt: don't use MAP_DENYWRITE when loading shared libraries via
    uselib()
  kernel/fork: factor out replacing the current MM exe_file
  kernel/fork: always deny write access to current MM exe_file
  binfmt: remove in-tree usage of MAP_DENYWRITE
  mm: remove VM_DENYWRITE
  mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff()
  fs: update documentation of get_write_access() and friends

 arch/x86/ia32/ia32_aout.c      |  8 ++-
 fs/binfmt_aout.c               |  7 ++-
 fs/binfmt_elf.c                |  6 +--
 fs/binfmt_elf_fdpic.c          |  2 +-
 fs/exec.c                      |  4 +-
 fs/proc/task_mmu.c             |  1 -
 include/linux/fs.h             | 19 ++++---
 include/linux/mm.h             |  4 +-
 include/linux/mman.h           |  4 +-
 include/trace/events/mmflags.h |  1 -
 kernel/events/core.c           |  2 -
 kernel/fork.c                  | 95 ++++++++++++++++++++++++++++++----
 kernel/sys.c                   | 33 +-----------
 lib/test_printf.c              |  5 +-
 mm/mmap.c                      | 29 ++---------
 mm/nommu.c                     |  2 -
 16 files changed, 119 insertions(+), 103 deletions(-)


base-commit: 7c60610d476766e128cc4284bb6349732cbd6606

Comments

Christian König Aug. 17, 2021, 11:01 a.m. UTC | #1
Am 16.08.21 um 21:48 schrieb David Hildenbrand:
> This series removes all in-tree usage of MAP_DENYWRITE from the kernel
> and removes VM_DENYWRITE. We stopped supporting MAP_DENYWRITE for
> user space applications a while ago because of the chance for DoS.
> The last renaming user is binfmt binary loading during exec and
> legacy library loading via uselib().
>
> With this change, MAP_DENYWRITE is effectively ignored throughout the
> kernel. Although the net change is small (well, we actually add code and
> comments), I think the cleanup in mmap() is quite nice.
>
> There are some (minor) user-visible changes with this series:
> 1. We no longer deny write access to shared libaries loaded via legacy
>     uselib(); this behavior matches modern user space e.g., via dlopen().
> 2. We no longer deny write access to the elf interpreter after exec
>     completed, treating it just like shared libraries (which it often is).
> 3. We always deny write access to the file linked via /proc/pid/exe:
>     sys_prctl(PR_SET_MM_MAP/EXE_FILE) will fail if write access to the file
>     cannot be denied, and write access to the file will remain denied
>     until the link is effectivel gone (exec, termination,
>     sys_prctl(PR_SET_MM_MAP/EXE_FILE)) -- just as if exec'ing the file.
>
> There is a related problem [2] with overlayfs, that should at least partly
> be tackled by this series. I don't quite understand the interaction of
> overlayfs and deny_write_access()/allow_write_access() at exec time:
>
> If we end up denying write access to the wrong file and not to the
> realfile, that would be fundamentally broken. We would have to reroute
> our deny_write_access()/ allow_write_access() calls for the exec file to
> the realfile -- but I leave figuring out the details to overlayfs guys, as
> that would be a related but different issue.
>
> There was a lengthy discussion in [3] whether to remove deny_write_access()
> completely; however, if we decide to go that way, it would ideally be done
> on top, because it could be that some applications even rely on the current
> behavior.
>
> v1 -> v2:
> - "kernel/fork: factor out replacing the current MM exe_file"
> -- Call the function "replace_mm_exe_file()" instead
> -- Add some doc, similar to set_mm_exe_file()
> -- Update patch subject/description
> - "kernel/fork: always deny write access to current MM exe_file"
> -- Introduce dup_mm_exe_file()
> -- Make set_mm_exe_file() return an error to make the code easier to
>     grasp.
> -- Improve comments
> - Added ACKs
> - Mention "sys_prctl(PR_SET_MM_MAP/EXE_FILE)" everywhere instead of
>    only "sys_prctl(PR_SET_MM_EXE_FILE)".
>
> RFC -> v1:
> - "binfmt: remove in-tree usage of MAP_DENYWRITE"
> -- Add a note that this should fix part of a problem with overlayfs

This is unfortunately way beyond my understanding of the fs layer to 
actually review this.

But from the ten mile high view it looks like a really nice cleanup to 
me and makes my live in the DRM subsystem much easier.

Feel free to add a Acked-by: Christian König <christian.koenig@amd.com> 
to the series.

Thanks,
Christian.

>
> [1] https://lore.kernel.org/r/20210423131640.20080-1-david@redhat.com/
> [2] https://lore.kernel.org/r/YNHXzBgzRrZu1MrD@miu.piliscsaba.redhat.com/
> [3] https://lkml.kernel.org/r/20210812084348.6521-1-david@redhat.com
>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Alexey Dobriyan <adobriyan@gmail.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Jiri Olsa <jolsa@redhat.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Petr Mladek <pmladek@suse.com>
> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Greg Ungerer <gerg@linux-m68k.org>
> Cc: Geert Uytterhoeven <geert@linux-m68k.org>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
> Cc: Chinwen Chang <chinwen.chang@mediatek.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: Feng Tang <feng.tang@intel.com>
> Cc: Kevin Brodsky <Kevin.Brodsky@arm.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Shawn Anastasio <shawn@anastas.io>
> Cc: Steven Price <steven.price@arm.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Christian Brauner <christian.brauner@ubuntu.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Gabriel Krisman Bertazi <krisman@collabora.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Marco Elver <elver@google.com>
> Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
> Cc: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> Cc: Thomas Cedeno <thomascedeno@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> Cc: Chengguang Xu <cgxu519@mykernel.net>
> Cc: "Christian König" <ckoenig.leichtzumerken@gmail.com>
> Cc: Florian Weimer <fweimer@redhat.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: David Laight <David.Laight@ACULAB.COM>
> Cc: linux-unionfs@vger.kernel.org
> Cc: linux-api@vger.kernel.org
> Cc: x86@kernel.org
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-mm@kvack.org
>
> David Hildenbrand (7):
>    binfmt: don't use MAP_DENYWRITE when loading shared libraries via
>      uselib()
>    kernel/fork: factor out replacing the current MM exe_file
>    kernel/fork: always deny write access to current MM exe_file
>    binfmt: remove in-tree usage of MAP_DENYWRITE
>    mm: remove VM_DENYWRITE
>    mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff()
>    fs: update documentation of get_write_access() and friends
>
>   arch/x86/ia32/ia32_aout.c      |  8 ++-
>   fs/binfmt_aout.c               |  7 ++-
>   fs/binfmt_elf.c                |  6 +--
>   fs/binfmt_elf_fdpic.c          |  2 +-
>   fs/exec.c                      |  4 +-
>   fs/proc/task_mmu.c             |  1 -
>   include/linux/fs.h             | 19 ++++---
>   include/linux/mm.h             |  4 +-
>   include/linux/mman.h           |  4 +-
>   include/trace/events/mmflags.h |  1 -
>   kernel/events/core.c           |  2 -
>   kernel/fork.c                  | 95 ++++++++++++++++++++++++++++++----
>   kernel/sys.c                   | 33 +-----------
>   lib/test_printf.c              |  5 +-
>   mm/mmap.c                      | 29 ++---------
>   mm/nommu.c                     |  2 -
>   16 files changed, 119 insertions(+), 103 deletions(-)
>
>
> base-commit: 7c60610d476766e128cc4284bb6349732cbd6606
David Hildenbrand Sept. 3, 2021, 9:45 a.m. UTC | #2
On 16.08.21 21:48, David Hildenbrand wrote:
> This series removes all in-tree usage of MAP_DENYWRITE from the kernel
> and removes VM_DENYWRITE. We stopped supporting MAP_DENYWRITE for
> user space applications a while ago because of the chance for DoS.
> The last renaming user is binfmt binary loading during exec and
> legacy library loading via uselib().
> 

So, how do we want to continue with this? Pick it up for v5.15? Have it 
in -next for a while and eventually pick it up for v5.16?

I think the "remove ETXTBSY completely" and "remove the sanity mapping 
check" thingies should be done on top.
Linus Torvalds Sept. 3, 2021, 4:26 p.m. UTC | #3
On Fri, Sep 3, 2021 at 2:45 AM David Hildenbrand <david@redhat.com> wrote:
>
> So, how do we want to continue with this? Pick it up for v5.15? Have it
> in -next for a while and eventually pick it up for v5.16?

I'm ok with the series. If you have a git tree, do the normal pull
request, and we can do it for 5.15 and see if anybody notices.

As you say, any final removal of ETXTBSY should be a separate and
later patch on top.

            Linus