[v1,0/7] Remove in-tree usage of MAP_DENYWRITE

Message ID	20210812084348.6521-1-david@redhat.com (mailing list archive)
Headers	show Return-Path: <SRS0=vrBG=ND=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 29A2A61059 From: David Hildenbrand <david@redhat.com> To: linux-kernel@vger.kernel.org Cc: David Hildenbrand <david@redhat.com>, Linus Torvalds <torvalds@linux-foundation.org>, Andrew Morton <akpm@linux-foundation.org>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, "H. Peter Anvin" <hpa@zytor.com>, Alexander Viro <viro@zeniv.linux.org.uk>, Alexey Dobriyan <adobriyan@gmail.com>, Steven Rostedt <rostedt@goodmis.org>, Peter Zijlstra <peterz@infradead.org>, Arnaldo Carvalho de Melo <acme@kernel.org>, Mark Rutland <mark.rutland@arm.com>, Alexander Shishkin <alexander.shishkin@linux.intel.com>, Jiri Olsa <jolsa@redhat.com>, Namhyung Kim <namhyung@kernel.org>, Petr Mladek <pmladek@suse.com>, Sergey Senozhatsky <sergey.senozhatsky@gmail.com>, Andy Shevchenko <andriy.shevchenko@linux.intel.com>, Rasmus Villemoes <linux@rasmusvillemoes.dk>, Kees Cook <keescook@chromium.org>, "Eric W. Biederman" <ebiederm@xmission.com>, Greg Ungerer <gerg@linux-m68k.org>, Geert Uytterhoeven <geert@linux-m68k.org>, Mike Rapoport <rppt@kernel.org>, Vlastimil Babka <vbabka@suse.cz>, Vincenzo Frascino <vincenzo.frascino@arm.com>, Chinwen Chang <chinwen.chang@mediatek.com>, Michel Lespinasse <walken@google.com>, Catalin Marinas <catalin.marinas@arm.com>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, Huang Ying <ying.huang@intel.com>, Jann Horn <jannh@google.com>, Feng Tang <feng.tang@intel.com>, Kevin Brodsky <Kevin.Brodsky@arm.com>, Michael Ellerman <mpe@ellerman.id.au>, Shawn Anastasio <shawn@anastas.io>, Steven Price <steven.price@arm.com>, Nicholas Piggin <npiggin@gmail.com>, Christian Brauner <christian.brauner@ubuntu.com>, Jens Axboe <axboe@kernel.dk>, Gabriel Krisman Bertazi <krisman@collabora.com>, Peter Xu <peterx@redhat.com>, Suren Baghdasaryan <surenb@google.com>, Shakeel Butt <shakeelb@google.com>, Marco Elver <elver@google.com>, Daniel Jordan <daniel.m.jordan@oracle.com>, Nicolas Viennot <Nicolas.Viennot@twosigma.com>, Thomas Cedeno <thomascedeno@google.com>, Collin Fijalkovich <cfijalkovich@google.com>, Michal Hocko <mhocko@suse.com>, Miklos Szeredi <miklos@szeredi.hu>, Chengguang Xu <cgxu519@mykernel.net>, =?utf-8?q?Christian_K=C3=B6nig?= <ckoenig.leichtzumerken@gmail.com>, linux-unionfs@vger.kernel.org, linux-api@vger.kernel.org, x86@kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE Date: Thu, 12 Aug 2021 10:43:41 +0200 Message-Id: <20210812084348.6521-1-david@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Remove in-tree usage of MAP_DENYWRITE \| expand [v1,0/7] Remove in-tree usage of MAP_DENYWRITE [v1,1/7] binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib() [v1,2/7] kernel/fork: factor out atomcially replacing the current MM exe_file [v1,3/7] kernel/fork: always deny write access to current MM exe_file [v1,4/7] binfmt: remove in-tree usage of MAP_DENYWRITE [v1,5/7] mm: remove VM_DENYWRITE [v1,6/7] mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff() [v1,7/7] fs: update documentation of get_write_access() and friends

David Hildenbrand Aug. 12, 2021, 8:43 a.m. UTC

This series is based on v5.14-rc5 and corresponds code-wise to the
previously sent RFC [1] (the RFC still applied cleanly).

This series removes all in-tree usage of MAP_DENYWRITE from the kernel
and removes VM_DENYWRITE. We stopped supporting MAP_DENYWRITE for
user space applications a while ago because of the chance for DoS.
The last renaming user is binfmt binary loading during exec and
legacy library loading via uselib().

With this change, MAP_DENYWRITE is effectively ignored throughout the
kernel. Although the net change is small, I think the cleanup in mmap()
is quite nice.

There are some (minor) user-visible changes with this series:
1. We no longer deny write access to shared libaries loaded via legacy
   uselib(); this behavior matches modern user space e.g., via dlopen().
2. We no longer deny write access to the elf interpreter after exec
   completed, treating it just like shared libraries (which it often is).
3. We always deny write access to the file linked via /proc/pid/exe:
   sys_prctl(PR_SET_MM_EXE_FILE) will fail if write access to the file
   cannot be denied, and write access to the file will remain denied
   until the link is effectivel gone (exec, termination,
   PR_SET_MM_EXE_FILE) -- just as if exec'ing the file.

I was wondering if we really care about permanently disabling write access
to the executable, or if it would be good enough to just disable write
access while loading the new executable during exec; but I don't know
the history of that -- and it somewhat makes sense to deny write access
at least to the main executable. With modern user space -- dlopen() -- we
can effectively modify the content of shared libraries while being used.

There is a related problem [2] with overlayfs, that should at least partly
be tackled by this series. I don't quite understand the interaction of
overlayfs and deny_write_access()/allow_write_access() at exec time:

If we end up denying write access to the wrong file and not to the
realfile, that would be fundamentally broken. We would have to reroute
our deny_write_access()/ allow_write_access() calls for the exec file to
the realfile -- but I leave figuring out the details to overlayfs guys, as
that would be a related but different issue.

RFC -> v1:
- "binfmt: remove in-tree usage of MAP_DENYWRITE"
-- Add a note that this should fix part of a problem with overlayfs

[1] https://lore.kernel.org/r/20210423131640.20080-1-david@redhat.com/
[2] https://lore.kernel.org/r/YNHXzBgzRrZu1MrD@miu.piliscsaba.redhat.com/

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Kevin Brodsky <Kevin.Brodsky@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Price <steven.price@arm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Gabriel Krisman Bertazi <krisman@collabora.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
Cc: Thomas Cedeno <thomascedeno@google.com>
Cc: Collin Fijalkovich <cfijalkovich@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chengguang Xu <cgxu519@mykernel.net>
Cc: "Christian König" <ckoenig.leichtzumerken@gmail.com>
Cc: linux-unionfs@vger.kernel.org
Cc: linux-api@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org

David Hildenbrand (7):
  binfmt: don't use MAP_DENYWRITE when loading shared libraries via
    uselib()
  kernel/fork: factor out atomcially replacing the current MM exe_file
  kernel/fork: always deny write access to current MM exe_file
  binfmt: remove in-tree usage of MAP_DENYWRITE
  mm: remove VM_DENYWRITE
  mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff()
  fs: update documentation of get_write_access() and friends

 arch/x86/ia32/ia32_aout.c      |  8 ++--
 fs/binfmt_aout.c               |  7 ++--
 fs/binfmt_elf.c                |  6 +--
 fs/binfmt_elf_fdpic.c          |  2 +-
 fs/proc/task_mmu.c             |  1 -
 include/linux/fs.h             | 19 +++++----
 include/linux/mm.h             |  3 +-
 include/linux/mman.h           |  4 +-
 include/trace/events/mmflags.h |  1 -
 kernel/events/core.c           |  2 -
 kernel/fork.c                  | 75 ++++++++++++++++++++++++++++++----
 kernel/sys.c                   | 33 +--------------
 lib/test_printf.c              |  5 +--
 mm/mmap.c                      | 29 ++-----------
 mm/nommu.c                     |  2 -
 15 files changed, 98 insertions(+), 99 deletions(-)


base-commit: 36a21d51725af2ce0700c6ebcb6b9594aac658a6

Florian Weimer Aug. 12, 2021, 12:20 p.m. UTC | #1

* David Hildenbrand:

> There are some (minor) user-visible changes with this series:
> 1. We no longer deny write access to shared libaries loaded via legacy
>    uselib(); this behavior matches modern user space e.g., via dlopen().
> 2. We no longer deny write access to the elf interpreter after exec
>    completed, treating it just like shared libraries (which it often is).

We have a persistent issue with people using cp (or similar tools) to
replace system libraries.  Since the file is truncated first, all
relocations and global data are replaced by file contents, result in
difficult-to-diagnose crashes.  It would be nice if we had a way to
prevent this mistake.  It doesn't have to be MAP_DENYWRITE or MAP_COPY.
It could be something completely new, like an option that turns every
future access beyond the truncation point into a signal (rather than
getting bad data or bad code and crashing much later).

I don't know how many invalid copy operations are currently thwarted by
the current program interpreter restriction.  I doubt that lifting the
restriction matters.

> 3. We always deny write access to the file linked via /proc/pid/exe:
>    sys_prctl(PR_SET_MM_EXE_FILE) will fail if write access to the file
>    cannot be denied, and write access to the file will remain denied
>    until the link is effectivel gone (exec, termination,
>    PR_SET_MM_EXE_FILE) -- just as if exec'ing the file.
>
> I was wondering if we really care about permanently disabling write access
> to the executable, or if it would be good enough to just disable write
> access while loading the new executable during exec; but I don't know
> the history of that -- and it somewhat makes sense to deny write access
> at least to the main executable. With modern user space -- dlopen() -- we
> can effectively modify the content of shared libraries while being used.

Is there a difference between ET_DYN and ET_EXEC executables?

Thanks,
Florian

David Hildenbrand Aug. 12, 2021, 12:47 p.m. UTC | #2

On 12.08.21 14:20, Florian Weimer wrote:
> * David Hildenbrand:
> 
>> There are some (minor) user-visible changes with this series:
>> 1. We no longer deny write access to shared libaries loaded via legacy
>>     uselib(); this behavior matches modern user space e.g., via dlopen().
>> 2. We no longer deny write access to the elf interpreter after exec
>>     completed, treating it just like shared libraries (which it often is).
> 
> We have a persistent issue with people using cp (or similar tools) to
> replace system libraries.  Since the file is truncated first, all
> relocations and global data are replaced by file contents, result in
> difficult-to-diagnose crashes.  It would be nice if we had a way to
> prevent this mistake.  It doesn't have to be MAP_DENYWRITE or MAP_COPY.
> It could be something completely new, like an option that turns every
> future access beyond the truncation point into a signal (rather than
> getting bad data or bad code and crashing much later).
> 
> I don't know how many invalid copy operations are currently thwarted by
> the current program interpreter restriction.  I doubt that lifting the
> restriction matters.
> 
>> 3. We always deny write access to the file linked via /proc/pid/exe:
>>     sys_prctl(PR_SET_MM_EXE_FILE) will fail if write access to the file
>>     cannot be denied, and write access to the file will remain denied
>>     until the link is effectivel gone (exec, termination,
>>     PR_SET_MM_EXE_FILE) -- just as if exec'ing the file.
>>
>> I was wondering if we really care about permanently disabling write access
>> to the executable, or if it would be good enough to just disable write
>> access while loading the new executable during exec; but I don't know
>> the history of that -- and it somewhat makes sense to deny write access
>> at least to the main executable. With modern user space -- dlopen() -- we
>> can effectively modify the content of shared libraries while being used.
> 
> Is there a difference between ET_DYN and ET_EXEC executables?

No, I don't think so. When exec'ing, the main executable will see a 
deny_write_access(file); AFAIKT, that can either be ET_DYN or ET_EXEC.

Eric W. Biederman Aug. 12, 2021, 4:17 p.m. UTC | #3

Florian Weimer <fweimer@redhat.com> writes:

> * David Hildenbrand:
>
>> There are some (minor) user-visible changes with this series:
>> 1. We no longer deny write access to shared libaries loaded via legacy
>>    uselib(); this behavior matches modern user space e.g., via dlopen().
>> 2. We no longer deny write access to the elf interpreter after exec
>>    completed, treating it just like shared libraries (which it often is).
>
> We have a persistent issue with people using cp (or similar tools) to
> replace system libraries.  Since the file is truncated first, all
> relocations and global data are replaced by file contents, result in
> difficult-to-diagnose crashes.  It would be nice if we had a way to
> prevent this mistake.  It doesn't have to be MAP_DENYWRITE or MAP_COPY.
> It could be something completely new, like an option that turns every
> future access beyond the truncation point into a signal (rather than
> getting bad data or bad code and crashing much later).
>
> I don't know how many invalid copy operations are currently thwarted by
> the current program interpreter restriction.  I doubt that lifting the
> restriction matters.

I suspect that what should happen is that we should make shared
libraries and executables read-only on disk.

We could potentially take this a step farther and introduce a new sysctl
that causes "mmap(adr, len, PROT_EXEC, MAP_SHARED, fd, off)" but not
PROT_WRITE to fail if the file can be written by anyone.  That sysctl
could even deny chown adding write access to the file if there are
mappings open.

Given that there hasn't been enough pain for people to install shared
libraries read-only yet I suspect just installing executables and shared
libraries without write-permissions on disk is enough to prevent the
hard to track down bugs you have been talking about.

>> 3. We always deny write access to the file linked via /proc/pid/exe:
>>    sys_prctl(PR_SET_MM_EXE_FILE) will fail if write access to the file
>>    cannot be denied, and write access to the file will remain denied
>>    until the link is effectivel gone (exec, termination,
>>    PR_SET_MM_EXE_FILE) -- just as if exec'ing the file.
>>
>> I was wondering if we really care about permanently disabling write access
>> to the executable, or if it would be good enough to just disable write
>> access while loading the new executable during exec; but I don't know
>> the history of that -- and it somewhat makes sense to deny write access
>> at least to the main executable. With modern user space -- dlopen() -- we
>> can effectively modify the content of shared libraries while being used.
>
> Is there a difference between ET_DYN and ET_EXEC executables?

What is being changed is how we track which files to denying write
access on.  Instead of denying write-access based on a per mapping (aka
mmap) basis, the new code is only denying access to /proc/self/exe.

Because the method of tracking is much coarser is why the interper stops
being protected.  The code doesn't care how the mappings happen, only
if the file is /proc/self/exe or not.

Eric

Eric W. Biederman Aug. 12, 2021, 5:32 p.m. UTC | #4

David Hildenbrand <david@redhat.com> writes:

> This series is based on v5.14-rc5 and corresponds code-wise to the
> previously sent RFC [1] (the RFC still applied cleanly).
>
> This series removes all in-tree usage of MAP_DENYWRITE from the kernel
> and removes VM_DENYWRITE. We stopped supporting MAP_DENYWRITE for
> user space applications a while ago because of the chance for DoS.
> The last renaming user is binfmt binary loading during exec and
> legacy library loading via uselib().
>
> With this change, MAP_DENYWRITE is effectively ignored throughout the
> kernel. Although the net change is small, I think the cleanup in mmap()
> is quite nice.
>
> There are some (minor) user-visible changes with this series:
> 1. We no longer deny write access to shared libaries loaded via legacy
>    uselib(); this behavior matches modern user space e.g., via dlopen().
> 2. We no longer deny write access to the elf interpreter after exec
>    completed, treating it just like shared libraries (which it often is).
> 3. We always deny write access to the file linked via /proc/pid/exe:
>    sys_prctl(PR_SET_MM_EXE_FILE) will fail if write access to the file
>    cannot be denied, and write access to the file will remain denied
>    until the link is effectivel gone (exec, termination,
>    PR_SET_MM_EXE_FILE) -- just as if exec'ing the file.
>
> I was wondering if we really care about permanently disabling write access
> to the executable, or if it would be good enough to just disable write
> access while loading the new executable during exec; but I don't know
> the history of that -- and it somewhat makes sense to deny write access
> at least to the main executable. With modern user space -- dlopen() -- we
> can effectively modify the content of shared libraries while being
> used.

So I think what we really want to do is to install executables with
and shared libraries without write permissions and immutable.  So that
upgrades/replacements of the libraries and executables are forced to
rename or unlink them.  We need the immutable bit as CAP_DAC_OVERRIDE
aka being root ignores the writable bits when a file is opened for
write.  However CAP_DAC_OVERRIDE does not override the immutable state
of a file.

I believe that denying write access at exec mmap time is actually much
to late in the process and making the denial of writing much larger in
scope is fundamentally what we want to do.  Changing how we install the
files, avoids the denial of service problems that MAP_DENYWRITE had.
Making the denial always happen ensures that installation programs are
never fooled into thinking a non-atomic update of an executable or
shared library is ok.

Still that is non-kernel work so I don't know who would make that
change.

As this fundamentally simplifies and a design mistake with very little
functional change.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>

For the entire series.


> There is a related problem [2] with overlayfs, that should at least partly
> be tackled by this series. I don't quite understand the interaction of
> overlayfs and deny_write_access()/allow_write_access() at exec time:
>
> If we end up denying write access to the wrong file and not to the
> realfile, that would be fundamentally broken. We would have to reroute
> our deny_write_access()/ allow_write_access() calls for the exec file to
> the realfile -- but I leave figuring out the details to overlayfs guys, as
> that would be a related but different issue.
>
> RFC -> v1:
> - "binfmt: remove in-tree usage of MAP_DENYWRITE"
> -- Add a note that this should fix part of a problem with overlayfs
>
> [1] https://lore.kernel.org/r/20210423131640.20080-1-david@redhat.com/
> [2] https://lore.kernel.org/r/YNHXzBgzRrZu1MrD@miu.piliscsaba.redhat.com/
>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Alexey Dobriyan <adobriyan@gmail.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Jiri Olsa <jolsa@redhat.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Petr Mladek <pmladek@suse.com>
> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Greg Ungerer <gerg@linux-m68k.org>
> Cc: Geert Uytterhoeven <geert@linux-m68k.org>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
> Cc: Chinwen Chang <chinwen.chang@mediatek.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: Feng Tang <feng.tang@intel.com>
> Cc: Kevin Brodsky <Kevin.Brodsky@arm.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Shawn Anastasio <shawn@anastas.io>
> Cc: Steven Price <steven.price@arm.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Christian Brauner <christian.brauner@ubuntu.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Gabriel Krisman Bertazi <krisman@collabora.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Marco Elver <elver@google.com>
> Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
> Cc: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> Cc: Thomas Cedeno <thomascedeno@google.com>
> Cc: Collin Fijalkovich <cfijalkovich@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> Cc: Chengguang Xu <cgxu519@mykernel.net>
> Cc: "Christian König" <ckoenig.leichtzumerken@gmail.com>
> Cc: linux-unionfs@vger.kernel.org
> Cc: linux-api@vger.kernel.org
> Cc: x86@kernel.org
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-mm@kvack.org
>
> David Hildenbrand (7):
>   binfmt: don't use MAP_DENYWRITE when loading shared libraries via
>     uselib()
>   kernel/fork: factor out atomcially replacing the current MM exe_file
>   kernel/fork: always deny write access to current MM exe_file
>   binfmt: remove in-tree usage of MAP_DENYWRITE
>   mm: remove VM_DENYWRITE
>   mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff()
>   fs: update documentation of get_write_access() and friends
>
>  arch/x86/ia32/ia32_aout.c      |  8 ++--
>  fs/binfmt_aout.c               |  7 ++--
>  fs/binfmt_elf.c                |  6 +--
>  fs/binfmt_elf_fdpic.c          |  2 +-
>  fs/proc/task_mmu.c             |  1 -
>  include/linux/fs.h             | 19 +++++----
>  include/linux/mm.h             |  3 +-
>  include/linux/mman.h           |  4 +-
>  include/trace/events/mmflags.h |  1 -
>  kernel/events/core.c           |  2 -
>  kernel/fork.c                  | 75 ++++++++++++++++++++++++++++++----
>  kernel/sys.c                   | 33 +--------------
>  lib/test_printf.c              |  5 +--
>  mm/mmap.c                      | 29 ++-----------
>  mm/nommu.c                     |  2 -
>  15 files changed, 98 insertions(+), 99 deletions(-)
>
>
> base-commit: 36a21d51725af2ce0700c6ebcb6b9594aac658a6

Andy Lutomirski Aug. 12, 2021, 5:35 p.m. UTC | #5

On Thu, Aug 12, 2021, at 10:32 AM, Eric W. Biederman wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
> > This series is based on v5.14-rc5 and corresponds code-wise to the
> > previously sent RFC [1] (the RFC still applied cleanly).
> >
> > This series removes all in-tree usage of MAP_DENYWRITE from the kernel
> > and removes VM_DENYWRITE. We stopped supporting MAP_DENYWRITE for
> > user space applications a while ago because of the chance for DoS.
> > The last renaming user is binfmt binary loading during exec and
> > legacy library loading via uselib().
> >
> > With this change, MAP_DENYWRITE is effectively ignored throughout the
> > kernel. Although the net change is small, I think the cleanup in mmap()
> > is quite nice.
> >
> > There are some (minor) user-visible changes with this series:
> > 1. We no longer deny write access to shared libaries loaded via legacy
> >    uselib(); this behavior matches modern user space e.g., via dlopen().
> > 2. We no longer deny write access to the elf interpreter after exec
> >    completed, treating it just like shared libraries (which it often is).
> > 3. We always deny write access to the file linked via /proc/pid/exe:
> >    sys_prctl(PR_SET_MM_EXE_FILE) will fail if write access to the file
> >    cannot be denied, and write access to the file will remain denied
> >    until the link is effectivel gone (exec, termination,
> >    PR_SET_MM_EXE_FILE) -- just as if exec'ing the file.
> >
> > I was wondering if we really care about permanently disabling write access
> > to the executable, or if it would be good enough to just disable write
> > access while loading the new executable during exec; but I don't know
> > the history of that -- and it somewhat makes sense to deny write access
> > at least to the main executable. With modern user space -- dlopen() -- we
> > can effectively modify the content of shared libraries while being
> > used.
> 
> So I think what we really want to do is to install executables with
> and shared libraries without write permissions and immutable.  So that
> upgrades/replacements of the libraries and executables are forced to
> rename or unlink them.  We need the immutable bit as CAP_DAC_OVERRIDE
> aka being root ignores the writable bits when a file is opened for
> write.  However CAP_DAC_OVERRIDE does not override the immutable state
> of a file.

If we really want to do this, I think we'd want a different flag that's more like sealed.  Non-root users should be able to do this, too.

Or we could just more gracefully handle users that overwrite running programs.

--Andy

Eric W. Biederman Aug. 12, 2021, 5:48 p.m. UTC | #6

"Andy Lutomirski" <luto@kernel.org> writes:

> On Thu, Aug 12, 2021, at 10:32 AM, Eric W. Biederman wrote:
>> David Hildenbrand <david@redhat.com> writes:
>> 
>> > This series is based on v5.14-rc5 and corresponds code-wise to the
>> > previously sent RFC [1] (the RFC still applied cleanly).
>> >
>> > This series removes all in-tree usage of MAP_DENYWRITE from the kernel
>> > and removes VM_DENYWRITE. We stopped supporting MAP_DENYWRITE for
>> > user space applications a while ago because of the chance for DoS.
>> > The last renaming user is binfmt binary loading during exec and
>> > legacy library loading via uselib().
>> >
>> > With this change, MAP_DENYWRITE is effectively ignored throughout the
>> > kernel. Although the net change is small, I think the cleanup in mmap()
>> > is quite nice.
>> >
>> > There are some (minor) user-visible changes with this series:
>> > 1. We no longer deny write access to shared libaries loaded via legacy
>> >    uselib(); this behavior matches modern user space e.g., via dlopen().
>> > 2. We no longer deny write access to the elf interpreter after exec
>> >    completed, treating it just like shared libraries (which it often is).
>> > 3. We always deny write access to the file linked via /proc/pid/exe:
>> >    sys_prctl(PR_SET_MM_EXE_FILE) will fail if write access to the file
>> >    cannot be denied, and write access to the file will remain denied
>> >    until the link is effectivel gone (exec, termination,
>> >    PR_SET_MM_EXE_FILE) -- just as if exec'ing the file.
>> >
>> > I was wondering if we really care about permanently disabling write access
>> > to the executable, or if it would be good enough to just disable write
>> > access while loading the new executable during exec; but I don't know
>> > the history of that -- and it somewhat makes sense to deny write access
>> > at least to the main executable. With modern user space -- dlopen() -- we
>> > can effectively modify the content of shared libraries while being
>> > used.
>> 
>> So I think what we really want to do is to install executables with
>> and shared libraries without write permissions and immutable.  So that
>> upgrades/replacements of the libraries and executables are forced to
>> rename or unlink them.  We need the immutable bit as CAP_DAC_OVERRIDE
>> aka being root ignores the writable bits when a file is opened for
>> write.  However CAP_DAC_OVERRIDE does not override the immutable state
>> of a file.
>
> If we really want to do this, I think we'd want a different flag
> that's more like sealed.  Non-root users should be able to do this,
> too.
>
> Or we could just more gracefully handle users that overwrite running
> programs.

I had a blind spot, and Florian Weimer made a very reasonable request.
Apparently userspace for shared libraires uses MAP_PRIVATE.

So we almost don't care if the library is overwritten.  We loose some
efficiency and apparently there are some corner cases like the library
being extended past the end of the exiting file that are problematic.

Given that MAP_PRIVATE for shared libraries is our strategy for handling
writes to shared libraries perhaps we just need to use MAP_POPULATE or a
new related flag (perhaps MAP_PRIVATE_NOW) that just makes certain that
everything mapped from the executable is guaranteed to be visible from
the time of the mmap, and any changes from the filesystem side after
that are guaranteed to cause a copy on write.

Once we get that figured out we could consider getting rid of deny-write
entirely.

Eric

Andy Lutomirski Aug. 12, 2021, 6:01 p.m. UTC | #7

On Thu, Aug 12, 2021, at 10:48 AM, Eric W. Biederman wrote:
> "Andy Lutomirski" <luto@kernel.org> writes:

> I had a blind spot, and Florian Weimer made a very reasonable request.
> Apparently userspace for shared libraires uses MAP_PRIVATE.
> 
> So we almost don't care if the library is overwritten.  We loose some
> efficiency and apparently there are some corner cases like the library
> being extended past the end of the exiting file that are problematic.
> 
> Given that MAP_PRIVATE for shared libraries is our strategy for handling
> writes to shared libraries perhaps we just need to use MAP_POPULATE or a
> new related flag (perhaps MAP_PRIVATE_NOW) that just makes certain that
> everything mapped from the executable is guaranteed to be visible from
> the time of the mmap, and any changes from the filesystem side after
> that are guaranteed to cause a copy on write.
> 
> Once we get that figured out we could consider getting rid of deny-write
> entirely.

Are all of the CoW bits in good enough shape for this to work without just immediately CoWing the whole file?  In principle, write(2) to a file should be able to notice that it needs to CoW some pages, but I doubt that this actually works.

--Andy

Linus Torvalds Aug. 12, 2021, 6:10 p.m. UTC | #8

On Thu, Aug 12, 2021 at 7:48 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Given that MAP_PRIVATE for shared libraries is our strategy for handling
> writes to shared libraries perhaps we just need to use MAP_POPULATE or a
> new related flag (perhaps MAP_PRIVATE_NOW)

No. That would be horrible for the usual bloated GUI libraries. It
might help some (dynamic page faults are not cheap either), but it
would hurt a lot.

This is definitely a "if you overwrite a system library while it's
being used, you get to keep both pieces" situation.

The kernel ETXTBUSY thing is purely a courtesy feature, and as people
have noticed it only really works for the main executable because of
various reasons. It's not something user space should even rely on,
it's more of a "ok, you're doing something incredibly stupid, and
we'll help you avoid shooting yourself in the foot when we notice".

Any distro should make sure their upgrade tools don't just
truncate/write to random libraries executables.

And if they do, it's really not a kernel issue.

This patch series basically takes this very historical error return,
and simplifies and clarifies the implementation, and in the process
might change some very subtle corner case (unmapping the original
executable entirely?). I hope (and think) it wouldn't matter exactly
because this is a "courtesy error" rather than anything that a sane
setup would _depend_ on, but hey, insane setups clearly exist.

               Linus

Florian Weimer Aug. 12, 2021, 6:15 p.m. UTC | #9

* Eric W. Biederman:

> Given that MAP_PRIVATE for shared libraries is our strategy for handling
> writes to shared libraries perhaps we just need to use MAP_POPULATE or a
> new related flag (perhaps MAP_PRIVATE_NOW) that just makes certain that
> everything mapped from the executable is guaranteed to be visible from
> the time of the mmap, and any changes from the filesystem side after
> that are guaranteed to cause a copy on write.

I think this is called MAP_COPY:

  <https://www.gnu.org/software/hurd/glibc/mmap.html>

If we could get that functionality, we would certainly use it in the
glibc dynamic loader.  And it's not just dynamic loaders that would
benefit.

But I assume there are some rather thorny issues around semantics.  If
the changed areas of the file are in the page cache already, everything
is okay.  But if parts of the file are changed or discarded that are
not, they would have to be read in first, which is rather awkward.

That's why I suggested the signal for all future accesses.  It seems
more tractable and addresses the largest issue (the difficulty of
figuring out why some processes crash occasionally).

Thanks,
Florian

Linus Torvalds Aug. 12, 2021, 6:21 p.m. UTC | #10

On Thu, Aug 12, 2021 at 8:16 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> I think this is called MAP_COPY:
>
>   <https://www.gnu.org/software/hurd/glibc/mmap.html>

Please don't even consider the crazy notions that GNU Hurd did.

It's a fundamental design mistake. The Hurd VM was horrendous, and
MAP_COPY was a prime example of the kinds of horrors it had.

I'm not sure how much of the mis-designs were due to Hurd, and how
much of it due to Mach 3. But please don't point to Hurd VM
documentation except possibly to warn people. We want people to
_forget_ those mistakes, not repeat them.

          Linus

Eric W. Biederman Aug. 12, 2021, 6:47 p.m. UTC | #11

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, Aug 12, 2021 at 7:48 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> Given that MAP_PRIVATE for shared libraries is our strategy for handling
>> writes to shared libraries perhaps we just need to use MAP_POPULATE or a
>> new related flag (perhaps MAP_PRIVATE_NOW)
>
> No. That would be horrible for the usual bloated GUI libraries. It
> might help some (dynamic page faults are not cheap either), but it
> would hurt a lot.

I wasn't aiming so much at the MAP_POPULATE part but something that
would trigger cow from writes to the file.  I see code that is close but
I don't see any code in the kernel that would implement that currently.

Upon reflection I think it will always be difficult to trigger cow from
the file write side of the kernel as code that would cow the page in
the page cache would cause problems with writable mmaps.

> This is definitely a "if you overwrite a system library while it's
> being used, you get to keep both pieces" situation.
>
> The kernel ETXTBUSY thing is purely a courtesy feature, and as people
> have noticed it only really works for the main executable because of
> various reasons. It's not something user space should even rely on,
> it's more of a "ok, you're doing something incredibly stupid, and
> we'll help you avoid shooting yourself in the foot when we notice".
>
> Any distro should make sure their upgrade tools don't just
> truncate/write to random libraries executables.

Yes.  I am trying to come up with advice on how userspace
implementations can implement their tools to use other mechanisms that
solve the overwriting shared libaries and executables problem that
are not broken by design.

For a little bit the way Florian Weirmer was talking and the fact that
uselib uses MAP_PRIVATE had me thinking that somehow MAP_PRIVATE could
be part of the solution.  I have now looked into the implementation of
MAP_PRIVATE and I since we don't perform the cow on filesystem writes
MAP_PRIVATE absolutely can not be part of the solution we recommend to
userspace.

So today the best advice I can give to userspace is to mark their
executables and shared libraries as read-only and immutable.  Otherwise
a change to the executable file can change what is mapped into memory.
MAP_PRIVATE does not help.

> And if they do, it's really not a kernel issue.

What is a kernel issue is giving people good advice on how to use kernel
features to solve real world problems.  I have seen the write to a
mapped exectuable/shared lib problem, and Florian has seen it.  So while
rare the problem is real and a pain to debug.

> This patch series basically takes this very historical error return,
> and simplifies and clarifies the implementation, and in the process
> might change some very subtle corner case (unmapping the original
> executable entirely?). I hope (and think) it wouldn't matter exactly
> because this is a "courtesy error" rather than anything that a sane
> setup would _depend_ on, but hey, insane setups clearly exist.

Oh yes.

I very much agree that the design of this patchset is perfectly fine.

I also see that MAP_DENYWRITE is unfortunately broken by design.  I
vaguely remember the discussion when MAP_DENYWRITE was made a noop
because of the denial-of-service aspect of MAP_DENYWRITE.

I very much agree that we should strongly encourage userspace not
to write to mmaped files.

As I am learning with my two year old, it helps to give a constructive
suggestion of alternative behavior instead of just saying no.
Florian reported that there remains a problem in userspace. So I am
coming up with a constructive suggestion.  My apologies for going off
into the weeds for a moment.

Eric

David Hildenbrand Aug. 12, 2021, 7:24 p.m. UTC | #12

On 12.08.21 20:10, Linus Torvalds wrote:
> On Thu, Aug 12, 2021 at 7:48 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> Given that MAP_PRIVATE for shared libraries is our strategy for handling
>> writes to shared libraries perhaps we just need to use MAP_POPULATE or a
>> new related flag (perhaps MAP_PRIVATE_NOW)
> 
> No. That would be horrible for the usual bloated GUI libraries. It
> might help some (dynamic page faults are not cheap either), but it
> would hurt a lot.

Right, we most certainly don't want to waste system ram / swap space, 
memory for page tables, and degrade performance just because some 
corner-case nasty user space could harm itself.

> 
> This is definitely a "if you overwrite a system library while it's
> being used, you get to keep both pieces" situation.

Right, play stupid games, win stupid prices. I agree that if there would 
be an efficient way to detect+handle such overwrites gracefully, it 
would be great to have the kernel support that. ETXTBUSY as implemented 
with this series (but also before this series) is really only a 
minimalistic approach to help detect some issues regarding the main 
executable.

David Laight Aug. 13, 2021, 9:05 a.m. UTC | #13

From: Eric W. Biederman
> Sent: 12 August 2021 19:47
...
> So today the best advice I can give to userspace is to mark their
> executables and shared libraries as read-only and immutable.  Otherwise
> a change to the executable file can change what is mapped into memory.
> MAP_PRIVATE does not help.

While 'immutable' might be ok for files installed by distributions
it would be a PITA in development.

ETXTBUSY is a useful reminder that the file you are copying from
machine A to machine B (etc) is still running and probably ought
to be killed/stopped before you get confused.

I've never really understood why it doesn't stop shared libraries
being overwritten - but they do tend to be updated less often.

Overwriting an in-use shared library could be really confusing.
It is likely that all the code is actually in memory.
So everything carries on running as normal.
Until the kernel gets under memory pressure and discards a page.
Then a page from the new version is faulted in and random
programs start getting SEGVs.
This could be days after the borked update.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Eric W. Biederman Aug. 13, 2021, 8:17 p.m. UTC | #14

David Laight <David.Laight@ACULAB.COM> writes:

> From: Eric W. Biederman
>> Sent: 12 August 2021 19:47
> ...
>> So today the best advice I can give to userspace is to mark their
>> executables and shared libraries as read-only and immutable.  Otherwise
>> a change to the executable file can change what is mapped into memory.
>> MAP_PRIVATE does not help.
>
> While 'immutable' might be ok for files installed by distributions
> it would be a PITA in development.

For development simply making the files read-only should be sufficient.

What I think should happen is when a new binary is installed it should
always be placed in a new file and renamed to the old name, rather than
copied over the old file.  That can be added to a makefile by writing to
a temporary file name and then using "mv" to the final name.

I tried to look at which options I would need to give to install to
implement that pattern but I don't see it.  Perhaps -b for backup.

I thought I could overcome the feature of CAP_DAC_OVERRIDE where
read-only files can be over-written by adding the immutable attribute.
I just tested and reread the code and I see that using immutable has
a couple of problems.  Only root is allowed to set immutable, and not
even root is allowed to rename or delete immutable files while the
immutable attribute is set.

> ETXTBUSY is a useful reminder that the file you are copying from
> machine A to machine B (etc) is still running and probably ought
> to be killed/stopped before you get confused.

That is true.

> I've never really understood why it doesn't stop shared libraries
> being overwritten - but they do tend to be updated less often.

The problem is that MAP_DENYWRITE can be applied to any file.  Which
makes it another kind of mandatory file lock, and it allows blocking
preventing all writes to any file you can read/mmap.  Which creates all
kinds of denial-of service opportunities.

A nasty example would be using mmap MAP_DENYWRITE on /var/log/messages.
Which a hostile actor could use to hide traces of their presence on a
machine.

So far no one has pointed out how to abuse denying writes to
/proc/self/exe so we can keep the denywrite behavior there for now.

> Overwriting an in-use shared library could be really confusing.
> It is likely that all the code is actually in memory.
> So everything carries on running as normal.
> Until the kernel gets under memory pressure and discards a page.
> Then a page from the new version is faulted in and random
> programs start getting SEGVs.
> This could be days after the borked update.

This should actually happen quite quickly after a borked update.  The
pages in the page cache are cache coherent so as soon as you get a cpu
cache flush the new contents of the overwritten page will be visible.

Which gets to half of the confusion with this.  Long ago and far away
rtld in glibc was written with the assumption that something called
MAP_COPY existed (I think it exists on hurd?).  On Linux at one point it
was emulated with MAP_PRIVATE | MAP_DENYWRITE.  Then we made
MAP_DENYWRITE a noop.  This was probably 20 years ago now.

The glibc rtld implementation is still written using MAP_COPY with
MAP_COPY defined to MAP_PRIVATE | MAP_DENYWRITE on linux.

Even with all of the improvements to the linux mm subsystem since the
year 2000.  Linux the linux mm code does not have the infrastructure to
support MAP_COPY.  Semantically MAP_COPY sounds nice.  An implementation
unfortunately would require that anyone who performs a write(2) to a
file mapped MAP_COPY would require walking the file mapping data
structures and creating one copy of the page for each place that page is
mapped MAP_COPY.  Which in the case of someone overwriting rtld would
require one copy of ld.so in memory for each program running on the
system.  A 457 * 162KiB = 74MiB increase on my little system that has
been up for a while.  Where libc would require something like 457 *
1.8MiB = 822.6MiB.  The performance of creating those copies would
likely also be atrocious.

Anything short of performing copy-on-write when a file is written using
write(2) would not provide the protection for programs.

Florian Weimer, would it be possible to get glibc's ld.so implementation to use
MAP_SHARED?  Just so people reading the code know what to expect of the
kernel?  As far as I can tell there is not a practical difference
between a read-only MAP_PRIVATE and a read-only MAP_SHARED.

Michael Kerrisk, is there any change we can document that in linux for
MAP_PRIVATE mappings the mapped pages will match the underlying file
unless a value is written to a page through the mapping?

Eric

Florian Weimer Aug. 13, 2021, 8:51 p.m. UTC | #15

* Eric W. Biederman:

> Florian Weimer, would it be possible to get glibc's ld.so implementation to use
> MAP_SHARED?  Just so people reading the code know what to expect of the
> kernel?  As far as I can tell there is not a practical difference
> between a read-only MAP_PRIVATE and a read-only MAP_SHARED.

Some applications use mprotect to change page protections behind glibc's
back.  Using MAP_SHARED would break fork pretty badly.

Most of the hard-to-diagnose crashes seem to come from global data or
relocations because they are wiped by truncation.  And we certainly
can't use MAP_SHARED for those.  Code often seems to come back unchanged
after the truncation because the overwritten file hasn't actually
changed.  File attributes don't help because the copying is an
adminstrative action in the context of the application (maybe the result
of some automation).

I think avoiding the crashes isn't the right approach.  What I'd like to
see is better diagnostics.  Writing mtime and ctime to the core file
might help.  Or adding a flag to the core file and /proc/PID/smaps that
indicates if the file has been truncated across the mapping since the
mapping was created.

A bit less conservative and even more obvious to diagnose would be a new
flag for the mapping (perhaps set via madvise) that causes any future
access to the mapping to fault with SIGBUS and a special si_code value
after the file has been truncated across the mapping.  I think we would
set that in the glibc dynamic loader.  It would make the crashes much
less weird.

Thanks,
Florian

Linus Torvalds Aug. 14, 2021, 12:31 a.m. UTC | #16

On Fri, Aug 13, 2021 at 10:18 AM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Florian Weimer, would it be possible to get glibc's ld.so implementation to use
> MAP_SHARED?  Just so people reading the code know what to expect of the
> kernel?  As far as I can tell there is not a practical difference
> between a read-only MAP_PRIVATE and a read-only MAP_SHARED.

There's a huge difference.

For one, you actually don't necessarily want read-only. Doing COW on
library images is quite common for things like relocation etc (you'd
_hope_ everything is PC-relative, but no)

So no. Never EVER use MAP_SHARED unless you literally expect to have
two different mappings that need to be kept in sync and one writes the
other.

I'll just repeat: stop arguing about this case. If somebody writes to
a busy library, THAT IS A FUNDAMENTAL BUG, and nobody sane should care
at all about it apart from the "you get what you deserve".

What's next? Do you think glibc should also map every byte in the user
address space so that user programs don't get SIGSEGV when they have
wild pointers?

Again - that's a user BUG and trying to "work around" a wild pointer
is a worse fix than the problem it tries to fix.

The exact same thing is true for shared library (or executable)
mappings. Trying to work around people writing to them is *worse* than
the bug of doing so.

Stop this completely inane discussion already.

                  Linus

Andy Lutomirski Aug. 14, 2021, 12:49 a.m. UTC | #17

On Fri, Aug 13, 2021, at 5:31 PM, Linus Torvalds wrote:
> On Fri, Aug 13, 2021 at 10:18 AM Eric W. Biederman
> <ebiederm@xmission.com> wrote:
> >
> > Florian Weimer, would it be possible to get glibc's ld.so implementation to use
> > MAP_SHARED?  Just so people reading the code know what to expect of the
> > kernel?  As far as I can tell there is not a practical difference
> > between a read-only MAP_PRIVATE and a read-only MAP_SHARED.
> 
> There's a huge difference.
> 
> For one, you actually don't necessarily want read-only. Doing COW on
> library images is quite common for things like relocation etc (you'd
> _hope_ everything is PC-relative, but no)
> 
> So no. Never EVER use MAP_SHARED unless you literally expect to have
> two different mappings that need to be kept in sync and one writes the
> other.
> 
> I'll just repeat: stop arguing about this case. If somebody writes to
> a busy library, THAT IS A FUNDAMENTAL BUG, and nobody sane should care
> at all about it apart from the "you get what you deserve".
> 
> What's next? Do you think glibc should also map every byte in the user
> address space so that user programs don't get SIGSEGV when they have
> wild pointers?
> 
> Again - that's a user BUG and trying to "work around" a wild pointer
> is a worse fix than the problem it tries to fix.
> 
> The exact same thing is true for shared library (or executable)
> mappings. Trying to work around people writing to them is *worse* than
> the bug of doing so.
> 
> Stop this completely inane discussion already.
> 

I’ll bite.  How about we attack this in the opposite direction: remove the deny write mechanism entirely.

In my life, I’ve encountered -ETXTBUSY intermittently, and it invariably means that I somehow failed to finish killing a program fast enough for whatever random rebuild I’m doing to succeed. It’s at best erratic — it only applies for static binaries, and it has never once saved me from a problem I care about. If the program I’m recompiling crashes, I don’t care — it’s probably already part way through dying from an unrelated fatal signal.  What actually happens is that I see -ETXTBUSY, think “wait, this isn’t Windows, why are there file sharing rules,” then think “wait, Linux has *one* half baked file sharing rule,” and go on with my life. [0]

Seriously, can we deprecate and remove the whole thing?

[0] we have mandatory locks, too. Sigh.

Linus Torvalds Aug. 14, 2021, 12:54 a.m. UTC | #18

On Fri, Aug 13, 2021 at 2:49 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> I’ll bite.  How about we attack this in the opposite direction: remove the deny write mechanism entirely.

I think that would be ok, except I can see somebody relying on it.

It's broken, it's stupid, but we've done that ETXTBUSY for a _loong_ time.

But you are right that we have removed parts of it over time (no more
MAP_DENYWRITE, no more uselib()) so that what we have today is a
fairly weak form of what we used to do.

And nobody really complained when we weakened it, so maybe removing it
entirely might be acceptable.

              Linus

Linus Torvalds Aug. 14, 2021, 12:58 a.m. UTC | #19

On Fri, Aug 13, 2021 at 2:54 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And nobody really complained when we weakened it, so maybe removing it
> entirely might be acceptable.

I guess we could just try it and see... Worst comes to worst, we'll
have to put it back, but at least we'd know what crazy thing still
wants it..

              Linus

Al Viro Aug. 14, 2021, 1:57 a.m. UTC | #20

On Fri, Aug 13, 2021 at 02:58:57PM -1000, Linus Torvalds wrote:
> On Fri, Aug 13, 2021 at 2:54 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > And nobody really complained when we weakened it, so maybe removing it
> > entirely might be acceptable.
> 
> I guess we could just try it and see... Worst comes to worst, we'll
> have to put it back, but at least we'd know what crazy thing still
> wants it..

Umm...  I'll need to go back and look through the thread, but I'm
fairly sure that there used to be suckers that did replacement of
binary that way (try to write, count on exclusion with execve while
it's being written to) instead of using rename.  Install scripts
of weird crap and stuff like that...

Al Viro Aug. 14, 2021, 2:02 a.m. UTC | #21

On Sat, Aug 14, 2021 at 01:57:31AM +0000, Al Viro wrote:
> On Fri, Aug 13, 2021 at 02:58:57PM -1000, Linus Torvalds wrote:
> > On Fri, Aug 13, 2021 at 2:54 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > And nobody really complained when we weakened it, so maybe removing it
> > > entirely might be acceptable.
> > 
> > I guess we could just try it and see... Worst comes to worst, we'll
> > have to put it back, but at least we'd know what crazy thing still
> > wants it..
> 
> Umm...  I'll need to go back and look through the thread, but I'm
> fairly sure that there used to be suckers that did replacement of
> binary that way (try to write, count on exclusion with execve while
> it's being written to) instead of using rename.  Install scripts
> of weird crap and stuff like that...

... and before anyone goes off - I certainly agree that using that
behaviour is not a good idea and had never been one.  All I'm saying
is that there at least used to be very random (and rarely exercised)
bits of userland relying upon that behaviour.

Matthew Wilcox Aug. 14, 2021, 3:04 a.m. UTC | #22

On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
> [0] we have mandatory locks, too. Sigh.

I'd love to remove that.  Perhaps we could try persuading more of the
distros to disable the CONFIG option first.

Christian Brauner Aug. 14, 2021, 7:53 a.m. UTC | #23

On Sat, Aug 14, 2021 at 01:57:31AM +0000, Al Viro wrote:
> On Fri, Aug 13, 2021 at 02:58:57PM -1000, Linus Torvalds wrote:
> > On Fri, Aug 13, 2021 at 2:54 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > And nobody really complained when we weakened it, so maybe removing it
> > > entirely might be acceptable.
> > 
> > I guess we could just try it and see... Worst comes to worst, we'll
> > have to put it back, but at least we'd know what crazy thing still
> > wants it..
> 
> Umm...  I'll need to go back and look through the thread, but I'm
> fairly sure that there used to be suckers that did replacement of
> binary that way (try to write, count on exclusion with execve while
> it's being written to) instead of using rename.  Install scripts
> of weird crap and stuff like that...

I'm not agains trying to remove it, but I think Al has a point.

Removing the write protection will also most certainly make certain
classes of attacks _easier_. For example, the runC container breakout
from last year using privileged containers issued CVE-2019-5736 would be
easier. I'm quoting from the commit I fixed this with:

    The attack can be made when attaching to a running container or when starting a
    container running a specially crafted image.  For example, when runC attaches
    to a container the attacker can trick it into executing itself. This could be
    done by replacing the target binary inside the container with a custom binary
    pointing back at the runC binary itself. As an example, if the target binary
    was /bin/bash, this could be replaced with an executable script specifying the
    interpreter path #!/proc/self/exe (/proc/self/exec is a symbolic link created
    by the kernel for every process which points to the binary that was executed
    for that process). As such when /bin/bash is executed inside the container,
    instead the target of /proc/self/exe will be executed - which will point to the
    runc binary on the host. The attacker can then proceed to write to the target
    of /proc/self/exe to try and overwrite the runC binary on the host.

and then the write protection kicks in of course:

    However in general, this will not succeed as the kernel will not
    permit it to be overwritten whilst runC is executing.

which the attack can of course already overcome nowadays with minimal
smarts:

    To overcome this, the attacker can instead open a file descriptor to
    /proc/self/exe using the O_PATH flag and then proceed to reopen the
    binary as O_WRONLY through /proc/self/fd/<nr> and try to write to it
    in a busy loop from a separate process. Ultimately it will succeed
    when the runC binary exits. After this the runC binary is
    compromised and can be used to attack other containers or the host
    itself.

But with write protection removed you'd allow such attacks to succeed
right away. It's not a huge deal to remove it since we need to have
other protection mechanisms in place already:

    To prevent this attack, LXC has been patched to create a temporary copy of the
    calling binary itself when it starts or attaches to containers. To do this LXC
    creates an anonymous, in-memory file using the memfd_create() system call and
    copies itself into the temporary in-memory file, which is then sealed to
    prevent further modifications. LXC then executes this sealed, in-memory file
    instead of the original on-disk binary. Any compromising write operations from
    a privileged container to the host LXC binary will then write to the temporary
    in-memory binary and not to the host binary on-disk, preserving the integrity
    of the host LXC binary. Also as the temporary, in-memory LXC binary is sealed,
    writes to this will also fail.

    Note: memfd_create() was added to the Linux kernel in the 3.17 release.

However, I still like to pich the upgrade mask idea Aleksa and we tried
to implement when we did openat2(). If we leave write-protection in
preventing /proc/self/exe from being written to:

we can take some time and upstream the upgrade mask patchset which was
part of the initial openat2() patchset but was dropped back then (and I
had Linus remove the last remants of the idea in [1]).

The idea was to add a new field to struct open_how "upgrade_mask" that
would allow a caller to specify with what permissions an fd could be
reopened with. I still like this idea a great deal and it would be a
very welcome addition to system management programs. The upgrade mask is
of course optional, i.e. the caller would have to specify the upgrade
mask at open time to restrict reopening (lest we regress the whole
world).

But, we could make it so that an O_PATH fd gotten from opening
/proc/<pid>/exe always gets a restricted upgrade mask set and so it
can't be upgraded to a O_WRONLY fd afterwards. For this to be
meaningful, write protection for /proc/self/exe would need to be kept.

[1]: commit 5c350aa11b441b32baf3bfe4018168cb8d10cef7
     Author: Christian Brauner <christian.brauner@ubuntu.com>
     Date:   Fri May 28 11:24:15 2021 +0200
     
         fcntl: remove unused VALID_UPGRADE_FLAGS
     
         We currently do not maky use of this feature and should we implement
         something like this in the future it's trivial to add it back.
     
         Link: https://lore.kernel.org/r/20210528092417.3942079-2-brauner@kernel.org
         Cc: Christoph Hellwig <hch@lst.de>
         Cc: Aleksa Sarai <cyphar@cyphar.com>
         Cc: Al Viro <viro@zeniv.linux.org.uk>
         Cc: linux-fsdevel@vger.kernel.org
         Suggested-by: Richard Guy Briggs <rgb@redhat.com>
         Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
         Reviewed-by: Christoph Hellwig <hch@lst.de>
         Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

David Hildenbrand Aug. 14, 2021, 9:06 a.m. UTC | #24

On 14.08.21 04:02, Al Viro wrote:
> On Sat, Aug 14, 2021 at 01:57:31AM +0000, Al Viro wrote:
>> On Fri, Aug 13, 2021 at 02:58:57PM -1000, Linus Torvalds wrote:
>>> On Fri, Aug 13, 2021 at 2:54 PM Linus Torvalds
>>> <torvalds@linux-foundation.org> wrote:
>>>>
>>>> And nobody really complained when we weakened it, so maybe removing it
>>>> entirely might be acceptable.
>>>
>>> I guess we could just try it and see... Worst comes to worst, we'll
>>> have to put it back, but at least we'd know what crazy thing still
>>> wants it..
>>
>> Umm...  I'll need to go back and look through the thread, but I'm
>> fairly sure that there used to be suckers that did replacement of
>> binary that way (try to write, count on exclusion with execve while
>> it's being written to) instead of using rename.  Install scripts
>> of weird crap and stuff like that...
> 
> ... and before anyone goes off - I certainly agree that using that
> behaviour is not a good idea and had never been one.  All I'm saying
> is that there at least used to be very random (and rarely exercised)
> bits of userland relying upon that behaviour.
> 

Removing it completely is certainly more controversial than limiting it 
to the main executable. I'm mostly happy as long as we get rid of that 
nasty per-VMA handling, because that adds real complexity at places that 
are complicated enough.

Having the remaining deny_write_access()/allow_write_access() at sane 
places now (loading a new binary, exchanging exe_file) looks certainly 
much cleaner and I still consider it a valuable, simple sanity feature 
to have around. I don't think there is any sane use case for modifying 
the main executable, and it seems to be very easy to catch.

For example, besides users that rely on this behavior, in my thinking 
(see the cover letter), especially having a binary not getting changed 
while we're loading it sounds like a very good idea (not saying we would 
expose a way to exploit the kernel if we would allow for modifications 
while in the elf parser, but also not saying we wouldn't because I 
didn't check if there would be a way; at least we already allow it in 
the legacy library loader before mapping the segments with 
MAP_DENYWRITE). And if we decide to keep the behavior while loading the 
executable, keeping it while exe_file is set isn't much added 
code/complexity IMHO.

Long story short, I'd vote for keeping it in, and if we decide to rip it 
out completely, do it a a separate, more careful step.

David Laight Aug. 14, 2021, 7:52 p.m. UTC | #25

From: Linus Torvalds
> Sent: 14 August 2021 01:55
> 
> On Fri, Aug 13, 2021 at 2:49 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > I’ll bite.  How about we attack this in the opposite direction: remove the deny write mechanism
> entirely.
> 
> I think that would be ok, except I can see somebody relying on it.
> 
> It's broken, it's stupid, but we've done that ETXTBUSY for a _loong_ time.

I think ETXTBUSY predates Linux itself.
But I can't remember whether the elf versions of sunos or svr4
implemented it for shared libraries.
I don't remember hitting it, so they may not have.

I'm actually surprised it ia an mmap() flag rather than an open() one.
Being able to open a file and guarantee it can't be changed seems a sane idea.
And not just for programs/libraries.

By the sound of it 'immutable' is no use.
You need to be able to unlink the file - otherwise you get into the
window's fiasco of not being able to update without 17 reboots.

FWIW MAP_COPY would only need to take one copy of the page - all the
users could share the same page (backed by a single page of swap).
Not that I'm suggesting it is a good idea at all.

I do wonder about /proc/self/exe though.
It gave the NetBSD Linux emulation a terrible problem.
Being able to open the inode of the program is fine.
The problem is the what readlink() returns - it is basically stale.
If a program open the link contents it could get anything at all.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Eric W. Biederman Aug. 17, 2021, 4:48 p.m. UTC | #26

Matthew Wilcox <willy@infradead.org> writes:

> On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
>> [0] we have mandatory locks, too. Sigh.
>
> I'd love to remove that.  Perhaps we could try persuading more of the
> distros to disable the CONFIG option first.

Yes.  The support is disabled in RHEL8.

Does anyone know the appropriate people to talk to encourage other
distro's to encourage them to disable the CONFIG_MANDATORY_FILE_LOCKING?

Either that or we can wait until the code bit-rots, but distro's
disabling and removing a feature on their own is the more responsible
path.

Given how many hoops need to be jumped through to use mandatory file
locking once it is enabled, and the fact it has never worked in
containers makes me suspect there are no more users.

Eric

David Hildenbrand Aug. 17, 2021, 4:50 p.m. UTC | #27

On 17.08.21 18:48, Eric W. Biederman wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> 
>> On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
>>> [0] we have mandatory locks, too. Sigh.
>>
>> I'd love to remove that.  Perhaps we could try persuading more of the
>> distros to disable the CONFIG option first.
> 
> Yes.  The support is disabled in RHEL8.

kernel-ark also seems to not set it for Fedora and ARK

redhat/configs/common/generic/CONFIG_MANDATORY_FILE_LOCKING:# 
CONFIG_MANDATORY_FILE_LOCKING is not set

Christian Brauner Aug. 18, 2021, 7:51 a.m. UTC | #28

On Sat, Aug 14, 2021 at 04:04:09AM +0100, Matthew Wilcox wrote:
> On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
> > [0] we have mandatory locks, too. Sigh.
> 
> I'd love to remove that.  Perhaps we could try persuading more of the
> distros to disable the CONFIG option first.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1940392

Rodrigo Campos Aug. 18, 2021, 9:34 a.m. UTC | #29

On Tue, Aug 17, 2021 at 6:49 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Matthew Wilcox <willy@infradead.org> writes:
>
> > On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
> >> [0] we have mandatory locks, too. Sigh.
> >
> > I'd love to remove that.  Perhaps we could try persuading more of the
> > distros to disable the CONFIG option first.
>
> Yes.  The support is disabled in RHEL8.

If it helps, it seems to be enabled on the just released debian stable:
    $ grep CONFIG_MANDATORY_FILE_LOCKING /boot/config-5.10.0-8-amd64
    CONFIG_MANDATORY_FILE_LOCKING=y

Also the new 5.13 kernel in experimental has it too:
    $ grep CONFIG_MANDATORY_FILE_LOCKING /boot/config-5.13.0-trunk-amd64
    CONFIG_MANDATORY_FILE_LOCKING=y

J. Bruce Fields Aug. 18, 2021, 3:42 p.m. UTC | #30

On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
> I’ll bite.  How about we attack this in the opposite direction: remove
> the deny write mechanism entirely.

For what it's worth, Windows has open flags that allow denying read or
write opens.  They also made their way into the NFSv4 protocol, but
knfsd enforces them only against other NFSv4 clients.  Last I checked,
Samba attempted to emulate them using flock (and there's a comment to
that effect on the flock syscall in fs/locks.c).  I don't know what Wine
does.

Pavel Shilovsky posted flags adding O_DENY* flags years ago:

	https://lwn.net/Articles/581005/

I keep thinking I should look back at those some day but will probably
never get to it.

I've no idea how Windows applications use them, though I'm told it's
common.

--b.

Eric W. Biederman Aug. 19, 2021, 1:56 p.m. UTC | #31

bfields@fieldses.org (J. Bruce Fields) writes:

> On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
>> I’ll bite.  How about we attack this in the opposite direction: remove
>> the deny write mechanism entirely.
>
> For what it's worth, Windows has open flags that allow denying read or
> write opens.  They also made their way into the NFSv4 protocol, but
> knfsd enforces them only against other NFSv4 clients.  Last I checked,
> Samba attempted to emulate them using flock (and there's a comment to
> that effect on the flock syscall in fs/locks.c).  I don't know what Wine
> does.
>
> Pavel Shilovsky posted flags adding O_DENY* flags years ago:
>
> 	https://lwn.net/Articles/581005/
>
> I keep thinking I should look back at those some day but will probably
> never get to it.
>
> I've no idea how Windows applications use them, though I'm told it's
> common.

I don't know in any detail.  I just have this memory of not being able
to open or do anything with a file on windows while any application has
it open.

We limit mandatory locks to filesystems that have the proper mount flag
and files that are sgid but are not executable.  Reusing that limit we
could probably allow such a behavior in Linux without causing chaos.

Without being very strict about which files can participate I can just
imagine someone hiding their presence by not allowing other applications
the ability to write to utmp or a log file.

In the windows world where everything evolved with those kinds of
restrictions it is probably fine (although super annoying).

Eric

J. Bruce Fields Aug. 19, 2021, 2:33 p.m. UTC | #32

On Thu, Aug 19, 2021 at 08:56:52AM -0500, Eric W. Biederman wrote:
> bfields@fieldses.org (J. Bruce Fields) writes:
> 
> > On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
> >> I’ll bite.  How about we attack this in the opposite direction: remove
> >> the deny write mechanism entirely.
> >
> > For what it's worth, Windows has open flags that allow denying read or
> > write opens.  They also made their way into the NFSv4 protocol, but
> > knfsd enforces them only against other NFSv4 clients.  Last I checked,
> > Samba attempted to emulate them using flock (and there's a comment to
> > that effect on the flock syscall in fs/locks.c).  I don't know what Wine
> > does.
> >
> > Pavel Shilovsky posted flags adding O_DENY* flags years ago:
> >
> > 	https://lwn.net/Articles/581005/
> >
> > I keep thinking I should look back at those some day but will probably
> > never get to it.
> >
> > I've no idea how Windows applications use them, though I'm told it's
> > common.
> 
> I don't know in any detail.  I just have this memory of not being able
> to open or do anything with a file on windows while any application has
> it open.
> 
> We limit mandatory locks to filesystems that have the proper mount flag
> and files that are sgid but are not executable.  Reusing that limit we
> could probably allow such a behavior in Linux without causing chaos.

I'm pretty confused about how we're using the term "mandatory locking".

The locks you're thinking of are basically ordinary posix byte-range
locks which we attempt to enforce as mandatory under certain conditions
(e.g. in fs/read_write.c:rw_verify_area).  That means we have to check
them on ordinary reads and writes, which is a pain in the butt.  (And we
don't manage to do it correctly--the code just checks for the existence
of a conflicting lock before performing IO, ignoring the obvious
time-of-check/time-of-use race.)

This has nothing to do with Windows share locks which from what I
understand are whole-file locks that are only enforced against opens.

--b.

> Without being very strict about which files can participate I can just
> imagine someone hiding their presence by not allowing other applications
> the ability to write to utmp or a log file.
> 
> In the windows world where everything evolved with those kinds of
> restrictions it is probably fine (although super annoying).
> 
> Eric

Jeff Layton Aug. 19, 2021, 6:39 p.m. UTC | #33

On Tue, 2021-08-17 at 11:48 -0500, Eric W. Biederman wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> 
> > On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
> > > [0] we have mandatory locks, too. Sigh.
> > 
> > I'd love to remove that.  Perhaps we could try persuading more of the
> > distros to disable the CONFIG option first.
> 
> Yes.  The support is disabled in RHEL8.
> 
> Does anyone know the appropriate people to talk to encourage other
> distro's to encourage them to disable the CONFIG_MANDATORY_FILE_LOCKING?
> 
> Either that or we can wait until the code bit-rots, but distro's
> disabling and removing a feature on their own is the more responsible
> path.
> 
> Given how many hoops need to be jumped through to use mandatory file
> locking once it is enabled, and the fact it has never worked in
> containers makes me suspect there are no more users.
> 

I'm all for ripping it out too. It's an insane interface anyway.

I've not heard a single complaint about this being turned off in
fedora/rhel or any other distro that has this disabled.

Cheers,

Linus Torvalds Aug. 19, 2021, 7:15 p.m. UTC | #34

On Thu, Aug 19, 2021 at 11:39 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> I'm all for ripping it out too. It's an insane interface anyway.
>
> I've not heard a single complaint about this being turned off in
> fedora/rhel or any other distro that has this disabled.

I'd love to remove it, we could absolutely test it. The fact that
several major distros have it disabled makes me think it's fine.

But as always, it would be good to check Android.

The desktop distros tend to have the same tools and programs, so if
Fedora and RHEL haven't needed it for years, then it's likely stale in
Debian too (despite being enabled).

But Android tends to be very different. Does anybody know?

            Linus

Jeff Layton Aug. 19, 2021, 7:18 p.m. UTC | #35

On Wed, 2021-08-18 at 11:34 +0200, Rodrigo Campos wrote:
> On Tue, Aug 17, 2021 at 6:49 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> > 
> > Matthew Wilcox <willy@infradead.org> writes:
> > 
> > > On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
> > > > [0] we have mandatory locks, too. Sigh.
> > > 
> > > I'd love to remove that.  Perhaps we could try persuading more of the
> > > distros to disable the CONFIG option first.
> > 
> > Yes.  The support is disabled in RHEL8.
> 
> If it helps, it seems to be enabled on the just released debian stable:
>     $ grep CONFIG_MANDATORY_FILE_LOCKING /boot/config-5.10.0-8-amd64
>     CONFIG_MANDATORY_FILE_LOCKING=y
> 
> Also the new 5.13 kernel in experimental has it too:
>     $ grep CONFIG_MANDATORY_FILE_LOCKING /boot/config-5.13.0-trunk-amd64
>     CONFIG_MANDATORY_FILE_LOCKING=y

A pity. It would have been nice if they had turned it off a while ago. I
guess I should have done more outreach at the time. Sigh...

In any case, I'm still inclined toward just ripping it out at this
point. It's hard to believe that anyone really uses it.

Eric Biggers Aug. 19, 2021, 7:55 p.m. UTC | #36

On Thu, Aug 19, 2021 at 12:15:08PM -0700, Linus Torvalds wrote:
> On Thu, Aug 19, 2021 at 11:39 AM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > I'm all for ripping it out too. It's an insane interface anyway.
> >
> > I've not heard a single complaint about this being turned off in
> > fedora/rhel or any other distro that has this disabled.
> 
> I'd love to remove it, we could absolutely test it. The fact that
> several major distros have it disabled makes me think it's fine.
> 
> But as always, it would be good to check Android.
> 
> The desktop distros tend to have the same tools and programs, so if
> Fedora and RHEL haven't needed it for years, then it's likely stale in
> Debian too (despite being enabled).
> 
> But Android tends to be very different. Does anybody know?
> 

As far as I know, Android never uses mandatory file locking.  While
CONFIG_MANDATORY_LOCKING=y is typically set (as it's "default y" upstream),
I can't find anywhere in the Android source tree that uses the "mand" mount
option, let alone anything actually using mandatory locks.  (I'm assuming that
Documentation/filesystems/mandatory-locking.rst is up-to-date regarding what
userspace actually has to do to use it.)

- Eric

Willy Tarreau Aug. 19, 2021, 8:03 p.m. UTC | #37

On Thu, Aug 19, 2021 at 03:18:15PM -0400, Jeff Layton wrote:
> On Wed, 2021-08-18 at 11:34 +0200, Rodrigo Campos wrote:
> > On Tue, Aug 17, 2021 at 6:49 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> > > 
> > > Matthew Wilcox <willy@infradead.org> writes:
> > > 
> > > > On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
> > > > > [0] we have mandatory locks, too. Sigh.
> > > > 
> > > > I'd love to remove that.  Perhaps we could try persuading more of the
> > > > distros to disable the CONFIG option first.
> > > 
> > > Yes.  The support is disabled in RHEL8.
> > 
> > If it helps, it seems to be enabled on the just released debian stable:
> >     $ grep CONFIG_MANDATORY_FILE_LOCKING /boot/config-5.10.0-8-amd64
> >     CONFIG_MANDATORY_FILE_LOCKING=y
> > 
> > Also the new 5.13 kernel in experimental has it too:
> >     $ grep CONFIG_MANDATORY_FILE_LOCKING /boot/config-5.13.0-trunk-amd64
> >     CONFIG_MANDATORY_FILE_LOCKING=y
> 
> A pity. It would have been nice if they had turned it off a while ago. I
> guess I should have done more outreach at the time. Sigh...

Would it be acceptable to add a warning when MS_MANDLOCK is passed to
mount() and backport this to stable kernels in order to get reports
of any such use in a reasonably short time ? Anyway it sounds
important to at least warn about deprecation.

Willy

Jeff Layton Aug. 19, 2021, 8:18 p.m. UTC | #38

On Thu, 2021-08-19 at 12:15 -0700, Linus Torvalds wrote:
> On Thu, Aug 19, 2021 at 11:39 AM Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > I'm all for ripping it out too. It's an insane interface anyway.
> > 
> > I've not heard a single complaint about this being turned off in
> > fedora/rhel or any other distro that has this disabled.
> 
> I'd love to remove it, we could absolutely test it. The fact that
> several major distros have it disabled makes me think it's fine.
> 
> But as always, it would be good to check Android.
> 
> The desktop distros tend to have the same tools and programs, so if
> Fedora and RHEL haven't needed it for years, then it's likely stale in
> Debian too (despite being enabled).
> 
> But Android tends to be very different. Does anybody know?
> 

Now that I think about it a little more, I actually did get one
complaint a few years ago:

Someone had upgraded from an earlier distro that supported the -o mand
mount option to a later one that had disabled it, and they had an (old)
fstab entry that specified it. They didn't actually use mandatory
locking and weren't sure why the option was set, so they removed it and
moved on.

I would feel a lot better about it if we had gotten Debian to turn it
off several years ago too, but I agree it's unlikely anyone uses this
and the risk of removing it is low.

I've spun up a patch to just rip it out. I'll do a bit of testing with
it tomorrow and then send it out.

Cheers,

Linus Torvalds Aug. 19, 2021, 8:31 p.m. UTC | #39

On Thu, Aug 19, 2021 at 1:18 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> Now that I think about it a little more, I actually did get one
> complaint a few years ago:
>
> Someone had upgraded from an earlier distro that supported the -o mand
> mount option to a later one that had disabled it, and they had an (old)
> fstab entry that specified it.

Hmm. We might be able to turn the "return -EINVAL" into just a warning.

Yes, yes, currently if you turn off CONFIG_MANDATORY_FILE_LOCKING, we
already do that

        VFS: "mand" mount option not supported

warning print, but then we fail the mount.

If CONFIG_MANDATORY_FILE_LOCKING goes away entirely, it might make
sense to turn that warning into something bigger, but then let the
mount continue - since now that "mand" flag would be purely a legacy
thing.

And yes, if we do that, we'd want the warning to be a big ugly thing,
just to make people very aware of it happening. Right now it's a
one-liner that is easy to miss, and the "oh, the mount failed" is the
thing that hopefully informs people about the fact that they need to
enable CONFIG_MANDATORY_FILE_LOCKING.

The logic being that if you can no longer enable mandatory locking in
the kernel, the current hard failure seems overly aggressive (and
might cause boot failures and inability to fix/report things when it
possibly keeps you from using the system at all).

              Linus

Jeff Layton Aug. 19, 2021, 9:43 p.m. UTC | #40

On Thu, 2021-08-19 at 13:31 -0700, Linus Torvalds wrote:
> On Thu, Aug 19, 2021 at 1:18 PM Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > Now that I think about it a little more, I actually did get one
> > complaint a few years ago:
> > 
> > Someone had upgraded from an earlier distro that supported the -o mand
> > mount option to a later one that had disabled it, and they had an (old)
> > fstab entry that specified it.
> 
> Hmm. We might be able to turn the "return -EINVAL" into just a warning.
> 
> Yes, yes, currently if you turn off CONFIG_MANDATORY_FILE_LOCKING, we
> already do that
> 
>         VFS: "mand" mount option not supported
> 
> warning print, but then we fail the mount.
> 
> If CONFIG_MANDATORY_FILE_LOCKING goes away entirely, it might make
> sense to turn that warning into something bigger, but then let the
> mount continue - since now that "mand" flag would be purely a legacy
> thing.
> 
> And yes, if we do that, we'd want the warning to be a big ugly thing,
> just to make people very aware of it happening. Right now it's a
> one-liner that is easy to miss, and the "oh, the mount failed" is the
> thing that hopefully informs people about the fact that they need to
> enable CONFIG_MANDATORY_FILE_LOCKING.
> 
> The logic being that if you can no longer enable mandatory locking in
> the kernel, the current hard failure seems overly aggressive (and
> might cause boot failures and inability to fix/report things when it
> possibly keeps you from using the system at all).
> 

What sort of big, ugly warning did you have in mind?

I'm fine with that general approach though and will plan to roll that
change into the patch I'm testing.

Thanks,

Linus Torvalds Aug. 19, 2021, 10:32 p.m. UTC | #41

On Thu, Aug 19, 2021 at 2:43 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> What sort of big, ugly warning did you have in mind?

I originally thought WARN_ON_ONCE() just to get the distro automatic
error handling involved, but it would probably be a big problem for
the people who end up having panic-on-warn or something.

So probably just a "make it a big box" thing that stands out, kind of
what lockdep etc does with

        pr_warn("======...====\n");

around the messages..

I don't know if distros have some pattern we could use that would end
up being something that gets reported to the user?

              Linus

Matthew Wilcox Aug. 20, 2021, 2:10 a.m. UTC | #42

On Thu, Aug 19, 2021 at 01:31:35PM -0700, Linus Torvalds wrote:
> Yes, yes, currently if you turn off CONFIG_MANDATORY_FILE_LOCKING, we
> already do that
> 
>         VFS: "mand" mount option not supported
> 
> warning print, but then we fail the mount.
> 
> If CONFIG_MANDATORY_FILE_LOCKING goes away entirely, it might make
> sense to turn that warning into something bigger, but then let the
> mount continue - since now that "mand" flag would be purely a legacy
> thing.
> 
> And yes, if we do that, we'd want the warning to be a big ugly thing,
> just to make people very aware of it happening. Right now it's a
> one-liner that is easy to miss, and the "oh, the mount failed" is the
> thing that hopefully informs people about the fact that they need to
> enable CONFIG_MANDATORY_FILE_LOCKING.

When I ripped out the NFS "intr" mount option fourteen years ago,
I just turned it into a noop (commit 150030b78a45).  It has greatly
amused me every article I've read that's been written since then
that recommends using it.  Just shows how much tribal knowledge we
have.

I think this is a little different, though; I was essetially making the
*wanted* behaviour of 'intr' the default (and disabling the unwanted
behaviour).  With 'mand', we're losing the behaviour entirely, and it's
plausible that someone might care.  Maybe something more like the old
sys_bdflush implementation?

        if (msg_count < 5) {
                msg_count++;
                printk(KERN_INFO
                        "warning: process `%s' used the obsolete bdflush"
                        " system call\n", current->comm);
                printk(KERN_INFO "Fix your initscripts?\n");
        }

NeilBrown Aug. 20, 2021, 3:44 a.m. UTC | #43

On Thu, 19 Aug 2021, J. Bruce Fields wrote:
> On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
> > I’ll bite.  How about we attack this in the opposite direction: remove
> > the deny write mechanism entirely.
> 
> For what it's worth, Windows has open flags that allow denying read or
> write opens.  They also made their way into the NFSv4 protocol, but
> knfsd enforces them only against other NFSv4 clients.  Last I checked,
> Samba attempted to emulate them using flock (and there's a comment to
> that effect on the flock syscall in fs/locks.c).  I don't know what Wine
> does.
> 
> Pavel Shilovsky posted flags adding O_DENY* flags years ago:
> 
> 	https://lwn.net/Articles/581005/
> 
> I keep thinking I should look back at those some day but will probably
> never get to it.

O_DENYREAD is an insane flag.  If a process reads a file that some other
process is working on, then the only which could be hurt is the reader.
So allowing a process to ask for the open to fail if someone is writing
might make sense.  Insisting that all opens fail does not.
Any code wanting O_DENYREAD *should* use advisory locking, and any code
wanting to know about read denial should too.

O_DENYWRITE can make sense.  When combined with a O_RDONLY open it is
effectively what happens when you exec a program.  You can no longer
open that file for write - you get ETXTBSY.

It would be nice to be able to combine O_DENYWRITE with O_RDWR.  This
combination is exactly what the kernel *should* do for swap files.
Unfortunately it doesn't.  The "i_writecount" field that is used to
trigger ETXTBSY for executables doesn't work for the single-writer
model.

I'm not sure about O_DENYDELETE.  It is a lock on the name.  Unix has
traditionally used lock-files to lock a name.  The functionality makes
sense for processes with write-access to the directory...

NeilBrown

Amir Goldstein Aug. 20, 2021, 6:36 a.m. UTC | #44

On Thu, Aug 19, 2021 at 11:32 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Aug 19, 2021 at 1:18 PM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > Now that I think about it a little more, I actually did get one
> > complaint a few years ago:
> >
> > Someone had upgraded from an earlier distro that supported the -o mand
> > mount option to a later one that had disabled it, and they had an (old)
> > fstab entry that specified it.
>
> Hmm. We might be able to turn the "return -EINVAL" into just a warning.
>
> Yes, yes, currently if you turn off CONFIG_MANDATORY_FILE_LOCKING, we
> already do that
>
>         VFS: "mand" mount option not supported
>
> warning print, but then we fail the mount.
>
> If CONFIG_MANDATORY_FILE_LOCKING goes away entirely, it might make
> sense to turn that warning into something bigger, but then let the
> mount continue - since now that "mand" flag would be purely a legacy
> thing.
>
> And yes, if we do that, we'd want the warning to be a big ugly thing,
> just to make people very aware of it happening. Right now it's a
> one-liner that is easy to miss, and the "oh, the mount failed" is the
> thing that hopefully informs people about the fact that they need to
> enable CONFIG_MANDATORY_FILE_LOCKING.
>
> The logic being that if you can no longer enable mandatory locking in
> the kernel, the current hard failure seems overly aggressive (and
> might cause boot failures and inability to fix/report things when it
> possibly keeps you from using the system at all).
>

Allow me to play the devil's advocate here - if fstab has '-o mand' we have
no way of knowing if any application is relying on '-o mand' and adding
more !!!!! to the warning is mostly good for clearing our conscious ;-)

Not saying we cannot resort to that and not saying there is an easy
solution, but there is one more solution to consider - force rdonly mount.
Yes, it could break some systems and possibly fail boot, but then again
an ext4 fs can already become rdonly due to errors, so it wouldn't
be the first time that sysadmins/users run into this behavior.

Thanks,
Amir.

Amir Goldstein Aug. 20, 2021, 7:14 a.m. UTC | #45

On Fri, Aug 20, 2021 at 9:36 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Thu, Aug 19, 2021 at 11:32 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Thu, Aug 19, 2021 at 1:18 PM Jeff Layton <jlayton@kernel.org> wrote:
> > >
> > > Now that I think about it a little more, I actually did get one
> > > complaint a few years ago:
> > >
> > > Someone had upgraded from an earlier distro that supported the -o mand
> > > mount option to a later one that had disabled it, and they had an (old)
> > > fstab entry that specified it.
> >
> > Hmm. We might be able to turn the "return -EINVAL" into just a warning.
> >
> > Yes, yes, currently if you turn off CONFIG_MANDATORY_FILE_LOCKING, we
> > already do that
> >
> >         VFS: "mand" mount option not supported
> >
> > warning print, but then we fail the mount.
> >
> > If CONFIG_MANDATORY_FILE_LOCKING goes away entirely, it might make
> > sense to turn that warning into something bigger, but then let the
> > mount continue - since now that "mand" flag would be purely a legacy
> > thing.
> >
> > And yes, if we do that, we'd want the warning to be a big ugly thing,
> > just to make people very aware of it happening. Right now it's a
> > one-liner that is easy to miss, and the "oh, the mount failed" is the
> > thing that hopefully informs people about the fact that they need to
> > enable CONFIG_MANDATORY_FILE_LOCKING.
> >
> > The logic being that if you can no longer enable mandatory locking in
> > the kernel, the current hard failure seems overly aggressive (and
> > might cause boot failures and inability to fix/report things when it
> > possibly keeps you from using the system at all).
> >
>
> Allow me to play the devil's advocate here - if fstab has '-o mand' we have
> no way of knowing if any application is relying on '-o mand' and adding
> more !!!!! to the warning is mostly good for clearing our conscious ;-)
>
> Not saying we cannot resort to that and not saying there is an easy
> solution, but there is one more solution to consider - force rdonly mount.
> Yes, it could break some systems and possibly fail boot, but then again
> an ext4 fs can already become rdonly due to errors, so it wouldn't
> be the first time that sysadmins/users run into this behavior.
>

Adding an anecdote - this week I got a report from field support
engineers about failure to assemble a RAID0 array, which led to this
warning that *requires* user intervention, in the worse case for boot
device it requires changing kernel boot params:

md/raid0:%s: cannot assemble multi-zone RAID0 with default_layout setting
md/raid0: please set raid.default_layout to 1 or 2

c84a1372df92 md/raid0: avoid RAID0 data corruption due to layout confusion.

There is no way I would have gotten this report from the field if a failure
was not involved...

The rdonly mount is only needed to get the attention of support people
to look the the kernel logs and find the warning - at this point, not too
many !!!!! are needed ;-)

So we could make 'mand' an alias to 'ro' and print a warning that says:
"'mand' mount option is deprecated, please fix your init scripts.
For caution, your filesystem was mounted rdonly, feel free to remount
rw and move on..."

Thanks,
Amir.

David Laight Aug. 20, 2021, 8:25 a.m. UTC | #46

From: NeilBrown
> Sent: 20 August 2021 04:45
...
> O_DENYREAD is an insane flag.  If a process reads a file that some other
> process is working on, then the only which could be hurt is the reader.
> So allowing a process to ask for the open to fail if someone is writing
> might make sense.  Insisting that all opens fail does not.
> Any code wanting O_DENYREAD *should* use advisory locking, and any code
> wanting to know about read denial should too.

It might make sense if O_DENYREAD | O_DENYWRITE | O_RDWR are all set.
That would be what O_EXCL ought to mean for a normal file.
So would be useful for a program that wants to update a config file.

...
> It would be nice to be able to combine O_DENYWRITE with O_RDWR.  This
> combination is exactly what the kernel *should* do for swap files.

I suspect that is a common usage - eg for updating a file that contains
a log file sequence number.

...
> I'm not sure about O_DENYDELETE.  It is a lock on the name.  Unix has
> traditionally used lock-files to lock a name.  The functionality makes
> sense for processes with write-access to the directory...

I'm not sure it makes any sense on filesystems that use inode numbers.
Which name would you protect, and how would you manage to do the test.
On windows O_DENYDELETE is pretty much the default.
Which is why software updates are such a PITA.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

David Laight Aug. 20, 2021, 8:30 a.m. UTC | #47

From: Linus Torvalds
> Sent: 19 August 2021 23:33
> 
> On Thu, Aug 19, 2021 at 2:43 PM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > What sort of big, ugly warning did you have in mind?
> 
> I originally thought WARN_ON_ONCE() just to get the distro automatic
> error handling involved, but it would probably be a big problem for
> the people who end up having panic-on-warn or something.

Even panic-on-oops is a PITA.
Took us weeks to realise that a customer system that was randomly
rebooting was 'just' having a boring NULL pointer access.
 
> So probably just a "make it a big box" thing that stands out, kind of
> what lockdep etc does with
> 
>         pr_warn("======...====\n");
> 
> around the messages..
> 
> I don't know if distros have some pattern we could use that would end
> up being something that gets reported to the user?

Will users even see it?
A lot of recent distro installs try very hard to hide all the kernel
messages.
OTOH I guess '-o mand' is unlikely to be set on any of those systems.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Jeff Layton Aug. 20, 2021, 12:27 p.m. UTC | #48

On Fri, 2021-08-20 at 10:14 +0300, Amir Goldstein wrote:
> On Fri, Aug 20, 2021 at 9:36 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > 
> > On Thu, Aug 19, 2021 at 11:32 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > > 
> > > On Thu, Aug 19, 2021 at 1:18 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > > 
> > > > Now that I think about it a little more, I actually did get one
> > > > complaint a few years ago:
> > > > 
> > > > Someone had upgraded from an earlier distro that supported the -o mand
> > > > mount option to a later one that had disabled it, and they had an (old)
> > > > fstab entry that specified it.
> > > 
> > > Hmm. We might be able to turn the "return -EINVAL" into just a warning.
> > > 
> > > Yes, yes, currently if you turn off CONFIG_MANDATORY_FILE_LOCKING, we
> > > already do that
> > > 
> > >         VFS: "mand" mount option not supported
> > > 
> > > warning print, but then we fail the mount.
> > > 
> > > If CONFIG_MANDATORY_FILE_LOCKING goes away entirely, it might make
> > > sense to turn that warning into something bigger, but then let the
> > > mount continue - since now that "mand" flag would be purely a legacy
> > > thing.
> > > 
> > > And yes, if we do that, we'd want the warning to be a big ugly thing,
> > > just to make people very aware of it happening. Right now it's a
> > > one-liner that is easy to miss, and the "oh, the mount failed" is the
> > > thing that hopefully informs people about the fact that they need to
> > > enable CONFIG_MANDATORY_FILE_LOCKING.
> > > 
> > > The logic being that if you can no longer enable mandatory locking in
> > > the kernel, the current hard failure seems overly aggressive (and
> > > might cause boot failures and inability to fix/report things when it
> > > possibly keeps you from using the system at all).
> > > 
> > 
> > Allow me to play the devil's advocate here - if fstab has '-o mand' we have
> > no way of knowing if any application is relying on '-o mand' and adding
> > more !!!!! to the warning is mostly good for clearing our conscious ;-)
> > 
> > Not saying we cannot resort to that and not saying there is an easy
> > solution, but there is one more solution to consider - force rdonly mount.
> > Yes, it could break some systems and possibly fail boot, but then again
> > an ext4 fs can already become rdonly due to errors, so it wouldn't
> > be the first time that sysadmins/users run into this behavior.
> > 
> 
> Adding an anecdote - this week I got a report from field support
> engineers about failure to assemble a RAID0 array, which led to this
> warning that *requires* user intervention, in the worse case for boot
> device it requires changing kernel boot params:
> 
> md/raid0:%s: cannot assemble multi-zone RAID0 with default_layout setting
> md/raid0: please set raid.default_layout to 1 or 2
> 
> c84a1372df92 md/raid0: avoid RAID0 data corruption due to layout confusion.
> 
> There is no way I would have gotten this report from the field if a failure
> was not involved...
> 
> The rdonly mount is only needed to get the attention of support people
> to look the the kernel logs and find the warning - at this point, not too
> many !!!!! are needed ;-)
> 
> So we could make 'mand' an alias to 'ro' and print a warning that says:
> "'mand' mount option is deprecated, please fix your init scripts.
> For caution, your filesystem was mounted rdonly, feel free to remount
> rw and move on..."

That is a possibility, but I'm not sure it's any better than just
failing the mount. We could also just keep the code around and throw a
big, scary warning about its impending removal for a few releases before
ripping it out completely (like Willy T. was suggesting).

I'm fine with any of these approaches if the consensus is that it's too
risky to just remove it. OTOH, I've yet to ever hear of any application
that uses this feature, even in a historical sense. You have to jump
through so many hoops that nothing can rely on having it available.

Willy Tarreau Aug. 20, 2021, 12:38 p.m. UTC | #49

On Fri, Aug 20, 2021 at 08:27:12AM -0400, Jeff Layton wrote:
> I'm fine with any of these approaches if the consensus is that it's too
> risky to just remove it. OTOH, I've yet to ever hear of any application
> that uses this feature, even in a historical sense.

Honestly, I agree. Some have fun of me because I'm often using old
stuff, but I don't even remember having used an application that
made use of mandatory locking. I remember having enabled it myself in
my kernels long ago after discovering its existence in the man pages,
just to test it. It doesn't rule out the possibility that it exists
somewhere though, but I think that the immediate removal combined
with the big fat warning in previous branches should be largely
enough to avoid the last minute surprise.

Willy

Jeff Layton Aug. 20, 2021, 12:54 p.m. UTC | #50

On Thu, 2021-08-19 at 10:33 -0400, J. Bruce Fields wrote:
> On Thu, Aug 19, 2021 at 08:56:52AM -0500, Eric W. Biederman wrote:
> > bfields@fieldses.org (J. Bruce Fields) writes:
> > 
> > > On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
> > > > I’ll bite.  How about we attack this in the opposite direction: remove
> > > > the deny write mechanism entirely.
> > > 
> > > For what it's worth, Windows has open flags that allow denying read or
> > > write opens.  They also made their way into the NFSv4 protocol, but
> > > knfsd enforces them only against other NFSv4 clients.  Last I checked,
> > > Samba attempted to emulate them using flock (and there's a comment to
> > > that effect on the flock syscall in fs/locks.c).  I don't know what Wine
> > > does.
> > > 
> > > Pavel Shilovsky posted flags adding O_DENY* flags years ago:
> > > 
> > > 	https://lwn.net/Articles/581005/
> > > 
> > > I keep thinking I should look back at those some day but will probably
> > > never get to it.
> > > 
> > > I've no idea how Windows applications use them, though I'm told it's
> > > common.
> > 
> > I don't know in any detail.  I just have this memory of not being able
> > to open or do anything with a file on windows while any application has
> > it open.
> > 
> > We limit mandatory locks to filesystems that have the proper mount flag
> > and files that are sgid but are not executable.  Reusing that limit we
> > could probably allow such a behavior in Linux without causing chaos.
> 
> I'm pretty confused about how we're using the term "mandatory locking".
> 
> The locks you're thinking of are basically ordinary posix byte-range
> locks which we attempt to enforce as mandatory under certain conditions
> (e.g. in fs/read_write.c:rw_verify_area).  That means we have to check
> them on ordinary reads and writes, which is a pain in the butt.  (And we
> don't manage to do it correctly--the code just checks for the existence
> of a conflicting lock before performing IO, ignoring the obvious
> time-of-check/time-of-use race.)
> 

Yeah, the locks we're talking about are the locks described in:

    Documentation/filesystems/mandatory-locking.rst

They've always been racy. You have to mount the fs with '-o mand' and
set a special mode on the file (setgid bit set, with group execute bit
cleared). It's a crazypants interface.

> This has nothing to do with Windows share locks which from what I
> understand are whole-file locks that are only enforced against opens.
> 

Yep. Those are different.

Confusingly, there is also LOCK_MAND|LOCK_READ|LOCK_WRITE for flock(),
which are purported to be for emulating Windows share modes. They aren't
really mandatory though.

> --b.
> 
> > Without being very strict about which files can participate I can just
> > imagine someone hiding their presence by not allowing other applications
> > the ability to write to utmp or a log file.
> > 
> > In the windows world where everything evolved with those kinds of
> > restrictions it is probably fine (although super annoying).
> > 
> > Eric

Jeff Layton Aug. 20, 2021, 1:03 p.m. UTC | #51

On Fri, 2021-08-20 at 14:38 +0200, Willy Tarreau wrote:
> On Fri, Aug 20, 2021 at 08:27:12AM -0400, Jeff Layton wrote:
> > I'm fine with any of these approaches if the consensus is that it's too
> > risky to just remove it. OTOH, I've yet to ever hear of any application
> > that uses this feature, even in a historical sense.
> 
> Honestly, I agree. Some have fun of me because I'm often using old
> stuff, but I don't even remember having used an application that
> made use of mandatory locking. I remember having enabled it myself in
> my kernels long ago after discovering its existence in the man pages,
> just to test it. It doesn't rule out the possibility that it exists
> somewhere though, but I think that the immediate removal combined
> with the big fat warning in previous branches should be largely
> enough to avoid the last minute surprise.
> 

Good point. It wouldn't hurt to push such a warning into stable kernels
at the same time. There always is a lag when we do something like this
before some downstream user notices.

Willy Tarreau Aug. 20, 2021, 1:11 p.m. UTC | #52

On Fri, Aug 20, 2021 at 09:03:16AM -0400, Jeff Layton wrote:
> Good point. It wouldn't hurt to push such a warning into stable kernels
> at the same time. There always is a lag when we do something like this
> before some downstream user notices.

Yes, that's why I proposed this. A warning can be backported into
stable without big consequences except warning future victims...

Willy

Steven Rostedt Aug. 20, 2021, 1:43 p.m. UTC | #53

On Thu, 19 Aug 2021 15:32:31 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> I originally thought WARN_ON_ONCE() just to get the distro automatic
> error handling involved, but it would probably be a big problem for
> the people who end up having panic-on-warn or something.
> 
> So probably just a "make it a big box" thing that stands out, kind of
> what lockdep etc does with
> 
>         pr_warn("======...====\n");
> 
> around the messages..
> 
> I don't know if distros have some pattern we could use that would end
> up being something that gets reported to the user?

People have started using my trace-printk notice message, that seems to
be big enough to get noticed.


 **********************************************************
 **   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **
 **                                                      **
 ** trace_printk() being used. Allocating extra memory.  **
 **                                                      **
 ** This means that this is a DEBUG kernel and it is     **
 ** unsafe for production use.                           **
 **                                                      **
 ** If you see this message and you are not debugging    **
 ** the kernel, report this immediately to your vendor!  **
 **                                                      **
 **   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **
 **********************************************************

There's been some talk about making that a more "generic" warning
message too.

-- Steve

Linus Torvalds Aug. 20, 2021, 4:06 p.m. UTC | #54

On Fri, Aug 20, 2021 at 6:43 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu, 19 Aug 2021 15:32:31 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > I don't know if distros have some pattern we could use that would end
> > up being something that gets reported to the user?
>
> People have started using my trace-printk notice message, that seems to
> be big enough to get noticed.

Well, I think people who use ftrace are m,ore likely to look at kernel
messages than most...

So what would be more interesting is if there's some distro support
for showing kernel notifications..

I see new notifications for calendar events, for devices that got
mounted, for a lot of things - so I'm really wondering if somebody
already perhaps had something for specially formatted kernel
messages..

               Linus

Kees Cook Aug. 20, 2021, 4:30 p.m. UTC | #55

On Thu, Aug 19, 2021 at 12:15:08PM -0700, Linus Torvalds wrote:
> On Thu, Aug 19, 2021 at 11:39 AM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > I'm all for ripping it out too. It's an insane interface anyway.
> >
> > I've not heard a single complaint about this being turned off in
> > fedora/rhel or any other distro that has this disabled.
> 
> I'd love to remove it, we could absolutely test it. The fact that
> several major distros have it disabled makes me think it's fine.

FWIW, it is now disabled in Ubuntu too:

https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/impish/commit/?h=master-next&id=f3aac5e47789cbeb3177a14d3d2a06575249e14b

> But as always, it would be good to check Android.

It looks like it's enabled (checking the Pixel 4 kernel image), but it's
not specifically mentioned in any of the build configs that are used to
construct the image, so I think this is just catching the "default y". I
expect it'd be fine to turn this off.

I will ask around to see if it's actually used.

H. Peter Anvin Aug. 20, 2021, 7:17 p.m. UTC | #56

I thought the main user was Samba and/or otherwise providing file service for M$ systems?

On August 20, 2021 9:30:31 AM PDT, Kees Cook <keescook@chromium.org> wrote:
>On Thu, Aug 19, 2021 at 12:15:08PM -0700, Linus Torvalds wrote:
>> On Thu, Aug 19, 2021 at 11:39 AM Jeff Layton <jlayton@kernel.org> wrote:
>> >
>> > I'm all for ripping it out too. It's an insane interface anyway.
>> >
>> > I've not heard a single complaint about this being turned off in
>> > fedora/rhel or any other distro that has this disabled.
>> 
>> I'd love to remove it, we could absolutely test it. The fact that
>> several major distros have it disabled makes me think it's fine.
>
>FWIW, it is now disabled in Ubuntu too:
>
>https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/impish/commit/?h=master-next&id=f3aac5e47789cbeb3177a14d3d2a06575249e14b
>
>> But as always, it would be good to check Android.
>
>It looks like it's enabled (checking the Pixel 4 kernel image), but it's
>not specifically mentioned in any of the build configs that are used to
>construct the image, so I think this is just catching the "default y". I
>expect it'd be fine to turn this off.
>
>I will ask around to see if it's actually used.
>

Jeff Layton Aug. 20, 2021, 9:29 p.m. UTC | #57

No, Windows has deny-mode locking at open time, but the kernel's
mandatory locks are enforced during read/write (which is why they are
such a pain). Samba will not miss these at all.

If we want something to provide windows-like semantics, we'd probably
want to start with something like Pavel Shilovsky's O_DENY_* patches.

-- Jeff

On Fri, 2021-08-20 at 12:17 -0700, H. Peter Anvin wrote:
> I thought the main user was Samba and/or otherwise providing file service for M$ systems?
> 
> On August 20, 2021 9:30:31 AM PDT, Kees Cook <keescook@chromium.org> wrote:
> > On Thu, Aug 19, 2021 at 12:15:08PM -0700, Linus Torvalds wrote:
> > > On Thu, Aug 19, 2021 at 11:39 AM Jeff Layton <jlayton@kernel.org> wrote:
> > > > 
> > > > I'm all for ripping it out too. It's an insane interface anyway.
> > > > 
> > > > I've not heard a single complaint about this being turned off in
> > > > fedora/rhel or any other distro that has this disabled.
> > > 
> > > I'd love to remove it, we could absolutely test it. The fact that
> > > several major distros have it disabled makes me think it's fine.
> > 
> > FWIW, it is now disabled in Ubuntu too:
> > 
> > https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/impish/commit/?h=master-next&id=f3aac5e47789cbeb3177a14d3d2a06575249e14b
> > 
> > > But as always, it would be good to check Android.
> > 
> > It looks like it's enabled (checking the Pixel 4 kernel image), but it's
> > not specifically mentioned in any of the build configs that are used to
> > construct the image, so I think this is just catching the "default y". I
> > expect it'd be fine to turn this off.
> > 
> > I will ask around to see if it's actually used.
> > 
>

Matthew Wilcox Aug. 20, 2021, 10:31 p.m. UTC | #58

On Fri, Aug 20, 2021 at 12:17:49PM -0700, H. Peter Anvin wrote:
> I thought the main user was Samba and/or otherwise providing file service for M$ systems?

When I asked around about this in ~2001, the only example anyoe was able
to come up with was some database that I no longer remember the name of.

Jeff Layton Aug. 21, 2021, 12:45 p.m. UTC | #59

On Fri, 2021-08-20 at 17:29 -0400, Jeff Layton wrote:
> No, Windows has deny-mode locking at open time, but the kernel's
> mandatory locks are enforced during read/write (which is why they are
> such a pain). Samba will not miss these at all.
> 
> If we want something to provide windows-like semantics, we'd probably
> want to start with something like Pavel Shilovsky's O_DENY_* patches.
> 
> -- Jeff
> 

Doh! It completely slipped my mind about byte-range locks on windows...

Those are mandatory and they do block read and write activity to the
ranges locked. They have weird semantics vs. POSIX locks (they stack
instead of splitting/merging, etc.).

Samba emulates these with (advisory) POSIX locks in most cases. Using
mandatory locks is probably possible, but I think it would add more
potential for deadlock and security issues.

Geert Uytterhoeven Aug. 23, 2021, 7:55 a.m. UTC | #60

On Fri, Aug 20, 2021 at 10:30 AM David Laight <David.Laight@aculab.com> wrote:
> From: Linus Torvalds
> > Sent: 19 August 2021 23:33
> >
> > On Thu, Aug 19, 2021 at 2:43 PM Jeff Layton <jlayton@kernel.org> wrote:
> > >
> > > What sort of big, ugly warning did you have in mind?
> >
> > I originally thought WARN_ON_ONCE() just to get the distro automatic
> > error handling involved, but it would probably be a big problem for
> > the people who end up having panic-on-warn or something.
>
> Even panic-on-oops is a PITA.
> Took us weeks to realise that a customer system that was randomly
> rebooting was 'just' having a boring NULL pointer access.
>
> > So probably just a "make it a big box" thing that stands out, kind of
> > what lockdep etc does with
> >
> >         pr_warn("======...====\n");
> >
> > around the messages..

Do we really need more of these?
They take time to print (especially on serial
consoles) and increase kernel size.

What's wrong with using an appropriate KERN_*, and letting userspace
make sure the admin/user will see the message (see below)?

> >
> > I don't know if distros have some pattern we could use that would end
> > up being something that gets reported to the user?
>
> Will users even see it?
> A lot of recent distro installs try very hard to hide all the kernel
> messages.

Exactly.  E.g. Ubuntu doesn't show any kernel output during normal
operation.

On Fri, Aug 20, 2021 at 6:12 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Aug 20, 2021 at 6:43 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> > On Thu, 19 Aug 2021 15:32:31 -0700
> > Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > >
> > > I don't know if distros have some pattern we could use that would end
> > > up being something that gets reported to the user?

> So what would be more interesting is if there's some distro support
> for showing kernel notifications..
>
> I see new notifications for calendar events, for devices that got
> mounted, for a lot of things - so I'm really wondering if somebody
> already perhaps had something for specially formatted kernel
> messages..

Isn't that what the old syslog and the new systemd are supposed to
handle in userspace?

Gr{oetje,eeting}s,

                        Geert

David Laight Aug. 23, 2021, 8:14 a.m. UTC | #61

From: Geert Uytterhoeven
> Sent: 23 August 2021 08:56
...
> Exactly.  E.g. Ubuntu doesn't show any kernel output during normal
> operation.

Current ubuntu (x86) is getting to be a PITA.
It even runs the graphical login on tty0 - which is where
the kernel messages would end up.
It is almost impossible to get an actual VGA console.

I need to find a different distro that has less bloat in it.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

J. Bruce Fields Aug. 23, 2021, 10:15 p.m. UTC | #62

On Sat, Aug 21, 2021 at 08:45:54AM -0400, Jeff Layton wrote:
> On Fri, 2021-08-20 at 17:29 -0400, Jeff Layton wrote:
> > No, Windows has deny-mode locking at open time, but the kernel's
> > mandatory locks are enforced during read/write (which is why they are
> > such a pain). Samba will not miss these at all.
> > 
> > If we want something to provide windows-like semantics, we'd probably
> > want to start with something like Pavel Shilovsky's O_DENY_* patches.
> > 
> > -- Jeff
> > 
> 
> Doh! It completely slipped my mind about byte-range locks on windows...
> 
> Those are mandatory and they do block read and write activity to the
> ranges locked. They have weird semantics vs. POSIX locks (they stack
> instead of splitting/merging, etc.).
> 
> Samba emulates these with (advisory) POSIX locks in most cases. Using
> mandatory locks is probably possible, but I think it would add more
> potential for deadlock and security issues.

Right, so Windows byte-range locks are different from Windows open deny
modes.

But even if somebody wanted to implement them, I doubt they'd start with
the mandatory locking code you're removing here, so I think they're
irrelevant to this discussion.

--b.

Andy Lutomirski Aug. 26, 2021, 5:48 p.m. UTC | #63

On Fri, Aug 13, 2021, at 5:54 PM, Linus Torvalds wrote:
> On Fri, Aug 13, 2021 at 2:49 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > I’ll bite.  How about we attack this in the opposite direction: remove the deny write mechanism entirely.
> 
> I think that would be ok, except I can see somebody relying on it.
> 
> It's broken, it's stupid, but we've done that ETXTBUSY for a _loong_ time.

Someone off-list just pointed something out to me, and I think we should push harder to remove ETXTBSY.  Specifically, we've all been focused on open() failing with ETXTBSY, and it's easy to make fun of anyone opening a running program for write when they should be unlinking and replacing it.

Alas, Linux's implementation of deny_write_access() is correct^Wabsurd, and deny_write_access() *also* returns ETXTBSY if the file is open for write.  So, in a multithreaded program, one thread does:

fd = open("some exefile", O_RDWR | O_CREAT | O_CLOEXEC);
write(fd, some stuff);

<--- problem is here

close(fd);
execve("some exefile");

Another thread does:

fork();
execve("something else");

In between fork and execve, there's another copy of the open file description, and i_writecount is held, and the execve() fails.  Whoops.  See, for example:

https://github.com/golang/go/issues/22315

I propose we get rid of deny_write_access() completely to solve this.

Getting rid of i_writecount itself seems a bit harder, since a handful of filesystems use it for clever reasons.

(OFD locks seem like they might have the same problem.  Maybe we should have a clone() flag to unshare the file table and close close-on-exec things?)

> 
> But you are right that we have removed parts of it over time (no more
> MAP_DENYWRITE, no more uselib()) so that what we have today is a
> fairly weak form of what we used to do.
> 
> And nobody really complained when we weakened it, so maybe removing it
> entirely might be acceptable.
> 
>               Linus
>

David Hildenbrand Aug. 26, 2021, 9:47 p.m. UTC | #64

On 26.08.21 19:48, Andy Lutomirski wrote:
> On Fri, Aug 13, 2021, at 5:54 PM, Linus Torvalds wrote:
>> On Fri, Aug 13, 2021 at 2:49 PM Andy Lutomirski <luto@kernel.org> wrote:
>>>
>>> I’ll bite.  How about we attack this in the opposite direction: remove the deny write mechanism entirely.
>>
>> I think that would be ok, except I can see somebody relying on it.
>>
>> It's broken, it's stupid, but we've done that ETXTBUSY for a _loong_ time.
> 
> Someone off-list just pointed something out to me, and I think we should push harder to remove ETXTBSY.  Specifically, we've all been focused on open() failing with ETXTBSY, and it's easy to make fun of anyone opening a running program for write when they should be unlinking and replacing it.
> 
> Alas, Linux's implementation of deny_write_access() is correct^Wabsurd, and deny_write_access() *also* returns ETXTBSY if the file is open for write.  So, in a multithreaded program, one thread does:
> 
> fd = open("some exefile", O_RDWR | O_CREAT | O_CLOEXEC);
> write(fd, some stuff);
> 
> <--- problem is here
> 
> close(fd);
> execve("some exefile");
> 
> Another thread does:
> 
> fork();
> execve("something else");
> 
> In between fork and execve, there's another copy of the open file description, and i_writecount is held, and the execve() fails.  Whoops.  See, for example:
> 
> https://github.com/golang/go/issues/22315
> 
> I propose we get rid of deny_write_access() completely to solve this.
> 
> Getting rid of i_writecount itself seems a bit harder, since a handful of filesystems use it for clever reasons.
> 
> (OFD locks seem like they might have the same problem.  Maybe we should have a clone() flag to unshare the file table and close close-on-exec things?)
> 

It's not like this issue is new (^2017) or relevant in practice. So no 
need to hurry IMHO. One step at a time: it might make perfect sense to 
remove ETXTBSY, but we have to be careful to not break other user space 
that actually cares about the current behavior in practice.

Eric W. Biederman Aug. 26, 2021, 10:13 p.m. UTC | #65

David Hildenbrand <david@redhat.com> writes:

> On 26.08.21 19:48, Andy Lutomirski wrote:
>> On Fri, Aug 13, 2021, at 5:54 PM, Linus Torvalds wrote:
>>> On Fri, Aug 13, 2021 at 2:49 PM Andy Lutomirski <luto@kernel.org> wrote:
>>>>
>>>> I’ll bite.  How about we attack this in the opposite direction: remove the deny write mechanism entirely.
>>>
>>> I think that would be ok, except I can see somebody relying on it.
>>>
>>> It's broken, it's stupid, but we've done that ETXTBUSY for a _loong_ time.
>>
>> Someone off-list just pointed something out to me, and I think we should push harder to remove ETXTBSY.  Specifically, we've all been focused on open() failing with ETXTBSY, and it's easy to make fun of anyone opening a running program for write when they should be unlinking and replacing it.
>>
>> Alas, Linux's implementation of deny_write_access() is correct^Wabsurd, and deny_write_access() *also* returns ETXTBSY if the file is open for write.  So, in a multithreaded program, one thread does:
>>
>> fd = open("some exefile", O_RDWR | O_CREAT | O_CLOEXEC);
>> write(fd, some stuff);
>>
>> <--- problem is here
>>
>> close(fd);
>> execve("some exefile");
>>
>> Another thread does:
>>
>> fork();
>> execve("something else");
>>
>> In between fork and execve, there's another copy of the open file description, and i_writecount is held, and the execve() fails.  Whoops.  See, for example:
>>
>> https://github.com/golang/go/issues/22315
>>
>> I propose we get rid of deny_write_access() completely to solve this.
>>
>> Getting rid of i_writecount itself seems a bit harder, since a handful of filesystems use it for clever reasons.
>>
>> (OFD locks seem like they might have the same problem.  Maybe we should have a clone() flag to unshare the file table and close close-on-exec things?)
>>
>
> It's not like this issue is new (^2017) or relevant in practice. So no
> need to hurry IMHO. One step at a time: it might make perfect sense to
> remove ETXTBSY, but we have to be careful to not break other user
> space that actually cares about the current behavior in practice.

It is an old enough issue that I agree there is no need to hurry.

I also ran into this issue not too long ago when I refactored the
usermode_driver code.  My challenge was not being in userspace
the delayed fput was not happening in my kernel thread.  Which meant
that writing the file, then closing the file, then execing the file
consistently reported -ETXTBSY.

The kernel code wound up doing:
	/* Flush delayed fput so exec can open the file read-only */
	flush_delayed_fput();
	task_work_run();

As I read the code the delay for userspace file descriptors is
always done with task_work_add, so userspace should not hit
that kind of silliness, and should be able to actually close
the file descriptor before the exec.


On the flip side, I don't know how anything can depend upon getting an
-ETXTBSY.  So I don't think there is any real risk of breaking userspace
if we remove it.

Eric

David Laight Aug. 27, 2021, 8:22 a.m. UTC | #66

From: Eric W. Biederman
> Sent: 26 August 2021 23:14
...
> I also ran into this issue not too long ago when I refactored the
> usermode_driver code.  My challenge was not being in userspace
> the delayed fput was not happening in my kernel thread.  Which meant
> that writing the file, then closing the file, then execing the file
> consistently reported -ETXTBSY.
> 
> The kernel code wound up doing:
> 	/* Flush delayed fput so exec can open the file read-only */
> 	flush_delayed_fput();
> 	task_work_run();
> 
> As I read the code the delay for userspace file descriptors is
> always done with task_work_add, so userspace should not hit
> that kind of silliness, and should be able to actually close
> the file descriptor before the exec.

If task_work_add ends up adding it to a task that is already
running on a different cpu, and that cpu takes a hardware
interrupt that takes some time and/or schedules the softint
code to run immediately the hardware interrupt completes
then it may well be possible for userspace to have 'issues'.

Any flags associated with O_DENY_WRITE would need to be cleared
synchronously in the close() rather then in any delayed fput().

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Christian Brauner Aug. 27, 2021, 10:18 a.m. UTC | #67

On Thu, Aug 26, 2021 at 11:47:07PM +0200, David Hildenbrand wrote:
> On 26.08.21 19:48, Andy Lutomirski wrote:
> > On Fri, Aug 13, 2021, at 5:54 PM, Linus Torvalds wrote:
> > > On Fri, Aug 13, 2021 at 2:49 PM Andy Lutomirski <luto@kernel.org> wrote:
> > > > 
> > > > I’ll bite.  How about we attack this in the opposite direction: remove the deny write mechanism entirely.
> > > 
> > > I think that would be ok, except I can see somebody relying on it.
> > > 
> > > It's broken, it's stupid, but we've done that ETXTBUSY for a _loong_ time.
> > 
> > Someone off-list just pointed something out to me, and I think we should push harder to remove ETXTBSY.  Specifically, we've all been focused on open() failing with ETXTBSY, and it's easy to make fun of anyone opening a running program for write when they should be unlinking and replacing it.
> > 
> > Alas, Linux's implementation of deny_write_access() is correct^Wabsurd, and deny_write_access() *also* returns ETXTBSY if the file is open for write.  So, in a multithreaded program, one thread does:
> > 
> > fd = open("some exefile", O_RDWR | O_CREAT | O_CLOEXEC);
> > write(fd, some stuff);
> > 
> > <--- problem is here
> > 
> > close(fd);
> > execve("some exefile");
> > 
> > Another thread does:
> > 
> > fork();
> > execve("something else");
> > 
> > In between fork and execve, there's another copy of the open file description, and i_writecount is held, and the execve() fails.  Whoops.  See, for example:
> > 
> > https://github.com/golang/go/issues/22315
> > 
> > I propose we get rid of deny_write_access() completely to solve this.
> > 
> > Getting rid of i_writecount itself seems a bit harder, since a handful of filesystems use it for clever reasons.
> > 
> > (OFD locks seem like they might have the same problem.  Maybe we should have a clone() flag to unshare the file table and close close-on-exec things?)
> > 
> 
> It's not like this issue is new (^2017) or relevant in practice. So no need
> to hurry IMHO. One step at a time: it might make perfect sense to remove
> ETXTBSY, but we have to be careful to not break other user space that
> actually cares about the current behavior in practice.

I agree. As I at least tried to show, removing write-protection can make
some exploits easier. I'm all for trying to remove this if it simplifies
things but for sure this shouldn't be part of this patchset and we
should be careful about it.

The removal of a (misguided or only partially functioning) protection
mechanism doesn't introduce but removes a failure point.
And I don't think removal and addition of a failure point usually have
the same consequences. Introducing a new failure point will often mean
userspace quickly detects regressions. Such regressions are pretty
common due to security fixes we introduce. Recent examples include [1].
Right after this was merged the regression was reported.

But when allowing behavior that used to fail like ETXTBSY it can be
difficult for userspace to detect such regressions. The reason for that
is quite often that userspace applications don't tend to do something
that they know upfront will fail. Attackers however might.

[1]: bfb819ea20ce ("proc: Check /proc/$pid/attr/ writes against file opener")

Christian

Eric W. Biederman Aug. 27, 2021, 3:58 p.m. UTC | #68

David Laight <David.Laight@ACULAB.COM> writes:

> From: Eric W. Biederman
>> Sent: 26 August 2021 23:14
> ...
>> I also ran into this issue not too long ago when I refactored the
>> usermode_driver code.  My challenge was not being in userspace
>> the delayed fput was not happening in my kernel thread.  Which meant
>> that writing the file, then closing the file, then execing the file
>> consistently reported -ETXTBSY.
>> 
>> The kernel code wound up doing:
>> 	/* Flush delayed fput so exec can open the file read-only */
>> 	flush_delayed_fput();
>> 	task_work_run();
>> 
>> As I read the code the delay for userspace file descriptors is
>> always done with task_work_add, so userspace should not hit
>> that kind of silliness, and should be able to actually close
>> the file descriptor before the exec.
>
> If task_work_add ends up adding it to a task that is already
> running on a different cpu, and that cpu takes a hardware
> interrupt that takes some time and/or schedules the softint
> code to run immediately the hardware interrupt completes
> then it may well be possible for userspace to have 'issues'.

It it task_work_add(current).  Which punts the work to the return to
userspace.

> Any flags associated with O_DENY_WRITE would need to be cleared
> synchronously in the close() rather then in any delayed fput().

Eric

David Hildenbrand Sept. 1, 2021, 8:28 a.m. UTC | #69

On 27.08.21 00:13, Eric W. Biederman wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 26.08.21 19:48, Andy Lutomirski wrote:
>>> On Fri, Aug 13, 2021, at 5:54 PM, Linus Torvalds wrote:
>>>> On Fri, Aug 13, 2021 at 2:49 PM Andy Lutomirski <luto@kernel.org> wrote:
>>>>>
>>>>> I’ll bite.  How about we attack this in the opposite direction: remove the deny write mechanism entirely.
>>>>
>>>> I think that would be ok, except I can see somebody relying on it.
>>>>
>>>> It's broken, it's stupid, but we've done that ETXTBUSY for a _loong_ time.
>>>
>>> Someone off-list just pointed something out to me, and I think we should push harder to remove ETXTBSY.  Specifically, we've all been focused on open() failing with ETXTBSY, and it's easy to make fun of anyone opening a running program for write when they should be unlinking and replacing it.
>>>
>>> Alas, Linux's implementation of deny_write_access() is correct^Wabsurd, and deny_write_access() *also* returns ETXTBSY if the file is open for write.  So, in a multithreaded program, one thread does:
>>>
>>> fd = open("some exefile", O_RDWR | O_CREAT | O_CLOEXEC);
>>> write(fd, some stuff);
>>>
>>> <--- problem is here
>>>
>>> close(fd);
>>> execve("some exefile");
>>>
>>> Another thread does:
>>>
>>> fork();
>>> execve("something else");
>>>
>>> In between fork and execve, there's another copy of the open file description, and i_writecount is held, and the execve() fails.  Whoops.  See, for example:
>>>
>>> https://github.com/golang/go/issues/22315
>>>
>>> I propose we get rid of deny_write_access() completely to solve this.
>>>
>>> Getting rid of i_writecount itself seems a bit harder, since a handful of filesystems use it for clever reasons.
>>>
>>> (OFD locks seem like they might have the same problem.  Maybe we should have a clone() flag to unshare the file table and close close-on-exec things?)
>>>
>>
>> It's not like this issue is new (^2017) or relevant in practice. So no
>> need to hurry IMHO. One step at a time: it might make perfect sense to
>> remove ETXTBSY, but we have to be careful to not break other user
>> space that actually cares about the current behavior in practice.
> 
> It is an old enough issue that I agree there is no need to hurry.
> 
> I also ran into this issue not too long ago when I refactored the
> usermode_driver code.  My challenge was not being in userspace
> the delayed fput was not happening in my kernel thread.  Which meant
> that writing the file, then closing the file, then execing the file
> consistently reported -ETXTBSY.
> 
> The kernel code wound up doing:
> 	/* Flush delayed fput so exec can open the file read-only */
> 	flush_delayed_fput();
> 	task_work_run();
> 
> As I read the code the delay for userspace file descriptors is
> always done with task_work_add, so userspace should not hit
> that kind of silliness, and should be able to actually close
> the file descriptor before the exec.
> 
> 
> On the flip side, I don't know how anything can depend upon getting an
> -ETXTBSY.  So I don't think there is any real risk of breaking userspace
> if we remove it.

At least in LTP, we have two test cases testing exactly that behavior:

testcases/kernel/syscalls/creat/creat07.c
testcases/kernel/syscalls/execve/execve04.c

[v1,0/7] Remove in-tree usage of MAP_DENYWRITE

Message

Comments