Message ID | 20210812084348.6521-1-david@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | Remove in-tree usage of MAP_DENYWRITE | expand |
* David Hildenbrand: > There are some (minor) user-visible changes with this series: > 1. We no longer deny write access to shared libaries loaded via legacy > uselib(); this behavior matches modern user space e.g., via dlopen(). > 2. We no longer deny write access to the elf interpreter after exec > completed, treating it just like shared libraries (which it often is). We have a persistent issue with people using cp (or similar tools) to replace system libraries. Since the file is truncated first, all relocations and global data are replaced by file contents, result in difficult-to-diagnose crashes. It would be nice if we had a way to prevent this mistake. It doesn't have to be MAP_DENYWRITE or MAP_COPY. It could be something completely new, like an option that turns every future access beyond the truncation point into a signal (rather than getting bad data or bad code and crashing much later). I don't know how many invalid copy operations are currently thwarted by the current program interpreter restriction. I doubt that lifting the restriction matters. > 3. We always deny write access to the file linked via /proc/pid/exe: > sys_prctl(PR_SET_MM_EXE_FILE) will fail if write access to the file > cannot be denied, and write access to the file will remain denied > until the link is effectivel gone (exec, termination, > PR_SET_MM_EXE_FILE) -- just as if exec'ing the file. > > I was wondering if we really care about permanently disabling write access > to the executable, or if it would be good enough to just disable write > access while loading the new executable during exec; but I don't know > the history of that -- and it somewhat makes sense to deny write access > at least to the main executable. With modern user space -- dlopen() -- we > can effectively modify the content of shared libraries while being used. Is there a difference between ET_DYN and ET_EXEC executables? Thanks, Florian
On 12.08.21 14:20, Florian Weimer wrote: > * David Hildenbrand: > >> There are some (minor) user-visible changes with this series: >> 1. We no longer deny write access to shared libaries loaded via legacy >> uselib(); this behavior matches modern user space e.g., via dlopen(). >> 2. We no longer deny write access to the elf interpreter after exec >> completed, treating it just like shared libraries (which it often is). > > We have a persistent issue with people using cp (or similar tools) to > replace system libraries. Since the file is truncated first, all > relocations and global data are replaced by file contents, result in > difficult-to-diagnose crashes. It would be nice if we had a way to > prevent this mistake. It doesn't have to be MAP_DENYWRITE or MAP_COPY. > It could be something completely new, like an option that turns every > future access beyond the truncation point into a signal (rather than > getting bad data or bad code and crashing much later). > > I don't know how many invalid copy operations are currently thwarted by > the current program interpreter restriction. I doubt that lifting the > restriction matters. > >> 3. We always deny write access to the file linked via /proc/pid/exe: >> sys_prctl(PR_SET_MM_EXE_FILE) will fail if write access to the file >> cannot be denied, and write access to the file will remain denied >> until the link is effectivel gone (exec, termination, >> PR_SET_MM_EXE_FILE) -- just as if exec'ing the file. >> >> I was wondering if we really care about permanently disabling write access >> to the executable, or if it would be good enough to just disable write >> access while loading the new executable during exec; but I don't know >> the history of that -- and it somewhat makes sense to deny write access >> at least to the main executable. With modern user space -- dlopen() -- we >> can effectively modify the content of shared libraries while being used. > > Is there a difference between ET_DYN and ET_EXEC executables? No, I don't think so. When exec'ing, the main executable will see a deny_write_access(file); AFAIKT, that can either be ET_DYN or ET_EXEC.
Florian Weimer <fweimer@redhat.com> writes: > * David Hildenbrand: > >> There are some (minor) user-visible changes with this series: >> 1. We no longer deny write access to shared libaries loaded via legacy >> uselib(); this behavior matches modern user space e.g., via dlopen(). >> 2. We no longer deny write access to the elf interpreter after exec >> completed, treating it just like shared libraries (which it often is). > > We have a persistent issue with people using cp (or similar tools) to > replace system libraries. Since the file is truncated first, all > relocations and global data are replaced by file contents, result in > difficult-to-diagnose crashes. It would be nice if we had a way to > prevent this mistake. It doesn't have to be MAP_DENYWRITE or MAP_COPY. > It could be something completely new, like an option that turns every > future access beyond the truncation point into a signal (rather than > getting bad data or bad code and crashing much later). > > I don't know how many invalid copy operations are currently thwarted by > the current program interpreter restriction. I doubt that lifting the > restriction matters. I suspect that what should happen is that we should make shared libraries and executables read-only on disk. We could potentially take this a step farther and introduce a new sysctl that causes "mmap(adr, len, PROT_EXEC, MAP_SHARED, fd, off)" but not PROT_WRITE to fail if the file can be written by anyone. That sysctl could even deny chown adding write access to the file if there are mappings open. Given that there hasn't been enough pain for people to install shared libraries read-only yet I suspect just installing executables and shared libraries without write-permissions on disk is enough to prevent the hard to track down bugs you have been talking about. >> 3. We always deny write access to the file linked via /proc/pid/exe: >> sys_prctl(PR_SET_MM_EXE_FILE) will fail if write access to the file >> cannot be denied, and write access to the file will remain denied >> until the link is effectivel gone (exec, termination, >> PR_SET_MM_EXE_FILE) -- just as if exec'ing the file. >> >> I was wondering if we really care about permanently disabling write access >> to the executable, or if it would be good enough to just disable write >> access while loading the new executable during exec; but I don't know >> the history of that -- and it somewhat makes sense to deny write access >> at least to the main executable. With modern user space -- dlopen() -- we >> can effectively modify the content of shared libraries while being used. > > Is there a difference between ET_DYN and ET_EXEC executables? What is being changed is how we track which files to denying write access on. Instead of denying write-access based on a per mapping (aka mmap) basis, the new code is only denying access to /proc/self/exe. Because the method of tracking is much coarser is why the interper stops being protected. The code doesn't care how the mappings happen, only if the file is /proc/self/exe or not. Eric
David Hildenbrand <david@redhat.com> writes: > This series is based on v5.14-rc5 and corresponds code-wise to the > previously sent RFC [1] (the RFC still applied cleanly). > > This series removes all in-tree usage of MAP_DENYWRITE from the kernel > and removes VM_DENYWRITE. We stopped supporting MAP_DENYWRITE for > user space applications a while ago because of the chance for DoS. > The last renaming user is binfmt binary loading during exec and > legacy library loading via uselib(). > > With this change, MAP_DENYWRITE is effectively ignored throughout the > kernel. Although the net change is small, I think the cleanup in mmap() > is quite nice. > > There are some (minor) user-visible changes with this series: > 1. We no longer deny write access to shared libaries loaded via legacy > uselib(); this behavior matches modern user space e.g., via dlopen(). > 2. We no longer deny write access to the elf interpreter after exec > completed, treating it just like shared libraries (which it often is). > 3. We always deny write access to the file linked via /proc/pid/exe: > sys_prctl(PR_SET_MM_EXE_FILE) will fail if write access to the file > cannot be denied, and write access to the file will remain denied > until the link is effectivel gone (exec, termination, > PR_SET_MM_EXE_FILE) -- just as if exec'ing the file. > > I was wondering if we really care about permanently disabling write access > to the executable, or if it would be good enough to just disable write > access while loading the new executable during exec; but I don't know > the history of that -- and it somewhat makes sense to deny write access > at least to the main executable. With modern user space -- dlopen() -- we > can effectively modify the content of shared libraries while being > used. So I think what we really want to do is to install executables with and shared libraries without write permissions and immutable. So that upgrades/replacements of the libraries and executables are forced to rename or unlink them. We need the immutable bit as CAP_DAC_OVERRIDE aka being root ignores the writable bits when a file is opened for write. However CAP_DAC_OVERRIDE does not override the immutable state of a file. I believe that denying write access at exec mmap time is actually much to late in the process and making the denial of writing much larger in scope is fundamentally what we want to do. Changing how we install the files, avoids the denial of service problems that MAP_DENYWRITE had. Making the denial always happen ensures that installation programs are never fooled into thinking a non-atomic update of an executable or shared library is ok. Still that is non-kernel work so I don't know who would make that change. As this fundamentally simplifies and a design mistake with very little functional change. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> For the entire series. > There is a related problem [2] with overlayfs, that should at least partly > be tackled by this series. I don't quite understand the interaction of > overlayfs and deny_write_access()/allow_write_access() at exec time: > > If we end up denying write access to the wrong file and not to the > realfile, that would be fundamentally broken. We would have to reroute > our deny_write_access()/ allow_write_access() calls for the exec file to > the realfile -- but I leave figuring out the details to overlayfs guys, as > that would be a related but different issue. > > RFC -> v1: > - "binfmt: remove in-tree usage of MAP_DENYWRITE" > -- Add a note that this should fix part of a problem with overlayfs > > [1] https://lore.kernel.org/r/20210423131640.20080-1-david@redhat.com/ > [2] https://lore.kernel.org/r/YNHXzBgzRrZu1MrD@miu.piliscsaba.redhat.com/ > > Cc: Linus Torvalds <torvalds@linux-foundation.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Borislav Petkov <bp@alien8.de> > Cc: "H. Peter Anvin" <hpa@zytor.com> > Cc: Alexander Viro <viro@zeniv.linux.org.uk> > Cc: Alexey Dobriyan <adobriyan@gmail.com> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Arnaldo Carvalho de Melo <acme@kernel.org> > Cc: Mark Rutland <mark.rutland@arm.com> > Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> > Cc: Jiri Olsa <jolsa@redhat.com> > Cc: Namhyung Kim <namhyung@kernel.org> > Cc: Petr Mladek <pmladek@suse.com> > Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> > Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> > Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> > Cc: Kees Cook <keescook@chromium.org> > Cc: "Eric W. Biederman" <ebiederm@xmission.com> > Cc: Greg Ungerer <gerg@linux-m68k.org> > Cc: Geert Uytterhoeven <geert@linux-m68k.org> > Cc: Mike Rapoport <rppt@kernel.org> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> > Cc: Chinwen Chang <chinwen.chang@mediatek.com> > Cc: Michel Lespinasse <walken@google.com> > Cc: Catalin Marinas <catalin.marinas@arm.com> > Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> > Cc: Huang Ying <ying.huang@intel.com> > Cc: Jann Horn <jannh@google.com> > Cc: Feng Tang <feng.tang@intel.com> > Cc: Kevin Brodsky <Kevin.Brodsky@arm.com> > Cc: Michael Ellerman <mpe@ellerman.id.au> > Cc: Shawn Anastasio <shawn@anastas.io> > Cc: Steven Price <steven.price@arm.com> > Cc: Nicholas Piggin <npiggin@gmail.com> > Cc: Christian Brauner <christian.brauner@ubuntu.com> > Cc: Jens Axboe <axboe@kernel.dk> > Cc: Gabriel Krisman Bertazi <krisman@collabora.com> > Cc: Peter Xu <peterx@redhat.com> > Cc: Suren Baghdasaryan <surenb@google.com> > Cc: Shakeel Butt <shakeelb@google.com> > Cc: Marco Elver <elver@google.com> > Cc: Daniel Jordan <daniel.m.jordan@oracle.com> > Cc: Nicolas Viennot <Nicolas.Viennot@twosigma.com> > Cc: Thomas Cedeno <thomascedeno@google.com> > Cc: Collin Fijalkovich <cfijalkovich@google.com> > Cc: Michal Hocko <mhocko@suse.com> > Cc: Miklos Szeredi <miklos@szeredi.hu> > Cc: Chengguang Xu <cgxu519@mykernel.net> > Cc: "Christian König" <ckoenig.leichtzumerken@gmail.com> > Cc: linux-unionfs@vger.kernel.org > Cc: linux-api@vger.kernel.org > Cc: x86@kernel.org > Cc: linux-fsdevel@vger.kernel.org > Cc: linux-mm@kvack.org > > David Hildenbrand (7): > binfmt: don't use MAP_DENYWRITE when loading shared libraries via > uselib() > kernel/fork: factor out atomcially replacing the current MM exe_file > kernel/fork: always deny write access to current MM exe_file > binfmt: remove in-tree usage of MAP_DENYWRITE > mm: remove VM_DENYWRITE > mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff() > fs: update documentation of get_write_access() and friends > > arch/x86/ia32/ia32_aout.c | 8 ++-- > fs/binfmt_aout.c | 7 ++-- > fs/binfmt_elf.c | 6 +-- > fs/binfmt_elf_fdpic.c | 2 +- > fs/proc/task_mmu.c | 1 - > include/linux/fs.h | 19 +++++---- > include/linux/mm.h | 3 +- > include/linux/mman.h | 4 +- > include/trace/events/mmflags.h | 1 - > kernel/events/core.c | 2 - > kernel/fork.c | 75 ++++++++++++++++++++++++++++++---- > kernel/sys.c | 33 +-------------- > lib/test_printf.c | 5 +-- > mm/mmap.c | 29 ++----------- > mm/nommu.c | 2 - > 15 files changed, 98 insertions(+), 99 deletions(-) > > > base-commit: 36a21d51725af2ce0700c6ebcb6b9594aac658a6
On Thu, Aug 12, 2021, at 10:32 AM, Eric W. Biederman wrote: > David Hildenbrand <david@redhat.com> writes: > > > This series is based on v5.14-rc5 and corresponds code-wise to the > > previously sent RFC [1] (the RFC still applied cleanly). > > > > This series removes all in-tree usage of MAP_DENYWRITE from the kernel > > and removes VM_DENYWRITE. We stopped supporting MAP_DENYWRITE for > > user space applications a while ago because of the chance for DoS. > > The last renaming user is binfmt binary loading during exec and > > legacy library loading via uselib(). > > > > With this change, MAP_DENYWRITE is effectively ignored throughout the > > kernel. Although the net change is small, I think the cleanup in mmap() > > is quite nice. > > > > There are some (minor) user-visible changes with this series: > > 1. We no longer deny write access to shared libaries loaded via legacy > > uselib(); this behavior matches modern user space e.g., via dlopen(). > > 2. We no longer deny write access to the elf interpreter after exec > > completed, treating it just like shared libraries (which it often is). > > 3. We always deny write access to the file linked via /proc/pid/exe: > > sys_prctl(PR_SET_MM_EXE_FILE) will fail if write access to the file > > cannot be denied, and write access to the file will remain denied > > until the link is effectivel gone (exec, termination, > > PR_SET_MM_EXE_FILE) -- just as if exec'ing the file. > > > > I was wondering if we really care about permanently disabling write access > > to the executable, or if it would be good enough to just disable write > > access while loading the new executable during exec; but I don't know > > the history of that -- and it somewhat makes sense to deny write access > > at least to the main executable. With modern user space -- dlopen() -- we > > can effectively modify the content of shared libraries while being > > used. > > So I think what we really want to do is to install executables with > and shared libraries without write permissions and immutable. So that > upgrades/replacements of the libraries and executables are forced to > rename or unlink them. We need the immutable bit as CAP_DAC_OVERRIDE > aka being root ignores the writable bits when a file is opened for > write. However CAP_DAC_OVERRIDE does not override the immutable state > of a file. If we really want to do this, I think we'd want a different flag that's more like sealed. Non-root users should be able to do this, too. Or we could just more gracefully handle users that overwrite running programs. --Andy
"Andy Lutomirski" <luto@kernel.org> writes: > On Thu, Aug 12, 2021, at 10:32 AM, Eric W. Biederman wrote: >> David Hildenbrand <david@redhat.com> writes: >> >> > This series is based on v5.14-rc5 and corresponds code-wise to the >> > previously sent RFC [1] (the RFC still applied cleanly). >> > >> > This series removes all in-tree usage of MAP_DENYWRITE from the kernel >> > and removes VM_DENYWRITE. We stopped supporting MAP_DENYWRITE for >> > user space applications a while ago because of the chance for DoS. >> > The last renaming user is binfmt binary loading during exec and >> > legacy library loading via uselib(). >> > >> > With this change, MAP_DENYWRITE is effectively ignored throughout the >> > kernel. Although the net change is small, I think the cleanup in mmap() >> > is quite nice. >> > >> > There are some (minor) user-visible changes with this series: >> > 1. We no longer deny write access to shared libaries loaded via legacy >> > uselib(); this behavior matches modern user space e.g., via dlopen(). >> > 2. We no longer deny write access to the elf interpreter after exec >> > completed, treating it just like shared libraries (which it often is). >> > 3. We always deny write access to the file linked via /proc/pid/exe: >> > sys_prctl(PR_SET_MM_EXE_FILE) will fail if write access to the file >> > cannot be denied, and write access to the file will remain denied >> > until the link is effectivel gone (exec, termination, >> > PR_SET_MM_EXE_FILE) -- just as if exec'ing the file. >> > >> > I was wondering if we really care about permanently disabling write access >> > to the executable, or if it would be good enough to just disable write >> > access while loading the new executable during exec; but I don't know >> > the history of that -- and it somewhat makes sense to deny write access >> > at least to the main executable. With modern user space -- dlopen() -- we >> > can effectively modify the content of shared libraries while being >> > used. >> >> So I think what we really want to do is to install executables with >> and shared libraries without write permissions and immutable. So that >> upgrades/replacements of the libraries and executables are forced to >> rename or unlink them. We need the immutable bit as CAP_DAC_OVERRIDE >> aka being root ignores the writable bits when a file is opened for >> write. However CAP_DAC_OVERRIDE does not override the immutable state >> of a file. > > If we really want to do this, I think we'd want a different flag > that's more like sealed. Non-root users should be able to do this, > too. > > Or we could just more gracefully handle users that overwrite running > programs. I had a blind spot, and Florian Weimer made a very reasonable request. Apparently userspace for shared libraires uses MAP_PRIVATE. So we almost don't care if the library is overwritten. We loose some efficiency and apparently there are some corner cases like the library being extended past the end of the exiting file that are problematic. Given that MAP_PRIVATE for shared libraries is our strategy for handling writes to shared libraries perhaps we just need to use MAP_POPULATE or a new related flag (perhaps MAP_PRIVATE_NOW) that just makes certain that everything mapped from the executable is guaranteed to be visible from the time of the mmap, and any changes from the filesystem side after that are guaranteed to cause a copy on write. Once we get that figured out we could consider getting rid of deny-write entirely. Eric
On Thu, Aug 12, 2021, at 10:48 AM, Eric W. Biederman wrote: > "Andy Lutomirski" <luto@kernel.org> writes: > I had a blind spot, and Florian Weimer made a very reasonable request. > Apparently userspace for shared libraires uses MAP_PRIVATE. > > So we almost don't care if the library is overwritten. We loose some > efficiency and apparently there are some corner cases like the library > being extended past the end of the exiting file that are problematic. > > Given that MAP_PRIVATE for shared libraries is our strategy for handling > writes to shared libraries perhaps we just need to use MAP_POPULATE or a > new related flag (perhaps MAP_PRIVATE_NOW) that just makes certain that > everything mapped from the executable is guaranteed to be visible from > the time of the mmap, and any changes from the filesystem side after > that are guaranteed to cause a copy on write. > > Once we get that figured out we could consider getting rid of deny-write > entirely. Are all of the CoW bits in good enough shape for this to work without just immediately CoWing the whole file? In principle, write(2) to a file should be able to notice that it needs to CoW some pages, but I doubt that this actually works. --Andy
On Thu, Aug 12, 2021 at 7:48 AM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Given that MAP_PRIVATE for shared libraries is our strategy for handling > writes to shared libraries perhaps we just need to use MAP_POPULATE or a > new related flag (perhaps MAP_PRIVATE_NOW) No. That would be horrible for the usual bloated GUI libraries. It might help some (dynamic page faults are not cheap either), but it would hurt a lot. This is definitely a "if you overwrite a system library while it's being used, you get to keep both pieces" situation. The kernel ETXTBUSY thing is purely a courtesy feature, and as people have noticed it only really works for the main executable because of various reasons. It's not something user space should even rely on, it's more of a "ok, you're doing something incredibly stupid, and we'll help you avoid shooting yourself in the foot when we notice". Any distro should make sure their upgrade tools don't just truncate/write to random libraries executables. And if they do, it's really not a kernel issue. This patch series basically takes this very historical error return, and simplifies and clarifies the implementation, and in the process might change some very subtle corner case (unmapping the original executable entirely?). I hope (and think) it wouldn't matter exactly because this is a "courtesy error" rather than anything that a sane setup would _depend_ on, but hey, insane setups clearly exist. Linus
* Eric W. Biederman: > Given that MAP_PRIVATE for shared libraries is our strategy for handling > writes to shared libraries perhaps we just need to use MAP_POPULATE or a > new related flag (perhaps MAP_PRIVATE_NOW) that just makes certain that > everything mapped from the executable is guaranteed to be visible from > the time of the mmap, and any changes from the filesystem side after > that are guaranteed to cause a copy on write. I think this is called MAP_COPY: <https://www.gnu.org/software/hurd/glibc/mmap.html> If we could get that functionality, we would certainly use it in the glibc dynamic loader. And it's not just dynamic loaders that would benefit. But I assume there are some rather thorny issues around semantics. If the changed areas of the file are in the page cache already, everything is okay. But if parts of the file are changed or discarded that are not, they would have to be read in first, which is rather awkward. That's why I suggested the signal for all future accesses. It seems more tractable and addresses the largest issue (the difficulty of figuring out why some processes crash occasionally). Thanks, Florian
On Thu, Aug 12, 2021 at 8:16 AM Florian Weimer <fweimer@redhat.com> wrote: > > I think this is called MAP_COPY: > > <https://www.gnu.org/software/hurd/glibc/mmap.html> Please don't even consider the crazy notions that GNU Hurd did. It's a fundamental design mistake. The Hurd VM was horrendous, and MAP_COPY was a prime example of the kinds of horrors it had. I'm not sure how much of the mis-designs were due to Hurd, and how much of it due to Mach 3. But please don't point to Hurd VM documentation except possibly to warn people. We want people to _forget_ those mistakes, not repeat them. Linus
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Thu, Aug 12, 2021 at 7:48 AM Eric W. Biederman <ebiederm@xmission.com> wrote: >> >> Given that MAP_PRIVATE for shared libraries is our strategy for handling >> writes to shared libraries perhaps we just need to use MAP_POPULATE or a >> new related flag (perhaps MAP_PRIVATE_NOW) > > No. That would be horrible for the usual bloated GUI libraries. It > might help some (dynamic page faults are not cheap either), but it > would hurt a lot. I wasn't aiming so much at the MAP_POPULATE part but something that would trigger cow from writes to the file. I see code that is close but I don't see any code in the kernel that would implement that currently. Upon reflection I think it will always be difficult to trigger cow from the file write side of the kernel as code that would cow the page in the page cache would cause problems with writable mmaps. > This is definitely a "if you overwrite a system library while it's > being used, you get to keep both pieces" situation. > > The kernel ETXTBUSY thing is purely a courtesy feature, and as people > have noticed it only really works for the main executable because of > various reasons. It's not something user space should even rely on, > it's more of a "ok, you're doing something incredibly stupid, and > we'll help you avoid shooting yourself in the foot when we notice". > > Any distro should make sure their upgrade tools don't just > truncate/write to random libraries executables. Yes. I am trying to come up with advice on how userspace implementations can implement their tools to use other mechanisms that solve the overwriting shared libaries and executables problem that are not broken by design. For a little bit the way Florian Weirmer was talking and the fact that uselib uses MAP_PRIVATE had me thinking that somehow MAP_PRIVATE could be part of the solution. I have now looked into the implementation of MAP_PRIVATE and I since we don't perform the cow on filesystem writes MAP_PRIVATE absolutely can not be part of the solution we recommend to userspace. So today the best advice I can give to userspace is to mark their executables and shared libraries as read-only and immutable. Otherwise a change to the executable file can change what is mapped into memory. MAP_PRIVATE does not help. > And if they do, it's really not a kernel issue. What is a kernel issue is giving people good advice on how to use kernel features to solve real world problems. I have seen the write to a mapped exectuable/shared lib problem, and Florian has seen it. So while rare the problem is real and a pain to debug. > This patch series basically takes this very historical error return, > and simplifies and clarifies the implementation, and in the process > might change some very subtle corner case (unmapping the original > executable entirely?). I hope (and think) it wouldn't matter exactly > because this is a "courtesy error" rather than anything that a sane > setup would _depend_ on, but hey, insane setups clearly exist. Oh yes. I very much agree that the design of this patchset is perfectly fine. I also see that MAP_DENYWRITE is unfortunately broken by design. I vaguely remember the discussion when MAP_DENYWRITE was made a noop because of the denial-of-service aspect of MAP_DENYWRITE. I very much agree that we should strongly encourage userspace not to write to mmaped files. As I am learning with my two year old, it helps to give a constructive suggestion of alternative behavior instead of just saying no. Florian reported that there remains a problem in userspace. So I am coming up with a constructive suggestion. My apologies for going off into the weeds for a moment. Eric
On 12.08.21 20:10, Linus Torvalds wrote: > On Thu, Aug 12, 2021 at 7:48 AM Eric W. Biederman <ebiederm@xmission.com> wrote: >> >> Given that MAP_PRIVATE for shared libraries is our strategy for handling >> writes to shared libraries perhaps we just need to use MAP_POPULATE or a >> new related flag (perhaps MAP_PRIVATE_NOW) > > No. That would be horrible for the usual bloated GUI libraries. It > might help some (dynamic page faults are not cheap either), but it > would hurt a lot. Right, we most certainly don't want to waste system ram / swap space, memory for page tables, and degrade performance just because some corner-case nasty user space could harm itself. > > This is definitely a "if you overwrite a system library while it's > being used, you get to keep both pieces" situation. Right, play stupid games, win stupid prices. I agree that if there would be an efficient way to detect+handle such overwrites gracefully, it would be great to have the kernel support that. ETXTBUSY as implemented with this series (but also before this series) is really only a minimalistic approach to help detect some issues regarding the main executable.
From: Eric W. Biederman > Sent: 12 August 2021 19:47 ... > So today the best advice I can give to userspace is to mark their > executables and shared libraries as read-only and immutable. Otherwise > a change to the executable file can change what is mapped into memory. > MAP_PRIVATE does not help. While 'immutable' might be ok for files installed by distributions it would be a PITA in development. ETXTBUSY is a useful reminder that the file you are copying from machine A to machine B (etc) is still running and probably ought to be killed/stopped before you get confused. I've never really understood why it doesn't stop shared libraries being overwritten - but they do tend to be updated less often. Overwriting an in-use shared library could be really confusing. It is likely that all the code is actually in memory. So everything carries on running as normal. Until the kernel gets under memory pressure and discards a page. Then a page from the new version is faulted in and random programs start getting SEGVs. This could be days after the borked update. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
* Eric W. Biederman: > Florian Weimer, would it be possible to get glibc's ld.so implementation to use > MAP_SHARED? Just so people reading the code know what to expect of the > kernel? As far as I can tell there is not a practical difference > between a read-only MAP_PRIVATE and a read-only MAP_SHARED. Some applications use mprotect to change page protections behind glibc's back. Using MAP_SHARED would break fork pretty badly. Most of the hard-to-diagnose crashes seem to come from global data or relocations because they are wiped by truncation. And we certainly can't use MAP_SHARED for those. Code often seems to come back unchanged after the truncation because the overwritten file hasn't actually changed. File attributes don't help because the copying is an adminstrative action in the context of the application (maybe the result of some automation). I think avoiding the crashes isn't the right approach. What I'd like to see is better diagnostics. Writing mtime and ctime to the core file might help. Or adding a flag to the core file and /proc/PID/smaps that indicates if the file has been truncated across the mapping since the mapping was created. A bit less conservative and even more obvious to diagnose would be a new flag for the mapping (perhaps set via madvise) that causes any future access to the mapping to fault with SIGBUS and a special si_code value after the file has been truncated across the mapping. I think we would set that in the glibc dynamic loader. It would make the crashes much less weird. Thanks, Florian
On Fri, Aug 13, 2021 at 10:18 AM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Florian Weimer, would it be possible to get glibc's ld.so implementation to use > MAP_SHARED? Just so people reading the code know what to expect of the > kernel? As far as I can tell there is not a practical difference > between a read-only MAP_PRIVATE and a read-only MAP_SHARED. There's a huge difference. For one, you actually don't necessarily want read-only. Doing COW on library images is quite common for things like relocation etc (you'd _hope_ everything is PC-relative, but no) So no. Never EVER use MAP_SHARED unless you literally expect to have two different mappings that need to be kept in sync and one writes the other. I'll just repeat: stop arguing about this case. If somebody writes to a busy library, THAT IS A FUNDAMENTAL BUG, and nobody sane should care at all about it apart from the "you get what you deserve". What's next? Do you think glibc should also map every byte in the user address space so that user programs don't get SIGSEGV when they have wild pointers? Again - that's a user BUG and trying to "work around" a wild pointer is a worse fix than the problem it tries to fix. The exact same thing is true for shared library (or executable) mappings. Trying to work around people writing to them is *worse* than the bug of doing so. Stop this completely inane discussion already. Linus
On Fri, Aug 13, 2021, at 5:31 PM, Linus Torvalds wrote: > On Fri, Aug 13, 2021 at 10:18 AM Eric W. Biederman > <ebiederm@xmission.com> wrote: > > > > Florian Weimer, would it be possible to get glibc's ld.so implementation to use > > MAP_SHARED? Just so people reading the code know what to expect of the > > kernel? As far as I can tell there is not a practical difference > > between a read-only MAP_PRIVATE and a read-only MAP_SHARED. > > There's a huge difference. > > For one, you actually don't necessarily want read-only. Doing COW on > library images is quite common for things like relocation etc (you'd > _hope_ everything is PC-relative, but no) > > So no. Never EVER use MAP_SHARED unless you literally expect to have > two different mappings that need to be kept in sync and one writes the > other. > > I'll just repeat: stop arguing about this case. If somebody writes to > a busy library, THAT IS A FUNDAMENTAL BUG, and nobody sane should care > at all about it apart from the "you get what you deserve". > > What's next? Do you think glibc should also map every byte in the user > address space so that user programs don't get SIGSEGV when they have > wild pointers? > > Again - that's a user BUG and trying to "work around" a wild pointer > is a worse fix than the problem it tries to fix. > > The exact same thing is true for shared library (or executable) > mappings. Trying to work around people writing to them is *worse* than > the bug of doing so. > > Stop this completely inane discussion already. > I’ll bite. How about we attack this in the opposite direction: remove the deny write mechanism entirely. In my life, I’ve encountered -ETXTBUSY intermittently, and it invariably means that I somehow failed to finish killing a program fast enough for whatever random rebuild I’m doing to succeed. It’s at best erratic — it only applies for static binaries, and it has never once saved me from a problem I care about. If the program I’m recompiling crashes, I don’t care — it’s probably already part way through dying from an unrelated fatal signal. What actually happens is that I see -ETXTBUSY, think “wait, this isn’t Windows, why are there file sharing rules,” then think “wait, Linux has *one* half baked file sharing rule,” and go on with my life. [0] Seriously, can we deprecate and remove the whole thing? [0] we have mandatory locks, too. Sigh.
On Fri, Aug 13, 2021 at 2:49 PM Andy Lutomirski <luto@kernel.org> wrote: > > I’ll bite. How about we attack this in the opposite direction: remove the deny write mechanism entirely. I think that would be ok, except I can see somebody relying on it. It's broken, it's stupid, but we've done that ETXTBUSY for a _loong_ time. But you are right that we have removed parts of it over time (no more MAP_DENYWRITE, no more uselib()) so that what we have today is a fairly weak form of what we used to do. And nobody really complained when we weakened it, so maybe removing it entirely might be acceptable. Linus
On Fri, Aug 13, 2021 at 2:54 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > And nobody really complained when we weakened it, so maybe removing it > entirely might be acceptable. I guess we could just try it and see... Worst comes to worst, we'll have to put it back, but at least we'd know what crazy thing still wants it.. Linus
On Fri, Aug 13, 2021 at 02:58:57PM -1000, Linus Torvalds wrote: > On Fri, Aug 13, 2021 at 2:54 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > And nobody really complained when we weakened it, so maybe removing it > > entirely might be acceptable. > > I guess we could just try it and see... Worst comes to worst, we'll > have to put it back, but at least we'd know what crazy thing still > wants it.. Umm... I'll need to go back and look through the thread, but I'm fairly sure that there used to be suckers that did replacement of binary that way (try to write, count on exclusion with execve while it's being written to) instead of using rename. Install scripts of weird crap and stuff like that...
On Sat, Aug 14, 2021 at 01:57:31AM +0000, Al Viro wrote: > On Fri, Aug 13, 2021 at 02:58:57PM -1000, Linus Torvalds wrote: > > On Fri, Aug 13, 2021 at 2:54 PM Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > And nobody really complained when we weakened it, so maybe removing it > > > entirely might be acceptable. > > > > I guess we could just try it and see... Worst comes to worst, we'll > > have to put it back, but at least we'd know what crazy thing still > > wants it.. > > Umm... I'll need to go back and look through the thread, but I'm > fairly sure that there used to be suckers that did replacement of > binary that way (try to write, count on exclusion with execve while > it's being written to) instead of using rename. Install scripts > of weird crap and stuff like that... ... and before anyone goes off - I certainly agree that using that behaviour is not a good idea and had never been one. All I'm saying is that there at least used to be very random (and rarely exercised) bits of userland relying upon that behaviour.
On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
> [0] we have mandatory locks, too. Sigh.
I'd love to remove that. Perhaps we could try persuading more of the
distros to disable the CONFIG option first.
On Sat, Aug 14, 2021 at 01:57:31AM +0000, Al Viro wrote: > On Fri, Aug 13, 2021 at 02:58:57PM -1000, Linus Torvalds wrote: > > On Fri, Aug 13, 2021 at 2:54 PM Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > And nobody really complained when we weakened it, so maybe removing it > > > entirely might be acceptable. > > > > I guess we could just try it and see... Worst comes to worst, we'll > > have to put it back, but at least we'd know what crazy thing still > > wants it.. > > Umm... I'll need to go back and look through the thread, but I'm > fairly sure that there used to be suckers that did replacement of > binary that way (try to write, count on exclusion with execve while > it's being written to) instead of using rename. Install scripts > of weird crap and stuff like that... I'm not agains trying to remove it, but I think Al has a point. Removing the write protection will also most certainly make certain classes of attacks _easier_. For example, the runC container breakout from last year using privileged containers issued CVE-2019-5736 would be easier. I'm quoting from the commit I fixed this with: The attack can be made when attaching to a running container or when starting a container running a specially crafted image. For example, when runC attaches to a container the attacker can trick it into executing itself. This could be done by replacing the target binary inside the container with a custom binary pointing back at the runC binary itself. As an example, if the target binary was /bin/bash, this could be replaced with an executable script specifying the interpreter path #!/proc/self/exe (/proc/self/exec is a symbolic link created by the kernel for every process which points to the binary that was executed for that process). As such when /bin/bash is executed inside the container, instead the target of /proc/self/exe will be executed - which will point to the runc binary on the host. The attacker can then proceed to write to the target of /proc/self/exe to try and overwrite the runC binary on the host. and then the write protection kicks in of course: However in general, this will not succeed as the kernel will not permit it to be overwritten whilst runC is executing. which the attack can of course already overcome nowadays with minimal smarts: To overcome this, the attacker can instead open a file descriptor to /proc/self/exe using the O_PATH flag and then proceed to reopen the binary as O_WRONLY through /proc/self/fd/<nr> and try to write to it in a busy loop from a separate process. Ultimately it will succeed when the runC binary exits. After this the runC binary is compromised and can be used to attack other containers or the host itself. But with write protection removed you'd allow such attacks to succeed right away. It's not a huge deal to remove it since we need to have other protection mechanisms in place already: To prevent this attack, LXC has been patched to create a temporary copy of the calling binary itself when it starts or attaches to containers. To do this LXC creates an anonymous, in-memory file using the memfd_create() system call and copies itself into the temporary in-memory file, which is then sealed to prevent further modifications. LXC then executes this sealed, in-memory file instead of the original on-disk binary. Any compromising write operations from a privileged container to the host LXC binary will then write to the temporary in-memory binary and not to the host binary on-disk, preserving the integrity of the host LXC binary. Also as the temporary, in-memory LXC binary is sealed, writes to this will also fail. Note: memfd_create() was added to the Linux kernel in the 3.17 release. However, I still like to pich the upgrade mask idea Aleksa and we tried to implement when we did openat2(). If we leave write-protection in preventing /proc/self/exe from being written to: we can take some time and upstream the upgrade mask patchset which was part of the initial openat2() patchset but was dropped back then (and I had Linus remove the last remants of the idea in [1]). The idea was to add a new field to struct open_how "upgrade_mask" that would allow a caller to specify with what permissions an fd could be reopened with. I still like this idea a great deal and it would be a very welcome addition to system management programs. The upgrade mask is of course optional, i.e. the caller would have to specify the upgrade mask at open time to restrict reopening (lest we regress the whole world). But, we could make it so that an O_PATH fd gotten from opening /proc/<pid>/exe always gets a restricted upgrade mask set and so it can't be upgraded to a O_WRONLY fd afterwards. For this to be meaningful, write protection for /proc/self/exe would need to be kept. [1]: commit 5c350aa11b441b32baf3bfe4018168cb8d10cef7 Author: Christian Brauner <christian.brauner@ubuntu.com> Date: Fri May 28 11:24:15 2021 +0200 fcntl: remove unused VALID_UPGRADE_FLAGS We currently do not maky use of this feature and should we implement something like this in the future it's trivial to add it back. Link: https://lore.kernel.org/r/20210528092417.3942079-2-brauner@kernel.org Cc: Christoph Hellwig <hch@lst.de> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Suggested-by: Richard Guy Briggs <rgb@redhat.com> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
On 14.08.21 04:02, Al Viro wrote: > On Sat, Aug 14, 2021 at 01:57:31AM +0000, Al Viro wrote: >> On Fri, Aug 13, 2021 at 02:58:57PM -1000, Linus Torvalds wrote: >>> On Fri, Aug 13, 2021 at 2:54 PM Linus Torvalds >>> <torvalds@linux-foundation.org> wrote: >>>> >>>> And nobody really complained when we weakened it, so maybe removing it >>>> entirely might be acceptable. >>> >>> I guess we could just try it and see... Worst comes to worst, we'll >>> have to put it back, but at least we'd know what crazy thing still >>> wants it.. >> >> Umm... I'll need to go back and look through the thread, but I'm >> fairly sure that there used to be suckers that did replacement of >> binary that way (try to write, count on exclusion with execve while >> it's being written to) instead of using rename. Install scripts >> of weird crap and stuff like that... > > ... and before anyone goes off - I certainly agree that using that > behaviour is not a good idea and had never been one. All I'm saying > is that there at least used to be very random (and rarely exercised) > bits of userland relying upon that behaviour. > Removing it completely is certainly more controversial than limiting it to the main executable. I'm mostly happy as long as we get rid of that nasty per-VMA handling, because that adds real complexity at places that are complicated enough. Having the remaining deny_write_access()/allow_write_access() at sane places now (loading a new binary, exchanging exe_file) looks certainly much cleaner and I still consider it a valuable, simple sanity feature to have around. I don't think there is any sane use case for modifying the main executable, and it seems to be very easy to catch. For example, besides users that rely on this behavior, in my thinking (see the cover letter), especially having a binary not getting changed while we're loading it sounds like a very good idea (not saying we would expose a way to exploit the kernel if we would allow for modifications while in the elf parser, but also not saying we wouldn't because I didn't check if there would be a way; at least we already allow it in the legacy library loader before mapping the segments with MAP_DENYWRITE). And if we decide to keep the behavior while loading the executable, keeping it while exe_file is set isn't much added code/complexity IMHO. Long story short, I'd vote for keeping it in, and if we decide to rip it out completely, do it a a separate, more careful step.
From: Linus Torvalds > Sent: 14 August 2021 01:55 > > On Fri, Aug 13, 2021 at 2:49 PM Andy Lutomirski <luto@kernel.org> wrote: > > > > I’ll bite. How about we attack this in the opposite direction: remove the deny write mechanism > entirely. > > I think that would be ok, except I can see somebody relying on it. > > It's broken, it's stupid, but we've done that ETXTBUSY for a _loong_ time. I think ETXTBUSY predates Linux itself. But I can't remember whether the elf versions of sunos or svr4 implemented it for shared libraries. I don't remember hitting it, so they may not have. I'm actually surprised it ia an mmap() flag rather than an open() one. Being able to open a file and guarantee it can't be changed seems a sane idea. And not just for programs/libraries. By the sound of it 'immutable' is no use. You need to be able to unlink the file - otherwise you get into the window's fiasco of not being able to update without 17 reboots. FWIW MAP_COPY would only need to take one copy of the page - all the users could share the same page (backed by a single page of swap). Not that I'm suggesting it is a good idea at all. I do wonder about /proc/self/exe though. It gave the NetBSD Linux emulation a terrible problem. Being able to open the inode of the program is fine. The problem is the what readlink() returns - it is basically stale. If a program open the link contents it could get anything at all. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Matthew Wilcox <willy@infradead.org> writes: > On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote: >> [0] we have mandatory locks, too. Sigh. > > I'd love to remove that. Perhaps we could try persuading more of the > distros to disable the CONFIG option first. Yes. The support is disabled in RHEL8. Does anyone know the appropriate people to talk to encourage other distro's to encourage them to disable the CONFIG_MANDATORY_FILE_LOCKING? Either that or we can wait until the code bit-rots, but distro's disabling and removing a feature on their own is the more responsible path. Given how many hoops need to be jumped through to use mandatory file locking once it is enabled, and the fact it has never worked in containers makes me suspect there are no more users. Eric
On 17.08.21 18:48, Eric W. Biederman wrote: > Matthew Wilcox <willy@infradead.org> writes: > >> On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote: >>> [0] we have mandatory locks, too. Sigh. >> >> I'd love to remove that. Perhaps we could try persuading more of the >> distros to disable the CONFIG option first. > > Yes. The support is disabled in RHEL8. kernel-ark also seems to not set it for Fedora and ARK redhat/configs/common/generic/CONFIG_MANDATORY_FILE_LOCKING:# CONFIG_MANDATORY_FILE_LOCKING is not set
On Sat, Aug 14, 2021 at 04:04:09AM +0100, Matthew Wilcox wrote: > On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote: > > [0] we have mandatory locks, too. Sigh. > > I'd love to remove that. Perhaps we could try persuading more of the > distros to disable the CONFIG option first. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1940392
On Tue, Aug 17, 2021 at 6:49 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Matthew Wilcox <willy@infradead.org> writes: > > > On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote: > >> [0] we have mandatory locks, too. Sigh. > > > > I'd love to remove that. Perhaps we could try persuading more of the > > distros to disable the CONFIG option first. > > Yes. The support is disabled in RHEL8. If it helps, it seems to be enabled on the just released debian stable: $ grep CONFIG_MANDATORY_FILE_LOCKING /boot/config-5.10.0-8-amd64 CONFIG_MANDATORY_FILE_LOCKING=y Also the new 5.13 kernel in experimental has it too: $ grep CONFIG_MANDATORY_FILE_LOCKING /boot/config-5.13.0-trunk-amd64 CONFIG_MANDATORY_FILE_LOCKING=y
On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote: > I’ll bite. How about we attack this in the opposite direction: remove > the deny write mechanism entirely. For what it's worth, Windows has open flags that allow denying read or write opens. They also made their way into the NFSv4 protocol, but knfsd enforces them only against other NFSv4 clients. Last I checked, Samba attempted to emulate them using flock (and there's a comment to that effect on the flock syscall in fs/locks.c). I don't know what Wine does. Pavel Shilovsky posted flags adding O_DENY* flags years ago: https://lwn.net/Articles/581005/ I keep thinking I should look back at those some day but will probably never get to it. I've no idea how Windows applications use them, though I'm told it's common. --b.
bfields@fieldses.org (J. Bruce Fields) writes: > On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote: >> I’ll bite. How about we attack this in the opposite direction: remove >> the deny write mechanism entirely. > > For what it's worth, Windows has open flags that allow denying read or > write opens. They also made their way into the NFSv4 protocol, but > knfsd enforces them only against other NFSv4 clients. Last I checked, > Samba attempted to emulate them using flock (and there's a comment to > that effect on the flock syscall in fs/locks.c). I don't know what Wine > does. > > Pavel Shilovsky posted flags adding O_DENY* flags years ago: > > https://lwn.net/Articles/581005/ > > I keep thinking I should look back at those some day but will probably > never get to it. > > I've no idea how Windows applications use them, though I'm told it's > common. I don't know in any detail. I just have this memory of not being able to open or do anything with a file on windows while any application has it open. We limit mandatory locks to filesystems that have the proper mount flag and files that are sgid but are not executable. Reusing that limit we could probably allow such a behavior in Linux without causing chaos. Without being very strict about which files can participate I can just imagine someone hiding their presence by not allowing other applications the ability to write to utmp or a log file. In the windows world where everything evolved with those kinds of restrictions it is probably fine (although super annoying). Eric
On Thu, Aug 19, 2021 at 08:56:52AM -0500, Eric W. Biederman wrote: > bfields@fieldses.org (J. Bruce Fields) writes: > > > On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote: > >> I’ll bite. How about we attack this in the opposite direction: remove > >> the deny write mechanism entirely. > > > > For what it's worth, Windows has open flags that allow denying read or > > write opens. They also made their way into the NFSv4 protocol, but > > knfsd enforces them only against other NFSv4 clients. Last I checked, > > Samba attempted to emulate them using flock (and there's a comment to > > that effect on the flock syscall in fs/locks.c). I don't know what Wine > > does. > > > > Pavel Shilovsky posted flags adding O_DENY* flags years ago: > > > > https://lwn.net/Articles/581005/ > > > > I keep thinking I should look back at those some day but will probably > > never get to it. > > > > I've no idea how Windows applications use them, though I'm told it's > > common. > > I don't know in any detail. I just have this memory of not being able > to open or do anything with a file on windows while any application has > it open. > > We limit mandatory locks to filesystems that have the proper mount flag > and files that are sgid but are not executable. Reusing that limit we > could probably allow such a behavior in Linux without causing chaos. I'm pretty confused about how we're using the term "mandatory locking". The locks you're thinking of are basically ordinary posix byte-range locks which we attempt to enforce as mandatory under certain conditions (e.g. in fs/read_write.c:rw_verify_area). That means we have to check them on ordinary reads and writes, which is a pain in the butt. (And we don't manage to do it correctly--the code just checks for the existence of a conflicting lock before performing IO, ignoring the obvious time-of-check/time-of-use race.) This has nothing to do with Windows share locks which from what I understand are whole-file locks that are only enforced against opens. --b. > Without being very strict about which files can participate I can just > imagine someone hiding their presence by not allowing other applications > the ability to write to utmp or a log file. > > In the windows world where everything evolved with those kinds of > restrictions it is probably fine (although super annoying). > > Eric
On Tue, 2021-08-17 at 11:48 -0500, Eric W. Biederman wrote: > Matthew Wilcox <willy@infradead.org> writes: > > > On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote: > > > [0] we have mandatory locks, too. Sigh. > > > > I'd love to remove that. Perhaps we could try persuading more of the > > distros to disable the CONFIG option first. > > Yes. The support is disabled in RHEL8. > > Does anyone know the appropriate people to talk to encourage other > distro's to encourage them to disable the CONFIG_MANDATORY_FILE_LOCKING? > > Either that or we can wait until the code bit-rots, but distro's > disabling and removing a feature on their own is the more responsible > path. > > Given how many hoops need to be jumped through to use mandatory file > locking once it is enabled, and the fact it has never worked in > containers makes me suspect there are no more users. > I'm all for ripping it out too. It's an insane interface anyway. I've not heard a single complaint about this being turned off in fedora/rhel or any other distro that has this disabled. Cheers,
On Thu, Aug 19, 2021 at 11:39 AM Jeff Layton <jlayton@kernel.org> wrote: > > I'm all for ripping it out too. It's an insane interface anyway. > > I've not heard a single complaint about this being turned off in > fedora/rhel or any other distro that has this disabled. I'd love to remove it, we could absolutely test it. The fact that several major distros have it disabled makes me think it's fine. But as always, it would be good to check Android. The desktop distros tend to have the same tools and programs, so if Fedora and RHEL haven't needed it for years, then it's likely stale in Debian too (despite being enabled). But Android tends to be very different. Does anybody know? Linus
On Wed, 2021-08-18 at 11:34 +0200, Rodrigo Campos wrote: > On Tue, Aug 17, 2021 at 6:49 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > > > Matthew Wilcox <willy@infradead.org> writes: > > > > > On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote: > > > > [0] we have mandatory locks, too. Sigh. > > > > > > I'd love to remove that. Perhaps we could try persuading more of the > > > distros to disable the CONFIG option first. > > > > Yes. The support is disabled in RHEL8. > > If it helps, it seems to be enabled on the just released debian stable: > $ grep CONFIG_MANDATORY_FILE_LOCKING /boot/config-5.10.0-8-amd64 > CONFIG_MANDATORY_FILE_LOCKING=y > > Also the new 5.13 kernel in experimental has it too: > $ grep CONFIG_MANDATORY_FILE_LOCKING /boot/config-5.13.0-trunk-amd64 > CONFIG_MANDATORY_FILE_LOCKING=y A pity. It would have been nice if they had turned it off a while ago. I guess I should have done more outreach at the time. Sigh... In any case, I'm still inclined toward just ripping it out at this point. It's hard to believe that anyone really uses it.
On Thu, Aug 19, 2021 at 12:15:08PM -0700, Linus Torvalds wrote: > On Thu, Aug 19, 2021 at 11:39 AM Jeff Layton <jlayton@kernel.org> wrote: > > > > I'm all for ripping it out too. It's an insane interface anyway. > > > > I've not heard a single complaint about this being turned off in > > fedora/rhel or any other distro that has this disabled. > > I'd love to remove it, we could absolutely test it. The fact that > several major distros have it disabled makes me think it's fine. > > But as always, it would be good to check Android. > > The desktop distros tend to have the same tools and programs, so if > Fedora and RHEL haven't needed it for years, then it's likely stale in > Debian too (despite being enabled). > > But Android tends to be very different. Does anybody know? > As far as I know, Android never uses mandatory file locking. While CONFIG_MANDATORY_LOCKING=y is typically set (as it's "default y" upstream), I can't find anywhere in the Android source tree that uses the "mand" mount option, let alone anything actually using mandatory locks. (I'm assuming that Documentation/filesystems/mandatory-locking.rst is up-to-date regarding what userspace actually has to do to use it.) - Eric
On Thu, Aug 19, 2021 at 03:18:15PM -0400, Jeff Layton wrote: > On Wed, 2021-08-18 at 11:34 +0200, Rodrigo Campos wrote: > > On Tue, Aug 17, 2021 at 6:49 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > > > > > Matthew Wilcox <willy@infradead.org> writes: > > > > > > > On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote: > > > > > [0] we have mandatory locks, too. Sigh. > > > > > > > > I'd love to remove that. Perhaps we could try persuading more of the > > > > distros to disable the CONFIG option first. > > > > > > Yes. The support is disabled in RHEL8. > > > > If it helps, it seems to be enabled on the just released debian stable: > > $ grep CONFIG_MANDATORY_FILE_LOCKING /boot/config-5.10.0-8-amd64 > > CONFIG_MANDATORY_FILE_LOCKING=y > > > > Also the new 5.13 kernel in experimental has it too: > > $ grep CONFIG_MANDATORY_FILE_LOCKING /boot/config-5.13.0-trunk-amd64 > > CONFIG_MANDATORY_FILE_LOCKING=y > > A pity. It would have been nice if they had turned it off a while ago. I > guess I should have done more outreach at the time. Sigh... Would it be acceptable to add a warning when MS_MANDLOCK is passed to mount() and backport this to stable kernels in order to get reports of any such use in a reasonably short time ? Anyway it sounds important to at least warn about deprecation. Willy
On Thu, 2021-08-19 at 12:15 -0700, Linus Torvalds wrote: > On Thu, Aug 19, 2021 at 11:39 AM Jeff Layton <jlayton@kernel.org> wrote: > > > > I'm all for ripping it out too. It's an insane interface anyway. > > > > I've not heard a single complaint about this being turned off in > > fedora/rhel or any other distro that has this disabled. > > I'd love to remove it, we could absolutely test it. The fact that > several major distros have it disabled makes me think it's fine. > > But as always, it would be good to check Android. > > The desktop distros tend to have the same tools and programs, so if > Fedora and RHEL haven't needed it for years, then it's likely stale in > Debian too (despite being enabled). > > But Android tends to be very different. Does anybody know? > Now that I think about it a little more, I actually did get one complaint a few years ago: Someone had upgraded from an earlier distro that supported the -o mand mount option to a later one that had disabled it, and they had an (old) fstab entry that specified it. They didn't actually use mandatory locking and weren't sure why the option was set, so they removed it and moved on. I would feel a lot better about it if we had gotten Debian to turn it off several years ago too, but I agree it's unlikely anyone uses this and the risk of removing it is low. I've spun up a patch to just rip it out. I'll do a bit of testing with it tomorrow and then send it out. Cheers,
On Thu, Aug 19, 2021 at 1:18 PM Jeff Layton <jlayton@kernel.org> wrote: > > Now that I think about it a little more, I actually did get one > complaint a few years ago: > > Someone had upgraded from an earlier distro that supported the -o mand > mount option to a later one that had disabled it, and they had an (old) > fstab entry that specified it. Hmm. We might be able to turn the "return -EINVAL" into just a warning. Yes, yes, currently if you turn off CONFIG_MANDATORY_FILE_LOCKING, we already do that VFS: "mand" mount option not supported warning print, but then we fail the mount. If CONFIG_MANDATORY_FILE_LOCKING goes away entirely, it might make sense to turn that warning into something bigger, but then let the mount continue - since now that "mand" flag would be purely a legacy thing. And yes, if we do that, we'd want the warning to be a big ugly thing, just to make people very aware of it happening. Right now it's a one-liner that is easy to miss, and the "oh, the mount failed" is the thing that hopefully informs people about the fact that they need to enable CONFIG_MANDATORY_FILE_LOCKING. The logic being that if you can no longer enable mandatory locking in the kernel, the current hard failure seems overly aggressive (and might cause boot failures and inability to fix/report things when it possibly keeps you from using the system at all). Linus
On Thu, 2021-08-19 at 13:31 -0700, Linus Torvalds wrote: > On Thu, Aug 19, 2021 at 1:18 PM Jeff Layton <jlayton@kernel.org> wrote: > > > > Now that I think about it a little more, I actually did get one > > complaint a few years ago: > > > > Someone had upgraded from an earlier distro that supported the -o mand > > mount option to a later one that had disabled it, and they had an (old) > > fstab entry that specified it. > > Hmm. We might be able to turn the "return -EINVAL" into just a warning. > > Yes, yes, currently if you turn off CONFIG_MANDATORY_FILE_LOCKING, we > already do that > > VFS: "mand" mount option not supported > > warning print, but then we fail the mount. > > If CONFIG_MANDATORY_FILE_LOCKING goes away entirely, it might make > sense to turn that warning into something bigger, but then let the > mount continue - since now that "mand" flag would be purely a legacy > thing. > > And yes, if we do that, we'd want the warning to be a big ugly thing, > just to make people very aware of it happening. Right now it's a > one-liner that is easy to miss, and the "oh, the mount failed" is the > thing that hopefully informs people about the fact that they need to > enable CONFIG_MANDATORY_FILE_LOCKING. > > The logic being that if you can no longer enable mandatory locking in > the kernel, the current hard failure seems overly aggressive (and > might cause boot failures and inability to fix/report things when it > possibly keeps you from using the system at all). > What sort of big, ugly warning did you have in mind? I'm fine with that general approach though and will plan to roll that change into the patch I'm testing. Thanks,
On Thu, Aug 19, 2021 at 2:43 PM Jeff Layton <jlayton@kernel.org> wrote: > > What sort of big, ugly warning did you have in mind? I originally thought WARN_ON_ONCE() just to get the distro automatic error handling involved, but it would probably be a big problem for the people who end up having panic-on-warn or something. So probably just a "make it a big box" thing that stands out, kind of what lockdep etc does with pr_warn("======...====\n"); around the messages.. I don't know if distros have some pattern we could use that would end up being something that gets reported to the user? Linus
On Thu, Aug 19, 2021 at 01:31:35PM -0700, Linus Torvalds wrote: > Yes, yes, currently if you turn off CONFIG_MANDATORY_FILE_LOCKING, we > already do that > > VFS: "mand" mount option not supported > > warning print, but then we fail the mount. > > If CONFIG_MANDATORY_FILE_LOCKING goes away entirely, it might make > sense to turn that warning into something bigger, but then let the > mount continue - since now that "mand" flag would be purely a legacy > thing. > > And yes, if we do that, we'd want the warning to be a big ugly thing, > just to make people very aware of it happening. Right now it's a > one-liner that is easy to miss, and the "oh, the mount failed" is the > thing that hopefully informs people about the fact that they need to > enable CONFIG_MANDATORY_FILE_LOCKING. When I ripped out the NFS "intr" mount option fourteen years ago, I just turned it into a noop (commit 150030b78a45). It has greatly amused me every article I've read that's been written since then that recommends using it. Just shows how much tribal knowledge we have. I think this is a little different, though; I was essetially making the *wanted* behaviour of 'intr' the default (and disabling the unwanted behaviour). With 'mand', we're losing the behaviour entirely, and it's plausible that someone might care. Maybe something more like the old sys_bdflush implementation? if (msg_count < 5) { msg_count++; printk(KERN_INFO "warning: process `%s' used the obsolete bdflush" " system call\n", current->comm); printk(KERN_INFO "Fix your initscripts?\n"); }
On Thu, Aug 19, 2021 at 11:32 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, Aug 19, 2021 at 1:18 PM Jeff Layton <jlayton@kernel.org> wrote: > > > > Now that I think about it a little more, I actually did get one > > complaint a few years ago: > > > > Someone had upgraded from an earlier distro that supported the -o mand > > mount option to a later one that had disabled it, and they had an (old) > > fstab entry that specified it. > > Hmm. We might be able to turn the "return -EINVAL" into just a warning. > > Yes, yes, currently if you turn off CONFIG_MANDATORY_FILE_LOCKING, we > already do that > > VFS: "mand" mount option not supported > > warning print, but then we fail the mount. > > If CONFIG_MANDATORY_FILE_LOCKING goes away entirely, it might make > sense to turn that warning into something bigger, but then let the > mount continue - since now that "mand" flag would be purely a legacy > thing. > > And yes, if we do that, we'd want the warning to be a big ugly thing, > just to make people very aware of it happening. Right now it's a > one-liner that is easy to miss, and the "oh, the mount failed" is the > thing that hopefully informs people about the fact that they need to > enable CONFIG_MANDATORY_FILE_LOCKING. > > The logic being that if you can no longer enable mandatory locking in > the kernel, the current hard failure seems overly aggressive (and > might cause boot failures and inability to fix/report things when it > possibly keeps you from using the system at all). > Allow me to play the devil's advocate here - if fstab has '-o mand' we have no way of knowing if any application is relying on '-o mand' and adding more !!!!! to the warning is mostly good for clearing our conscious ;-) Not saying we cannot resort to that and not saying there is an easy solution, but there is one more solution to consider - force rdonly mount. Yes, it could break some systems and possibly fail boot, but then again an ext4 fs can already become rdonly due to errors, so it wouldn't be the first time that sysadmins/users run into this behavior. Thanks, Amir.
On Fri, Aug 20, 2021 at 9:36 AM Amir Goldstein <amir73il@gmail.com> wrote: > > On Thu, Aug 19, 2021 at 11:32 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Thu, Aug 19, 2021 at 1:18 PM Jeff Layton <jlayton@kernel.org> wrote: > > > > > > Now that I think about it a little more, I actually did get one > > > complaint a few years ago: > > > > > > Someone had upgraded from an earlier distro that supported the -o mand > > > mount option to a later one that had disabled it, and they had an (old) > > > fstab entry that specified it. > > > > Hmm. We might be able to turn the "return -EINVAL" into just a warning. > > > > Yes, yes, currently if you turn off CONFIG_MANDATORY_FILE_LOCKING, we > > already do that > > > > VFS: "mand" mount option not supported > > > > warning print, but then we fail the mount. > > > > If CONFIG_MANDATORY_FILE_LOCKING goes away entirely, it might make > > sense to turn that warning into something bigger, but then let the > > mount continue - since now that "mand" flag would be purely a legacy > > thing. > > > > And yes, if we do that, we'd want the warning to be a big ugly thing, > > just to make people very aware of it happening. Right now it's a > > one-liner that is easy to miss, and the "oh, the mount failed" is the > > thing that hopefully informs people about the fact that they need to > > enable CONFIG_MANDATORY_FILE_LOCKING. > > > > The logic being that if you can no longer enable mandatory locking in > > the kernel, the current hard failure seems overly aggressive (and > > might cause boot failures and inability to fix/report things when it > > possibly keeps you from using the system at all). > > > > Allow me to play the devil's advocate here - if fstab has '-o mand' we have > no way of knowing if any application is relying on '-o mand' and adding > more !!!!! to the warning is mostly good for clearing our conscious ;-) > > Not saying we cannot resort to that and not saying there is an easy > solution, but there is one more solution to consider - force rdonly mount. > Yes, it could break some systems and possibly fail boot, but then again > an ext4 fs can already become rdonly due to errors, so it wouldn't > be the first time that sysadmins/users run into this behavior. > Adding an anecdote - this week I got a report from field support engineers about failure to assemble a RAID0 array, which led to this warning that *requires* user intervention, in the worse case for boot device it requires changing kernel boot params: md/raid0:%s: cannot assemble multi-zone RAID0 with default_layout setting md/raid0: please set raid.default_layout to 1 or 2 c84a1372df92 md/raid0: avoid RAID0 data corruption due to layout confusion. There is no way I would have gotten this report from the field if a failure was not involved... The rdonly mount is only needed to get the attention of support people to look the the kernel logs and find the warning - at this point, not too many !!!!! are needed ;-) So we could make 'mand' an alias to 'ro' and print a warning that says: "'mand' mount option is deprecated, please fix your init scripts. For caution, your filesystem was mounted rdonly, feel free to remount rw and move on..." Thanks, Amir.
From: NeilBrown > Sent: 20 August 2021 04:45 ... > O_DENYREAD is an insane flag. If a process reads a file that some other > process is working on, then the only which could be hurt is the reader. > So allowing a process to ask for the open to fail if someone is writing > might make sense. Insisting that all opens fail does not. > Any code wanting O_DENYREAD *should* use advisory locking, and any code > wanting to know about read denial should too. It might make sense if O_DENYREAD | O_DENYWRITE | O_RDWR are all set. That would be what O_EXCL ought to mean for a normal file. So would be useful for a program that wants to update a config file. ... > It would be nice to be able to combine O_DENYWRITE with O_RDWR. This > combination is exactly what the kernel *should* do for swap files. I suspect that is a common usage - eg for updating a file that contains a log file sequence number. ... > I'm not sure about O_DENYDELETE. It is a lock on the name. Unix has > traditionally used lock-files to lock a name. The functionality makes > sense for processes with write-access to the directory... I'm not sure it makes any sense on filesystems that use inode numbers. Which name would you protect, and how would you manage to do the test. On windows O_DENYDELETE is pretty much the default. Which is why software updates are such a PITA. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
From: Linus Torvalds > Sent: 19 August 2021 23:33 > > On Thu, Aug 19, 2021 at 2:43 PM Jeff Layton <jlayton@kernel.org> wrote: > > > > What sort of big, ugly warning did you have in mind? > > I originally thought WARN_ON_ONCE() just to get the distro automatic > error handling involved, but it would probably be a big problem for > the people who end up having panic-on-warn or something. Even panic-on-oops is a PITA. Took us weeks to realise that a customer system that was randomly rebooting was 'just' having a boring NULL pointer access. > So probably just a "make it a big box" thing that stands out, kind of > what lockdep etc does with > > pr_warn("======...====\n"); > > around the messages.. > > I don't know if distros have some pattern we could use that would end > up being something that gets reported to the user? Will users even see it? A lot of recent distro installs try very hard to hide all the kernel messages. OTOH I guess '-o mand' is unlikely to be set on any of those systems. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Fri, 2021-08-20 at 10:14 +0300, Amir Goldstein wrote: > On Fri, Aug 20, 2021 at 9:36 AM Amir Goldstein <amir73il@gmail.com> wrote: > > > > On Thu, Aug 19, 2021 at 11:32 PM Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > On Thu, Aug 19, 2021 at 1:18 PM Jeff Layton <jlayton@kernel.org> wrote: > > > > > > > > Now that I think about it a little more, I actually did get one > > > > complaint a few years ago: > > > > > > > > Someone had upgraded from an earlier distro that supported the -o mand > > > > mount option to a later one that had disabled it, and they had an (old) > > > > fstab entry that specified it. > > > > > > Hmm. We might be able to turn the "return -EINVAL" into just a warning. > > > > > > Yes, yes, currently if you turn off CONFIG_MANDATORY_FILE_LOCKING, we > > > already do that > > > > > > VFS: "mand" mount option not supported > > > > > > warning print, but then we fail the mount. > > > > > > If CONFIG_MANDATORY_FILE_LOCKING goes away entirely, it might make > > > sense to turn that warning into something bigger, but then let the > > > mount continue - since now that "mand" flag would be purely a legacy > > > thing. > > > > > > And yes, if we do that, we'd want the warning to be a big ugly thing, > > > just to make people very aware of it happening. Right now it's a > > > one-liner that is easy to miss, and the "oh, the mount failed" is the > > > thing that hopefully informs people about the fact that they need to > > > enable CONFIG_MANDATORY_FILE_LOCKING. > > > > > > The logic being that if you can no longer enable mandatory locking in > > > the kernel, the current hard failure seems overly aggressive (and > > > might cause boot failures and inability to fix/report things when it > > > possibly keeps you from using the system at all). > > > > > > > Allow me to play the devil's advocate here - if fstab has '-o mand' we have > > no way of knowing if any application is relying on '-o mand' and adding > > more !!!!! to the warning is mostly good for clearing our conscious ;-) > > > > Not saying we cannot resort to that and not saying there is an easy > > solution, but there is one more solution to consider - force rdonly mount. > > Yes, it could break some systems and possibly fail boot, but then again > > an ext4 fs can already become rdonly due to errors, so it wouldn't > > be the first time that sysadmins/users run into this behavior. > > > > Adding an anecdote - this week I got a report from field support > engineers about failure to assemble a RAID0 array, which led to this > warning that *requires* user intervention, in the worse case for boot > device it requires changing kernel boot params: > > md/raid0:%s: cannot assemble multi-zone RAID0 with default_layout setting > md/raid0: please set raid.default_layout to 1 or 2 > > c84a1372df92 md/raid0: avoid RAID0 data corruption due to layout confusion. > > There is no way I would have gotten this report from the field if a failure > was not involved... > > The rdonly mount is only needed to get the attention of support people > to look the the kernel logs and find the warning - at this point, not too > many !!!!! are needed ;-) > > So we could make 'mand' an alias to 'ro' and print a warning that says: > "'mand' mount option is deprecated, please fix your init scripts. > For caution, your filesystem was mounted rdonly, feel free to remount > rw and move on..." That is a possibility, but I'm not sure it's any better than just failing the mount. We could also just keep the code around and throw a big, scary warning about its impending removal for a few releases before ripping it out completely (like Willy T. was suggesting). I'm fine with any of these approaches if the consensus is that it's too risky to just remove it. OTOH, I've yet to ever hear of any application that uses this feature, even in a historical sense. You have to jump through so many hoops that nothing can rely on having it available.
On Fri, Aug 20, 2021 at 08:27:12AM -0400, Jeff Layton wrote: > I'm fine with any of these approaches if the consensus is that it's too > risky to just remove it. OTOH, I've yet to ever hear of any application > that uses this feature, even in a historical sense. Honestly, I agree. Some have fun of me because I'm often using old stuff, but I don't even remember having used an application that made use of mandatory locking. I remember having enabled it myself in my kernels long ago after discovering its existence in the man pages, just to test it. It doesn't rule out the possibility that it exists somewhere though, but I think that the immediate removal combined with the big fat warning in previous branches should be largely enough to avoid the last minute surprise. Willy
On Thu, 2021-08-19 at 10:33 -0400, J. Bruce Fields wrote: > On Thu, Aug 19, 2021 at 08:56:52AM -0500, Eric W. Biederman wrote: > > bfields@fieldses.org (J. Bruce Fields) writes: > > > > > On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote: > > > > I’ll bite. How about we attack this in the opposite direction: remove > > > > the deny write mechanism entirely. > > > > > > For what it's worth, Windows has open flags that allow denying read or > > > write opens. They also made their way into the NFSv4 protocol, but > > > knfsd enforces them only against other NFSv4 clients. Last I checked, > > > Samba attempted to emulate them using flock (and there's a comment to > > > that effect on the flock syscall in fs/locks.c). I don't know what Wine > > > does. > > > > > > Pavel Shilovsky posted flags adding O_DENY* flags years ago: > > > > > > https://lwn.net/Articles/581005/ > > > > > > I keep thinking I should look back at those some day but will probably > > > never get to it. > > > > > > I've no idea how Windows applications use them, though I'm told it's > > > common. > > > > I don't know in any detail. I just have this memory of not being able > > to open or do anything with a file on windows while any application has > > it open. > > > > We limit mandatory locks to filesystems that have the proper mount flag > > and files that are sgid but are not executable. Reusing that limit we > > could probably allow such a behavior in Linux without causing chaos. > > I'm pretty confused about how we're using the term "mandatory locking". > > The locks you're thinking of are basically ordinary posix byte-range > locks which we attempt to enforce as mandatory under certain conditions > (e.g. in fs/read_write.c:rw_verify_area). That means we have to check > them on ordinary reads and writes, which is a pain in the butt. (And we > don't manage to do it correctly--the code just checks for the existence > of a conflicting lock before performing IO, ignoring the obvious > time-of-check/time-of-use race.) > Yeah, the locks we're talking about are the locks described in: Documentation/filesystems/mandatory-locking.rst They've always been racy. You have to mount the fs with '-o mand' and set a special mode on the file (setgid bit set, with group execute bit cleared). It's a crazypants interface. > This has nothing to do with Windows share locks which from what I > understand are whole-file locks that are only enforced against opens. > Yep. Those are different. Confusingly, there is also LOCK_MAND|LOCK_READ|LOCK_WRITE for flock(), which are purported to be for emulating Windows share modes. They aren't really mandatory though. > --b. > > > Without being very strict about which files can participate I can just > > imagine someone hiding their presence by not allowing other applications > > the ability to write to utmp or a log file. > > > > In the windows world where everything evolved with those kinds of > > restrictions it is probably fine (although super annoying). > > > > Eric
On Fri, 2021-08-20 at 14:38 +0200, Willy Tarreau wrote: > On Fri, Aug 20, 2021 at 08:27:12AM -0400, Jeff Layton wrote: > > I'm fine with any of these approaches if the consensus is that it's too > > risky to just remove it. OTOH, I've yet to ever hear of any application > > that uses this feature, even in a historical sense. > > Honestly, I agree. Some have fun of me because I'm often using old > stuff, but I don't even remember having used an application that > made use of mandatory locking. I remember having enabled it myself in > my kernels long ago after discovering its existence in the man pages, > just to test it. It doesn't rule out the possibility that it exists > somewhere though, but I think that the immediate removal combined > with the big fat warning in previous branches should be largely > enough to avoid the last minute surprise. > Good point. It wouldn't hurt to push such a warning into stable kernels at the same time. There always is a lag when we do something like this before some downstream user notices.
On Fri, Aug 20, 2021 at 09:03:16AM -0400, Jeff Layton wrote: > Good point. It wouldn't hurt to push such a warning into stable kernels > at the same time. There always is a lag when we do something like this > before some downstream user notices. Yes, that's why I proposed this. A warning can be backported into stable without big consequences except warning future victims... Willy
On Thu, 19 Aug 2021 15:32:31 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > I originally thought WARN_ON_ONCE() just to get the distro automatic > error handling involved, but it would probably be a big problem for > the people who end up having panic-on-warn or something. > > So probably just a "make it a big box" thing that stands out, kind of > what lockdep etc does with > > pr_warn("======...====\n"); > > around the messages.. > > I don't know if distros have some pattern we could use that would end > up being something that gets reported to the user? People have started using my trace-printk notice message, that seems to be big enough to get noticed. ********************************************************** ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE ** ** ** ** trace_printk() being used. Allocating extra memory. ** ** ** ** This means that this is a DEBUG kernel and it is ** ** unsafe for production use. ** ** ** ** If you see this message and you are not debugging ** ** the kernel, report this immediately to your vendor! ** ** ** ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE ** ********************************************************** There's been some talk about making that a more "generic" warning message too. -- Steve
On Fri, Aug 20, 2021 at 6:43 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Thu, 19 Aug 2021 15:32:31 -0700 > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > I don't know if distros have some pattern we could use that would end > > up being something that gets reported to the user? > > People have started using my trace-printk notice message, that seems to > be big enough to get noticed. Well, I think people who use ftrace are m,ore likely to look at kernel messages than most... So what would be more interesting is if there's some distro support for showing kernel notifications.. I see new notifications for calendar events, for devices that got mounted, for a lot of things - so I'm really wondering if somebody already perhaps had something for specially formatted kernel messages.. Linus
On Thu, Aug 19, 2021 at 12:15:08PM -0700, Linus Torvalds wrote: > On Thu, Aug 19, 2021 at 11:39 AM Jeff Layton <jlayton@kernel.org> wrote: > > > > I'm all for ripping it out too. It's an insane interface anyway. > > > > I've not heard a single complaint about this being turned off in > > fedora/rhel or any other distro that has this disabled. > > I'd love to remove it, we could absolutely test it. The fact that > several major distros have it disabled makes me think it's fine. FWIW, it is now disabled in Ubuntu too: https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/impish/commit/?h=master-next&id=f3aac5e47789cbeb3177a14d3d2a06575249e14b > But as always, it would be good to check Android. It looks like it's enabled (checking the Pixel 4 kernel image), but it's not specifically mentioned in any of the build configs that are used to construct the image, so I think this is just catching the "default y". I expect it'd be fine to turn this off. I will ask around to see if it's actually used.
I thought the main user was Samba and/or otherwise providing file service for M$ systems? On August 20, 2021 9:30:31 AM PDT, Kees Cook <keescook@chromium.org> wrote: >On Thu, Aug 19, 2021 at 12:15:08PM -0700, Linus Torvalds wrote: >> On Thu, Aug 19, 2021 at 11:39 AM Jeff Layton <jlayton@kernel.org> wrote: >> > >> > I'm all for ripping it out too. It's an insane interface anyway. >> > >> > I've not heard a single complaint about this being turned off in >> > fedora/rhel or any other distro that has this disabled. >> >> I'd love to remove it, we could absolutely test it. The fact that >> several major distros have it disabled makes me think it's fine. > >FWIW, it is now disabled in Ubuntu too: > >https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/impish/commit/?h=master-next&id=f3aac5e47789cbeb3177a14d3d2a06575249e14b > >> But as always, it would be good to check Android. > >It looks like it's enabled (checking the Pixel 4 kernel image), but it's >not specifically mentioned in any of the build configs that are used to >construct the image, so I think this is just catching the "default y". I >expect it'd be fine to turn this off. > >I will ask around to see if it's actually used. >
No, Windows has deny-mode locking at open time, but the kernel's mandatory locks are enforced during read/write (which is why they are such a pain). Samba will not miss these at all. If we want something to provide windows-like semantics, we'd probably want to start with something like Pavel Shilovsky's O_DENY_* patches. -- Jeff On Fri, 2021-08-20 at 12:17 -0700, H. Peter Anvin wrote: > I thought the main user was Samba and/or otherwise providing file service for M$ systems? > > On August 20, 2021 9:30:31 AM PDT, Kees Cook <keescook@chromium.org> wrote: > > On Thu, Aug 19, 2021 at 12:15:08PM -0700, Linus Torvalds wrote: > > > On Thu, Aug 19, 2021 at 11:39 AM Jeff Layton <jlayton@kernel.org> wrote: > > > > > > > > I'm all for ripping it out too. It's an insane interface anyway. > > > > > > > > I've not heard a single complaint about this being turned off in > > > > fedora/rhel or any other distro that has this disabled. > > > > > > I'd love to remove it, we could absolutely test it. The fact that > > > several major distros have it disabled makes me think it's fine. > > > > FWIW, it is now disabled in Ubuntu too: > > > > https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/impish/commit/?h=master-next&id=f3aac5e47789cbeb3177a14d3d2a06575249e14b > > > > > But as always, it would be good to check Android. > > > > It looks like it's enabled (checking the Pixel 4 kernel image), but it's > > not specifically mentioned in any of the build configs that are used to > > construct the image, so I think this is just catching the "default y". I > > expect it'd be fine to turn this off. > > > > I will ask around to see if it's actually used. > > >
On Fri, Aug 20, 2021 at 12:17:49PM -0700, H. Peter Anvin wrote:
> I thought the main user was Samba and/or otherwise providing file service for M$ systems?
When I asked around about this in ~2001, the only example anyoe was able
to come up with was some database that I no longer remember the name of.
On Fri, 2021-08-20 at 17:29 -0400, Jeff Layton wrote: > No, Windows has deny-mode locking at open time, but the kernel's > mandatory locks are enforced during read/write (which is why they are > such a pain). Samba will not miss these at all. > > If we want something to provide windows-like semantics, we'd probably > want to start with something like Pavel Shilovsky's O_DENY_* patches. > > -- Jeff > Doh! It completely slipped my mind about byte-range locks on windows... Those are mandatory and they do block read and write activity to the ranges locked. They have weird semantics vs. POSIX locks (they stack instead of splitting/merging, etc.). Samba emulates these with (advisory) POSIX locks in most cases. Using mandatory locks is probably possible, but I think it would add more potential for deadlock and security issues.
On Fri, Aug 20, 2021 at 10:30 AM David Laight <David.Laight@aculab.com> wrote: > From: Linus Torvalds > > Sent: 19 August 2021 23:33 > > > > On Thu, Aug 19, 2021 at 2:43 PM Jeff Layton <jlayton@kernel.org> wrote: > > > > > > What sort of big, ugly warning did you have in mind? > > > > I originally thought WARN_ON_ONCE() just to get the distro automatic > > error handling involved, but it would probably be a big problem for > > the people who end up having panic-on-warn or something. > > Even panic-on-oops is a PITA. > Took us weeks to realise that a customer system that was randomly > rebooting was 'just' having a boring NULL pointer access. > > > So probably just a "make it a big box" thing that stands out, kind of > > what lockdep etc does with > > > > pr_warn("======...====\n"); > > > > around the messages.. Do we really need more of these? They take time to print (especially on serial consoles) and increase kernel size. What's wrong with using an appropriate KERN_*, and letting userspace make sure the admin/user will see the message (see below)? > > > > I don't know if distros have some pattern we could use that would end > > up being something that gets reported to the user? > > Will users even see it? > A lot of recent distro installs try very hard to hide all the kernel > messages. Exactly. E.g. Ubuntu doesn't show any kernel output during normal operation. On Fri, Aug 20, 2021 at 6:12 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Aug 20, 2021 at 6:43 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Thu, 19 Aug 2021 15:32:31 -0700 > > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > > > I don't know if distros have some pattern we could use that would end > > > up being something that gets reported to the user? > So what would be more interesting is if there's some distro support > for showing kernel notifications.. > > I see new notifications for calendar events, for devices that got > mounted, for a lot of things - so I'm really wondering if somebody > already perhaps had something for specially formatted kernel > messages.. Isn't that what the old syslog and the new systemd are supposed to handle in userspace? Gr{oetje,eeting}s, Geert
From: Geert Uytterhoeven > Sent: 23 August 2021 08:56 ... > Exactly. E.g. Ubuntu doesn't show any kernel output during normal > operation. Current ubuntu (x86) is getting to be a PITA. It even runs the graphical login on tty0 - which is where the kernel messages would end up. It is almost impossible to get an actual VGA console. I need to find a different distro that has less bloat in it. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Sat, Aug 21, 2021 at 08:45:54AM -0400, Jeff Layton wrote: > On Fri, 2021-08-20 at 17:29 -0400, Jeff Layton wrote: > > No, Windows has deny-mode locking at open time, but the kernel's > > mandatory locks are enforced during read/write (which is why they are > > such a pain). Samba will not miss these at all. > > > > If we want something to provide windows-like semantics, we'd probably > > want to start with something like Pavel Shilovsky's O_DENY_* patches. > > > > -- Jeff > > > > Doh! It completely slipped my mind about byte-range locks on windows... > > Those are mandatory and they do block read and write activity to the > ranges locked. They have weird semantics vs. POSIX locks (they stack > instead of splitting/merging, etc.). > > Samba emulates these with (advisory) POSIX locks in most cases. Using > mandatory locks is probably possible, but I think it would add more > potential for deadlock and security issues. Right, so Windows byte-range locks are different from Windows open deny modes. But even if somebody wanted to implement them, I doubt they'd start with the mandatory locking code you're removing here, so I think they're irrelevant to this discussion. --b.
On Fri, Aug 13, 2021, at 5:54 PM, Linus Torvalds wrote: > On Fri, Aug 13, 2021 at 2:49 PM Andy Lutomirski <luto@kernel.org> wrote: > > > > I’ll bite. How about we attack this in the opposite direction: remove the deny write mechanism entirely. > > I think that would be ok, except I can see somebody relying on it. > > It's broken, it's stupid, but we've done that ETXTBUSY for a _loong_ time. Someone off-list just pointed something out to me, and I think we should push harder to remove ETXTBSY. Specifically, we've all been focused on open() failing with ETXTBSY, and it's easy to make fun of anyone opening a running program for write when they should be unlinking and replacing it. Alas, Linux's implementation of deny_write_access() is correct^Wabsurd, and deny_write_access() *also* returns ETXTBSY if the file is open for write. So, in a multithreaded program, one thread does: fd = open("some exefile", O_RDWR | O_CREAT | O_CLOEXEC); write(fd, some stuff); <--- problem is here close(fd); execve("some exefile"); Another thread does: fork(); execve("something else"); In between fork and execve, there's another copy of the open file description, and i_writecount is held, and the execve() fails. Whoops. See, for example: https://github.com/golang/go/issues/22315 I propose we get rid of deny_write_access() completely to solve this. Getting rid of i_writecount itself seems a bit harder, since a handful of filesystems use it for clever reasons. (OFD locks seem like they might have the same problem. Maybe we should have a clone() flag to unshare the file table and close close-on-exec things?) > > But you are right that we have removed parts of it over time (no more > MAP_DENYWRITE, no more uselib()) so that what we have today is a > fairly weak form of what we used to do. > > And nobody really complained when we weakened it, so maybe removing it > entirely might be acceptable. > > Linus >
On 26.08.21 19:48, Andy Lutomirski wrote: > On Fri, Aug 13, 2021, at 5:54 PM, Linus Torvalds wrote: >> On Fri, Aug 13, 2021 at 2:49 PM Andy Lutomirski <luto@kernel.org> wrote: >>> >>> I’ll bite. How about we attack this in the opposite direction: remove the deny write mechanism entirely. >> >> I think that would be ok, except I can see somebody relying on it. >> >> It's broken, it's stupid, but we've done that ETXTBUSY for a _loong_ time. > > Someone off-list just pointed something out to me, and I think we should push harder to remove ETXTBSY. Specifically, we've all been focused on open() failing with ETXTBSY, and it's easy to make fun of anyone opening a running program for write when they should be unlinking and replacing it. > > Alas, Linux's implementation of deny_write_access() is correct^Wabsurd, and deny_write_access() *also* returns ETXTBSY if the file is open for write. So, in a multithreaded program, one thread does: > > fd = open("some exefile", O_RDWR | O_CREAT | O_CLOEXEC); > write(fd, some stuff); > > <--- problem is here > > close(fd); > execve("some exefile"); > > Another thread does: > > fork(); > execve("something else"); > > In between fork and execve, there's another copy of the open file description, and i_writecount is held, and the execve() fails. Whoops. See, for example: > > https://github.com/golang/go/issues/22315 > > I propose we get rid of deny_write_access() completely to solve this. > > Getting rid of i_writecount itself seems a bit harder, since a handful of filesystems use it for clever reasons. > > (OFD locks seem like they might have the same problem. Maybe we should have a clone() flag to unshare the file table and close close-on-exec things?) > It's not like this issue is new (^2017) or relevant in practice. So no need to hurry IMHO. One step at a time: it might make perfect sense to remove ETXTBSY, but we have to be careful to not break other user space that actually cares about the current behavior in practice.
David Hildenbrand <david@redhat.com> writes: > On 26.08.21 19:48, Andy Lutomirski wrote: >> On Fri, Aug 13, 2021, at 5:54 PM, Linus Torvalds wrote: >>> On Fri, Aug 13, 2021 at 2:49 PM Andy Lutomirski <luto@kernel.org> wrote: >>>> >>>> I’ll bite. How about we attack this in the opposite direction: remove the deny write mechanism entirely. >>> >>> I think that would be ok, except I can see somebody relying on it. >>> >>> It's broken, it's stupid, but we've done that ETXTBUSY for a _loong_ time. >> >> Someone off-list just pointed something out to me, and I think we should push harder to remove ETXTBSY. Specifically, we've all been focused on open() failing with ETXTBSY, and it's easy to make fun of anyone opening a running program for write when they should be unlinking and replacing it. >> >> Alas, Linux's implementation of deny_write_access() is correct^Wabsurd, and deny_write_access() *also* returns ETXTBSY if the file is open for write. So, in a multithreaded program, one thread does: >> >> fd = open("some exefile", O_RDWR | O_CREAT | O_CLOEXEC); >> write(fd, some stuff); >> >> <--- problem is here >> >> close(fd); >> execve("some exefile"); >> >> Another thread does: >> >> fork(); >> execve("something else"); >> >> In between fork and execve, there's another copy of the open file description, and i_writecount is held, and the execve() fails. Whoops. See, for example: >> >> https://github.com/golang/go/issues/22315 >> >> I propose we get rid of deny_write_access() completely to solve this. >> >> Getting rid of i_writecount itself seems a bit harder, since a handful of filesystems use it for clever reasons. >> >> (OFD locks seem like they might have the same problem. Maybe we should have a clone() flag to unshare the file table and close close-on-exec things?) >> > > It's not like this issue is new (^2017) or relevant in practice. So no > need to hurry IMHO. One step at a time: it might make perfect sense to > remove ETXTBSY, but we have to be careful to not break other user > space that actually cares about the current behavior in practice. It is an old enough issue that I agree there is no need to hurry. I also ran into this issue not too long ago when I refactored the usermode_driver code. My challenge was not being in userspace the delayed fput was not happening in my kernel thread. Which meant that writing the file, then closing the file, then execing the file consistently reported -ETXTBSY. The kernel code wound up doing: /* Flush delayed fput so exec can open the file read-only */ flush_delayed_fput(); task_work_run(); As I read the code the delay for userspace file descriptors is always done with task_work_add, so userspace should not hit that kind of silliness, and should be able to actually close the file descriptor before the exec. On the flip side, I don't know how anything can depend upon getting an -ETXTBSY. So I don't think there is any real risk of breaking userspace if we remove it. Eric
From: Eric W. Biederman > Sent: 26 August 2021 23:14 ... > I also ran into this issue not too long ago when I refactored the > usermode_driver code. My challenge was not being in userspace > the delayed fput was not happening in my kernel thread. Which meant > that writing the file, then closing the file, then execing the file > consistently reported -ETXTBSY. > > The kernel code wound up doing: > /* Flush delayed fput so exec can open the file read-only */ > flush_delayed_fput(); > task_work_run(); > > As I read the code the delay for userspace file descriptors is > always done with task_work_add, so userspace should not hit > that kind of silliness, and should be able to actually close > the file descriptor before the exec. If task_work_add ends up adding it to a task that is already running on a different cpu, and that cpu takes a hardware interrupt that takes some time and/or schedules the softint code to run immediately the hardware interrupt completes then it may well be possible for userspace to have 'issues'. Any flags associated with O_DENY_WRITE would need to be cleared synchronously in the close() rather then in any delayed fput(). David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Thu, Aug 26, 2021 at 11:47:07PM +0200, David Hildenbrand wrote: > On 26.08.21 19:48, Andy Lutomirski wrote: > > On Fri, Aug 13, 2021, at 5:54 PM, Linus Torvalds wrote: > > > On Fri, Aug 13, 2021 at 2:49 PM Andy Lutomirski <luto@kernel.org> wrote: > > > > > > > > I’ll bite. How about we attack this in the opposite direction: remove the deny write mechanism entirely. > > > > > > I think that would be ok, except I can see somebody relying on it. > > > > > > It's broken, it's stupid, but we've done that ETXTBUSY for a _loong_ time. > > > > Someone off-list just pointed something out to me, and I think we should push harder to remove ETXTBSY. Specifically, we've all been focused on open() failing with ETXTBSY, and it's easy to make fun of anyone opening a running program for write when they should be unlinking and replacing it. > > > > Alas, Linux's implementation of deny_write_access() is correct^Wabsurd, and deny_write_access() *also* returns ETXTBSY if the file is open for write. So, in a multithreaded program, one thread does: > > > > fd = open("some exefile", O_RDWR | O_CREAT | O_CLOEXEC); > > write(fd, some stuff); > > > > <--- problem is here > > > > close(fd); > > execve("some exefile"); > > > > Another thread does: > > > > fork(); > > execve("something else"); > > > > In between fork and execve, there's another copy of the open file description, and i_writecount is held, and the execve() fails. Whoops. See, for example: > > > > https://github.com/golang/go/issues/22315 > > > > I propose we get rid of deny_write_access() completely to solve this. > > > > Getting rid of i_writecount itself seems a bit harder, since a handful of filesystems use it for clever reasons. > > > > (OFD locks seem like they might have the same problem. Maybe we should have a clone() flag to unshare the file table and close close-on-exec things?) > > > > It's not like this issue is new (^2017) or relevant in practice. So no need > to hurry IMHO. One step at a time: it might make perfect sense to remove > ETXTBSY, but we have to be careful to not break other user space that > actually cares about the current behavior in practice. I agree. As I at least tried to show, removing write-protection can make some exploits easier. I'm all for trying to remove this if it simplifies things but for sure this shouldn't be part of this patchset and we should be careful about it. The removal of a (misguided or only partially functioning) protection mechanism doesn't introduce but removes a failure point. And I don't think removal and addition of a failure point usually have the same consequences. Introducing a new failure point will often mean userspace quickly detects regressions. Such regressions are pretty common due to security fixes we introduce. Recent examples include [1]. Right after this was merged the regression was reported. But when allowing behavior that used to fail like ETXTBSY it can be difficult for userspace to detect such regressions. The reason for that is quite often that userspace applications don't tend to do something that they know upfront will fail. Attackers however might. [1]: bfb819ea20ce ("proc: Check /proc/$pid/attr/ writes against file opener") Christian
David Laight <David.Laight@ACULAB.COM> writes: > From: Eric W. Biederman >> Sent: 26 August 2021 23:14 > ... >> I also ran into this issue not too long ago when I refactored the >> usermode_driver code. My challenge was not being in userspace >> the delayed fput was not happening in my kernel thread. Which meant >> that writing the file, then closing the file, then execing the file >> consistently reported -ETXTBSY. >> >> The kernel code wound up doing: >> /* Flush delayed fput so exec can open the file read-only */ >> flush_delayed_fput(); >> task_work_run(); >> >> As I read the code the delay for userspace file descriptors is >> always done with task_work_add, so userspace should not hit >> that kind of silliness, and should be able to actually close >> the file descriptor before the exec. > > If task_work_add ends up adding it to a task that is already > running on a different cpu, and that cpu takes a hardware > interrupt that takes some time and/or schedules the softint > code to run immediately the hardware interrupt completes > then it may well be possible for userspace to have 'issues'. It it task_work_add(current). Which punts the work to the return to userspace. > Any flags associated with O_DENY_WRITE would need to be cleared > synchronously in the close() rather then in any delayed fput(). Eric
On 27.08.21 00:13, Eric W. Biederman wrote: > David Hildenbrand <david@redhat.com> writes: > >> On 26.08.21 19:48, Andy Lutomirski wrote: >>> On Fri, Aug 13, 2021, at 5:54 PM, Linus Torvalds wrote: >>>> On Fri, Aug 13, 2021 at 2:49 PM Andy Lutomirski <luto@kernel.org> wrote: >>>>> >>>>> I’ll bite. How about we attack this in the opposite direction: remove the deny write mechanism entirely. >>>> >>>> I think that would be ok, except I can see somebody relying on it. >>>> >>>> It's broken, it's stupid, but we've done that ETXTBUSY for a _loong_ time. >>> >>> Someone off-list just pointed something out to me, and I think we should push harder to remove ETXTBSY. Specifically, we've all been focused on open() failing with ETXTBSY, and it's easy to make fun of anyone opening a running program for write when they should be unlinking and replacing it. >>> >>> Alas, Linux's implementation of deny_write_access() is correct^Wabsurd, and deny_write_access() *also* returns ETXTBSY if the file is open for write. So, in a multithreaded program, one thread does: >>> >>> fd = open("some exefile", O_RDWR | O_CREAT | O_CLOEXEC); >>> write(fd, some stuff); >>> >>> <--- problem is here >>> >>> close(fd); >>> execve("some exefile"); >>> >>> Another thread does: >>> >>> fork(); >>> execve("something else"); >>> >>> In between fork and execve, there's another copy of the open file description, and i_writecount is held, and the execve() fails. Whoops. See, for example: >>> >>> https://github.com/golang/go/issues/22315 >>> >>> I propose we get rid of deny_write_access() completely to solve this. >>> >>> Getting rid of i_writecount itself seems a bit harder, since a handful of filesystems use it for clever reasons. >>> >>> (OFD locks seem like they might have the same problem. Maybe we should have a clone() flag to unshare the file table and close close-on-exec things?) >>> >> >> It's not like this issue is new (^2017) or relevant in practice. So no >> need to hurry IMHO. One step at a time: it might make perfect sense to >> remove ETXTBSY, but we have to be careful to not break other user >> space that actually cares about the current behavior in practice. > > It is an old enough issue that I agree there is no need to hurry. > > I also ran into this issue not too long ago when I refactored the > usermode_driver code. My challenge was not being in userspace > the delayed fput was not happening in my kernel thread. Which meant > that writing the file, then closing the file, then execing the file > consistently reported -ETXTBSY. > > The kernel code wound up doing: > /* Flush delayed fput so exec can open the file read-only */ > flush_delayed_fput(); > task_work_run(); > > As I read the code the delay for userspace file descriptors is > always done with task_work_add, so userspace should not hit > that kind of silliness, and should be able to actually close > the file descriptor before the exec. > > > On the flip side, I don't know how anything can depend upon getting an > -ETXTBSY. So I don't think there is any real risk of breaking userspace > if we remove it. At least in LTP, we have two test cases testing exactly that behavior: testcases/kernel/syscalls/creat/creat07.c testcases/kernel/syscalls/execve/execve04.c