diff mbox series

c/r: prctl: Remove PR_SET_MM_EXE_FILE old file mapping restriction

Message ID 20211215062342.GA1548576@juliacomputing.com (mailing list archive)
State New
Headers show
Series c/r: prctl: Remove PR_SET_MM_EXE_FILE old file mapping restriction | expand

Commit Message

Keno Fischer Dec. 15, 2021, 6:23 a.m. UTC
* What this patch does

This patch changes the behavior of replace_mm_exe_file to remove the
restriction that the current mm may not have any mappings of the
previous exe_file. This restriction has a bit of a complicated history,
which I will summarize below, but the upshot is that whatever value it
may have had when originally introduced (and it is worth pointing out
that the history does not suggest it was ever seen as a security
feature) - in its current state, the restriction is essentially useless
and merely forces userspace into awkward contusions (and extra system
calls) to be able to use it.

* Context/History

The /proc/<pid>/exe symlink provides access to the file that was used
to execve <pid>. It is used for example by gdb to find the on-disk location
of the executed binary and read its debug information.

Originally, the /proc/<pid>/exe symlink was immutable, set by the
kernel upon execve and never changed again. However, in b32dfe377
("c/r: prctl: add ability to set new mm_struct::exe_file"), `prctl`
gained the ability to modify this symlink for use by c/r, under a couple
of restrictions:

1. The process contains no mappings marked as VM_EXECUTABLE (i.e. mappings
   created in execve or by splitting mappings thereof).
2. The new file has appropriate access permissions
3. The call may only be made once
4. The calling process has CAP_SYS_RESOURCE

The restriction we're considering here is point 1. For completeness, I
will note that restriction 3 was subsequently dropped, and restriction 4
was expanded to also allow the local user namespace's root to perform
the operation (as long as this was done using `PR_SET_MM_MAP`).

On restriction 1, the original commit notes that:

    Note it allows to change /proc/$pid/exe if there are no VM_EXECUTABLE
    vmas present for current process, simply because this feature is a special
    to C/R and mm::num_exe_file_vmas become meaningless after that.

The `num_exe_file_vmas` counter was a refcount for the number of mapped VMAs
with the VM_EXECUTABLE flag set. It was used to drop the reference of
/proc/<pid>/exe to the execve'd file if all mappings to it created in
sys_execve were subsequently removed. Thus, as best I can tell, this restriction
was simply a convenience to avoid the additional complexity of correctly handling
non-zero `num_exe_file_vmas` while updating the exe_file.

However, `num_exe_file_vmas` was removed a few months later in e9714acf8c
("mm: kill vma flag VM_EXECUTABLE and mm->num_exe_file_vmas") with the
justification that nobody depended on it and the functionality could
be replaced by an appropriate use of PR_SET_MM_EXE_FILE.

Because of this change, the restriction was updated in bafb282d ("c/r: prctl:
update prctl_set_mm_exe_file() after mm->num_exe_file_vmas removal") to
not allow any mappings present in the memory map of the process other
than the *new* exe_file (a more strict restriction than the original
restriction) and then again in 4229fb1dc ("c/r: prctl: less paranoid
prctl_set_mm_exe_file()") to disallow any mappings of the old exe_file
(still a more strict restriction than the original restriction on
VM_EXECUTABLE mappings, as now any mapping with the same path would
be forbidden not just those created in execve).

It is worth noting that at this point the check for mappings of the
original file and the modification to mm->exe_file were still protected
by the mm's mmap_sem and thus atomic with respect to other modifications
of the mm. However, this too was changed in 6e399cd14 ("prctl: avoid using mmap_sem
for exe_file serialization") and the prctl now separately acquires the
mm's read sema just for the purpose of enforcing the restriction (but
does not enforce any sort of atomicity with respect to the update of
the exe_file).

Except for minor refactorings, this is essentially the state of the
restriction in today's kernel. It appears to me that this was originally
a technical restriction to avoid additional complexity from the interaction
with VM_EXECTUABLE, but when this was removed the question of whether the
restriction was still sensible was not revisited. I searched around for
any additional justifications for this restriction, but could not
find any, and given the lack of enforced atomicity, it does not seem
that any guarantees are actually provided in practice.

* The use case for dropping the restriction

Apart from a general dislike for executing unnecessary code, there are
some practical reasons to want to drop the restriction. In particular,
it is currently awkward to call PR_SET_MM_MAP from an executable itself.
In the original c/r usecase, the restorer was a ptracer and the original
exe_file merely a stub that essentially did nothing, so it was no trouble
to unmap it completely, However, there are a few usecases where
PR_SET_MM_MAP would be useful that are not ptracers.

One such use case are preloaders that run before ld.so and the main
executable in order to control the memory layout. Wine has such a
preloader, but they are also useful to control memory layout for
debugging purposes. Another use case are non-ptrace checkpoint/restore
systems (ptrace is powerful, but not particularly performant, so
c/r systems that are ok with some state changing can gain
performance by not using it).

It is of course possible to use PR_SET_MM_MAP in these contexts
by relocating the executable to private memory and unmapping the
original, but this introduces additional unnecessary complexity for
what appears to be no good reason.

* Summary

As far as I can tell, the restriction against mappings of the old
exe_file in PR_SET_MM_EXE_FILE/PR_SET_MM_MAP exists for no good
reason, but is simply an artifact of its development process.
Because it makes it hard to use this APIs in legitimate contexts
I propose that the restriction be dropped.

Signed-off-by: Keno Fischer <keno@juliacomputing.com>
---
 kernel/fork.c | 18 ------------------
 1 file changed, 18 deletions(-)
diff mbox series

Patch

diff --git a/kernel/fork.c b/kernel/fork.c
index 3244cc56b697..11e01dae8bbc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1203,27 +1203,9 @@  int set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
  */
 int replace_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
 {
-	struct vm_area_struct *vma;
 	struct file *old_exe_file;
 	int ret = 0;
 
-	/* Forbid mm->exe_file change if old file still mapped. */
-	old_exe_file = get_mm_exe_file(mm);
-	if (old_exe_file) {
-		mmap_read_lock(mm);
-		for (vma = mm->mmap; vma && !ret; vma = vma->vm_next) {
-			if (!vma->vm_file)
-				continue;
-			if (path_equal(&vma->vm_file->f_path,
-				       &old_exe_file->f_path))
-				ret = -EBUSY;
-		}
-		mmap_read_unlock(mm);
-		fput(old_exe_file);
-		if (ret)
-			return ret;
-	}
-
 	/* set the new file, lockless */
 	ret = deny_write_access(new_exe_file);
 	if (ret)