[v2,7/8] exec: Generic execfd support
diff mbox series

Message ID 87y2poyd91.fsf_-_@x220.int.ebiederm.org
State New
Headers show
Series
  • exec: Control flow simplifications
Related show

Commit Message

Eric W. Biederman May 19, 2020, 12:33 a.m. UTC
Most of the support for passing the file descriptor of an executable
to an interpreter already lives in the generic code and in binfmt_elf.
Rework the fields in binfmt_elf that deal with executable file
descriptor passing to make executable file descriptor passing a first
class concept.

Move the fd_install from binfmt_misc into begin_new_exec after the new
creds have been installed.  This means that accessing the file through
/proc/<pid>/fd/N is able to see the creds for the new executable
before allowing access to the new executables files.

Performing the install of the executables file descriptor after
the point of no return also means that nothing special needs to
be done on error.  The exiting of the process will close all
of it's open files.

Move the would_dump from binfmt_misc into begin_new_exec right
after would_dump is called on the bprm->file.  This makes it
obvious this case exists and that no nesting of bprm->file is
currently supported.

In binfmt_misc the movement of fd_install into generic code means
that it's special error exit path is no longer needed.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/binfmt_elf.c         |  4 ++--
 fs/binfmt_elf_fdpic.c   |  4 ++--
 fs/binfmt_misc.c        | 40 ++++++++--------------------------------
 fs/exec.c               | 15 +++++++++++++++
 include/linux/binfmts.h | 10 +++++-----
 5 files changed, 32 insertions(+), 41 deletions(-)

Comments

Kees Cook May 19, 2020, 7:46 p.m. UTC | #1
On Mon, May 18, 2020 at 07:33:46PM -0500, Eric W. Biederman wrote:
> 
> Most of the support for passing the file descriptor of an executable
> to an interpreter already lives in the generic code and in binfmt_elf.
> Rework the fields in binfmt_elf that deal with executable file
> descriptor passing to make executable file descriptor passing a first
> class concept.
> 
> Move the fd_install from binfmt_misc into begin_new_exec after the new
> creds have been installed.  This means that accessing the file through
> /proc/<pid>/fd/N is able to see the creds for the new executable
> before allowing access to the new executables files.
> 
> Performing the install of the executables file descriptor after
> the point of no return also means that nothing special needs to
> be done on error.  The exiting of the process will close all
> of it's open files.
> 
> Move the would_dump from binfmt_misc into begin_new_exec right
> after would_dump is called on the bprm->file.  This makes it
> obvious this case exists and that no nesting of bprm->file is
> currently supported.
> 
> In binfmt_misc the movement of fd_install into generic code means
> that it's special error exit path is no longer needed.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Yes, this is so much nicer. :) My head did spin a little between changing
the management of bprm->executable between this patch and the next,
but I'm okay now. ;)

Reviewed-by: Kees Cook <keescook@chromium.org>

nits/thoughts below...

> [...]
> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> index 8c7779d6bf19..653508b25815 100644
> --- a/include/linux/binfmts.h
> +++ b/include/linux/binfmts.h
> [...]
> @@ -48,6 +51,7 @@ struct linux_binprm {
>  	unsigned int taso:1;
>  #endif
>  	unsigned int recursion_depth; /* only for search_binary_handler() */
> +	struct file * executable; /* Executable to pass to the interpreter */
>  	struct file * file;
>  	struct cred *cred;	/* new credentials */

nit: can we fix the "* " stuff here? This should be *file and *executable.

> [...]
> @@ -69,10 +73,6 @@ struct linux_binprm {
>  #define BINPRM_FLAGS_ENFORCE_NONDUMP_BIT 0
>  #define BINPRM_FLAGS_ENFORCE_NONDUMP (1 << BINPRM_FLAGS_ENFORCE_NONDUMP_BIT)
>  
> -/* fd of the binary should be passed to the interpreter */
> -#define BINPRM_FLAGS_EXECFD_BIT 1
> -#define BINPRM_FLAGS_EXECFD (1 << BINPRM_FLAGS_EXECFD_BIT)
> -
>  /* filename of the binary will be inaccessible after exec */
>  #define BINPRM_FLAGS_PATH_INACCESSIBLE_BIT 2
>  #define BINPRM_FLAGS_PATH_INACCESSIBLE (1 << BINPRM_FLAGS_PATH_INACCESSIBLE_BIT)

nit: may as well renumber BINPRM_FLAGS_PATH_INACCESSIBLE_BIT to 1,
they're not UAPI. And, actually, nothing uses the *_BIT defines, so
probably the entire chunk of code could just be reduced to:

/* either interpreter or executable was unreadable */
#define BINPRM_FLAGS_ENFORCE_NONDUMP    BIT(0)
/* filename of the binary will be inaccessible after exec */
#define BINPRM_FLAGS_PATH_INACCESSIBLE  BIT(1)

Though frankly, I wonder if interp_flags could just be removed in favor
of two new bit members, especially since interp_data is gone:

+               /* Either interpreter or executable was unreadable. */
+               nondumpable:1;
+               /* Filename of the binary will be inaccessible after exec. */
+               path_inaccessible:1;
...
-       unsigned interp_flags;
...etc
Linus Torvalds May 19, 2020, 7:54 p.m. UTC | #2
On Tue, May 19, 2020 at 12:46 PM Kees Cook <keescook@chromium.org> wrote:
>
> Though frankly, I wonder if interp_flags could just be removed in favor
> of two new bit members, especially since interp_data is gone:

Yeah, I think that might be a good cleanup - but please keep it as a
separate thing at the end of the series (or maybe the beginning)

                Linus
Eric W. Biederman May 19, 2020, 8:20 p.m. UTC | #3
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, May 19, 2020 at 12:46 PM Kees Cook <keescook@chromium.org> wrote:
>>
>> Though frankly, I wonder if interp_flags could just be removed in favor
>> of two new bit members, especially since interp_data is gone:
>
> Yeah, I think that might be a good cleanup - but please keep it as a
> separate thing at the end of the series (or maybe the beginning)

I will.

With a little care we can replace setting BINPRM_FLAGS_ENFORCE_NONDUMP
and clearing bprm->mm->dumpable.

Which is the direction I have been looking.

Now that I think about it I believe that the loop in exec_binprm should
be clearing BINPRM_FLAGS_PATH_INACCESSIBLE as it is only relevant to
fexec/execveat with a close on exec file descriptor.

Eric
Rob Landley May 19, 2020, 9:59 p.m. UTC | #4
On 5/18/20 7:33 PM, Eric W. Biederman wrote:
> 
> Most of the support for passing the file descriptor of an executable
> to an interpreter already lives in the generic code and in binfmt_elf.
> Rework the fields in binfmt_elf that deal with executable file
> descriptor passing to make executable file descriptor passing a first
> class concept.

I was reading this to try to figure out how to do execve(NULL, argv[], envp) to
re-exec self after a vfork() in a chroot with no /proc, and hit the most trivial
quibble ever:

> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1323,7 +1323,10 @@ int begin_new_exec(struct linux_binprm * bprm)
>  	 */
>  	set_mm_exe_file(bprm->mm, bprm->file);
>  
> +	/* If the binary is not readable than enforce mm->dumpable=0 */

then

Rob
Eric W. Biederman May 20, 2020, 4:05 p.m. UTC | #5
Rob Landley <rob@landley.net> writes:

> On 5/18/20 7:33 PM, Eric W. Biederman wrote:
>> 
>> Most of the support for passing the file descriptor of an executable
>> to an interpreter already lives in the generic code and in binfmt_elf.
>> Rework the fields in binfmt_elf that deal with executable file
>> descriptor passing to make executable file descriptor passing a first
>> class concept.
>
> I was reading this to try to figure out how to do execve(NULL, argv[], envp) to
> re-exec self after a vfork() in a chroot with no /proc, and hit the most trivial
> quibble ever:

We have /proc/self/exe today.  If I understand you correctly you would
like to do the equivalent of 'execve("/proc/self/exe", argv[], envp[])'
without having proc mounted.

The file descriptor is stored in mm->exe_file.

Probably the most straight forward implementation is to allow
execveat(AT_EXE_FILE, ...).

You can look at binfmt_misc for how to reopen an open file descriptor.

>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1323,7 +1323,10 @@ int begin_new_exec(struct linux_binprm * bprm)
>>  	 */
>>  	set_mm_exe_file(bprm->mm, bprm->file);
>>  
>> +	/* If the binary is not readable than enforce mm->dumpable=0 */
>
> then

It took me a minute yes good catch.

Eric
Rob Landley May 21, 2020, 10:50 p.m. UTC | #6
On 5/20/20 11:05 AM, Eric W. Biederman wrote:
> Rob Landley <rob@landley.net> writes:
> 
>> On 5/18/20 7:33 PM, Eric W. Biederman wrote:
>>>
>>> Most of the support for passing the file descriptor of an executable
>>> to an interpreter already lives in the generic code and in binfmt_elf.
>>> Rework the fields in binfmt_elf that deal with executable file
>>> descriptor passing to make executable file descriptor passing a first
>>> class concept.
>>
>> I was reading this to try to figure out how to do execve(NULL, argv[], envp) to
>> re-exec self after a vfork() in a chroot with no /proc, and hit the most trivial
>> quibble ever:
> 
> We have /proc/self/exe today.

Not when you first enter a container that's just created a new namespace, or
initramfs first launches PID 1 and runs a shell script to set up the environment
and your (subshell) and background& support only has vfork and not fork, or just
plain "somebody did a chroot"...

(Yes a nommu system with range registers can want _security_ without
_address_translation_. Strange but true! I haven't actually sat down to try to
implement nommu containers yet, but I've done worse things on many occasions.
Remember: the S in IoT stands for Security.)

> If I understand you correctly you would
> like to do the equivalent of 'execve("/proc/self/exe", argv[], envp[])'
> without having proc mounted.

Toybox would _like_ proc mounted, but can't assume it. I'm writing a new
bash-compatible shell with nommu support, which means in order to do subshell
and background tasks if (!CONFIG_FORK) I need to create a pipe pair, vfork(),
have the child exec itself to unblock the parent, and then read the context data
that just got discarded through the pipe from the parent. ("Wheee." And you can
quote me on that.)

I've implemented that already
(https://github.com/landley/toybox/blob/0.8.3/toys/pending/sh.c#L674 and reentry
is L2516, yeah it's a work in progress), but "exec self" requires /proc/self/exe
and since I gave up on getting
http://lkml.iu.edu/hypermail/linux/kernel/2005.1/09399.html in (I should
apologize to Randy but I just haven't got the spoons to face
https://landley.net/notes-2017.html#14-09-2017 again; three strikes and the
patch stays out) I need /init to be a shell script to set up an initramfs that's
made by pointing CONFIG_INITRAMFS_SOURCE at a directory that was made without
running the build as root, because there's no /dev/console and you can't mknod
as a non-root user.

Maybe instead of fixing CONFIG_DEVTMPFS_MOUNT to apply to initramfs I could
instead add a CONFIG_INITRAMFS_EXTRA=blah.txt to usr/{Kconfig,Makefile} to
append user-supplied extra lines to the end of the gen_initramfs.sh output and
make a /dev/console that way (kinda like genext2fs and mksquashfs), but getting
that in through the linux-kernel bureaucracy means consulting a 27 step
checklist supplementing the basic 17 step submission procedure (with
bibliographic references) explaining how to fill out the forms, perform the
validation steps, go through the proper channels, and get the appropriate series
of signatures and approvals, and I just haven't got the stomach for it anymore.
I was participating here as a hobbyist. Linux-kernel has aged into a rigid
bureaucracy. It's no fun anymore.

Which means any kernel patch I write I have to forward port regularly, sometimes
for a very long time. Heck, I gave linux-kernel three strikes at miniconfig
fifteen years ago now:

  http://lkml.iu.edu/hypermail/linux/kernel/0511.2/0479.html
  https://lwn.net/Articles/161086/
  https://lkml.org/lkml/2006/7/6/404

And was still maintaining it out of tree a decade later:

  https://landley.net/aboriginal/FAQ.html#dev_miniconfig
  https://github.com/landley/aboriginal/blob/master/more/miniconfig.sh

These days I've moved on to a microconfig format that mostly fits on one line,
ala the KCONF= stuff in toybox's built in:

  https://github.com/landley/toybox/blob/master/scripts/mkroot.sh#L136

For example, the User Mode Linux miniconfig from my ancient
https://landley.net/writing/docs/UML.html would translate to microconfig as:

  BINFMT_ELF,HOSTFS,LBD,BLK_DEV,BLK_DEV_LOOP,STDERR_CONSOLE,UNIX98_PTYS,EXT2_FS

The current kernel also needs "64BIT" because my host toolchain doesn't have the
-m32 headers installed, but then it builds fine ala:

make ARCH=um allnoconfig KCONFIG_ALLCONFIG=<(echo
BINFMT_ELF,HOSTFS,LBD,BLK_DEV,BLK_DEV_LOOP,STDERR_CONSOLE,UNIX98_PTYS,EXT2_FS,64BIT
| sed -E 's/([^,]*)(,|$)/CONFIG_\1=y\n/g')

Of course running the resulting ./linux says:

  Checking PROT_EXEC mmap in /dev/shm...Operation not permitted
  /dev/shm must be not mounted noexec

But *shrug*, Devuan did that not me. I haven't really used UML since QEMU
started working. Shouldn't the old "create file, map file, delete file" trick
stop flushing the data to backing store no matter where the file lives? I mean,
that trick dates back to the VAX, and we argued about it on the UML list a
decade ago (circa
https://sourceforge.net/p/user-mode-linux/mailman/message/14000710/) but...
fixing random things that are wrong with Linux is not my problem anymore. I'm
only in this thread because I'm cc'd.

Spending five years repeatedly posting perl removal patches and ending up with
intentional sabotage at the end from the guy who'd added perl in the first place
when the Gratuitous Build Dependency Removal patches finally got traction
(https://landley.net/notes-2013.html#28-03-2013) kinda put me off doing that again.

> The file descriptor is stored in mm->exe_file.
> Probably the most straight forward implementation is to allow
> execveat(AT_EXE_FILE, ...).

Cool, that works.

> You can look at binfmt_misc for how to reopen an open file descriptor.

Added to the todo heap.

Thanks,

Rob
Eric W. Biederman May 22, 2020, 3:28 a.m. UTC | #7
Rob Landley <rob@landley.net> writes:

> On 5/20/20 11:05 AM, Eric W. Biederman wrote:

> Toybox would _like_ proc mounted, but can't assume it. I'm writing a new
> bash-compatible shell with nommu support, which means in order to do subshell
> and background tasks if (!CONFIG_FORK) I need to create a pipe pair, vfork(),
> have the child exec itself to unblock the parent, and then read the context data
> that just got discarded through the pipe from the parent. ("Wheee." And you can
> quote me on that.)

Do you have clone(CLONE_VM) ?  If my quick skim of the kernel sources is
correct that should be the same as vfork except without causing the
parent to wait for you.  Which I think would remove the need to reexec
yourself.

>> The file descriptor is stored in mm->exe_file.
>> Probably the most straight forward implementation is to allow
>> execveat(AT_EXE_FILE, ...).
>
> Cool, that works.
>
>> You can look at binfmt_misc for how to reopen an open file descriptor.
>
> Added to the todo heap.

Yes I don't think it would be a lot of code.

I think you might be better served with clone(CLONE_VM) as it doesn't
block so you don't need to feed yourself your context over a pipe.

Eric
Rob Landley May 22, 2020, 4:51 a.m. UTC | #8
On 5/21/20 10:28 PM, Eric W. Biederman wrote:
> 
> Rob Landley <rob@landley.net> writes:
> 
>> On 5/20/20 11:05 AM, Eric W. Biederman wrote:
> 
>> Toybox would _like_ proc mounted, but can't assume it. I'm writing a new
>> bash-compatible shell with nommu support, which means in order to do subshell
>> and background tasks if (!CONFIG_FORK) I need to create a pipe pair, vfork(),
>> have the child exec itself to unblock the parent, and then read the context data
>> that just got discarded through the pipe from the parent. ("Wheee." And you can
>> quote me on that.)
> 
> Do you have clone(CLONE_VM) ?  If my quick skim of the kernel sources is
> correct that should be the same as vfork except without causing the
> parent to wait for you.  Which I think would remove the need to reexec
> yourself.

As with perpetual motion, that only seems like it would work if you don't
understand what's going on.

A nommu system uses physical addresses, not virtual ones, so every process sees
the same addresses. So if I allocate a new block of memory and memcpy the
contents of the old one into the new one, any pointers in the copy point back
into the ORIGINAL block of memory. Trying to adjust the pointers in the copy is
the exact same problem as trying to do garbage collection in C: it's an AI
complete problem.

Any attempt to "implement a full fork" on nommu hits this problem: copying an
existing mapping to a new address range means any address values in the new
mapping point into the OLD mapping. Things like fdpic fix this up at exec time
(traversing elf tables and relocating), but not at runtime. If you can solve the
"relocate at runtime all addresses within an existing mapping, and all other
mappings that might point to this mapping, including local variables on the
stack that point to a structure member or halfway into a string rather than the
start of an allocation, without adjusting unrelated values coincidentally within
RANGE of a mapping" problem, THEN you can fork on a nommu system.

What vfork() does is pause the parent and have the child continue AS the parent
for a bit (with the system call returning 0). The child starts with all the same
memory mappings the parent has (usually not even a new stack). The child has a
new PID and new resources like its own file descriptor table so close() and
open() don't affect the parent, but if you change a global that's visible to the
parent when it resumes (ant often local variables too: don't return from the
function that called vfork() because if you DON'T have a new stack it'll stomp
the return address the parent needs when IT does it). If the child calls
malloc() the parent needs to free it because it's same heap (because same
mapping of the same physical memory).

Then when the child is ready to discard all those mappings (due to calling
either execve() or _exit(), those are the only two options), the parent resumes
from where it left off with the PID of the child as the system call return value.

The reason the child pauses the parent is so only one process is ever using
those mappings at a given time. Otherwise they're acting like threads without
locking, and usually both are sharing a stack.

P.S. You can use threads _instead_ of fork for some stuff on nommu, but that's
its own can of worms. You still need to vfork() when you do create a child
process you're going to exec, so it doesn't go away, you're just requiring
multiple techniques simultaneously to handle a special case.

P.P.S. vfork() is useful on mmu systems to solve the "don't fork from a thread"
problem. You can vfork() from a thread cheaply and reliably and it only pauses
the one thread you forked from, not every thread in the whole process. If you
fork() from a heavily threadded process you can cause a multi-milisecond latency
spike because even with an mmu the copy on write "keep track of what's shared by
what" generally can't handle the "threads AND processes sharing mappings" case,
so it just gives up and copies it all at fork time, in one go, holding a big
lock while doing so. This causes a large latency spike which vfork() avoids.
(And can cause a large wasteful allocation and memory dirtying which is
immediately freed.)

>>> The file descriptor is stored in mm->exe_file.
>>> Probably the most straight forward implementation is to allow
>>> execveat(AT_EXE_FILE, ...).
>>
>> Cool, that works.
>>
>>> You can look at binfmt_misc for how to reopen an open file descriptor.
>>
>> Added to the todo heap.
> 
> Yes I don't think it would be a lot of code.
> 
> I think you might be better served with clone(CLONE_VM) as it doesn't
> block so you don't need to feed yourself your context over a pipe.

Except that doesn't fix it.

Yes I could use threads instead, but the cure is worse than the disease and the
result is your shell background processes are threads rather than independent
processes (is $$ reporting PID or TID, I really don't want to go there).

> Eric

Rob
Eric W. Biederman May 22, 2020, 1:35 p.m. UTC | #9
Rob Landley <rob@landley.net> writes:

> On 5/21/20 10:28 PM, Eric W. Biederman wrote:
>> 
>> Rob Landley <rob@landley.net> writes:
>> 
>>> On 5/20/20 11:05 AM, Eric W. Biederman wrote:
>> 
>>>> The file descriptor is stored in mm->exe_file.
>>>> Probably the most straight forward implementation is to allow
>>>> execveat(AT_EXE_FILE, ...).
>>>
>>> Cool, that works.
>>>
>>>> You can look at binfmt_misc for how to reopen an open file descriptor.
>>>
>>> Added to the todo heap.
>> 
>> Yes I don't think it would be a lot of code.
>> 
>> I think you might be better served with clone(CLONE_VM) as it doesn't
>> block so you don't need to feed yourself your context over a pipe.
>
> Except that doesn't fix it.
>
> Yes I could use threads instead, but the cure is worse than the disease and the
> result is your shell background processes are threads rather than independent
> processes (is $$ reporting PID or TID, I really don't want to go
> there).

I was just suggesting clone(CLONE_VM) because it creates a thread in a
separate process.  Which on nommu sounds like it could be almost exactly
what you want.

If you need the separate copies of all of your global variables etc,
re-exec'ing your self could be the easier way to go.

Eric

Patch
diff mbox series

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 396d5c2e6b5e..441c85f04dfd 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -273,8 +273,8 @@  create_elf_tables(struct linux_binprm *bprm, const struct elfhdr *exec,
 		NEW_AUX_ENT(AT_BASE_PLATFORM,
 			    (elf_addr_t)(unsigned long)u_base_platform);
 	}
-	if (bprm->interp_flags & BINPRM_FLAGS_EXECFD) {
-		NEW_AUX_ENT(AT_EXECFD, bprm->interp_data);
+	if (bprm->have_execfd) {
+		NEW_AUX_ENT(AT_EXECFD, bprm->execfd);
 	}
 #undef NEW_AUX_ENT
 	/* AT_NULL is zero; clear the rest too */
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index 896e3ca9bf85..2d5e9eb12075 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -628,10 +628,10 @@  static int create_elf_fdpic_tables(struct linux_binprm *bprm,
 			    (elf_addr_t) (unsigned long) u_base_platform);
 	}
 
-	if (bprm->interp_flags & BINPRM_FLAGS_EXECFD) {
+	if (bprm->have_execfd) {
 		nr = 0;
 		csp -= 2 * sizeof(unsigned long);
-		NEW_AUX_ENT(AT_EXECFD, bprm->interp_data);
+		NEW_AUX_ENT(AT_EXECFD, bprm->execfd);
 	}
 
 	nr = 0;
diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index 50a73afdf9b7..ad2866f28f0c 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -134,7 +134,6 @@  static int load_misc_binary(struct linux_binprm *bprm)
 	Node *fmt;
 	struct file *interp_file = NULL;
 	int retval;
-	int fd_binary = -1;
 
 	retval = -ENOEXEC;
 	if (!enabled)
@@ -161,29 +160,12 @@  static int load_misc_binary(struct linux_binprm *bprm)
 	}
 
 	if (fmt->flags & MISC_FMT_OPEN_BINARY) {
-
-		/* if the binary should be opened on behalf of the
-		 * interpreter than keep it open and assign descriptor
-		 * to it
-		 */
-		fd_binary = get_unused_fd_flags(0);
-		if (fd_binary < 0) {
-			retval = fd_binary;
-			goto ret;
-		}
-		fd_install(fd_binary, bprm->file);
-
-		/* if the binary is not readable than enforce mm->dumpable=0
-		   regardless of the interpreter's permissions */
-		would_dump(bprm, bprm->file);
+		/* Pass the open binary to the interpreter */
+		bprm->have_execfd = 1;
+		bprm->executable = bprm->file;
 
 		allow_write_access(bprm->file);
 		bprm->file = NULL;
-
-		/* mark the bprm that fd should be passed to interp */
-		bprm->interp_flags |= BINPRM_FLAGS_EXECFD;
-		bprm->interp_data = fd_binary;
-
 	} else {
 		allow_write_access(bprm->file);
 		fput(bprm->file);
@@ -192,19 +174,19 @@  static int load_misc_binary(struct linux_binprm *bprm)
 	/* make argv[1] be the path to the binary */
 	retval = copy_strings_kernel(1, &bprm->interp, bprm);
 	if (retval < 0)
-		goto error;
+		goto ret;
 	bprm->argc++;
 
 	/* add the interp as argv[0] */
 	retval = copy_strings_kernel(1, &fmt->interpreter, bprm);
 	if (retval < 0)
-		goto error;
+		goto ret;
 	bprm->argc++;
 
 	/* Update interp in case binfmt_script needs it. */
 	retval = bprm_change_interp(fmt->interpreter, bprm);
 	if (retval < 0)
-		goto error;
+		goto ret;
 
 	if (fmt->flags & MISC_FMT_OPEN_FILE) {
 		interp_file = file_clone_open(fmt->interp_file);
@@ -215,7 +197,7 @@  static int load_misc_binary(struct linux_binprm *bprm)
 	}
 	retval = PTR_ERR(interp_file);
 	if (IS_ERR(interp_file))
-		goto error;
+		goto ret;
 
 	bprm->file = interp_file;
 	if (fmt->flags & MISC_FMT_CREDENTIALS)
@@ -223,17 +205,11 @@  static int load_misc_binary(struct linux_binprm *bprm)
 
 	retval = search_binary_handler(bprm);
 	if (retval < 0)
-		goto error;
+		goto ret;
 
 ret:
 	dput(fmt->dentry);
 	return retval;
-error:
-	if (fd_binary > 0)
-		ksys_close(fd_binary);
-	bprm->interp_flags = 0;
-	bprm->interp_data = 0;
-	goto ret;
 }
 
 /* Command parsers */
diff --git a/fs/exec.c b/fs/exec.c
index 5fc458460e44..ca91393893ea 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1323,7 +1323,10 @@  int begin_new_exec(struct linux_binprm * bprm)
 	 */
 	set_mm_exe_file(bprm->mm, bprm->file);
 
+	/* If the binary is not readable than enforce mm->dumpable=0 */
 	would_dump(bprm, bprm->file);
+	if (bprm->have_execfd)
+		would_dump(bprm, bprm->executable);
 
 	/*
 	 * Release all of the old mmap stuff
@@ -1427,6 +1430,16 @@  int begin_new_exec(struct linux_binprm * bprm)
 	 * credentials; any time after this it may be unlocked.
 	 */
 	security_bprm_committed_creds(bprm);
+
+	/* Pass the opened binary to the interpreter. */
+	if (bprm->have_execfd) {
+		retval = get_unused_fd_flags(0);
+		if (retval < 0)
+			goto out_unlock;
+		fd_install(retval, bprm->executable);
+		bprm->executable = NULL;
+		bprm->execfd = retval;
+	}
 	return 0;
 
 out_unlock:
@@ -1516,6 +1529,8 @@  static void free_bprm(struct linux_binprm *bprm)
 		allow_write_access(bprm->file);
 		fput(bprm->file);
 	}
+	if (bprm->executable)
+		fput(bprm->executable);
 	/* If a binfmt changed the interp, free it. */
 	if (bprm->interp != bprm->filename)
 		kfree(bprm->interp);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 8c7779d6bf19..653508b25815 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -26,6 +26,9 @@  struct linux_binprm {
 	unsigned long p; /* current top of mem */
 	unsigned long argmin; /* rlimit marker for copy_strings() */
 	unsigned int
+		/* Should an execfd be passed to userspace? */
+		have_execfd:1,
+
 		/* It is safe to use the creds of a script (see binfmt_misc) */
 		preserve_creds:1,
 		/*
@@ -48,6 +51,7 @@  struct linux_binprm {
 	unsigned int taso:1;
 #endif
 	unsigned int recursion_depth; /* only for search_binary_handler() */
+	struct file * executable; /* Executable to pass to the interpreter */
 	struct file * file;
 	struct cred *cred;	/* new credentials */
 	int unsafe;		/* how unsafe this exec is (mask of LSM_UNSAFE_*) */
@@ -58,7 +62,7 @@  struct linux_binprm {
 				   of the time same as filename, but could be
 				   different for binfmt_{misc,script} */
 	unsigned interp_flags;
-	unsigned interp_data;
+	int execfd;		/* File descriptor of the executable */
 	unsigned long loader, exec;
 
 	struct rlimit rlim_stack; /* Saved RLIMIT_STACK used during exec. */
@@ -69,10 +73,6 @@  struct linux_binprm {
 #define BINPRM_FLAGS_ENFORCE_NONDUMP_BIT 0
 #define BINPRM_FLAGS_ENFORCE_NONDUMP (1 << BINPRM_FLAGS_ENFORCE_NONDUMP_BIT)
 
-/* fd of the binary should be passed to the interpreter */
-#define BINPRM_FLAGS_EXECFD_BIT 1
-#define BINPRM_FLAGS_EXECFD (1 << BINPRM_FLAGS_EXECFD_BIT)
-
 /* filename of the binary will be inaccessible after exec */
 #define BINPRM_FLAGS_PATH_INACCESSIBLE_BIT 2
 #define BINPRM_FLAGS_PATH_INACCESSIBLE (1 << BINPRM_FLAGS_PATH_INACCESSIBLE_BIT)