diff mbox series

[3/5] exec: Remove recursion from search_binary_handler

Message ID 87eerszyim.fsf_-_@x220.int.ebiederm.org (mailing list archive)
State New, archived
Headers show
Series exec: Control flow simplifications | expand

Commit Message

Eric W. Biederman May 9, 2020, 7:41 p.m. UTC
Instead of recursing in search_binary_handler have the methods that
would recurse return a positive value, and simply loop in exec_binprm.

This is a trivial change as all of the methods that would recurse do
so as effectively the last thing they do.   Making this a trivial code
change.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 arch/alpha/kernel/binfmt_loader.c |  2 +-
 fs/binfmt_em86.c                  |  2 +-
 fs/binfmt_misc.c                  |  5 +----
 fs/binfmt_script.c                |  2 +-
 fs/exec.c                         | 20 +++++++++-----------
 include/linux/binfmts.h           |  2 --
 6 files changed, 13 insertions(+), 20 deletions(-)

Comments

Linus Torvalds May 9, 2020, 8:16 p.m. UTC | #1
On Sat, May 9, 2020 at 12:45 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Instead of recursing in search_binary_handler have the methods that
> would recurse return a positive value, and simply loop in exec_binprm.
>
> This is a trivial change as all of the methods that would recurse do
> so as effectively the last thing they do.   Making this a trivial code
> change.

Looks good.

I'd suggest doing that loop slightly differently:

> -       ret = search_binary_handler(bprm);
> +       do {
> +               depth++;
> +               ret = search_binary_handler(bprm);
> +               /* This allows 4 levels of binfmt rewrites before failing hard. */
> +               if ((ret > 0) && (depth > 5))
> +                       ret = -ELOOP;
> +       } while (ret > 0);
>          if (ret >= 0) {

That's really an odd way to write this.

So honestly, if "ret < 0", then we can just return directly.

So I think would make much more sense to do this loop something like

        for (depth = 0; depth < 5; depth++) {
                int ret;

                ret = search_binary_handler(bprm);
                if (ret < 0)
                        return ret;

                /* Continue searching for the next binary handler? */
                if (ret > 0)
                        continue;

                /* Success! */
                audit_bprm(bprm);
                trace_sched_process_exec(current, old_pid, bprm);
                ptrace_event(PTRACE_EVENT_EXEC, old_vpid);
                proc_exec_connector(current);
                return 0;
        }
        return -ELOOP;

(if I read the logic of exec_binprm() right - I might have missed something).

                Linus
Tetsuo Handa May 10, 2020, 4:22 a.m. UTC | #2
On 2020/05/10 4:41, Eric W. Biederman wrote:
> --- a/fs/binfmt_misc.c
> +++ b/fs/binfmt_misc.c
> @@ -234,10 +234,7 @@ static int load_misc_binary(struct linux_binprm *bprm)
>  	if (retval < 0)
>  		goto error;
>  
> -	retval = search_binary_handler(bprm);
> -	if (retval < 0)
> -		goto error;
> -
> +	retval = 1; /* Search for the interpreter */
>  ret:
>  	dput(fmt->dentry);
>  	return retval;

Wouldn't this change cause

	if (fd_binary > 0)
		ksys_close(fd_binary);
	bprm->interp_flags = 0;
	bprm->interp_data = 0;

not to be called when "Search for the interpreter" failed?
Linus Torvalds May 10, 2020, 7:38 p.m. UTC | #3
On Sat, May 9, 2020 at 9:30 PM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> Wouldn't this change cause
>
>         if (fd_binary > 0)
>                 ksys_close(fd_binary);
>         bprm->interp_flags = 0;
>         bprm->interp_data = 0;
>
> not to be called when "Search for the interpreter" failed?

Good catch. We seem to have some subtle magic wrt the fd_binary file
descriptor, which depends on the recursive behavior.

I'm not seeing how to fix it cleanly with the "turn it into a loop".
Basically, that binfmt_misc use-case isn't really a tail-call.

Eric, ideas?

                 Linus
Eric W. Biederman May 11, 2020, 2:33 p.m. UTC | #4
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sat, May 9, 2020 at 9:30 PM Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
>>
>> Wouldn't this change cause
>>
>>         if (fd_binary > 0)
>>                 ksys_close(fd_binary);
>>         bprm->interp_flags = 0;
>>         bprm->interp_data = 0;
>>
>> not to be called when "Search for the interpreter" failed?
>
> Good catch. We seem to have some subtle magic wrt the fd_binary file
> descriptor, which depends on the recursive behavior.

Yes.  I Tetsuo I really appreciate you noticing this.  This is exactly
the kind of behavior I am trying to flush out and keep from being
hidden.

> I'm not seeing how to fix it cleanly with the "turn it into a loop".
> Basically, that binfmt_misc use-case isn't really a tail-call.

I have reservations about installing a new file descriptor before
we process the close on exec logic and the related security modules
closing file descriptors that your new credentials no longer give
you access to logic.

I haven't yet figured out how opening a file descriptor during exec
should fit into all of that.



What I do see is that interp_data is just a parameter that is smuggled
into the call of search binary handler.  And the next binary handler
needs to be binfmt_elf for it to make much sense, as only binfmt_elf
(and binfmt_elf_fdpic) deals with BINPRM_FLAGS_EXECFD.


So I think what needs to happen is to rename bprm->interp_data to
bprm->execfd, remove BINPRM_FLAGS_EXECFD and make closing that file
descriptor free_bprm's responsiblity.

I hope such a change will make it easier to see all of the pieces that
are intereacting during exec.

I am still asking: is the installation of that file descriptor useful if
it is not exported passed to userspace as an AT_EXECFD note?


I will dig in and see what I can come up with.

Eric
Rob Landley May 11, 2020, 7:10 p.m. UTC | #5
On 5/11/20 9:33 AM, Eric W. Biederman wrote:
> What I do see is that interp_data is just a parameter that is smuggled
> into the call of search binary handler.  And the next binary handler
> needs to be binfmt_elf for it to make much sense, as only binfmt_elf
> (and binfmt_elf_fdpic) deals with BINPRM_FLAGS_EXECFD.

The binfmt_elf_fdpic driver is separate from binfmt_elf for the same reason
ext2/ext3/ext4 used to have 3 drivers: fdpic is really just binfmt_elf with the
4 main sections (text, data, bss, rodata) able to move independently of each
other (each tracked with its own base pointer).

It's kind of -fPIE on steroids, and various security people have sniffed at it
over the years to give ASLR more degrees of freedom on with-MMU systems. Many
moons ago Rich Felker proposed teaching the fdpic loader how to load normal ELF
binaries so there's just the one loader (there's a flag in the ELF header to say
whether the sections are independent or not).

Rob
Kees Cook May 11, 2020, 9:55 p.m. UTC | #6
On Mon, May 11, 2020 at 09:33:21AM -0500, Eric W. Biederman wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
> > On Sat, May 9, 2020 at 9:30 PM Tetsuo Handa
> > <penguin-kernel@i-love.sakura.ne.jp> wrote:
> >>
> >> Wouldn't this change cause
> >>
> >>         if (fd_binary > 0)
> >>                 ksys_close(fd_binary);
> >>         bprm->interp_flags = 0;
> >>         bprm->interp_data = 0;
> >>
> >> not to be called when "Search for the interpreter" failed?
> >
> > Good catch. We seem to have some subtle magic wrt the fd_binary file
> > descriptor, which depends on the recursive behavior.
> 
> Yes.  I Tetsuo I really appreciate you noticing this.  This is exactly
> the kind of behavior I am trying to flush out and keep from being
> hidden.
> 
> > I'm not seeing how to fix it cleanly with the "turn it into a loop".
> > Basically, that binfmt_misc use-case isn't really a tail-call.
> 
> I have reservations about installing a new file descriptor before
> we process the close on exec logic and the related security modules
> closing file descriptors that your new credentials no longer give
> you access to logic.

Hm, this does feel odd. In looking at this, it seems like this file
never gets close-on-exec set, and doesn't have its flags changed from
its original open:
                .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
only the UMH path through exec doesn't explicitly open a file by name
from what I can see, so we'll only have these flags.

> I haven't yet figured out how opening a file descriptor during exec
> should fit into all of that.
> 
> What I do see is that interp_data is just a parameter that is smuggled
> into the call of search binary handler.  And the next binary handler
> needs to be binfmt_elf for it to make much sense, as only binfmt_elf
> (and binfmt_elf_fdpic) deals with BINPRM_FLAGS_EXECFD.
> 
> So I think what needs to happen is to rename bprm->interp_data to
> bprm->execfd, remove BINPRM_FLAGS_EXECFD and make closing that file
> descriptor free_bprm's responsiblity.

Yeah, I would agree. As far as the close handling, I don't think there
is a difference here: it interp_data was closed on the binfmt_misc.c
error path, and in the new world it would be the exec error path -- both
would be under the original credentials.

> I hope such a change will make it easier to see all of the pieces that
> are intereacting during exec.

Right -- I'm not sure which piece should "consume" bprm->execfd though,
which I think is what you're asking next...

> I am still asking: is the installation of that file descriptor useful if
> it is not exported passed to userspace as an AT_EXECFD note?
> 
> I will dig in and see what I can come up with.

Should binfmt_misc do the install, or can the consuming binfmt do it?
i.e. when binfmt_elf sees bprm->execfd, does it perform the install
instead?
Eric W. Biederman May 12, 2020, 6:42 p.m. UTC | #7
Kees Cook <keescook@chromium.org> writes:

> On Mon, May 11, 2020 at 09:33:21AM -0500, Eric W. Biederman wrote:
>> Linus Torvalds <torvalds@linux-foundation.org> writes:
>> 
>> > On Sat, May 9, 2020 at 9:30 PM Tetsuo Handa
>> > <penguin-kernel@i-love.sakura.ne.jp> wrote:
>> >>
>> >> Wouldn't this change cause
>> >>
>> >>         if (fd_binary > 0)
>> >>                 ksys_close(fd_binary);
>> >>         bprm->interp_flags = 0;
>> >>         bprm->interp_data = 0;
>> >>
>> >> not to be called when "Search for the interpreter" failed?
>> >
>> > Good catch. We seem to have some subtle magic wrt the fd_binary file
>> > descriptor, which depends on the recursive behavior.
>> 
>> Yes.  I Tetsuo I really appreciate you noticing this.  This is exactly
>> the kind of behavior I am trying to flush out and keep from being
>> hidden.
>> 
>> > I'm not seeing how to fix it cleanly with the "turn it into a loop".
>> > Basically, that binfmt_misc use-case isn't really a tail-call.
>> 
>> I have reservations about installing a new file descriptor before
>> we process the close on exec logic and the related security modules
>> closing file descriptors that your new credentials no longer give
>> you access to logic.
>
> Hm, this does feel odd. In looking at this, it seems like this file
> never gets close-on-exec set, and doesn't have its flags changed from
> its original open:
>                 .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
> only the UMH path through exec doesn't explicitly open a file by name
> from what I can see, so we'll only have these flags.
>
>> I haven't yet figured out how opening a file descriptor during exec
>> should fit into all of that.
>> 
>> What I do see is that interp_data is just a parameter that is smuggled
>> into the call of search binary handler.  And the next binary handler
>> needs to be binfmt_elf for it to make much sense, as only binfmt_elf
>> (and binfmt_elf_fdpic) deals with BINPRM_FLAGS_EXECFD.
>> 
>> So I think what needs to happen is to rename bprm->interp_data to
>> bprm->execfd, remove BINPRM_FLAGS_EXECFD and make closing that file
>> descriptor free_bprm's responsiblity.
>
> Yeah, I would agree. As far as the close handling, I don't think there
> is a difference here: it interp_data was closed on the binfmt_misc.c
> error path, and in the new world it would be the exec error path -- both
> would be under the original credentials.
>
>> I hope such a change will make it easier to see all of the pieces that
>> are intereacting during exec.
>
> Right -- I'm not sure which piece should "consume" bprm->execfd though,
> which I think is what you're asking next...
>
>> I am still asking: is the installation of that file descriptor useful if
>> it is not exported passed to userspace as an AT_EXECFD note?
>> 
>> I will dig in and see what I can come up with.
>
> Should binfmt_misc do the install, or can the consuming binfmt do it?
> i.e. when binfmt_elf sees bprm->execfd, does it perform the install
> instead?

I am still thinking about this one, but here is where I am at.  At a
practical level passing the file descriptor of the script to interpreter
seems like something we should encourage in the long term.  It removes
races and it is cheaper because then the interpreter does not have to
turn around and open the script itself.



Strictly speaking binfmt_misc should not need to close the file
descriptor in binfmt_misc because we have already unshared the files
struct and reset_files_struct should handle restoring it.

Calling fd_install in binfmt_misc still seems wrong, as that exposes
the new file descriptor to user space with the old creds.

It is possible although unlikely for userspace to find the file
descriptor without consulting AT_EXECFD so just to be conservative I
think we should install the file descriptor in begin_new_exec even if
the next interpreter does not support AT_EXECFD.


I am still working on how to handle recursive binfmts but I suspect it
is just a matter of having an array of struct files in struct
linux_binprm.


Eric
Kees Cook May 12, 2020, 7:25 p.m. UTC | #8
On Tue, May 12, 2020 at 01:42:53PM -0500, Eric W. Biederman wrote:
> Kees Cook <keescook@chromium.org> writes:
> > Should binfmt_misc do the install, or can the consuming binfmt do it?
> > i.e. when binfmt_elf sees bprm->execfd, does it perform the install
> > instead?
> 
> I am still thinking about this one, but here is where I am at.  At a
> practical level passing the file descriptor of the script to interpreter
> seems like something we should encourage in the long term.  It removes
> races and it is cheaper because then the interpreter does not have to
> turn around and open the script itself.

Yeah, this does sounds pretty good, though I have concerns about doing
it for a process that isn't expecting it. I've seen a lot of bad code
make assumptions about initial fd numbers. :(

> Strictly speaking binfmt_misc should not need to close the file
> descriptor in binfmt_misc because we have already unshared the files
> struct and reset_files_struct should handle restoring it.

If I get what you mean, I agree. The error case is fine.

> Calling fd_install in binfmt_misc still seems wrong, as that exposes
> the new file descriptor to user space with the old creds.

I haven't dug into the details here -- is there a real risk here? The
old creds are what opened the file originally for the exec. Are you
thinking about executable-but-not-readable files?

> It is possible although unlikely for userspace to find the file
> descriptor without consulting AT_EXECFD so just to be conservative I
> think we should install the file descriptor in begin_new_exec even if
> the next interpreter does not support AT_EXECFD.

I think universally installing the fd needs to be a distinct patch --
it's going to have a lot of consequences, IMO. We can certainly deal
with them, but I don't think it should be part of this clean-up series.

> I am still working on how to handle recursive binfmts but I suspect it
> is just a matter of having an array of struct files in struct
> linux_binprm.

If install is left if binfmt_misc, then the recursive problem goes away,
yes?
Eric W. Biederman May 12, 2020, 8:31 p.m. UTC | #9
Kees Cook <keescook@chromium.org> writes:

> On Tue, May 12, 2020 at 01:42:53PM -0500, Eric W. Biederman wrote:
>> Kees Cook <keescook@chromium.org> writes:
>> > Should binfmt_misc do the install, or can the consuming binfmt do it?
>> > i.e. when binfmt_elf sees bprm->execfd, does it perform the install
>> > instead?
>> 
>> I am still thinking about this one, but here is where I am at.  At a
>> practical level passing the file descriptor of the script to interpreter
>> seems like something we should encourage in the long term.  It removes
>> races and it is cheaper because then the interpreter does not have to
>> turn around and open the script itself.
>
> Yeah, this does sounds pretty good, though I have concerns about doing
> it for a process that isn't expecting it. I've seen a lot of bad code
> make assumptions about initial fd numbers. :(

Yes.  That is definitely a concern.

>> Strictly speaking binfmt_misc should not need to close the file
>> descriptor in binfmt_misc because we have already unshared the files
>> struct and reset_files_struct should handle restoring it.
>
> If I get what you mean, I agree. The error case is fine.
>
>> Calling fd_install in binfmt_misc still seems wrong, as that exposes
>> the new file descriptor to user space with the old creds.
>
> I haven't dug into the details here -- is there a real risk here? The
> old creds are what opened the file originally for the exec. Are you
> thinking about executable-but-not-readable files?

I am thinking about looking in proc/<pid>/fd and maybe opening those
files.  That access is gated by ptrace_may_access which is gated
by the process credentials. So I know strictly speaking it is wrong.

I think you are correct that it would only allow access to a file that
could be accessed another way.  Even execveat at a quick glance appears
to go through the orinary permission checks of open.

The current code is definitely a maintenance pitfall as it install state
into the process early.

>> It is possible although unlikely for userspace to find the file
>> descriptor without consulting AT_EXECFD so just to be conservative I
>> think we should install the file descriptor in begin_new_exec even if
>> the next interpreter does not support AT_EXECFD.
>
> I think universally installing the fd needs to be a distinct patch --
> it's going to have a lot of consequences, IMO. We can certainly deal
> with them, but I don't think it should be part of this clean-up series.

I meant generically installing the fd not universally installing it.

>> I am still working on how to handle recursive binfmts but I suspect it
>> is just a matter of having an array of struct files in struct
>> linux_binprm.
>
> If install is left if binfmt_misc, then the recursive problem goes away,
> yes?

I don't think leaving the install in binfmt_misc is responsible at this
point.

Eric
Kees Cook May 12, 2020, 11:08 p.m. UTC | #10
On Tue, May 12, 2020 at 03:31:57PM -0500, Eric W. Biederman wrote:
> >> It is possible although unlikely for userspace to find the file
> >> descriptor without consulting AT_EXECFD so just to be conservative I
> >> think we should install the file descriptor in begin_new_exec even if
> >> the next interpreter does not support AT_EXECFD.
> >
> > I think universally installing the fd needs to be a distinct patch --
> > it's going to have a lot of consequences, IMO. We can certainly deal
> > with them, but I don't think it should be part of this clean-up series.
> 
> I meant generically installing the fd not universally installing it.
> 
> >> I am still working on how to handle recursive binfmts but I suspect it
> >> is just a matter of having an array of struct files in struct
> >> linux_binprm.
> >
> > If install is left if binfmt_misc, then the recursive problem goes away,
> > yes?
> 
> I don't think leaving the install in binfmt_misc is responsible at this
> point.

I'm nearly certain the answer is "yes", but I wonder if we should stop
for a moment and ask "does anything still use MISC_FMT_OPEN_BINARY ? It
looks like either "O" or "C" binfmt_misc registration flag. My installed
binfmts on Ubuntu don't use them...

I'm currently pulling a list of all the packages in Debian than depend
on the binfmt-support package and checking their flags.
Kees Cook May 12, 2020, 11:47 p.m. UTC | #11
On Tue, May 12, 2020 at 04:08:56PM -0700, Kees Cook wrote:
> I'm nearly certain the answer is "yes", but I wonder if we should stop
> for a moment and ask "does anything still use MISC_FMT_OPEN_BINARY ? It
> looks like either "O" or "C" binfmt_misc registration flag. My installed
> binfmts on Ubuntu don't use them...
> 
> I'm currently pulling a list of all the packages in Debian than depend
> on the binfmt-support package and checking their flags.

So, binfmt-support in Debian doesn't in _support_ MISC_FMT_OPEN_BINARY
("O"):


        credentials =
                (binfmt->credentials && !strcmp (binfmt->credentials, "yes"))
                ? "C" : "";
        preserve = (binfmt->preserve && !strcmp (binfmt->preserve, "yes"))
                ? "P" : "";
        fix_binary =
                (binfmt->fix_binary && !strcmp (binfmt->fix_binary, "yes"))
                ? "F" : "";
...
        regstring = xasprintf (":%s:%c:%s:%s:%s:%s:%s%s%s\n",
                               name, type, binfmt->offset, binfmt->magic,
                               binfmt->mask, interpreter,
                               credentials, preserve, fix_binary);

However, "credentials" ("C") does imply MISC_FMT_OPEN_BINARY.


I looked at every Debian package using binfmt-support, and "only" qemu
uses "credential".

And now I wonder if qemu actually uses the resulting AT_EXECFD ...
Kees Cook May 12, 2020, 11:51 p.m. UTC | #12
On Tue, May 12, 2020 at 04:47:14PM -0700, Kees Cook wrote:
> And now I wonder if qemu actually uses the resulting AT_EXECFD ...

It does, though I'm not sure if this is to support crossing mount points,
dropping privileges, or something else, since it does fall back to just
trying to open the file.

    execfd = qemu_getauxval(AT_EXECFD);
    if (execfd == 0) {
        execfd = open(filename, O_RDONLY);
        if (execfd < 0) {
            printf("Error while loading %s: %s\n", filename, strerror(errno));
            _exit(EXIT_FAILURE);
        }
    }
Linus Torvalds May 13, 2020, 12:20 a.m. UTC | #13
On Tue, May 12, 2020 at 11:46 AM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> I am still thinking about this one, but here is where I am at.  At a
> practical level passing the file descriptor of the script to interpreter
> seems like something we should encourage in the long term.  It removes
> races and it is cheaper because then the interpreter does not have to
> turn around and open the script itself.

Yeah, I think we should continue to support it, because I think it's
the right thing to do (and we might just end up having compatibility
issues if we don't).

How about trying to move the logic to the common code, out of binfmt_misc?

IOW, how about something very similar to your "brpm->preserve_creds"
thing that you did for the credentials (also for binfmt_misc, which
shouldn't surprise anybody: binfmt_misc is simply the "this is the
generic thing for letting user mode do the final details").

> Calling fd_install in binfmt_misc still seems wrong, as that exposes
> the new file descriptor to user space with the old creds.

Right.  And it really would be good to simply not have these kinds of
very special cases inside the low-level binfmt code: I'd much rather
have the special cases in the generic code, so that we see what the
ordering is etc. One of the big problems with all these binfmt
callbacks has been the fact that it makes it so hard to think about
and change the generic code, because the low-level binfmt handlers all
do their own special thing.

So moving it to generic code would likely simplify things from that
angle, even if the actual complexity of the feature itself remains.

Besides, we really have exposed this to other code anyway thanks to
that whole bprm->interp_data thing, and the AT_EXECFD AUX entries that
we have. So it's not really "internal" to binfmt_misc _anyway_.

So how about we just move the fd_binary logic to the generic execve
code, and just binfmt_misc set the flag for "yes, please do this",
exactly like "preserve_creds"?

> It is possible although unlikely for userspace to find the file
> descriptor without consulting AT_EXECFD so just to be conservative I
> think we should install the file descriptor in begin_new_exec even if
> the next interpreter does not support AT_EXECFD.

Ack. I think the AT_EXECFD thing is a sign that this isn't internal to
binfmt_misc, but it also shouldn't be gating this issue. In reality,
ELF is the only real binary format that matters - the script/misc
binfmts are just indirection entries - and it supports AT_EXECFD, so
let's just ignore the theoretical case of "maybe nobody exposes it".

So yes, just make it part of begin_new_exec(), and there's no reason
to support more than a single fd. No stacks or arrays of these things
required, I feel. It's not like AT_EXECFD supports the notion of
multiple fd's being reported anyway, nor does it make any sense to
have some kind of nested misc->misc binfmt nesting.

So making that whole interp_data and fd_binary thing be a generic
layer thing would make the search_binary_handler() code in binfmt_misc
be a pure tailcall too, and then the conversion to a loop ends up
working and being the right thing.

No?

                Linus
Rob Landley May 13, 2020, 2:39 a.m. UTC | #14
On 5/12/20 7:20 PM, Linus Torvalds wrote:
> On Tue, May 12, 2020 at 11:46 AM Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> I am still thinking about this one, but here is where I am at.  At a
>> practical level passing the file descriptor of the script to interpreter
>> seems like something we should encourage in the long term.  It removes
>> races and it is cheaper because then the interpreter does not have to
>> turn around and open the script itself.
> 
> Yeah, I think we should continue to support it, because I think it's
> the right thing to do (and we might just end up having compatibility
> issues if we don't).
...
>> It is possible although unlikely for userspace to find the file
>> descriptor without consulting AT_EXECFD so just to be conservative I
>> think we should install the file descriptor in begin_new_exec even if
>> the next interpreter does not support AT_EXECFD.
> 
> Ack. I think the AT_EXECFD thing is a sign that this isn't internal to
> binfmt_misc, but it also shouldn't be gating this issue. In reality,
> ELF is the only real binary format that matters - the script/misc
> binfmts are just indirection entries - and it supports AT_EXECFD, so
> let's just ignore the theoretical case of "maybe nobody exposes it".

Would this potentially make the re-exec-yourself case easier to do at some
point? (Which nommu needs to do, and /proc/self/exe isn't always available.)

Here's the first time I asked about that:

https://lore.kernel.org/lkml/200612261823.07927.rob@landley.net/

Here's the most recent:

https://lkml.org/lkml/2017/9/5/246

Here's someone else asking and being basically told "chroot isn't a thing":

http://lkml.iu.edu/hypermail/linux/kernel/0906.3/00584.html

(See also "CVE-2019-5736" and the workarounds thereto.)

Rob

P.S. Yes I'm aware it would only work properly with static binaries. Not the
first thing that's true for.
Linus Torvalds May 13, 2020, 7:51 p.m. UTC | #15
On Tue, May 12, 2020 at 7:32 PM Rob Landley <rob@landley.net> wrote:
>
> On 5/12/20 7:20 PM, Linus Torvalds wrote:
> > Ack. I think the AT_EXECFD thing is a sign that this isn't internal to
> > binfmt_misc, but it also shouldn't be gating this issue. In reality,
> > ELF is the only real binary format that matters - the script/misc
> > binfmts are just indirection entries - and it supports AT_EXECFD, so
> > let's just ignore the theoretical case of "maybe nobody exposes it".
>
> Would this potentially make the re-exec-yourself case easier to do at some
> point? (Which nommu needs to do, and /proc/self/exe isn't always available.)

AT_EXECFD may be an ELF thing, but normal ELF binaries don't do that
"we have a fd". So it only triggers for binfmt_misc (and only when the
flag is set for "I want the fd").

So no, this wouldn't help re-exec-yourself in general.

Although I guess we could add an ELF section note that does that whole
"executable fd" thing for other things too.

Everything is possible in theory..

               Linus
Eric W. Biederman May 13, 2020, 9:59 p.m. UTC | #16
Rob Landley <rob@landley.net> writes:

> On 5/11/20 9:33 AM, Eric W. Biederman wrote:
>> What I do see is that interp_data is just a parameter that is smuggled
>> into the call of search binary handler.  And the next binary handler
>> needs to be binfmt_elf for it to make much sense, as only binfmt_elf
>> (and binfmt_elf_fdpic) deals with BINPRM_FLAGS_EXECFD.
>
> The binfmt_elf_fdpic driver is separate from binfmt_elf for the same reason
> ext2/ext3/ext4 used to have 3 drivers: fdpic is really just binfmt_elf with the
> 4 main sections (text, data, bss, rodata) able to move independently of each
> other (each tracked with its own base pointer).
>
> It's kind of -fPIE on steroids, and various security people have sniffed at it
> over the years to give ASLR more degrees of freedom on with-MMU systems. Many
> moons ago Rich Felker proposed teaching the fdpic loader how to load normal ELF
> binaries so there's just the one loader (there's a flag in the ELF header to say
> whether the sections are independent or not).

Careful with your terminology.  ELF sections are for .o's For
executables ELF have segments.  And reading through the code it is the
program segments that are independently relocatable.

There is a flag but it is defined per architecture and I don't think one
of the architectures define it.

I looked at ARM and apparently with an MMU ARM turns fdpic binaries into
PIE executables.  I am not certain why.

The registers passed to the entry point are also different for both
cases.

I think it would have been nice if the fdpic support had used a
different ELF type, instead of a different depending on using a
different architecture.

All that aside the core dumping code looks to be essentially the same
between binfmt_elf.c and binfmt_elf_fdpic.c.  Do you think people would
be interested in refactoring binfmt_elf.c and binfmt_elf_fdpic.c so that
they could share the same core dumping code?

Eric
Eric W. Biederman May 14, 2020, 2:56 p.m. UTC | #17
Kees Cook <keescook@chromium.org> writes:

> On Tue, May 12, 2020 at 04:47:14PM -0700, Kees Cook wrote:
>> And now I wonder if qemu actually uses the resulting AT_EXECFD ...
>
> It does, though I'm not sure if this is to support crossing mount points,
> dropping privileges, or something else, since it does fall back to just
> trying to open the file.
>
>     execfd = qemu_getauxval(AT_EXECFD);
>     if (execfd == 0) {
>         execfd = open(filename, O_RDONLY);
>         if (execfd < 0) {
>             printf("Error while loading %s: %s\n", filename, strerror(errno));
>             _exit(EXIT_FAILURE);
>         }
>     }

My hunch is that the fallback exists from a time when the kernel did not
implement AT_EXECFD, or so that qemu can run on kernels that don't
implement AT_EXECFD.  It doesn't really matter unless the executable is
suid, or otherwise changes privileges.


I looked into this a bit to remind myself why exec works the way it
works, with changing privileges.

The classic attack is pointing a symlink at a #! script that is suid or
otherwise changes privileges.  The kernel will open the script and set
the privileges, read the interpreter from the first line, and proceed to
exec the interpreter.  The interpreter will then open the script using
the pathname supplied by the kernel.  The name of the symlink.
Before the interpreter reopens the script the attack would replace
the symlink with a script that does something else, but gets to run
with the privileges of the script.


Defending against that time of check vs time of use attack is why
bprm_fill_uid, and cap_bprm_set_creds use the credentials derived from
the interpreter instead of the credentials derived from the script.


The other defense is to replace the pathname of the executable that the
intepreter will open with /dev/fd/N.

All of this predates Linux entirely.  I do remember this was fixed at
some point in Linux but I don't remember the details.  I can just read
the solution that was picked in the code.



All of this makes me wonder how are the LSMs protected against this
attack.

Let's see the following LSMS implement brpm_set_creds:
tomoyo   - Abuses bprm_set_creds to call tomoyo_load_policy [ safe ]
smack    - Requires CAP_MAC_ADMIN to smack setxattrs        [ vulnerable? ]
           Uses those xattrs in smack_bprm_set_creds
apparmor - Everything is based on names so the symlink      [ safe? ]
           attack won't work as it has the wrong name.
           As long as the trusted names can't be renamed
           apparmor appears good.
selinux  - Appears to let anyone set selinux xattrs         [ safe? ]
           Requires permission for a sid transfer
           As the attack appears not to allow anything that
           would not be allowed anyway it looks like selinux
           is safe.

LSM folks, especially Casey am I reading this correctly?  Did I
correctly infer how your LSMs deal with the time of check to time of use
attack on the script name?

Eric
Eric W. Biederman May 14, 2020, 4:49 p.m. UTC | #18
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, May 12, 2020 at 11:46 AM Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> I am still thinking about this one, but here is where I am at.  At a
>> practical level passing the file descriptor of the script to interpreter
>> seems like something we should encourage in the long term.  It removes
>> races and it is cheaper because then the interpreter does not have to
>> turn around and open the script itself.
>
> Yeah, I think we should continue to support it, because I think it's
> the right thing to do (and we might just end up having compatibility
> issues if we don't).
>
> How about trying to move the logic to the common code, out of binfmt_misc?
>
> IOW, how about something very similar to your "brpm->preserve_creds"
> thing that you did for the credentials (also for binfmt_misc, which
> shouldn't surprise anybody: binfmt_misc is simply the "this is the
> generic thing for letting user mode do the final details").
>
>> Calling fd_install in binfmt_misc still seems wrong, as that exposes
>> the new file descriptor to user space with the old creds.
>
> Right.  And it really would be good to simply not have these kinds of
> very special cases inside the low-level binfmt code: I'd much rather
> have the special cases in the generic code, so that we see what the
> ordering is etc. One of the big problems with all these binfmt
> callbacks has been the fact that it makes it so hard to think about
> and change the generic code, because the low-level binfmt handlers all
> do their own special thing.
>
> So moving it to generic code would likely simplify things from that
> angle, even if the actual complexity of the feature itself remains.
>
> Besides, we really have exposed this to other code anyway thanks to
> that whole bprm->interp_data thing, and the AT_EXECFD AUX entries that
> we have. So it's not really "internal" to binfmt_misc _anyway_.
>
> So how about we just move the fd_binary logic to the generic execve
> code, and just binfmt_misc set the flag for "yes, please do this",
> exactly like "preserve_creds"?
>
>> It is possible although unlikely for userspace to find the file
>> descriptor without consulting AT_EXECFD so just to be conservative I
>> think we should install the file descriptor in begin_new_exec even if
>> the next interpreter does not support AT_EXECFD.
>
> Ack. I think the AT_EXECFD thing is a sign that this isn't internal to
> binfmt_misc, but it also shouldn't be gating this issue. In reality,
> ELF is the only real binary format that matters - the script/misc
> binfmts are just indirection entries - and it supports AT_EXECFD, so
> let's just ignore the theoretical case of "maybe nobody exposes it".
>
> So yes, just make it part of begin_new_exec(), and there's no reason
> to support more than a single fd. No stacks or arrays of these things
> required, I feel. It's not like AT_EXECFD supports the notion of
> multiple fd's being reported anyway, nor does it make any sense to
> have some kind of nested misc->misc binfmt nesting.
>
> So making that whole interp_data and fd_binary thing be a generic
> layer thing would make the search_binary_handler() code in binfmt_misc
> be a pure tailcall too, and then the conversion to a loop ends up
> working and being the right thing.

That is pretty much what I have been thinking.  I have just been taking
it slow so I find as many funny corner cases as I can.

Nothing ever clears the BINPRM_FLAGS_EXECFD so the current code can
not support nesting.

Now I do think a nested misc->misc binfmt thing can make sense in
principal.  I have an old dos spectrum emulator that I use to play some
of the games that I grew up with.  Running that emulator makes me two
emulators deep.  I can also imagine writting a domain specific language
in python or perl, and setting things up so scripts in the domain
specific language can be run directly.

So I think I need to deliberately test and prevent a nested misc->misc,
just so data structures don't get stomped.  If the cases where it could
useful prove sufficiently interesting we can enable them later.

Eric
Casey Schaufler May 14, 2020, 4:56 p.m. UTC | #19
On 5/14/2020 7:56 AM, Eric W. Biederman wrote:
> Kees Cook <keescook@chromium.org> writes:
>
>> On Tue, May 12, 2020 at 04:47:14PM -0700, Kees Cook wrote:
>>> And now I wonder if qemu actually uses the resulting AT_EXECFD ...
>> It does, though I'm not sure if this is to support crossing mount points,
>> dropping privileges, or something else, since it does fall back to just
>> trying to open the file.
>>
>>     execfd = qemu_getauxval(AT_EXECFD);
>>     if (execfd == 0) {
>>         execfd = open(filename, O_RDONLY);
>>         if (execfd < 0) {
>>             printf("Error while loading %s: %s\n", filename, strerror(errno));
>>             _exit(EXIT_FAILURE);
>>         }
>>     }
> My hunch is that the fallback exists from a time when the kernel did not
> implement AT_EXECFD, or so that qemu can run on kernels that don't
> implement AT_EXECFD.  It doesn't really matter unless the executable is
> suid, or otherwise changes privileges.
>
>
> I looked into this a bit to remind myself why exec works the way it
> works, with changing privileges.
>
> The classic attack is pointing a symlink at a #! script that is suid or
> otherwise changes privileges.  The kernel will open the script and set
> the privileges, read the interpreter from the first line, and proceed to
> exec the interpreter.  The interpreter will then open the script using
> the pathname supplied by the kernel.  The name of the symlink.
> Before the interpreter reopens the script the attack would replace
> the symlink with a script that does something else, but gets to run
> with the privileges of the script.
>
>
> Defending against that time of check vs time of use attack is why
> bprm_fill_uid, and cap_bprm_set_creds use the credentials derived from
> the interpreter instead of the credentials derived from the script.
>
>
> The other defense is to replace the pathname of the executable that the
> intepreter will open with /dev/fd/N.
>
> All of this predates Linux entirely.  I do remember this was fixed at
> some point in Linux but I don't remember the details.  I can just read
> the solution that was picked in the code.
>
>
>
> All of this makes me wonder how are the LSMs protected against this
> attack.
>
> Let's see the following LSMS implement brpm_set_creds:
> tomoyo   - Abuses bprm_set_creds to call tomoyo_load_policy [ safe ]
> smack    - Requires CAP_MAC_ADMIN to smack setxattrs        [ vulnerable? ]
>            Uses those xattrs in smack_bprm_set_creds

What is the concern? If the xattrs change after the check,
the behavior should still be consistent. 

> apparmor - Everything is based on names so the symlink      [ safe? ]
>            attack won't work as it has the wrong name.
>            As long as the trusted names can't be renamed
>            apparmor appears good.
> selinux  - Appears to let anyone set selinux xattrs         [ safe? ]
>            Requires permission for a sid transfer
>            As the attack appears not to allow anything that
>            would not be allowed anyway it looks like selinux
>            is safe.
>
> LSM folks, especially Casey am I reading this correctly?  Did I
> correctly infer how your LSMs deal with the time of check to time of use
> attack on the script name?
>
> Eric
>
Eric W. Biederman May 14, 2020, 5:02 p.m. UTC | #20
Casey Schaufler <casey@schaufler-ca.com> writes:

> On 5/14/2020 7:56 AM, Eric W. Biederman wrote:
>> Kees Cook <keescook@chromium.org> writes:
>>
>>> On Tue, May 12, 2020 at 04:47:14PM -0700, Kees Cook wrote:
>>>> And now I wonder if qemu actually uses the resulting AT_EXECFD ...
>>> It does, though I'm not sure if this is to support crossing mount points,
>>> dropping privileges, or something else, since it does fall back to just
>>> trying to open the file.
>>>
>>>     execfd = qemu_getauxval(AT_EXECFD);
>>>     if (execfd == 0) {
>>>         execfd = open(filename, O_RDONLY);
>>>         if (execfd < 0) {
>>>             printf("Error while loading %s: %s\n", filename, strerror(errno));
>>>             _exit(EXIT_FAILURE);
>>>         }
>>>     }
>> My hunch is that the fallback exists from a time when the kernel did not
>> implement AT_EXECFD, or so that qemu can run on kernels that don't
>> implement AT_EXECFD.  It doesn't really matter unless the executable is
>> suid, or otherwise changes privileges.
>>
>>
>> I looked into this a bit to remind myself why exec works the way it
>> works, with changing privileges.
>>
>> The classic attack is pointing a symlink at a #! script that is suid or
>> otherwise changes privileges.  The kernel will open the script and set
>> the privileges, read the interpreter from the first line, and proceed to
>> exec the interpreter.  The interpreter will then open the script using
>> the pathname supplied by the kernel.  The name of the symlink.
>> Before the interpreter reopens the script the attack would replace
>> the symlink with a script that does something else, but gets to run
>> with the privileges of the script.
>>
>>
>> Defending against that time of check vs time of use attack is why
>> bprm_fill_uid, and cap_bprm_set_creds use the credentials derived from
>> the interpreter instead of the credentials derived from the script.
>>
>>
>> The other defense is to replace the pathname of the executable that the
>> intepreter will open with /dev/fd/N.
>>
>> All of this predates Linux entirely.  I do remember this was fixed at
>> some point in Linux but I don't remember the details.  I can just read
>> the solution that was picked in the code.
>>
>>
>>
>> All of this makes me wonder how are the LSMs protected against this
>> attack.
>>
>> Let's see the following LSMS implement brpm_set_creds:
>> tomoyo   - Abuses bprm_set_creds to call tomoyo_load_policy [ safe ]
>> smack    - Requires CAP_MAC_ADMIN to smack setxattrs        [ vulnerable? ]
>>            Uses those xattrs in smack_bprm_set_creds
>
> What is the concern? If the xattrs change after the check,
> the behavior should still be consistent.

The concern is that there are xattrs set on a #! script.  Someone
replaces the script after smack reads the xattr and sets bprm->cred but
before the interpreter reopens the script.

In short if there is one script with xattrs set. I can run any script as
if those xattrs were set on it.

I don't know the smack security model well enough to know if that
is a problem or not.  It looks like it may be a concern because smack
limits who can mess with it's security xattrs.

Eric
Rob Landley May 14, 2020, 6:46 p.m. UTC | #21
On 5/13/20 4:59 PM, Eric W. Biederman wrote:
> Careful with your terminology.  ELF sections are for .o's For
> executables ELF have segments.  And reading through the code it is the
> program segments that are independently relocatable.

Sorry, I have trouble keeping this stuff straight when it's not in front of me.
(I have a paperback copy of the old "linkers and loaders" book and it was the
driest thing I have _ever_ slogged through. Back before the Linux Foundation ate
the FSG I was pushing https://refspecs.linuxbase.org/ to include missing ABI
supplement, I have copies of ones it doesn't collected from now long-dead sites...)

But more recently I've just made puppy eyes at Rich Felker to have him fix this
stuff for me, because I do _not_ retain the terminology here. REL vs RELA vs
PLT, can you have a PLT without a GOT...?

> There is a flag but it is defined per architecture and I don't think one
> of the architectures define it.

They all check for one, but I don't remember there being a #define.

I have a todo item to check more architectures' fdpic binaries, this was from
sh2eb (ala j-core):

  https://github.com/landley/toybox/commit/d61aeaf9e#diff-4442ddbb8949R65

There was the out of tree arm fdpic toolchain from the french guys for cortex-m,
and the original frv paper, and in theory blackfin but nothing they touched ever
got merged upstream anywhere:

In _theory_ you could do fdpic for x86, but as with u-boot for x86 nobody ever
bothers because it's got an x86-only solution. (And then the x86 version of
stuff gets pushed to other platforms because all our device tree files were
GPLed so of course acpi for arm became a thing. Sigh...)

> I looked at ARM and apparently with an MMU ARM turns fdpic binaries into
> PIE executables.  I am not certain why.

Falling back to a more widely tested codepath, I expect. Also maybe it saves 3
registers if all 4 are using the same base register? Map them linearly and it
becomes "single base + offset"? Which of course looses the extra ASLR benefits
the security people wanted, but "undoing what the security people want in the
name of an unmeasurable microbenchmark optimization" is a proud tradition.

Just because the 4 segments are compiled as independently relocatable doesn't
mean they HAVE to be. (You'd think the code would be using different register
numbers to index stuff so you'd STILL be using 4 registers, but I haven't looked
at what arm's doing...)

> The registers passed to the entry point are also different for both
> cases.

From the same machine code chunks? I boggle at what the ld.so fixup is doing then...

> I think it would have been nice if the fdpic support had used a
> different ELF type, instead of a different depending on using a
> different architecture.

This is what you get when a blackfin developer talks to the gnu/binutils developers:

  https://sourceware.org/legacy-ml/binutils/2008-04/msg00350.html

> All that aside the core dumping code looks to be essentially the same
> between binfmt_elf.c and binfmt_elf_fdpic.c.  Do you think people would
> be interested in refactoring binfmt_elf.c and binfmt_elf_fdpic.c so that
> they could share the same core dumping code?

I think merging the two of them together entirely would be a good idea, and
anything that can collapse together I'm happy to regression test on sh2.

I also note that qemu-sh4eb can run these binaries, maybe I can whip up a
qemu-system-sh4eb that runs a nommu fdpic userspace...

[hours later]

Ok, here's me asking Rich Felker a question:

>>> So fdpic binaries run under qemu-sh2eb and there's a qemu-system-sh2eb that
>>> SHOULD also be able to run them under the r2d board emulation, and the kernel
>>> builds fine under the sh2eb compiler but I can't enable fdpic support without
>>> CONFIG_NOMMU, and if I yank that dependency from Kconfig (which only sh2 has,
>>> arm and such do fdpic with or without mmu) the build breaks with:
>>>
>>> /home/landley/toybox/clean/ccc/sh2eb-linux-muslfdpic-cross/bin/sh2eb-linux-muslfdpic-ld:
>>> fs/binfmt_elf_fdpic.o: in function `load_elf_fdpic_binary':
>>> binfmt_elf_fdpic.c:(.text+0x1734): undefined reference to
>>> `elf_fdpic_arch_lay_out_mm'
>>>
>>> The problem is if I switch off CONFIG_MMU in the kernel, buckets of stuff in the
>>> r2d board kernel config changes and suddenly I don't get serial output from the
>>> qemu-system-sh2eb -M r2d boot anymore. Before it was running the kernel but just
>>> failing to run init...

And his response:

>> I don't think qemu-system-sh4eb can boot a nommu kernel. But you don't
>> need to in order to do userspace-only testing. Just build a normal
>> sh4eb kernel. It doesn't need CONFIG_BINFMT_ELF_FDPIC. The normal ELF
>> loader can load FDPIC just fine, because a valid FDPIC ELF file is a
>> valid ELF file, just with more constraints (in same sense a square is
>> a rectangle). The normal ELF loader won't independently float the text
>> and data segments, but that's okay because your emulated system has an
>> MMU and can just map them adjacently like they show up in the ELF file
>> with their untransformed addresses.
>> 
>> Now that I think about it, it's possible that the ARM folks broke this
>> when adding support for enabling CONFIG_BINFMT_ELF_FDPIC with MMU. If
>> so, and you find you really do need the FDPIC loader now because they
>> made the normal ELF loader refuse to do it, I think it will suffice to
>> copy the ARM version of elf_fdpic_arch_lay_out_mm from
>> arch/arm/kernel/elf.c to somewhere it will be compiled on SH.

I.E. testing the kernel fdpic loader under qemu is NOT EASY (because the fdpic
loader refuses to build in a with-mmu context, and the relevant board emulations
refuse to build without), but it can fall back to the conventional ELF loader
which collates the segments and treats fdpic as PIE? (Which... is how qemu-sh2eb
application emulation is loading them...?)

Which was news to me...

> Eric

Rob
diff mbox series

Patch

diff --git a/arch/alpha/kernel/binfmt_loader.c b/arch/alpha/kernel/binfmt_loader.c
index a8d0d6e06526..a90c8b1d5498 100644
--- a/arch/alpha/kernel/binfmt_loader.c
+++ b/arch/alpha/kernel/binfmt_loader.c
@@ -38,7 +38,7 @@  static int load_binary(struct linux_binprm *bprm)
 	retval = prepare_binprm(bprm);
 	if (retval < 0)
 		return retval;
-	return search_binary_handler(bprm);
+	return 1; /* Search for the interpreter */
 }
 
 static struct linux_binfmt loader_format = {
diff --git a/fs/binfmt_em86.c b/fs/binfmt_em86.c
index 466497860c62..a9b9ac7f9bb0 100644
--- a/fs/binfmt_em86.c
+++ b/fs/binfmt_em86.c
@@ -95,7 +95,7 @@  static int load_em86(struct linux_binprm *bprm)
 	if (retval < 0)
 		return retval;
 
-	return search_binary_handler(bprm);
+	return 1; /* Search for the interpreter */
 }
 
 static struct linux_binfmt em86_format = {
diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index cdb45829354d..127fae9c21ab 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -234,10 +234,7 @@  static int load_misc_binary(struct linux_binprm *bprm)
 	if (retval < 0)
 		goto error;
 
-	retval = search_binary_handler(bprm);
-	if (retval < 0)
-		goto error;
-
+	retval = 1; /* Search for the interpreter */
 ret:
 	dput(fmt->dentry);
 	return retval;
diff --git a/fs/binfmt_script.c b/fs/binfmt_script.c
index e9e6a6f4a35f..76a05696d376 100644
--- a/fs/binfmt_script.c
+++ b/fs/binfmt_script.c
@@ -146,7 +146,7 @@  static int load_script(struct linux_binprm *bprm)
 	retval = prepare_binprm(bprm);
 	if (retval < 0)
 		return retval;
-	return search_binary_handler(bprm);
+	return 1; /* Search for the interpreter */
 }
 
 static struct linux_binfmt script_format = {
diff --git a/fs/exec.c b/fs/exec.c
index 635b5085050c..8bbf5fa785a6 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1690,16 +1690,12 @@  EXPORT_SYMBOL(remove_arg_zero);
 /*
  * cycle the list of binary formats handler, until one recognizes the image
  */
-int search_binary_handler(struct linux_binprm *bprm)
+static int search_binary_handler(struct linux_binprm *bprm)
 {
 	bool need_retry = IS_ENABLED(CONFIG_MODULES);
 	struct linux_binfmt *fmt;
 	int retval;
 
-	/* This allows 4 levels of binfmt rewrites before failing hard. */
-	if (bprm->recursion_depth > 5)
-		return -ELOOP;
-
 	retval = security_bprm_check(bprm);
 	if (retval)
 		return retval;
@@ -1712,10 +1708,7 @@  int search_binary_handler(struct linux_binprm *bprm)
 			continue;
 		read_unlock(&binfmt_lock);
 
-		bprm->recursion_depth++;
 		retval = fmt->load_binary(bprm);
-		bprm->recursion_depth--;
-
 		read_lock(&binfmt_lock);
 		put_binfmt(fmt);
 		if (bprm->point_of_no_return || !bprm->file ||
@@ -1738,12 +1731,11 @@  int search_binary_handler(struct linux_binprm *bprm)
 
 	return retval;
 }
-EXPORT_SYMBOL(search_binary_handler);
 
 static int exec_binprm(struct linux_binprm *bprm)
 {
 	pid_t old_pid, old_vpid;
-	int ret;
+	int ret, depth = 0;
 
 	/* Need to fetch pid before load_binary changes it */
 	old_pid = current->pid;
@@ -1751,7 +1743,13 @@  static int exec_binprm(struct linux_binprm *bprm)
 	old_vpid = task_pid_nr_ns(current, task_active_pid_ns(current->parent));
 	rcu_read_unlock();
 
-	ret = search_binary_handler(bprm);
+	do {
+		depth++;
+		ret = search_binary_handler(bprm);
+		/* This allows 4 levels of binfmt rewrites before failing hard. */
+		if ((ret > 0) && (depth > 5))
+			ret = -ELOOP;
+	} while (ret > 0);
 	if (ret >= 0) {
 		audit_bprm(bprm);
 		trace_sched_process_exec(current, old_pid, bprm);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 42f760acfc2c..89f1135dcb75 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -47,7 +47,6 @@  struct linux_binprm {
 #ifdef __alpha__
 	unsigned int taso:1;
 #endif
-	unsigned int recursion_depth; /* only for search_binary_handler() */
 	struct file * file;
 	struct cred *cred;	/* new credentials */
 	int unsafe;		/* how unsafe this exec is (mask of LSM_UNSAFE_*) */
@@ -118,7 +117,6 @@  extern void unregister_binfmt(struct linux_binfmt *);
 
 extern int prepare_binprm(struct linux_binprm *);
 extern int __must_check remove_arg_zero(struct linux_binprm *);
-extern int search_binary_handler(struct linux_binprm *);
 extern int begin_new_exec(struct linux_binprm * bprm);
 extern void setup_new_exec(struct linux_binprm * bprm);
 extern void finalize_exec(struct linux_binprm *bprm);