Segmentation Fault on non-commit --branch clone

Message ID	20191101002432.GA49846@carpenter.lan (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=gx+g=YZ=vger.kernel.org=git-owner@kernel.org> Date: Fri, 1 Nov 2019 01:24:32 +0100 From: Davide Berardi <berardi.dav@gmail.com> To: git@vger.kernel.org Cc: Johannes.Schindelin@gmx.de, peff@peff.net, gitster@pobox.com Subject: [PATCH] Segmentation Fault on non-commit --branch clone Message-ID: <20191101002432.GA49846@carpenter.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Disposition: inline User-Agent: Mutt/1.11.4 (2019-03-13) Sender: git-owner@vger.kernel.org Precedence: bulk
Series	Segmentation Fault on non-commit --branch clone \| expand Segmentation Fault on non-commit --branch clone

Davide Berardi Nov. 1, 2019, 12:24 a.m. UTC

Fixed segmentation fault that can be triggered using
$ git clone --branch $object $repository
with object pointing to a non-commit ref (e.g. a blob).

Signed-off-by: Davide Berardi <berardi.dav@gmail.com>

---
 builtin/clone.c         | 26 ++++++++++++++++++++++++++
 refs.h                  |  7 +++++++
 t/t5609-clone-branch.sh | 22 +++++++++++++++++++++-
 3 files changed, 54 insertions(+), 1 deletion(-)

Johannes Schindelin Nov. 1, 2019, 7:08 p.m. UTC | #1

Hi Davide,

I wonder whether you might want to reword the Subject: such that it
reflects what the patch does instead of what the problem is that
motivated you to work on the patch (the patch does not cause the
segmentation fault, after all, but it fixes it).

On Fri, 1 Nov 2019, Davide Berardi wrote:

> Fixed segmentation fault that can be triggered using
> $ git clone --branch $object $repository
> with object pointing to a non-commit ref (e.g. a blob).
>
> Signed-off-by: Davide Berardi <berardi.dav@gmail.com>
>
> ---
> builtin/clone.c         | 26 ++++++++++++++++++++++++++
> refs.h                  |  7 +++++++
> t/t5609-clone-branch.sh | 22 +++++++++++++++++++++-
> 3 files changed, 54 insertions(+), 1 deletion(-)
>
> diff --git a/builtin/clone.c b/builtin/clone.c
> index f665b28ccc..0f4a18302c 100644
> --- a/builtin/clone.c
> +++ b/builtin/clone.c
> @@ -704,11 +704,32 @@ static void update_remote_refs(const struct ref *refs,
> 	}
> }
>
> +static int fallback_on_noncommit(const struct ref *check,
> +				 const struct ref *remote,
> +				 const char *msg)
> +{
> +	if (check == NULL)
> +		return 1;
> +	struct commit *c = lookup_commit_reference_gently(the_repository,
> +						   &check->old_oid, 1);
> +	if (c == NULL) {
> +		/* Fallback HEAD to fallback refs */
> +		warning(_("%s is not a valid commit object, HEAD will fallback
> to %s"),
> +			check->name, FALLBACK_REF);

Quite honestly, I do not think that it is a good idea to fall back in
this case. The user asked for something that cannot be accomplished, and
the best way to handle this is to exit with an error, i.e. `die()`.

> +		update_ref(msg, FALLBACK_REF, &remote->old_oid, NULL,
> +			   REF_NO_DEREF, UPDATE_REFS_DIE_ON_ERR);
> +	}
> +
> +	return c == NULL;
> +}
> +
> static void update_head(const struct ref *our, const struct ref *remote,
> 			const char *msg)
> {
> 	const char *head;
> 	if (our && skip_prefix(our->name, "refs/heads/", &head)) {
> +		if (fallback_on_noncommit(our, remote, msg) != 0)
> +			return;
> 		/* Local default branch link */
> 		if (create_symref("HEAD", our->name, NULL) < 0)
> 			die(_("unable to update HEAD"));
> @@ -718,12 +739,17 @@ static void update_head(const struct ref *our, const
> struct ref *remote,
> 			install_branch_config(0, head, option_origin,
> 		our->name);
> 		}
> 	} else if (our) {
> +		if (fallback_on_noncommit(our, remote, msg) != 0)
> +			return;
> +		/* here commit is valid */
> 		struct commit *c = lookup_commit_reference(the_repository,
> 							   &our->old_oid);
> 		/* --branch specifies a non-branch (i.e. tags), detach HEAD */
> 		update_ref(msg, "HEAD", &c->object.oid, NULL, REF_NO_DEREF,
> 			   UPDATE_REFS_DIE_ON_ERR);
> 	} else if (remote) {
> +		if (fallback_on_noncommit(remote, remote, msg) != 0)
> +			return;
> 		/*
> 		 * We know remote HEAD points to a non-branch, or
> 		 * HEAD points to a branch but we don't know which one.
> diff --git a/refs.h b/refs.h
> index 730d05ad91..35ee6eb058 100644
> --- a/refs.h
> +++ b/refs.h
> @@ -497,6 +497,13 @@ enum action_on_err {
> 	UPDATE_REFS_QUIET_ON_ERR
> };
>
> +/*
> + * In case of a remote HEAD pointing to a non-commit update_head
> + * will make HEAD reference fallback to this value, master ref
> + * should be safe.
> + */
> +#define FALLBACK_REF "refs/heads/master"
> +
> /*
>  * Begin a reference transaction.  The reference transaction must
>  * be freed by calling ref_transaction_free().
> diff --git a/t/t5609-clone-branch.sh b/t/t5609-clone-branch.sh
> index 6e7a7be052..0680cf5a89 100755
> --- a/t/t5609-clone-branch.sh
> +++ b/t/t5609-clone-branch.sh
> @@ -20,7 +20,13 @@ test_expect_success 'setup' '
> 	 echo one >file && git add file && git commit -m one &&
> 	 git checkout -b two &&
> 	 echo two >file && git add file && git commit -m two &&
> -	 git checkout master) &&
> +	 git checkout master &&
> +	 # Create dummy objects
> +	 _B=$(git rev-list --objects --all | grep -m 1 "^[^ ]\+ [^ ]\+" | \

Hmm. That naming convention (`_B`) is a new one, and does not really
align with the existing code in the test suite, does it?

Further, it seems that you pick some semi-random object from the object
database. I think you can do better: if you want to get the hash of a
blob, use `git rev-parse HEAD:file`, if you want the hash of a tree, use
the same command with `HEAD:`, if you want a commit, with `HEAD`.

> +	      awk "{print \$1}") &&
> +	 echo "${_B}" >> .git/refs/tags/broken-tag &&

In the Git test suite, we only use curlies to expand shell variables
when necessary, i.e. when followed by a valid variable name character.

> +	 echo "${_B}" >> .git/refs/heads/broken-head
> +	) &&
> 	mkdir empty &&
> 	(cd empty && git init)
> '
> @@ -67,4 +73,18 @@ test_expect_success 'clone -b not allowed with empty repos'
> '
> 	test_must_fail git clone -b branch empty clone-branch-empty
> '
>
> +test_expect_success 'clone -b with broken tag will fallback to master' '
> +	git clone -b broken-tag parent clone-broken-tag &&

As I said earlier, I think this should fail, i.e. something like

	test_must_fail git clone ... 2>err &&
	test_i18ngrep [something-from-that-error-message] err

Thanks,
Johannes

> +	(cd clone-broken-tag &&
> +	 check_HEAD master
> +	)
> +'
> +
> +test_expect_success 'clone -b with broken head will fallback to master' '
> +	git clone -b broken-head parent clone-broken-head &&
> +	(cd clone-broken-head &&
> +	 check_HEAD master
> +	)
> +'
> +
> test_done
> --
> 2.22.0
>
>
>

Jeff King Nov. 1, 2019, 7:35 p.m. UTC | #2

On Fri, Nov 01, 2019 at 08:08:10PM +0100, Johannes Schindelin wrote:

> > +static int fallback_on_noncommit(const struct ref *check,
> > +				 const struct ref *remote,
> > +				 const char *msg)
> > +{
> > +	if (check == NULL)
> > +		return 1;
> > +	struct commit *c = lookup_commit_reference_gently(the_repository,
> > +						   &check->old_oid, 1);
> > +	if (c == NULL) {
> > +		/* Fallback HEAD to fallback refs */
> > +		warning(_("%s is not a valid commit object, HEAD will fallback
> > to %s"),
> > +			check->name, FALLBACK_REF);
> 
> Quite honestly, I do not think that it is a good idea to fall back in
> this case. The user asked for something that cannot be accomplished, and
> the best way to handle this is to exit with an error, i.e. `die()`.

The main reason I proposed falling back here is that the user can
correct the situation without having to redo the clone from scratch
(which might have been very expensive). And we cannot just leave HEAD
empty there; we have to put _something_ in it.

I do think it's important, though, that we don't just fall back; we
should still report an error exit from the program (just as we do for
the similar case when clone's checkout step fails). Otherwise something
as simple as:

  git clone -b $url repo &&
  cd repo &&
  do_something

could have quite unexpected results.

I don't know how often this would actually help users, though. It _is_ a
pretty rare situation to ask for a non-commit. So maybe it's all
over-engineering, and we should start with just die(). If somebody comes
along later and wants to enhance it, it should be pretty
straightforward.

-Peff

Jeff King Nov. 1, 2019, 7:43 p.m. UTC | #3

On Fri, Nov 01, 2019 at 01:24:32AM +0100, Davide Berardi wrote:

> Fixed segmentation fault that can be triggered using
> $ git clone --branch $object $repository
> with object pointing to a non-commit ref (e.g. a blob).

Thanks for working on this. I left some thoughts on the overall fallback
scheme in the other part of the thread, and other than I agree with all
of Dscho's review.

A few other comments:

> +static int fallback_on_noncommit(const struct ref *check,
> +				 const struct ref *remote,
> +				 const char *msg)
> +{
> +	if (check == NULL)
> +		return 1;

I wondered in what circumstances "check" would be NULL. In the first
conditional we pass "our" after checking it's non-NULL:

> 	if (our && skip_prefix(our->name, "refs/heads/", &head)) {
> +		if (fallback_on_noncommit(our, remote, msg) != 0)

and then again in the second case-arm:

> 	} else if (our) {
> +		if (fallback_on_noncommit(our, remote, msg) != 0)

and then in the third we pass remote after checking that it's not NULL:

> 	} else if (remote) {
> +		if (fallback_on_noncommit(remote, remote, msg) != 0)
> +			return;

So I think this NULL check can go away. In general I don't mind
functions being defensive, but it's hard to decide whether it's doing
the right thing since it's not a case we think can come up (it could be
marked with a BUG() assertion, but IMHO it's not worth it; most
functions require their arguments to be non-NULL, so checking it would
be unusual in our code base).

> +static int fallback_on_noncommit(const struct ref *check,
> +				 const struct ref *remote,
> +				 const char *msg)
> [...]
> +	return c == NULL;

The return value for this function is unusual for our code base. If it's
just an error return, we'd usually use "0" for success and a negative
value for errors (i.e., mimicking system calls).

> diff --git a/refs.h b/refs.h
> index 730d05ad91..35ee6eb058 100644
> --- a/refs.h
> +++ b/refs.h
> @@ -497,6 +497,13 @@ enum action_on_err {
> 	UPDATE_REFS_QUIET_ON_ERR
> };
> 
> +/*
> + * In case of a remote HEAD pointing to a non-commit update_head
> + * will make HEAD reference fallback to this value, master ref
> + * should be safe.
> + */
> +#define FALLBACK_REF "refs/heads/master"
> +
> /*

Since this is only used in one spot, I think it's better to keep it
localized to that function. E.g., with:

  static const char fallback_ref[] = "refs/heads/master";

That way it's clear that no other code depends on it.

-Peff

Junio C Hamano Nov. 2, 2019, 9:18 a.m. UTC | #4

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Hi Davide,
>
> I wonder whether you might want to reword the Subject: such that it
> reflects what the patch does instead of what the problem is that
> motivated you to work on the patch (the patch does not cause the
> segmentation fault, after all, but it fixes it).

Good point.

    Subject: clone: do not segfault on "clone -b B" where B is a non-commit

    The code in "git clone -b B" to decide what branch to check out
    assumed that B points at a commit object without checking,
    leading to dereferencing a NULL pointer and causing a segfault.

    Just aborting the operation when it happens is not a very
    attractive option because we would be left with a directory
    without .git/HEAD that cannot be used as a valid repository the
    user can attempt to recover from the situation by checking out
    something.

    Fall back to use the 'master' branch, which is what we use when
    the command without the "-b" option cannot figure out what
    branch the remote side points with its HEAD.

or something like that, perhaps?

I am not sure if the existing code is careful enough setting up the
resulting local 'master' branch, or needs more changes associated
with this patch, though.  For example, does it want to be set to
integrate with the 'master' branch on the remote side by setting
"branch.master.remote" and "branch.master.merge" configuration, or
do we want to turn them off?  I _think_ the answer is that we want
to behave as if the user said "-b master" instead of "-b B" (with B
that does not point at a commit), but I am not sure.  Also, don't we
try to be a bit noisier when the fallback fails?  For example, if
the user said "clone -b master" and the 'master' points at an object
that is not a commit, falling back and writing refs/heads/master in
the HEAD would leave us in the same position as we did not have any
fallback.

I skimmed your review and I think I agree with most (if not all) of
them.  Thanks.

Junio C Hamano Nov. 2, 2019, 10:07 a.m. UTC | #5

Davide Berardi <berardi.dav@gmail.com> writes:

> +static int fallback_on_noncommit(const struct ref *check,
> +				 const struct ref *remote,
> +				 const char *msg)
> +{
> +	if (check == NULL)
> +		return 1;
> +	struct commit *c = lookup_commit_reference_gently(the_repository,
> +						   &check->old_oid, 1);

This is decl-after-stmt.  Can check be NULL, though?  IOW, the first
two lines in this function should go.

> +	if (c == NULL) {

We tend to say "if (!c) {" instead.

> +		/* Fallback HEAD to fallback refs */

You are falling back to just a single ref (i.e. s/refs/ref/) but
more importantly, what the code is doing is obvious enough without
this comment.  What we want commenting on is _why_ we do this.

	/*
	 * The ref specified to be checked out does not point at a
	 * commit so pointing HEAD at it will leave us a broken
         * repository.  But we need to have _something_ plausible
	 * in HEAD---otherwise the result would not be a repository.
	 */

would explain why we point HEAD to the default 'master' branch.

> +		warning(_("%s is not a valid commit object, HEAD will fallback to %s"),
> +			check->name, FALLBACK_REF);
> +		update_ref(msg, FALLBACK_REF, &remote->old_oid, NULL,
> +			   REF_NO_DEREF, UPDATE_REFS_DIE_ON_ERR);

Having said that, I was hoping that this would use the help from the
guess_remote_head() like the case without "-b" option does, so that
we do not have to use a hardcoded default.

> +	}
> +
> +	return c == NULL;

Our API convention is 0 for success and negavive for failure.

> +}
> +
> static void update_head(const struct ref *our, const struct ref *remote,
> 			const char *msg)
> {
> 	const char *head;
> 	if (our && skip_prefix(our->name, "refs/heads/", &head)) {
> +		if (fallback_on_noncommit(our, remote, msg) != 0)
> +			return;
> 		/* Local default branch link */
> 		if (create_symref("HEAD", our->name, NULL) < 0)
> 			die(_("unable to update HEAD"));
> @@ -718,12 +739,17 @@ static void update_head(const str

Here in the patch context is a "do this in a non-bare repository"
guard, and crucially, we do this:

> 			install_branch_config(0, head, option_origin, our->name);

That is, we add configuration for our->name (which is now
"refs/heads/master"), but I do not think we updated any of the other
field in *our to make the world a consistent place.  Is the object
pointed at by our local refs/heads/master in a reasonable
relationship with the object at the tip of the 'master' branch at
the remote site, or is can totally be unrelated because we made no
attempt to make our->old_oid or whatever field consistent with the
"corrected" reality?

Given the potential complication, and given that we are doing a
corretive action only so that we leave some repository the user can
fix manually, I am inclined to say that we should avoid falling into
this codepath.  I'll wonder about a totally different approach later
in this message that makes the fallback_on_noncommit() helper and
change to these existing parts of the update_head() function
unnecessary.

> 		}
> 	} else if (our) {
> +		if (fallback_on_noncommit(our, remote, msg) != 0)
> +			return;
> +		/* here commit is valid */
> 		struct commit *c = lookup_commit_reference(the_repository,
> 							   &our->old_oid);

What makes us certain that commit is valid?  our->old_oid is not
adjusted by the fallback_on_noncommit() function, but we did check
if it is a commit by doing the exact same lookup_commit_reference()
in there already, and we know it found a commit (otherwise the
function would have returned a non-zero to signal an error).

But it also means that we are making a redundant and unnecessary
call if the code is structured better.

This makes me wonder why we are not adding a single call to a helper
function at the beginning of the function, something like

	const struct oid *tip = NULL;
	struct commit *tip_commit = NULL;

	if (our)
		tip = &our->old_oid;
	else if (remote)
		tip = &remote->old_oid;

	if (!tip)
		return;

	tip_commit = lookup_commit_reference_gently(the_repository, tip);

Then, if !tip_commit, we know we need to fall back to something.  I
actually think it would probably be cleanest if we added

	if (!tip_commit) {
		/*
		 * The given non-commit cannot be checked out,
                 * so have a 'master' branch and leave it unborn.
                 */
		warning(_"non-commit cannot be checked out");
		create_symref("HEAD", "refs/heads/master", msg);
		return;
	}

That is, we always check out an unborn 'master' branch (which would
yield another warning at the beginning of checkout() function, which
is what we want) doing minimum damage, without even configuring any
remote tracking information.

The rest of the update_head() function could be left as-is, but with
the "see what we would be checking out first" approach, we probably
can lose some code (like the call to lookup_commit_reference() in
the "detached HEAD" case), without adding any extra logic.

> 	} else if (remote) {
> +		if (fallback_on_noncommit(remote, remote, msg) != 0)
> +			return;
> 		/*
> 		 * We know remote HEAD points to a non-branch, or
> 		 * HEAD points to a branch but we don't know which one.

Thanks.

Junio C Hamano Nov. 2, 2019, 10:16 a.m. UTC | #6

Jeff King <peff@peff.net> writes:

> I don't know how often this would actually help users, though. It _is_ a
> pretty rare situation to ask for a non-commit. So maybe it's all
> over-engineering, and we should start with just die(). If somebody comes
> along later and wants to enhance it, it should be pretty
> straightforward.

I like that; after update_head() finishes, there are a few clean-up
things that the caller wants to do besides a checkout() call, but if
we make update_head() return a failure, perhaps the caller side
change would be as small as the attached patch.  That would go nicely
with the "make the result just barely usable" approach of leaving an
unborn master branch I suggested in a separate message, I would think.

 builtin/clone.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/builtin/clone.c b/builtin/clone.c
index c46ee29f0a..fa0558fa3e 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -1246,7 +1246,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 			   branch_top.buf, reflog_msg.buf, transport,
 			   !is_local, filter_options.choice);
 
-	update_head(our_head_points_at, remote_head, reflog_msg.buf);
+	err = update_head(our_head_points_at, remote_head, reflog_msg.buf) < 0;
 
 	/*
 	 * We want to show progress for recursive submodule clones iff
@@ -1265,8 +1265,10 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 	}
 
 	junk_mode = JUNK_LEAVE_REPO;
-	fetch_if_missing = 1;
-	err = checkout(submodule_progress);
+	if (!err) {
+		fetch_if_missing = 1;
+		err = checkout(submodule_progress);
+	}
 
 	strbuf_release(&reflog_msg);
 	strbuf_release(&branch_top);

Davide Berardi Nov. 3, 2019, 6:16 p.m. UTC | #7

On Sat, Nov 02, 2019 at 07:16:23PM +0900, Junio C Hamano wrote:
>Jeff King <peff@peff.net> writes:
>
>> I don't know how often this would actually help users, though. It _is_ a
>> pretty rare situation to ask for a non-commit. So maybe it's all
>> over-engineering, and we should start with just die(). If somebody comes
>> along later and wants to enhance it, it should be pretty
>> straightforward.
>
>I like that; after update_head() finishes, there are a few clean-up
>things that the caller wants to do besides a checkout() call, but if
>we make update_head() return a failure, perhaps the caller side
>change would be as small as the attached patch.  That would go nicely
>with the "make the result just barely usable" approach of leaving an
>unborn master branch I suggested in a separate message, I would think.
>
Thank you all for your precious comments, I've tried to implement
your suggestions and I've sent the patch here.

The problem with the proposed approach was that the code was
incompatible with some tests (specifically the tests which specifies an
empty .git directory would fail and fallback to the unborn master
branch).

The lookup commit have two error-paths:

1. the commit cannot be found;
2. the commit is found and cannot be casted to a commit (whoops!).

so, I've returned the second condition using an auxiliary variable and
declaring a new lookup_commit function keeping compatibility with the
old one.

I'm sorry for my errors but I'm far for an expert of git internals,
thank you (all) for your time and kindness.

ciao,
D.

Junio C Hamano Nov. 4, 2019, 3:55 a.m. UTC | #8

Davide Berardi <berardi.dav@gmail.com> writes:

> The problem with the proposed approach was that the code was
> incompatible with some tests (specifically the tests which specifies an
> empty .git directory would fail and fallback to the unborn master
> branch).
>
> The lookup commit have two error-paths:
>
> 1. the commit cannot be found;
> 2. the commit is found and cannot be casted to a commit (whoops!).
>
> so, I've returned the second condition using an auxiliary variable and
> declaring a new lookup_commit function keeping compatibility with the
> old one.

It's more like three, I think.

  1a. The given object name is a sentinel, "no such object", value.

  1b. The given object name is meant to name a real object, but
      there is no object with such a name in the repository.

  2.  The given object name names an existing object, but it does
      not peel to a commit.

Traditionally, we had only lookup_commit() and died on any of the
above.  Later, we added the _gently() variant for callers that want
to use a commit and also want to handle an error case where the
object name they are handed by their callers does not peel to a
commit.

In the "unborn repository, empty .git" case, are you getting a
random object name, or null_oid (aka case 1a. above)?  If that is
the case, then your solution to introduce another variant of
lookup_commit() that takes *err parameter is a wrong approach would
not differenciate between 1a. and 1b., which would not help, as I
suspect that we still do want to treat 1b. as an error.

Wouldn't it be cleaner to catch 1a. upfront, e.g.

	if (our)
		tip = &our->old_oid;
	else if (remote)
		tip = &remote->old_oid;
	else
		return NULL;

	if (is_null_oid(tip))
        	return NULL;

	tip_commit = lookup_commit_reference_gently(the_repository, tip, 1);
	if (!tip_commit) {
		warning(...);
		create_symref(...);
		return NULL;
	}

I am not offhand sure if the places we return NULL upfront want to
also create HEAD symref, or that is something automatically happens
for us in the downstream of this function, though.

Thanks.

Segmentation Fault on non-commit --branch clone

Commit Message

Comments

Patch