diff mbox series

checkout, clone: die if tree cannot be parsed

Message ID 20220302003613.15567-1-chooglen@google.com (mailing list archive)
State Accepted
Commit 8d2eaf649abd28baa083a723d1e041b26d2be83e
Headers show
Series checkout, clone: die if tree cannot be parsed | expand

Commit Message

Glen Choo March 2, 2022, 12:36 a.m. UTC
When a tree oid is invalid, parse_tree_indirect() can return NULL. Check
for NULL instead of proceeding as though it were a valid pointer and
segfaulting.

Signed-off-by: Glen Choo <chooglen@google.com>
---
At $DAYJOB, this bug was discovered due to some interactions between
"git clone --filter=tree:0" and a buggy server that failed to transfer
certain commits.

In the 'checkout' step of "git clone --filter=tree:0", the repo tries to
get the HEAD commit from the server (since it's not present locally),
but this fails due to an unrelated bug in the server. Since the commit
tree is invalid, parse_tree_indirect() returns NULL, causing
parse_tree(NULL) to segfault.

I tried to write a test for this segfault, but I couldn't quite figure
out how:

- Invalid trees are typically caught pretty early, so I suspect that any
  reproduction scenario would need to replicate the partial clone +
  buggy server setup.
- I couldn't figure out how to replicate the aforementioned buggy setup

I'd appreciate any suggestions on how to test this though :)

Note that there are many other callsites that don't check for NULLs from
parse_tree_indirect(), and some of which are fairly subtle. I wasn't
confident in changing those, so I stayed on the conservative side and
only changed the ones that I could get to segfault.

 builtin/checkout.c | 13 ++++++++++---
 builtin/clone.c    |  2 ++
 2 files changed, 12 insertions(+), 3 deletions(-)


base-commit: 715d08a9e51251ad8290b181b6ac3b9e1f9719d7

Comments

Junio C Hamano March 2, 2022, 7:26 a.m. UTC | #1
Glen Choo <chooglen@google.com> writes:

> Note that there are many other callsites that don't check for NULLs from
> parse_tree_indirect(), and some of which are fairly subtle. I wasn't
> confident in changing those, so I stayed on the conservative side and
> only changed the ones that I could get to segfault.

Thanks.

> -		tree = parse_tree_indirect(old_branch_info->commit ?
> -					   &old_branch_info->commit->object.oid :
> -					   the_hash_algo->empty_tree);
> +
> +		old_commit_oid = old_branch_info->commit ?
> +			&old_branch_info->commit->object.oid :
> +			the_hash_algo->empty_tree;

I guess this is done only so that you can use the object name in the
error message, which is fine.

> +		tree = parse_tree_indirect(old_commit_oid);
> +		if (!tree)
> +			die(_("unable to parse commit %s"),
> +				oid_to_hex(old_commit_oid));

"unable to parse commit" is a bit of a white lie.  In fact, there is
nothing that makes oid_commit_oid the name of a commit object.

"unable to parse object '%s' as a tree" would be more technically
correct, but one random-looking hexadecimal string is almost
indistinguishable from another, and neither would be a very useful
message from the end user's point of view.  I am wondering if we can
use old_branch_info to formulate something easier to understand for
the user.  update_refs_for_switch() seems to compute old_desc as a
human readable name by using old_branch_info->name if available
before falling back to old_branch_info->commit object name, which
might be a source of inspiration.

>  		init_tree_desc(&trees[0], tree->buffer, tree->size);
>  		parse_tree(new_tree);
>  		tree = new_tree;
> diff --git a/builtin/clone.c b/builtin/clone.c
> index a572cda503..0aea177660 100644
> --- a/builtin/clone.c
> +++ b/builtin/clone.c
> @@ -700,6 +700,8 @@
>  	init_checkout_metadata(&opts.meta, head, &oid, NULL);
>  
>  	tree = parse_tree_indirect(&oid);
> +	if (!tree)
> +		die(_("unable to parse commit %s"), oid_to_hex(&oid));

If we follow the code path, we can positively tell that oid here has
always came from reading HEAD, so "unable to parse HEAD as a tree"
would be an easier version of the message, I think.

Thanks.
Glen Choo March 2, 2022, 7:35 p.m. UTC | #2
Junio C Hamano <gitster@pobox.com> writes:

> Glen Choo <chooglen@google.com> writes:
>
>> -		tree = parse_tree_indirect(old_branch_info->commit ?
>> -					   &old_branch_info->commit->object.oid :
>> -					   the_hash_algo->empty_tree);
>> +
>> +		old_commit_oid = old_branch_info->commit ?
>> +			&old_branch_info->commit->object.oid :
>> +			the_hash_algo->empty_tree;
>
> I guess this is done only so that you can use the object name in the
> error message, which is fine.

That's correct.

>> +		tree = parse_tree_indirect(old_commit_oid);
>> +		if (!tree)
>> +			die(_("unable to parse commit %s"),
>> +				oid_to_hex(old_commit_oid));
>
> "unable to parse commit" is a bit of a white lie.  In fact, there is
> nothing that makes oid_commit_oid the name of a commit object.
>
> "unable to parse object '%s' as a tree" would be more technically
> correct,

Hm, yes. With regards to parse_tree_indirect(), "unable to parse object
'%s' as a tree" is a more accurate description of the failure. But since
we know that the oid is a commit in this context, I'm not sure if we
need to offload that much information to the user - if we failed to
parse the given object id in the appropriate manner, the user would
still be directed to figure out what's wrong with the object.

>          but one random-looking hexadecimal string is almost
> indistinguishable from another, and neither would be a very useful
> message from the end user's point of view.  I am wondering if we can
> use old_branch_info to formulate something easier to understand for
> the user.  update_refs_for_switch() seems to compute old_desc as a
> human readable name by using old_branch_info->name if available
> before falling back to old_branch_info->commit object name, which
> might be a source of inspiration.

I think it's actually more helpful to have the oid instead of a
human-readable description like old_branch_info->name.

For an advanced user/repo admin, seeing the oid makes it very obvious
that there's a particular problem with the given object, and this would
direct them to hunt down the object locally (without partial clone) or
on the remote (with partial clone, as in the original motivation). From
there, it's easy to figure out which branch points to the offending
object. The branch name might be misleading - the user would presumably
start with hunting down the ref, then explore several possibilities
before realizing the problem is actually with the _object_.

For a novice user, neither the branch name nor the object id is
actionable because they probably wouldn't be able to fix the issue
anyway. The advantage of the opaque hex string is that by being
intimidating and unrecognizable, it indicates to the user that they
shouldn't try to debug the issue and so they might give up sooner and
ask for help from someone who might be able to fix it.
Johannes Schindelin March 9, 2022, 10:20 p.m. UTC | #3
Hi Glen,

On Wed, 2 Mar 2022, Glen Choo wrote:

> Junio C Hamano <gitster@pobox.com> writes:
>
> >          but one random-looking hexadecimal string is almost
> > indistinguishable from another, and neither would be a very useful
> > message from the end user's point of view.  I am wondering if we can
> > use old_branch_info to formulate something easier to understand for
> > the user.  update_refs_for_switch() seems to compute old_desc as a
> > human readable name by using old_branch_info->name if available
> > before falling back to old_branch_info->commit object name, which
> > might be a source of inspiration.
>
> I think it's actually more helpful to have the oid instead of a
> human-readable description like old_branch_info->name.

The most helpful would be to have both. That way, it would at least be
potentially possible to figure out from a ref how to fetch a non-corrupt
version from elsewhere.

Ciao,
Dscho
diff mbox series

Patch

diff --git a/builtin/checkout.c b/builtin/checkout.c
index d9b31bbb6d..c1035304a5 100644
--- a/builtin/checkout.c
+++ b/builtin/checkout.c
@@ -738,6 +738,7 @@ 
 		struct tree_desc trees[2];
 		struct tree *tree;
 		struct unpack_trees_options topts;
+		const struct object_id *old_commit_oid;
 
 		memset(&topts, 0, sizeof(topts));
 		topts.head_idx = -1;
@@ -765,9 +766,15 @@ 
 				       &new_branch_info->commit->object.oid :
 				       &new_branch_info->oid, NULL);
 		topts.preserve_ignored = !opts->overwrite_ignore;
-		tree = parse_tree_indirect(old_branch_info->commit ?
-					   &old_branch_info->commit->object.oid :
-					   the_hash_algo->empty_tree);
+
+		old_commit_oid = old_branch_info->commit ?
+			&old_branch_info->commit->object.oid :
+			the_hash_algo->empty_tree;
+		tree = parse_tree_indirect(old_commit_oid);
+		if (!tree)
+			die(_("unable to parse commit %s"),
+				oid_to_hex(old_commit_oid));
+
 		init_tree_desc(&trees[0], tree->buffer, tree->size);
 		parse_tree(new_tree);
 		tree = new_tree;
diff --git a/builtin/clone.c b/builtin/clone.c
index a572cda503..0aea177660 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -700,6 +700,8 @@ 
 	init_checkout_metadata(&opts.meta, head, &oid, NULL);
 
 	tree = parse_tree_indirect(&oid);
+	if (!tree)
+		die(_("unable to parse commit %s"), oid_to_hex(&oid));
 	parse_tree(tree);
 	init_tree_desc(&t, tree->buffer, tree->size);
 	if (unpack_trees(1, &t, &opts) < 0)