diff mbox series

[v2,10/10] tag: don't misreport type of tagged objects in errors

Message ID patch-10.11-a84f670ac24-20210328T021238Z-avarab@gmail.com (mailing list archive)
State New, archived
Headers show
Series improve reporting of unexpected objects | expand

Commit Message

Ævar Arnfjörð Bjarmason March 28, 2021, 2:13 a.m. UTC
Fix a regression in 89e4202f982 ([PATCH] Parse tags for absent
objects, 2005-06-21) (yes, that ancient!) and correctly report an
error on a tag like:

    object <a tree hash>
    type commit

As:

    error: object <a tree hash> is tree, not a commit

Instead of our long-standing misbehavior of inverting the two, and
reporting:

    error: object <a tree hash> is commit, not a tree

Which, as can be trivially seen with 'git cat-file -t <a tree hash>'
is incorrect.

The reason for this misreporting is that in parse_tag_buffer() we end
up doing a lookup_{blob,commit,tag,tree}() depending on what we read
out of the "type" line.

If we haven't parsed that object before we end up dispatching to the
type-specific lookup functions, e.g. this for commit.c in
lookup_commit_type():

	struct object *obj = lookup_object(r, oid);
	if (!obj)
		return create_object(r, oid, alloc_commit_node(r));

Its allocation will then set the obj->type according to what the tag
told us the type was, but which we've never validated. At this point
we've got an object in memory that hasn't been parsed, and whose type
is incorrect, since we mistrusted a tag to tell us the type.

Then when we actually load the object with parse_object() we read it
and find that it's a "tree". See 8ff226a9d5e (add object_as_type
helper for casting objects, 2014-07-13) for that behavior (that's just
a refactoring commit, but shows all the code involved).

Which explains why we inverted the error report. Normally when
object_as_type() is called it's by the lookup_{blob,commit,tag,tree}()
functions via parse_object(). At that point we can trust the
obj->type.

In the case of parsing objects we've learned about via a tag with an
incorrect type it's the opposite, the obj->type isn't correct and
holds the mislabeled type, but we're parsing the object and know for
sure what object type we're dealing with.

Hence the non-intuitive solution of adding a
lookup_{blob,commit,tag,tree}_type() function. It's to distinguish
calls from parse_object_buffer() where we actually know the type, from
a parse_tag_buffer() where we're just guessing about the type.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 blob.c                                 | 16 +++++++++++++++-
 blob.h                                 |  3 +++
 commit.c                               | 14 +++++++++++++-
 commit.h                               |  2 ++
 object.c                               |  8 ++++----
 t/t6102-rev-list-unexpected-objects.sh | 16 ++++++++--------
 tag.c                                  | 14 +++++++++++++-
 tag.h                                  |  2 ++
 tree.c                                 | 14 +++++++++++++-
 tree.h                                 |  2 ++
 10 files changed, 75 insertions(+), 16 deletions(-)

Comments

Junio C Hamano March 30, 2021, 5:50 a.m. UTC | #1
Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> Fix a regression in 89e4202f982 ([PATCH] Parse tags for absent
> objects, 2005-06-21) (yes, that ancient!) and correctly report an
> error on a tag like:
>
>     object <a tree hash>
>     type commit
>
> As:
>
>     error: object <a tree hash> is tree, not a commit
>
> Instead of our long-standing misbehavior of inverting the two, and
> reporting:
>
>     error: object <a tree hash> is commit, not a tree
>
> Which, as can be trivially seen with 'git cat-file -t <a tree hash>'
> is incorrect.

Hmph, I've always thought it is just "supposed to be a" missing in
the sentence ;-)

> Hence the non-intuitive solution of adding a
> lookup_{blob,commit,tag,tree}_type() function. It's to distinguish
> calls from parse_object_buffer() where we actually know the type, from
> a parse_tag_buffer() where we're just guessing about the type.

I think it makes sense to allow the caller to express distinction
between "I know that this object is a blob, because I just read its
object header" and "Another object tells me that this object must be
a blob, because it is in a tree entry whose mode bits are 100644".

I wish we found a set of names better than lookup_<type>_type() for
that, though.  It's just between

      lookup_tag_type(r, oid, OBJ_NONE);
      lookup_tag_type(r, oid, OBJ_TAG);

I cannot quite tell which one is which.  I also wonder if the last
arg should just be a boolean ("I know it is a tag" vs "I heard it
must be a tag").
Jeff King March 31, 2021, 11:02 a.m. UTC | #2
On Mon, Mar 29, 2021 at 10:50:18PM -0700, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
> 
> > Fix a regression in 89e4202f982 ([PATCH] Parse tags for absent
> > objects, 2005-06-21) (yes, that ancient!) and correctly report an
> > error on a tag like:
> >
> >     object <a tree hash>
> >     type commit
> >
> > As:
> >
> >     error: object <a tree hash> is tree, not a commit
> >
> > Instead of our long-standing misbehavior of inverting the two, and
> > reporting:
> >
> >     error: object <a tree hash> is commit, not a tree
> >
> > Which, as can be trivially seen with 'git cat-file -t <a tree hash>'
> > is incorrect.
> 
> Hmph, I've always thought it is just "supposed to be a" missing in
> the sentence ;-)

So going with the discussion elsewhere in the thread, I'd probably say
something like:

  error: object <oid> seen as both a commit and a tree

which precisely says what we do know, without implying which is correct.

Ævar's patch tries to improve the case where we might _know_ which is
correct (because we're actually parsing the object contents), but of
course it covers only a fraction of cases. I'm not really opposed to
that per se, but I probably wouldn't bother myself.

  Side note: this is all making the assumption that what is in the
  object itself is "correct", but of course that is not necessarily
  true, even. All of these cases are the result of bugs, so it is
  possible that the bug was in the writing of the original object
  contents, and not the object that is referring to it. Likewise, I'd
  imagine an easy way to get into this situation is with a bogus
  refs/replace object that switches type.

> > Hence the non-intuitive solution of adding a
> > lookup_{blob,commit,tag,tree}_type() function. It's to distinguish
> > calls from parse_object_buffer() where we actually know the type, from
> > a parse_tag_buffer() where we're just guessing about the type.
> 
> I think it makes sense to allow the caller to express distinction
> between "I know that this object is a blob, because I just read its
> object header" and "Another object tells me that this object must be
> a blob, because it is in a tree entry whose mode bits are 100644".
> 
> I wish we found a set of names better than lookup_<type>_type() for
> that, though.  It's just between
> 
>       lookup_tag_type(r, oid, OBJ_NONE);
>       lookup_tag_type(r, oid, OBJ_TAG);
> 
> I cannot quite tell which one is which.  I also wonder if the last
> arg should just be a boolean ("I know it is a tag" vs "I heard it
> must be a tag").

Yeah, I also found that very confusing. AFAICT lookup_tag_type() would
only ever see OBJ_NONE or OBJ_TAG. Making it more than a boolean makes
both the interface and implementation more complicated.

I also think the manual handling of OBJ_NONE in each lookup_* function
is confusing. They all call object_as_type() because the point of that
function is both to type-check the struct and to convert it away from
OBJ_NONE.

If we handled this error there, then I think it would be much more
natural, because we'd have already covered the OBJ_NONE case, and
because it's already the place we're emitting the existing error. E.g.:

diff --git a/object.c b/object.c
index 2c32691dc4..e6345541f7 100644
--- a/object.c
+++ b/object.c
@@ -157,7 +157,7 @@ void *create_object(struct repository *r, const struct object_id *oid, void *o)
 	return obj;
 }
 
-void *object_as_type(struct object *obj, enum object_type type, int quiet)
+void *object_as_type(struct object *obj, enum object_type type, unsigned flags)
 {
 	if (obj->type == type)
 		return obj;
@@ -169,10 +169,16 @@ void *object_as_type(struct object *obj, enum object_type type, int quiet)
 		return obj;
 	}
 	else {
-		if (!quiet)
-			error(_("object %s is a %s, not a %s"),
-			      oid_to_hex(&obj->oid),
-			      type_name(obj->type), type_name(type));
+		if (!(flags & OBJECT_AS_TYPE_QUIET)) {
+			if (flags & OBJECT_AS_TYPE_EXPECT_PARSED)
+				error(_("object %s is a %s, but was referred to as a %s"),
+				      oid_to_hex(&obj->oid), type_name(obj->type),
+				      type_name(type));
+			else
+				error(_("object %s referred to as both a %s and a %s"),
+				      oid_to_hex(&obj->oid),
+				      type_name(obj->type), type_name(type));
+		}
 		return NULL;
 	}
 }

-Peff
Junio C Hamano March 31, 2021, 6:05 p.m. UTC | #3
Jeff King <peff@peff.net> writes:

> I also think the manual handling of OBJ_NONE in each lookup_* function
> is confusing. They all call object_as_type() because the point of that
> function is both to type-check the struct and to convert it away from
> OBJ_NONE.
>
> If we handled this error there, then I think it would be much more
> natural, because we'd have already covered the OBJ_NONE case, and
> because it's already the place we're emitting the existing error. E.g.:

This makes quite a lot of sense.  If presented with this simple
change and 10-patch series at the same time and are told that the
goal of the changes were more or less the same, I'd pick this one
100% of the time.

> diff --git a/object.c b/object.c
> index 2c32691dc4..e6345541f7 100644
> --- a/object.c
> +++ b/object.c
> @@ -157,7 +157,7 @@ void *create_object(struct repository *r, const struct object_id *oid, void *o)
>  	return obj;
>  }
>  
> -void *object_as_type(struct object *obj, enum object_type type, int quiet)
> +void *object_as_type(struct object *obj, enum object_type type, unsigned flags)
>  {
>  	if (obj->type == type)
>  		return obj;
> @@ -169,10 +169,16 @@ void *object_as_type(struct object *obj, enum object_type type, int quiet)
>  		return obj;
>  	}
>  	else {
> -		if (!quiet)
> -			error(_("object %s is a %s, not a %s"),
> -			      oid_to_hex(&obj->oid),
> -			      type_name(obj->type), type_name(type));
> +		if (!(flags & OBJECT_AS_TYPE_QUIET)) {
> +			if (flags & OBJECT_AS_TYPE_EXPECT_PARSED)
> +				error(_("object %s is a %s, but was referred to as a %s"),
> +				      oid_to_hex(&obj->oid), type_name(obj->type),
> +				      type_name(type));
> +			else
> +				error(_("object %s referred to as both a %s and a %s"),
> +				      oid_to_hex(&obj->oid),
> +				      type_name(obj->type), type_name(type));
> +		}
>  		return NULL;
>  	}
>  }
>
> -Peff
Ævar Arnfjörð Bjarmason March 31, 2021, 6:31 p.m. UTC | #4
On Wed, Mar 31 2021, Jeff King wrote:

> On Mon, Mar 29, 2021 at 10:50:18PM -0700, Junio C Hamano wrote:
>
>> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
>> 
>> > Fix a regression in 89e4202f982 ([PATCH] Parse tags for absent
>> > objects, 2005-06-21) (yes, that ancient!) and correctly report an
>> > error on a tag like:
>> >
>> >     object <a tree hash>
>> >     type commit
>> >
>> > As:
>> >
>> >     error: object <a tree hash> is tree, not a commit
>> >
>> > Instead of our long-standing misbehavior of inverting the two, and
>> > reporting:
>> >
>> >     error: object <a tree hash> is commit, not a tree
>> >
>> > Which, as can be trivially seen with 'git cat-file -t <a tree hash>'
>> > is incorrect.
>> 
>> Hmph, I've always thought it is just "supposed to be a" missing in
>> the sentence ;-)
>
> So going with the discussion elsewhere in the thread, I'd probably say
> something like:
>
>   error: object <oid> seen as both a commit and a tree
>
> which precisely says what we do know, without implying which is correct.
>
> Ævar's patch tries to improve the case where we might _know_ which is
> correct (because we're actually parsing the object contents), but of
> course it covers only a fraction of cases. I'm not really opposed to
> that per se, but I probably wouldn't bother myself.

What fraction of cases? As far as I can tell it covers all cases where
we get this error.

If there is a case like what you're describing I haven't found it.

I.e. it happens when we have an un-parsed "struct object" whose type is
inferred, and parse it to find out it's not what we expected.

It's not ambigious at all what the object actually is. It's just that
the previous code was leaking the *assumption* about the type at the
time of emitting the error, due to an apparent oversight with parsed
v.s. non-parsed.

Or in other words, we're leaking the implementation detail that we
pre-allocated an object struct of a given type in anticipation of
holding a parsed version of that object soon.

>   Side note: this is all making the assumption that what is in the
>   object itself is "correct", but of course that is not necessarily
>   true, even. All of these cases are the result of bugs, so it is
>   possible that the bug was in the writing of the original object
>   contents, and not the object that is referring to it. Likewise, I'd
>   imagine an easy way to get into this situation is with a bogus
>   refs/replace object that switches type.

Perhaps, I haven't tested that in any detail.

>> > Hence the non-intuitive solution of adding a
>> > lookup_{blob,commit,tag,tree}_type() function. It's to distinguish
>> > calls from parse_object_buffer() where we actually know the type, from
>> > a parse_tag_buffer() where we're just guessing about the type.
>> 
>> I think it makes sense to allow the caller to express distinction
>> between "I know that this object is a blob, because I just read its
>> object header" and "Another object tells me that this object must be
>> a blob, because it is in a tree entry whose mode bits are 100644".
>> 
>> I wish we found a set of names better than lookup_<type>_type() for
>> that, though.  It's just between
>> 
>>       lookup_tag_type(r, oid, OBJ_NONE);
>>       lookup_tag_type(r, oid, OBJ_TAG);
>> 
>> I cannot quite tell which one is which.  I also wonder if the last
>> arg should just be a boolean ("I know it is a tag" vs "I heard it
>> must be a tag").
>
> Yeah, I also found that very confusing. AFAICT lookup_tag_type() would
> only ever see OBJ_NONE or OBJ_TAG. Making it more than a boolean makes
> both the interface and implementation more complicated.

I don't feel strongly either way, but one concern here is that these are
very hot functions, and maybe it's better to give the compiler a better
chance to work with them without considering an extra argument, but I
haven't tested that...

> I also think the manual handling of OBJ_NONE in each lookup_* function
> is confusing. They all call object_as_type() because the point of that
> function is both to type-check the struct and to convert it away from
> OBJ_NONE.
>
> If we handled this error there, then I think it would be much more
> natural, because we'd have already covered the OBJ_NONE case, and
> because it's already the place we're emitting the existing error. E.g.:
>
> diff --git a/object.c b/object.c
> index 2c32691dc4..e6345541f7 100644
> --- a/object.c
> +++ b/object.c
> @@ -157,7 +157,7 @@ void *create_object(struct repository *r, const struct object_id *oid, void *o)
>  	return obj;
>  }
>  
> -void *object_as_type(struct object *obj, enum object_type type, int quiet)
> +void *object_as_type(struct object *obj, enum object_type type, unsigned flags)
>  {
>  	if (obj->type == type)
>  		return obj;
> @@ -169,10 +169,16 @@ void *object_as_type(struct object *obj, enum object_type type, int quiet)
>  		return obj;
>  	}
>  	else {
> -		if (!quiet)
> -			error(_("object %s is a %s, not a %s"),
> -			      oid_to_hex(&obj->oid),
> -			      type_name(obj->type), type_name(type));
> +		if (!(flags & OBJECT_AS_TYPE_QUIET)) {
> +			if (flags & OBJECT_AS_TYPE_EXPECT_PARSED)
> +				error(_("object %s is a %s, but was referred to as a %s"),
> +				      oid_to_hex(&obj->oid), type_name(obj->type),
> +				      type_name(type));
> +			else
> +				error(_("object %s referred to as both a %s and a %s"),
> +				      oid_to_hex(&obj->oid),
> +				      type_name(obj->type), type_name(type));
> +		}
>  		return NULL;
>  	}
>  }

Per the above I don't understand how you think there's any uncertainty
here.

If I'm right and there isn't then first of all I don't see how we could
emit 1/2 of those errors. The whole problem here is that we don't know
the type of the un-parsed object (and presumably don't want to eagerly
know, it would mean hitting the object store).

But when we do know why would we beat around the bush and say "was
referred to as X and Y" once we know what it is. 

AFAICT there's no more reason to think that parse_object_buffer() will
be wrong about the type than "git cat-file -t" will be. They both use
the same underlying functions to get that information.
Junio C Hamano March 31, 2021, 6:41 p.m. UTC | #5
Jeff King <peff@peff.net> writes:

> +			if (flags & OBJECT_AS_TYPE_EXPECT_PARSED)
> +				error(_("object %s is a %s, but was referred to as a %s"),
> +				      oid_to_hex(&obj->oid), type_name(obj->type),
> +				      type_name(type));
> +			else
> +				error(_("object %s referred to as both a %s and a %s"),
> +				      oid_to_hex(&obj->oid),
> +				      type_name(obj->type), type_name(type));
> +		}

Am I correct to understand that the latter is after we read a tree
that refers to an object with 100644 (blob) and then another tree
that refers to the same object with 40000 (tree), before we have a
need/chance to actually find out what that object is?  The error
would trigger while reading the second tree and find the second
mention of the object that conflicts with the earlier one?
Jeff King March 31, 2021, 6:59 p.m. UTC | #6
On Wed, Mar 31, 2021 at 08:31:16PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > Ævar's patch tries to improve the case where we might _know_ which is
> > correct (because we're actually parsing the object contents), but of
> > course it covers only a fraction of cases. I'm not really opposed to
> > that per se, but I probably wouldn't bother myself.
> 
> What fraction of cases? As far as I can tell it covers all cases where
> we get this error.
> 
> If there is a case like what you're describing I haven't found it.

It would happen any time somebody calls lookup_foo() because they saw an
object referenced, but _doesn't_ parse it. And then somebody later calls
lookup_bar() in the same way. Neither of them consulted the actual
object database.

Try this with your patches:

-- >8 --
git init repo
cd repo

# just for making things deterministic
export GIT_COMMITTER_NAME='A U Thor'
export GIT_COMMITTER_EMAIL='author@example.com'
export GIT_COMMITTER_DATE='@1234567890 +0000'

blob=$(echo foo | git hash-object -w --stdin)
git tag -m 'tag of blob' tag-of-blob $blob
git update-ref refs/tags/tag-of-commit $(
  git cat-file tag tag-of-blob |
  sed s/blob/commit/g |
  git hash-object -w --stdin -t tag
)
git update-ref refs/tags/tag-of-tree $(
  git cat-file tag tag-of-blob |
  sed s/blob/tree/g |
  git hash-object -w --stdin -t tag
)

git fsck
-- >8 --

That fsck produces (257cc5642 is the blob):

  error: object 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 is a blob, not a commit
  error: 257cc5642cb1a054f08cc83f2d943e56fd3ebe99: object could not be parsed: .git/objects/25/7cc5642cb1a054f08cc83f2d943e56fd3ebe99
  error: object 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 is a commit, not a tree
  error: bad tag pointer to 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 in aaff0d42df150e1a734f6a8516878b2ea315ee0a
  error: aaff0d42df150e1a734f6a8516878b2ea315ee0a: object could not be parsed: .git/objects/aa/ff0d42df150e1a734f6a8516878b2ea315ee0a
  error: object 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 is a commit, not a blob
  error: bad tag pointer to 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 in bbd2b7077cd91ee6175cdc0e4c477c25c230cdc7
  error: bbd2b7077cd91ee6175cdc0e4c477c25c230cdc7: object could not be parsed: .git/objects/bb/d2b7077cd91ee6175cdc0e4c477c25c230cdc7

So we claim "is X, not Y" in multiple directions for the same object.

It might just be that there are spots in the fsck code that need to be
adjusted to use your new function (if they are indeed parsing the
referred-to object). But there are lots of places that don't actually
parse the object at the moment they're parsing the tag. E.g.:

  $ git for-each-ref --format='%(*objectname)'
  error: object 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 is a commit, not a tree
  error: bad tag pointer to 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 in aaff0d42df150e1a734f6a8516878b2ea315ee0a
  Segmentation fault

Neither of those types is the correct one. And the segfault is just a
bonus! :)

I'd expect similar cases with parsing commit parents and tree pointers.
And probably tree entries whose modes are wrong.

> I.e. it happens when we have an un-parsed "struct object" whose type is
> inferred, and parse it to find out it's not what we expected.
> 
> It's not ambigious at all what the object actually is. It's just that
> the previous code was leaking the *assumption* about the type at the
> time of emitting the error, due to an apparent oversight with parsed
> v.s. non-parsed.
> 
> Or in other words, we're leaking the implementation detail that we
> pre-allocated an object struct of a given type in anticipation of
> holding a parsed version of that object soon.

Right. In the case that you are indeed parsing the object later, you can
say definitively "it is X in the odb, but seen as Y previously". But we
do not always hit the "is X, not Y" error when parsing the object. It
might be caused by two of these "pre-allocations" (though really I think
it is not just an implementation detail; the pre-allocation happened
because some other object referred to us as a given type, so it really
is a corruption in the repository. Just not in the object we mention).

> > @@ -169,10 +169,16 @@ void *object_as_type(struct object *obj, enum object_type type, int quiet)
> >  		return obj;
> >  	}
> >  	else {
> > -		if (!quiet)
> > -			error(_("object %s is a %s, not a %s"),
> > -			      oid_to_hex(&obj->oid),
> > -			      type_name(obj->type), type_name(type));
> > +		if (!(flags & OBJECT_AS_TYPE_QUIET)) {
> > +			if (flags & OBJECT_AS_TYPE_EXPECT_PARSED)
> > +				error(_("object %s is a %s, but was referred to as a %s"),
> > +				      oid_to_hex(&obj->oid), type_name(obj->type),
> > +				      type_name(type));
> > +			else
> > +				error(_("object %s referred to as both a %s and a %s"),
> > +				      oid_to_hex(&obj->oid),
> > +				      type_name(obj->type), type_name(type));
> > +		}
> >  		return NULL;
> >  	}
> >  }
> 
> Per the above I don't understand how you think there's any uncertainty
> here.
> 
> If I'm right and there isn't then first of all I don't see how we could
> emit 1/2 of those errors. The whole problem here is that we don't know
> the type of the un-parsed object (and presumably don't want to eagerly
> know, it would mean hitting the object store).

Forgetting for a moment how to trigger it with actual Git commands, the
root of the problem is that:

  lookup_tree(&oid);
  lookup_blob(&oid);

is going to produce an error message. But we cannot know which object
type is wrong and which is right (if any). So we'd want to produce the
"referred to as both" message.

_If_ the caller happens to know that it has just parsed the object
contents and got a tree, then it would call lookup_parsed_tree(&oid),
which would pass along OBJECT_AS_TYPE_EXPECT_PARSED, and produce the
other message.

In practice, of course those two lookup_foo() calls are not right next
to each other. But they may be triggered on an identical oid by two
references from different objects.

> But when we do know why would we beat around the bush and say "was
> referred to as X and Y" once we know what it is. 
> 
> AFAICT there's no more reason to think that parse_object_buffer() will
> be wrong about the type than "git cat-file -t" will be. They both use
> the same underlying functions to get that information.

My point is that we are not always coming from parse_object_buffer()
when we see these error messages.

-Peff
Jeff King March 31, 2021, 7 p.m. UTC | #7
On Wed, Mar 31, 2021 at 11:41:18AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > +			if (flags & OBJECT_AS_TYPE_EXPECT_PARSED)
> > +				error(_("object %s is a %s, but was referred to as a %s"),
> > +				      oid_to_hex(&obj->oid), type_name(obj->type),
> > +				      type_name(type));
> > +			else
> > +				error(_("object %s referred to as both a %s and a %s"),
> > +				      oid_to_hex(&obj->oid),
> > +				      type_name(obj->type), type_name(type));
> > +		}
> 
> Am I correct to understand that the latter is after we read a tree
> that refers to an object with 100644 (blob) and then another tree
> that refers to the same object with 40000 (tree), before we have a
> need/chance to actually find out what that object is?  The error
> would trigger while reading the second tree and find the second
> mention of the object that conflicts with the earlier one?

Yes, exactly (or two tags, or a tag and a tree, or a commit and a tree,
etc).

-Peff
Ævar Arnfjörð Bjarmason March 31, 2021, 8:46 p.m. UTC | #8
On Wed, Mar 31 2021, Jeff King wrote:

> On Wed, Mar 31, 2021 at 08:31:16PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> > Ævar's patch tries to improve the case where we might _know_ which is
>> > correct (because we're actually parsing the object contents), but of
>> > course it covers only a fraction of cases. I'm not really opposed to
>> > that per se, but I probably wouldn't bother myself.
>> 
>> What fraction of cases? As far as I can tell it covers all cases where
>> we get this error.
>> 
>> If there is a case like what you're describing I haven't found it.
>
> It would happen any time somebody calls lookup_foo() because they saw an
> object referenced, but _doesn't_ parse it. And then somebody later calls
> lookup_bar() in the same way. Neither of them consulted the actual
> object database.
>
> Try this with your patches:
>
> -- >8 --
> git init repo
> cd repo
>
> # just for making things deterministic
> export GIT_COMMITTER_NAME='A U Thor'
> export GIT_COMMITTER_EMAIL='author@example.com'
> export GIT_COMMITTER_DATE='@1234567890 +0000'
>
> blob=$(echo foo | git hash-object -w --stdin)
> git tag -m 'tag of blob' tag-of-blob $blob
> git update-ref refs/tags/tag-of-commit $(
>   git cat-file tag tag-of-blob |
>   sed s/blob/commit/g |
>   git hash-object -w --stdin -t tag
> )
> git update-ref refs/tags/tag-of-tree $(
>   git cat-file tag tag-of-blob |
>   sed s/blob/tree/g |
>   git hash-object -w --stdin -t tag
> )
>
> git fsck
> -- >8 --
>
> That fsck produces (257cc5642 is the blob):
>
>   error: object 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 is a blob, not a commit
>   error: 257cc5642cb1a054f08cc83f2d943e56fd3ebe99: object could not be parsed: .git/objects/25/7cc5642cb1a054f08cc83f2d943e56fd3ebe99
>   error: object 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 is a commit, not a tree
>   error: bad tag pointer to 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 in aaff0d42df150e1a734f6a8516878b2ea315ee0a
>   error: aaff0d42df150e1a734f6a8516878b2ea315ee0a: object could not be parsed: .git/objects/aa/ff0d42df150e1a734f6a8516878b2ea315ee0a
>   error: object 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 is a commit, not a blob
>   error: bad tag pointer to 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 in bbd2b7077cd91ee6175cdc0e4c477c25c230cdc7
>   error: bbd2b7077cd91ee6175cdc0e4c477c25c230cdc7: object could not be parsed: .git/objects/bb/d2b7077cd91ee6175cdc0e4c477c25c230cdc7
>
> So we claim "is X, not Y" in multiple directions for the same object.
>
> It might just be that there are spots in the fsck code that need to be
> adjusted to use your new function (if they are indeed parsing the
> referred-to object). But there are lots of places that don't actually
> parse the object at the moment they're parsing the tag. E.g.:
>
>   $ git for-each-ref --format='%(*objectname)'
>   error: object 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 is a commit, not a tree
>   error: bad tag pointer to 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 in aaff0d42df150e1a734f6a8516878b2ea315ee0a
>   Segmentation fault
>
> Neither of those types is the correct one. And the segfault is just a
> bonus! :)
>
> I'd expect similar cases with parsing commit parents and tree pointers.
> And probably tree entries whose modes are wrong.

So the segfault happens without my patches, but the change is that
before we'd always get it wrong and say "commit, not a tree", but now
we'll get it right some of the time. Patching the relevant object.c code
to emit different messages from the various functions shows that it's
the oid_is_type*() functions that get it right, but object_as_type() is
wrong as before.

So that's certainly something I missed.

But are there any cases where it makes things worse? Or is it just that
it's not a full fix in all cases, but only a partial one?

>> I.e. it happens when we have an un-parsed "struct object" whose type is
>> inferred, and parse it to find out it's not what we expected.
>> 
>> It's not ambigious at all what the object actually is. It's just that
>> the previous code was leaking the *assumption* about the type at the
>> time of emitting the error, due to an apparent oversight with parsed
>> v.s. non-parsed.
>> 
>> Or in other words, we're leaking the implementation detail that we
>> pre-allocated an object struct of a given type in anticipation of
>> holding a parsed version of that object soon.
>
> Right. In the case that you are indeed parsing the object later, you can
> say definitively "it is X in the odb, but seen as Y previously". But we
> do not always hit the "is X, not Y" error when parsing the object. It
> might be caused by two of these "pre-allocations" (though really I think
> it is not just an implementation detail; the pre-allocation happened
> because some other object referred to us as a given type, so it really
> is a corruption in the repository. Just not in the object we mention).

Indeed, the goal is to emit a sensible message on-the-fly when we see
that corruption.

>> > @@ -169,10 +169,16 @@ void *object_as_type(struct object *obj, enum object_type type, int quiet)
>> >  		return obj;
>> >  	}
>> >  	else {
>> > -		if (!quiet)
>> > -			error(_("object %s is a %s, not a %s"),
>> > -			      oid_to_hex(&obj->oid),
>> > -			      type_name(obj->type), type_name(type));
>> > +		if (!(flags & OBJECT_AS_TYPE_QUIET)) {
>> > +			if (flags & OBJECT_AS_TYPE_EXPECT_PARSED)
>> > +				error(_("object %s is a %s, but was referred to as a %s"),
>> > +				      oid_to_hex(&obj->oid), type_name(obj->type),
>> > +				      type_name(type));
>> > +			else
>> > +				error(_("object %s referred to as both a %s and a %s"),
>> > +				      oid_to_hex(&obj->oid),
>> > +				      type_name(obj->type), type_name(type));
>> > +		}
>> >  		return NULL;
>> >  	}
>> >  }
>> 
>> Per the above I don't understand how you think there's any uncertainty
>> here.
>> 
>> If I'm right and there isn't then first of all I don't see how we could
>> emit 1/2 of those errors. The whole problem here is that we don't know
>> the type of the un-parsed object (and presumably don't want to eagerly
>> know, it would mean hitting the object store).
>
> Forgetting for a moment how to trigger it with actual Git commands, the
> root of the problem is that:
>
>   lookup_tree(&oid);
>   lookup_blob(&oid);
>
> is going to produce an error message. But we cannot know which object
> type is wrong and which is right (if any). So we'd want to produce the
> "referred to as both" message.
>
> _If_ the caller happens to know that it has just parsed the object
> contents and got a tree, then it would call lookup_parsed_tree(&oid),
> which would pass along OBJECT_AS_TYPE_EXPECT_PARSED, and produce the
> other message.
>
> In practice, of course those two lookup_foo() calls are not right next
> to each other. But they may be triggered on an identical oid by two
> references from different objects.

[...]

>> But when we do know why would we beat around the bush and say "was
>> referred to as X and Y" once we know what it is. 
>> 
>> AFAICT there's no more reason to think that parse_object_buffer() will
>> be wrong about the type than "git cat-file -t" will be. They both use
>> the same underlying functions to get that information.
>
> My point is that we are not always coming from parse_object_buffer()
> when we see these error messages.

If my solution of relying on the parsed v.s. non-parsed shouldn't we
just devolve to a full object info lookup when emitting the error? It's
more expensive, but we're emitting an error anyway...
Jeff King April 1, 2021, 7:54 a.m. UTC | #9
On Wed, Mar 31, 2021 at 10:46:22PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > Neither of those types is the correct one. And the segfault is just a
> > bonus! :)
> >
> > I'd expect similar cases with parsing commit parents and tree pointers.
> > And probably tree entries whose modes are wrong.
> 
> So the segfault happens without my patches,

Yeah, sorry if that was unclear. It is definitely a pre-existing bug.

> but the change is that
> before we'd always get it wrong and say "commit, not a tree", but now
> we'll get it right some of the time. Patching the relevant object.c code
> to emit different messages from the various functions shows that it's
> the oid_is_type*() functions that get it right, but object_as_type() is
> wrong as before.
> 
> So that's certainly something I missed.
> 
> But are there any cases where it makes things worse? Or is it just that
> it's not a full fix in all cases, but only a partial one?

Right, I don't think your patch is making anything worse. It's just that
it does not cover all cases where we see an object as two different
types. Nor can it, since it is relying on code paths that actually parse
the object, and not all of them do.

> > My point is that we are not always coming from parse_object_buffer()
> > when we see these error messages.
> 
> If my solution of relying on the parsed v.s. non-parsed shouldn't we
> just devolve to a full object info lookup when emitting the error? It's
> more expensive, but we're emitting an error anyway...

That's certainly one option (that I suggested earlier in [0]). If we go
that route, then we do not need any of this "the caller passes in an
extra bit to say that it is parsing the object, and it found a tree",
because the error routine in object_as_type() would consult the odb
itself.

But I still think it does not make the error messages fully useful. We
might say "object X is really a tree in the odb, but we previously saw
it as a commit". But we will still have to return NULL from
lookup_tree(), so whatever containing object referenced X, _even though
it has the correct type_, will be the one to propagate the failure up
the stack. It was whoever was responsible for that "previously saw" that
is actually corrupt, and we no longer know who that was.

Which is why I wonder if it is worth even bothering to put a lot of
effort in here. If the issue is just that "X is a foo, not a bar" is
sometimes misleading, then we could solve that by simply making the
message more precise ("we saw X as a foo and a bar; one of them is
wrong"). Even if we could know _which_ is wrong with respect to what's
in the object contents, it isn't all that helpful without being able to
tell the user which object reference was the one that led us to the
wrong conclusion.

-Peff

[0] https://lore.kernel.org/git/YGBHH7sAVsPpVKWd@coredump.intra.peff.net/
diff mbox series

Patch

diff --git a/blob.c b/blob.c
index 182718aba9f..b233e0daa2f 100644
--- a/blob.c
+++ b/blob.c
@@ -2,17 +2,31 @@ 
 #include "blob.h"
 #include "repository.h"
 #include "alloc.h"
+#include "object-store.h"
 
 const char *blob_type = "blob";
 
-struct blob *lookup_blob(struct repository *r, const struct object_id *oid)
+struct blob *lookup_blob_type(struct repository *r,
+			      const struct object_id *oid,
+			      enum object_type type)
 {
 	struct object *obj = lookup_object(r, oid);
 	if (!obj)
 		return create_object(r, oid, alloc_blob_node(r));
+	if (type != OBJ_NONE &&
+	    obj->type != OBJ_NONE) {
+		enum object_type want = OBJ_BLOB;
+		if (oid_is_type_or_error(oid, obj->type, &want))
+			return NULL;
+	}
 	return object_as_type(obj, OBJ_BLOB, 0);
 }
 
+struct blob *lookup_blob(struct repository *r, const struct object_id *oid)
+{
+	return lookup_blob_type(r, oid, OBJ_NONE);
+}
+
 int parse_blob_buffer(struct blob *item, void *buffer, unsigned long size)
 {
 	item->object.parsed = 1;
diff --git a/blob.h b/blob.h
index 16648720557..066a2effcbf 100644
--- a/blob.h
+++ b/blob.h
@@ -10,6 +10,9 @@  struct blob {
 };
 
 struct blob *lookup_blob(struct repository *r, const struct object_id *oid);
+struct blob *lookup_blob_type(struct repository *r,
+			      const struct object_id *oid,
+			      enum object_type type);
 
 int parse_blob_buffer(struct blob *item, void *buffer, unsigned long size);
 
diff --git a/commit.c b/commit.c
index b3701003678..ab6cee1e8c3 100644
--- a/commit.c
+++ b/commit.c
@@ -57,14 +57,26 @@  struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref
 	return c;
 }
 
-struct commit *lookup_commit(struct repository *r, const struct object_id *oid)
+struct commit *lookup_commit_type(struct repository *r, const struct object_id *oid,
+				  enum object_type type)
 {
 	struct object *obj = lookup_object(r, oid);
 	if (!obj)
 		return create_object(r, oid, alloc_commit_node(r));
+	if (type != OBJ_NONE &&
+	    obj->type != OBJ_NONE) {
+		enum object_type want = OBJ_COMMIT;
+		if (oid_is_type_or_error(oid, obj->type, &want))
+			return NULL;
+	}
 	return object_as_type(obj, OBJ_COMMIT, 0);
 }
 
+struct commit *lookup_commit(struct repository *r, const struct object_id *oid)
+{
+	return lookup_commit_type(r, oid, OBJ_NONE);
+}
+
 struct commit *lookup_commit_reference_by_name(const char *name)
 {
 	struct object_id oid;
diff --git a/commit.h b/commit.h
index df42eb434f3..9def4f3f19d 100644
--- a/commit.h
+++ b/commit.h
@@ -64,6 +64,8 @@  void add_name_decoration(enum decoration_type type, const char *name, struct obj
 const struct name_decoration *get_name_decoration(const struct object *obj);
 
 struct commit *lookup_commit(struct repository *r, const struct object_id *oid);
+struct commit *lookup_commit_type(struct repository *r, const struct object_id *oid,
+				  enum object_type type);
 struct commit *lookup_commit_reference(struct repository *r,
 				       const struct object_id *oid);
 struct commit *lookup_commit_reference_gently(struct repository *r,
diff --git a/object.c b/object.c
index 0f60743e61f..60037422488 100644
--- a/object.c
+++ b/object.c
@@ -230,14 +230,14 @@  struct object *parse_object_buffer(struct repository *r, const struct object_id
 
 	obj = NULL;
 	if (type == OBJ_BLOB) {
-		struct blob *blob = lookup_blob(r, oid);
+		struct blob *blob = lookup_blob_type(r, oid, type);
 		if (blob) {
 			if (parse_blob_buffer(blob, buffer, size))
 				return NULL;
 			obj = &blob->object;
 		}
 	} else if (type == OBJ_TREE) {
-		struct tree *tree = lookup_tree(r, oid);
+		struct tree *tree = lookup_tree_type(r, oid, type);
 		if (tree) {
 			obj = &tree->object;
 			if (!tree->buffer)
@@ -249,7 +249,7 @@  struct object *parse_object_buffer(struct repository *r, const struct object_id
 			}
 		}
 	} else if (type == OBJ_COMMIT) {
-		struct commit *commit = lookup_commit(r, oid);
+		struct commit *commit = lookup_commit_type(r, oid, type);
 		if (commit) {
 			if (parse_commit_buffer(r, commit, buffer, size, 1))
 				return NULL;
@@ -260,7 +260,7 @@  struct object *parse_object_buffer(struct repository *r, const struct object_id
 			obj = &commit->object;
 		}
 	} else if (type == OBJ_TAG) {
-		struct tag *tag = lookup_tag(r, oid);
+		struct tag *tag = lookup_tag_type(r, oid, type);
 		if (tag) {
 			if (parse_tag_buffer(r, tag, buffer, size))
 			       return NULL;
diff --git a/t/t6102-rev-list-unexpected-objects.sh b/t/t6102-rev-list-unexpected-objects.sh
index 2ea1982b9ed..4a6b3cc3b01 100755
--- a/t/t6102-rev-list-unexpected-objects.sh
+++ b/t/t6102-rev-list-unexpected-objects.sh
@@ -156,14 +156,14 @@  test_expect_success 'setup unexpected non-tag tag' '
 test_expect_success 'traverse unexpected incorrectly typed tag (to commit & tag)' '
 	test_must_fail git rev-list --objects $tag_tag_commit 2>err &&
 	cat >expected <<-EOF &&
-	error: object $commit is a tag, not a commit
+	error: object $commit is a commit, not a tag
 	fatal: bad object $commit
 	EOF
 	test_cmp expected err &&
 
 	test_must_fail git rev-list --objects $commit_tag_tag 2>err &&
 	cat >expected <<-EOF &&
-	error: object $tag_commit is a commit, not a tag
+	error: object $tag_commit is a tag, not a commit
 	fatal: bad object $tag_commit
 	EOF
 	test_cmp expected err
@@ -172,14 +172,14 @@  test_expect_success 'traverse unexpected incorrectly typed tag (to commit & tag)
 test_expect_success 'traverse unexpected incorrectly typed tag (to tree)' '
 	test_must_fail git rev-list --objects $tag_tag_tree 2>err &&
 	cat >expected <<-EOF &&
-	error: object $tree is a tag, not a tree
+	error: object $tree is a tree, not a tag
 	fatal: bad object $tree
 	EOF
 	test_cmp expected err &&
 
 	test_must_fail git rev-list --objects $commit_tag_tree 2>err &&
 	cat >expected <<-EOF &&
-	error: object $tree is a commit, not a tree
+	error: object $tree is a tree, not a commit
 	fatal: bad object $tree
 	EOF
 	test_cmp expected err
@@ -188,14 +188,14 @@  test_expect_success 'traverse unexpected incorrectly typed tag (to tree)' '
 test_expect_success 'traverse unexpected incorrectly typed tag (to blob)' '
 	test_must_fail git rev-list --objects $tag_tag_blob 2>err &&
 	cat >expected <<-EOF &&
-	error: object $blob is a tag, not a blob
+	error: object $blob is a blob, not a tag
 	fatal: bad object $blob
 	EOF
 	test_cmp expected err &&
 
 	test_must_fail git rev-list --objects $commit_tag_blob 2>err &&
 	cat >expected <<-EOF &&
-	error: object $blob is a commit, not a blob
+	error: object $blob is a blob, not a commit
 	fatal: bad object $blob
 	EOF
 	test_cmp expected err
@@ -204,14 +204,14 @@  test_expect_success 'traverse unexpected incorrectly typed tag (to blob)' '
 test_expect_success 'traverse unexpected non-tag tag (tree seen to blob)' '
 	test_must_fail git rev-list --objects $tree $commit_tag_blob 2>err &&
 	cat >expected <<-EOF &&
-	error: object $blob is a commit, not a blob
+	error: object $blob is a blob, not a commit
 	fatal: bad object $blob
 	EOF
 	test_cmp expected err &&
 
 	test_must_fail git rev-list --objects $tree $tag_tag_blob 2>err &&
 	cat >expected <<-EOF &&
-	error: object $blob is a tag, not a blob
+	error: object $blob is a blob, not a tag
 	fatal: bad object $blob
 	EOF
 	test_cmp expected err
diff --git a/tag.c b/tag.c
index 3e18a418414..0ef87897b29 100644
--- a/tag.c
+++ b/tag.c
@@ -99,14 +99,26 @@  struct object *deref_tag_noverify(struct object *o)
 	return o;
 }
 
-struct tag *lookup_tag(struct repository *r, const struct object_id *oid)
+struct tag *lookup_tag_type(struct repository *r, const struct object_id *oid,
+			    enum object_type type)
 {
 	struct object *obj = lookup_object(r, oid);
 	if (!obj)
 		return create_object(r, oid, alloc_tag_node(r));
+	if (type != OBJ_NONE &&
+	    obj->type != OBJ_NONE) {
+		enum object_type want = OBJ_TAG;
+		if (oid_is_type_or_error(oid, obj->type, &want))
+			return NULL;
+	}
 	return object_as_type(obj, OBJ_TAG, 0);
 }
 
+struct tag *lookup_tag(struct repository *r, const struct object_id *oid)
+{
+	return lookup_tag_type(r, oid, OBJ_NONE);
+}
+
 static timestamp_t parse_tag_date(const char *buf, const char *tail)
 {
 	const char *dateptr;
diff --git a/tag.h b/tag.h
index 3ce8e721924..42bd3e64011 100644
--- a/tag.h
+++ b/tag.h
@@ -12,6 +12,8 @@  struct tag {
 	timestamp_t date;
 };
 struct tag *lookup_tag(struct repository *r, const struct object_id *oid);
+struct tag *lookup_tag_type(struct repository *r, const struct object_id *oid,
+			    enum object_type type);
 int parse_tag_buffer(struct repository *r, struct tag *item, const void *data, unsigned long size);
 int parse_tag(struct tag *item);
 void release_tag_memory(struct tag *t);
diff --git a/tree.c b/tree.c
index d9b1c70b28a..895c66420e8 100644
--- a/tree.c
+++ b/tree.c
@@ -195,14 +195,26 @@  int read_tree(struct repository *r, struct tree *tree, int stage,
 	return 0;
 }
 
-struct tree *lookup_tree(struct repository *r, const struct object_id *oid)
+struct tree *lookup_tree_type(struct repository *r, const struct object_id *oid,
+			      enum object_type type)
 {
 	struct object *obj = lookup_object(r, oid);
 	if (!obj)
 		return create_object(r, oid, alloc_tree_node(r));
+	if (type != OBJ_NONE &&
+	    obj->type != OBJ_NONE) {
+		enum object_type want = OBJ_TREE;
+		if (oid_is_type_or_error(oid, obj->type, &want))
+			return NULL;
+	}
 	return object_as_type(obj, OBJ_TREE, 0);
 }
 
+struct tree *lookup_tree(struct repository *r, const struct object_id *oid)
+{
+	return lookup_tree_type(r, oid, OBJ_NONE);
+}
+
 int parse_tree_buffer(struct tree *item, void *buffer, unsigned long size)
 {
 	if (item->object.parsed)
diff --git a/tree.h b/tree.h
index 3eb0484cbf2..49bd44f79b3 100644
--- a/tree.h
+++ b/tree.h
@@ -15,6 +15,8 @@  struct tree {
 extern const char *tree_type;
 
 struct tree *lookup_tree(struct repository *r, const struct object_id *oid);
+struct tree *lookup_tree_type(struct repository *r, const struct object_id *oid,
+			      enum object_type type);
 
 int parse_tree_buffer(struct tree *item, void *buffer, unsigned long size);