Message ID | patch-10.11-a84f670ac24-20210328T021238Z-avarab@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | improve reporting of unexpected objects | expand |
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > Fix a regression in 89e4202f982 ([PATCH] Parse tags for absent > objects, 2005-06-21) (yes, that ancient!) and correctly report an > error on a tag like: > > object <a tree hash> > type commit > > As: > > error: object <a tree hash> is tree, not a commit > > Instead of our long-standing misbehavior of inverting the two, and > reporting: > > error: object <a tree hash> is commit, not a tree > > Which, as can be trivially seen with 'git cat-file -t <a tree hash>' > is incorrect. Hmph, I've always thought it is just "supposed to be a" missing in the sentence ;-) > Hence the non-intuitive solution of adding a > lookup_{blob,commit,tag,tree}_type() function. It's to distinguish > calls from parse_object_buffer() where we actually know the type, from > a parse_tag_buffer() where we're just guessing about the type. I think it makes sense to allow the caller to express distinction between "I know that this object is a blob, because I just read its object header" and "Another object tells me that this object must be a blob, because it is in a tree entry whose mode bits are 100644". I wish we found a set of names better than lookup_<type>_type() for that, though. It's just between lookup_tag_type(r, oid, OBJ_NONE); lookup_tag_type(r, oid, OBJ_TAG); I cannot quite tell which one is which. I also wonder if the last arg should just be a boolean ("I know it is a tag" vs "I heard it must be a tag").
On Mon, Mar 29, 2021 at 10:50:18PM -0700, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > > > Fix a regression in 89e4202f982 ([PATCH] Parse tags for absent > > objects, 2005-06-21) (yes, that ancient!) and correctly report an > > error on a tag like: > > > > object <a tree hash> > > type commit > > > > As: > > > > error: object <a tree hash> is tree, not a commit > > > > Instead of our long-standing misbehavior of inverting the two, and > > reporting: > > > > error: object <a tree hash> is commit, not a tree > > > > Which, as can be trivially seen with 'git cat-file -t <a tree hash>' > > is incorrect. > > Hmph, I've always thought it is just "supposed to be a" missing in > the sentence ;-) So going with the discussion elsewhere in the thread, I'd probably say something like: error: object <oid> seen as both a commit and a tree which precisely says what we do know, without implying which is correct. Ævar's patch tries to improve the case where we might _know_ which is correct (because we're actually parsing the object contents), but of course it covers only a fraction of cases. I'm not really opposed to that per se, but I probably wouldn't bother myself. Side note: this is all making the assumption that what is in the object itself is "correct", but of course that is not necessarily true, even. All of these cases are the result of bugs, so it is possible that the bug was in the writing of the original object contents, and not the object that is referring to it. Likewise, I'd imagine an easy way to get into this situation is with a bogus refs/replace object that switches type. > > Hence the non-intuitive solution of adding a > > lookup_{blob,commit,tag,tree}_type() function. It's to distinguish > > calls from parse_object_buffer() where we actually know the type, from > > a parse_tag_buffer() where we're just guessing about the type. > > I think it makes sense to allow the caller to express distinction > between "I know that this object is a blob, because I just read its > object header" and "Another object tells me that this object must be > a blob, because it is in a tree entry whose mode bits are 100644". > > I wish we found a set of names better than lookup_<type>_type() for > that, though. It's just between > > lookup_tag_type(r, oid, OBJ_NONE); > lookup_tag_type(r, oid, OBJ_TAG); > > I cannot quite tell which one is which. I also wonder if the last > arg should just be a boolean ("I know it is a tag" vs "I heard it > must be a tag"). Yeah, I also found that very confusing. AFAICT lookup_tag_type() would only ever see OBJ_NONE or OBJ_TAG. Making it more than a boolean makes both the interface and implementation more complicated. I also think the manual handling of OBJ_NONE in each lookup_* function is confusing. They all call object_as_type() because the point of that function is both to type-check the struct and to convert it away from OBJ_NONE. If we handled this error there, then I think it would be much more natural, because we'd have already covered the OBJ_NONE case, and because it's already the place we're emitting the existing error. E.g.: diff --git a/object.c b/object.c index 2c32691dc4..e6345541f7 100644 --- a/object.c +++ b/object.c @@ -157,7 +157,7 @@ void *create_object(struct repository *r, const struct object_id *oid, void *o) return obj; } -void *object_as_type(struct object *obj, enum object_type type, int quiet) +void *object_as_type(struct object *obj, enum object_type type, unsigned flags) { if (obj->type == type) return obj; @@ -169,10 +169,16 @@ void *object_as_type(struct object *obj, enum object_type type, int quiet) return obj; } else { - if (!quiet) - error(_("object %s is a %s, not a %s"), - oid_to_hex(&obj->oid), - type_name(obj->type), type_name(type)); + if (!(flags & OBJECT_AS_TYPE_QUIET)) { + if (flags & OBJECT_AS_TYPE_EXPECT_PARSED) + error(_("object %s is a %s, but was referred to as a %s"), + oid_to_hex(&obj->oid), type_name(obj->type), + type_name(type)); + else + error(_("object %s referred to as both a %s and a %s"), + oid_to_hex(&obj->oid), + type_name(obj->type), type_name(type)); + } return NULL; } } -Peff
Jeff King <peff@peff.net> writes: > I also think the manual handling of OBJ_NONE in each lookup_* function > is confusing. They all call object_as_type() because the point of that > function is both to type-check the struct and to convert it away from > OBJ_NONE. > > If we handled this error there, then I think it would be much more > natural, because we'd have already covered the OBJ_NONE case, and > because it's already the place we're emitting the existing error. E.g.: This makes quite a lot of sense. If presented with this simple change and 10-patch series at the same time and are told that the goal of the changes were more or less the same, I'd pick this one 100% of the time. > diff --git a/object.c b/object.c > index 2c32691dc4..e6345541f7 100644 > --- a/object.c > +++ b/object.c > @@ -157,7 +157,7 @@ void *create_object(struct repository *r, const struct object_id *oid, void *o) > return obj; > } > > -void *object_as_type(struct object *obj, enum object_type type, int quiet) > +void *object_as_type(struct object *obj, enum object_type type, unsigned flags) > { > if (obj->type == type) > return obj; > @@ -169,10 +169,16 @@ void *object_as_type(struct object *obj, enum object_type type, int quiet) > return obj; > } > else { > - if (!quiet) > - error(_("object %s is a %s, not a %s"), > - oid_to_hex(&obj->oid), > - type_name(obj->type), type_name(type)); > + if (!(flags & OBJECT_AS_TYPE_QUIET)) { > + if (flags & OBJECT_AS_TYPE_EXPECT_PARSED) > + error(_("object %s is a %s, but was referred to as a %s"), > + oid_to_hex(&obj->oid), type_name(obj->type), > + type_name(type)); > + else > + error(_("object %s referred to as both a %s and a %s"), > + oid_to_hex(&obj->oid), > + type_name(obj->type), type_name(type)); > + } > return NULL; > } > } > > -Peff
On Wed, Mar 31 2021, Jeff King wrote: > On Mon, Mar 29, 2021 at 10:50:18PM -0700, Junio C Hamano wrote: > >> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: >> >> > Fix a regression in 89e4202f982 ([PATCH] Parse tags for absent >> > objects, 2005-06-21) (yes, that ancient!) and correctly report an >> > error on a tag like: >> > >> > object <a tree hash> >> > type commit >> > >> > As: >> > >> > error: object <a tree hash> is tree, not a commit >> > >> > Instead of our long-standing misbehavior of inverting the two, and >> > reporting: >> > >> > error: object <a tree hash> is commit, not a tree >> > >> > Which, as can be trivially seen with 'git cat-file -t <a tree hash>' >> > is incorrect. >> >> Hmph, I've always thought it is just "supposed to be a" missing in >> the sentence ;-) > > So going with the discussion elsewhere in the thread, I'd probably say > something like: > > error: object <oid> seen as both a commit and a tree > > which precisely says what we do know, without implying which is correct. > > Ævar's patch tries to improve the case where we might _know_ which is > correct (because we're actually parsing the object contents), but of > course it covers only a fraction of cases. I'm not really opposed to > that per se, but I probably wouldn't bother myself. What fraction of cases? As far as I can tell it covers all cases where we get this error. If there is a case like what you're describing I haven't found it. I.e. it happens when we have an un-parsed "struct object" whose type is inferred, and parse it to find out it's not what we expected. It's not ambigious at all what the object actually is. It's just that the previous code was leaking the *assumption* about the type at the time of emitting the error, due to an apparent oversight with parsed v.s. non-parsed. Or in other words, we're leaking the implementation detail that we pre-allocated an object struct of a given type in anticipation of holding a parsed version of that object soon. > Side note: this is all making the assumption that what is in the > object itself is "correct", but of course that is not necessarily > true, even. All of these cases are the result of bugs, so it is > possible that the bug was in the writing of the original object > contents, and not the object that is referring to it. Likewise, I'd > imagine an easy way to get into this situation is with a bogus > refs/replace object that switches type. Perhaps, I haven't tested that in any detail. >> > Hence the non-intuitive solution of adding a >> > lookup_{blob,commit,tag,tree}_type() function. It's to distinguish >> > calls from parse_object_buffer() where we actually know the type, from >> > a parse_tag_buffer() where we're just guessing about the type. >> >> I think it makes sense to allow the caller to express distinction >> between "I know that this object is a blob, because I just read its >> object header" and "Another object tells me that this object must be >> a blob, because it is in a tree entry whose mode bits are 100644". >> >> I wish we found a set of names better than lookup_<type>_type() for >> that, though. It's just between >> >> lookup_tag_type(r, oid, OBJ_NONE); >> lookup_tag_type(r, oid, OBJ_TAG); >> >> I cannot quite tell which one is which. I also wonder if the last >> arg should just be a boolean ("I know it is a tag" vs "I heard it >> must be a tag"). > > Yeah, I also found that very confusing. AFAICT lookup_tag_type() would > only ever see OBJ_NONE or OBJ_TAG. Making it more than a boolean makes > both the interface and implementation more complicated. I don't feel strongly either way, but one concern here is that these are very hot functions, and maybe it's better to give the compiler a better chance to work with them without considering an extra argument, but I haven't tested that... > I also think the manual handling of OBJ_NONE in each lookup_* function > is confusing. They all call object_as_type() because the point of that > function is both to type-check the struct and to convert it away from > OBJ_NONE. > > If we handled this error there, then I think it would be much more > natural, because we'd have already covered the OBJ_NONE case, and > because it's already the place we're emitting the existing error. E.g.: > > diff --git a/object.c b/object.c > index 2c32691dc4..e6345541f7 100644 > --- a/object.c > +++ b/object.c > @@ -157,7 +157,7 @@ void *create_object(struct repository *r, const struct object_id *oid, void *o) > return obj; > } > > -void *object_as_type(struct object *obj, enum object_type type, int quiet) > +void *object_as_type(struct object *obj, enum object_type type, unsigned flags) > { > if (obj->type == type) > return obj; > @@ -169,10 +169,16 @@ void *object_as_type(struct object *obj, enum object_type type, int quiet) > return obj; > } > else { > - if (!quiet) > - error(_("object %s is a %s, not a %s"), > - oid_to_hex(&obj->oid), > - type_name(obj->type), type_name(type)); > + if (!(flags & OBJECT_AS_TYPE_QUIET)) { > + if (flags & OBJECT_AS_TYPE_EXPECT_PARSED) > + error(_("object %s is a %s, but was referred to as a %s"), > + oid_to_hex(&obj->oid), type_name(obj->type), > + type_name(type)); > + else > + error(_("object %s referred to as both a %s and a %s"), > + oid_to_hex(&obj->oid), > + type_name(obj->type), type_name(type)); > + } > return NULL; > } > } Per the above I don't understand how you think there's any uncertainty here. If I'm right and there isn't then first of all I don't see how we could emit 1/2 of those errors. The whole problem here is that we don't know the type of the un-parsed object (and presumably don't want to eagerly know, it would mean hitting the object store). But when we do know why would we beat around the bush and say "was referred to as X and Y" once we know what it is. AFAICT there's no more reason to think that parse_object_buffer() will be wrong about the type than "git cat-file -t" will be. They both use the same underlying functions to get that information.
Jeff King <peff@peff.net> writes: > + if (flags & OBJECT_AS_TYPE_EXPECT_PARSED) > + error(_("object %s is a %s, but was referred to as a %s"), > + oid_to_hex(&obj->oid), type_name(obj->type), > + type_name(type)); > + else > + error(_("object %s referred to as both a %s and a %s"), > + oid_to_hex(&obj->oid), > + type_name(obj->type), type_name(type)); > + } Am I correct to understand that the latter is after we read a tree that refers to an object with 100644 (blob) and then another tree that refers to the same object with 40000 (tree), before we have a need/chance to actually find out what that object is? The error would trigger while reading the second tree and find the second mention of the object that conflicts with the earlier one?
On Wed, Mar 31, 2021 at 08:31:16PM +0200, Ævar Arnfjörð Bjarmason wrote: > > Ævar's patch tries to improve the case where we might _know_ which is > > correct (because we're actually parsing the object contents), but of > > course it covers only a fraction of cases. I'm not really opposed to > > that per se, but I probably wouldn't bother myself. > > What fraction of cases? As far as I can tell it covers all cases where > we get this error. > > If there is a case like what you're describing I haven't found it. It would happen any time somebody calls lookup_foo() because they saw an object referenced, but _doesn't_ parse it. And then somebody later calls lookup_bar() in the same way. Neither of them consulted the actual object database. Try this with your patches: -- >8 -- git init repo cd repo # just for making things deterministic export GIT_COMMITTER_NAME='A U Thor' export GIT_COMMITTER_EMAIL='author@example.com' export GIT_COMMITTER_DATE='@1234567890 +0000' blob=$(echo foo | git hash-object -w --stdin) git tag -m 'tag of blob' tag-of-blob $blob git update-ref refs/tags/tag-of-commit $( git cat-file tag tag-of-blob | sed s/blob/commit/g | git hash-object -w --stdin -t tag ) git update-ref refs/tags/tag-of-tree $( git cat-file tag tag-of-blob | sed s/blob/tree/g | git hash-object -w --stdin -t tag ) git fsck -- >8 -- That fsck produces (257cc5642 is the blob): error: object 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 is a blob, not a commit error: 257cc5642cb1a054f08cc83f2d943e56fd3ebe99: object could not be parsed: .git/objects/25/7cc5642cb1a054f08cc83f2d943e56fd3ebe99 error: object 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 is a commit, not a tree error: bad tag pointer to 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 in aaff0d42df150e1a734f6a8516878b2ea315ee0a error: aaff0d42df150e1a734f6a8516878b2ea315ee0a: object could not be parsed: .git/objects/aa/ff0d42df150e1a734f6a8516878b2ea315ee0a error: object 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 is a commit, not a blob error: bad tag pointer to 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 in bbd2b7077cd91ee6175cdc0e4c477c25c230cdc7 error: bbd2b7077cd91ee6175cdc0e4c477c25c230cdc7: object could not be parsed: .git/objects/bb/d2b7077cd91ee6175cdc0e4c477c25c230cdc7 So we claim "is X, not Y" in multiple directions for the same object. It might just be that there are spots in the fsck code that need to be adjusted to use your new function (if they are indeed parsing the referred-to object). But there are lots of places that don't actually parse the object at the moment they're parsing the tag. E.g.: $ git for-each-ref --format='%(*objectname)' error: object 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 is a commit, not a tree error: bad tag pointer to 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 in aaff0d42df150e1a734f6a8516878b2ea315ee0a Segmentation fault Neither of those types is the correct one. And the segfault is just a bonus! :) I'd expect similar cases with parsing commit parents and tree pointers. And probably tree entries whose modes are wrong. > I.e. it happens when we have an un-parsed "struct object" whose type is > inferred, and parse it to find out it's not what we expected. > > It's not ambigious at all what the object actually is. It's just that > the previous code was leaking the *assumption* about the type at the > time of emitting the error, due to an apparent oversight with parsed > v.s. non-parsed. > > Or in other words, we're leaking the implementation detail that we > pre-allocated an object struct of a given type in anticipation of > holding a parsed version of that object soon. Right. In the case that you are indeed parsing the object later, you can say definitively "it is X in the odb, but seen as Y previously". But we do not always hit the "is X, not Y" error when parsing the object. It might be caused by two of these "pre-allocations" (though really I think it is not just an implementation detail; the pre-allocation happened because some other object referred to us as a given type, so it really is a corruption in the repository. Just not in the object we mention). > > @@ -169,10 +169,16 @@ void *object_as_type(struct object *obj, enum object_type type, int quiet) > > return obj; > > } > > else { > > - if (!quiet) > > - error(_("object %s is a %s, not a %s"), > > - oid_to_hex(&obj->oid), > > - type_name(obj->type), type_name(type)); > > + if (!(flags & OBJECT_AS_TYPE_QUIET)) { > > + if (flags & OBJECT_AS_TYPE_EXPECT_PARSED) > > + error(_("object %s is a %s, but was referred to as a %s"), > > + oid_to_hex(&obj->oid), type_name(obj->type), > > + type_name(type)); > > + else > > + error(_("object %s referred to as both a %s and a %s"), > > + oid_to_hex(&obj->oid), > > + type_name(obj->type), type_name(type)); > > + } > > return NULL; > > } > > } > > Per the above I don't understand how you think there's any uncertainty > here. > > If I'm right and there isn't then first of all I don't see how we could > emit 1/2 of those errors. The whole problem here is that we don't know > the type of the un-parsed object (and presumably don't want to eagerly > know, it would mean hitting the object store). Forgetting for a moment how to trigger it with actual Git commands, the root of the problem is that: lookup_tree(&oid); lookup_blob(&oid); is going to produce an error message. But we cannot know which object type is wrong and which is right (if any). So we'd want to produce the "referred to as both" message. _If_ the caller happens to know that it has just parsed the object contents and got a tree, then it would call lookup_parsed_tree(&oid), which would pass along OBJECT_AS_TYPE_EXPECT_PARSED, and produce the other message. In practice, of course those two lookup_foo() calls are not right next to each other. But they may be triggered on an identical oid by two references from different objects. > But when we do know why would we beat around the bush and say "was > referred to as X and Y" once we know what it is. > > AFAICT there's no more reason to think that parse_object_buffer() will > be wrong about the type than "git cat-file -t" will be. They both use > the same underlying functions to get that information. My point is that we are not always coming from parse_object_buffer() when we see these error messages. -Peff
On Wed, Mar 31, 2021 at 11:41:18AM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > + if (flags & OBJECT_AS_TYPE_EXPECT_PARSED) > > + error(_("object %s is a %s, but was referred to as a %s"), > > + oid_to_hex(&obj->oid), type_name(obj->type), > > + type_name(type)); > > + else > > + error(_("object %s referred to as both a %s and a %s"), > > + oid_to_hex(&obj->oid), > > + type_name(obj->type), type_name(type)); > > + } > > Am I correct to understand that the latter is after we read a tree > that refers to an object with 100644 (blob) and then another tree > that refers to the same object with 40000 (tree), before we have a > need/chance to actually find out what that object is? The error > would trigger while reading the second tree and find the second > mention of the object that conflicts with the earlier one? Yes, exactly (or two tags, or a tag and a tree, or a commit and a tree, etc). -Peff
On Wed, Mar 31 2021, Jeff King wrote: > On Wed, Mar 31, 2021 at 08:31:16PM +0200, Ævar Arnfjörð Bjarmason wrote: > >> > Ævar's patch tries to improve the case where we might _know_ which is >> > correct (because we're actually parsing the object contents), but of >> > course it covers only a fraction of cases. I'm not really opposed to >> > that per se, but I probably wouldn't bother myself. >> >> What fraction of cases? As far as I can tell it covers all cases where >> we get this error. >> >> If there is a case like what you're describing I haven't found it. > > It would happen any time somebody calls lookup_foo() because they saw an > object referenced, but _doesn't_ parse it. And then somebody later calls > lookup_bar() in the same way. Neither of them consulted the actual > object database. > > Try this with your patches: > > -- >8 -- > git init repo > cd repo > > # just for making things deterministic > export GIT_COMMITTER_NAME='A U Thor' > export GIT_COMMITTER_EMAIL='author@example.com' > export GIT_COMMITTER_DATE='@1234567890 +0000' > > blob=$(echo foo | git hash-object -w --stdin) > git tag -m 'tag of blob' tag-of-blob $blob > git update-ref refs/tags/tag-of-commit $( > git cat-file tag tag-of-blob | > sed s/blob/commit/g | > git hash-object -w --stdin -t tag > ) > git update-ref refs/tags/tag-of-tree $( > git cat-file tag tag-of-blob | > sed s/blob/tree/g | > git hash-object -w --stdin -t tag > ) > > git fsck > -- >8 -- > > That fsck produces (257cc5642 is the blob): > > error: object 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 is a blob, not a commit > error: 257cc5642cb1a054f08cc83f2d943e56fd3ebe99: object could not be parsed: .git/objects/25/7cc5642cb1a054f08cc83f2d943e56fd3ebe99 > error: object 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 is a commit, not a tree > error: bad tag pointer to 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 in aaff0d42df150e1a734f6a8516878b2ea315ee0a > error: aaff0d42df150e1a734f6a8516878b2ea315ee0a: object could not be parsed: .git/objects/aa/ff0d42df150e1a734f6a8516878b2ea315ee0a > error: object 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 is a commit, not a blob > error: bad tag pointer to 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 in bbd2b7077cd91ee6175cdc0e4c477c25c230cdc7 > error: bbd2b7077cd91ee6175cdc0e4c477c25c230cdc7: object could not be parsed: .git/objects/bb/d2b7077cd91ee6175cdc0e4c477c25c230cdc7 > > So we claim "is X, not Y" in multiple directions for the same object. > > It might just be that there are spots in the fsck code that need to be > adjusted to use your new function (if they are indeed parsing the > referred-to object). But there are lots of places that don't actually > parse the object at the moment they're parsing the tag. E.g.: > > $ git for-each-ref --format='%(*objectname)' > error: object 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 is a commit, not a tree > error: bad tag pointer to 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 in aaff0d42df150e1a734f6a8516878b2ea315ee0a > Segmentation fault > > Neither of those types is the correct one. And the segfault is just a > bonus! :) > > I'd expect similar cases with parsing commit parents and tree pointers. > And probably tree entries whose modes are wrong. So the segfault happens without my patches, but the change is that before we'd always get it wrong and say "commit, not a tree", but now we'll get it right some of the time. Patching the relevant object.c code to emit different messages from the various functions shows that it's the oid_is_type*() functions that get it right, but object_as_type() is wrong as before. So that's certainly something I missed. But are there any cases where it makes things worse? Or is it just that it's not a full fix in all cases, but only a partial one? >> I.e. it happens when we have an un-parsed "struct object" whose type is >> inferred, and parse it to find out it's not what we expected. >> >> It's not ambigious at all what the object actually is. It's just that >> the previous code was leaking the *assumption* about the type at the >> time of emitting the error, due to an apparent oversight with parsed >> v.s. non-parsed. >> >> Or in other words, we're leaking the implementation detail that we >> pre-allocated an object struct of a given type in anticipation of >> holding a parsed version of that object soon. > > Right. In the case that you are indeed parsing the object later, you can > say definitively "it is X in the odb, but seen as Y previously". But we > do not always hit the "is X, not Y" error when parsing the object. It > might be caused by two of these "pre-allocations" (though really I think > it is not just an implementation detail; the pre-allocation happened > because some other object referred to us as a given type, so it really > is a corruption in the repository. Just not in the object we mention). Indeed, the goal is to emit a sensible message on-the-fly when we see that corruption. >> > @@ -169,10 +169,16 @@ void *object_as_type(struct object *obj, enum object_type type, int quiet) >> > return obj; >> > } >> > else { >> > - if (!quiet) >> > - error(_("object %s is a %s, not a %s"), >> > - oid_to_hex(&obj->oid), >> > - type_name(obj->type), type_name(type)); >> > + if (!(flags & OBJECT_AS_TYPE_QUIET)) { >> > + if (flags & OBJECT_AS_TYPE_EXPECT_PARSED) >> > + error(_("object %s is a %s, but was referred to as a %s"), >> > + oid_to_hex(&obj->oid), type_name(obj->type), >> > + type_name(type)); >> > + else >> > + error(_("object %s referred to as both a %s and a %s"), >> > + oid_to_hex(&obj->oid), >> > + type_name(obj->type), type_name(type)); >> > + } >> > return NULL; >> > } >> > } >> >> Per the above I don't understand how you think there's any uncertainty >> here. >> >> If I'm right and there isn't then first of all I don't see how we could >> emit 1/2 of those errors. The whole problem here is that we don't know >> the type of the un-parsed object (and presumably don't want to eagerly >> know, it would mean hitting the object store). > > Forgetting for a moment how to trigger it with actual Git commands, the > root of the problem is that: > > lookup_tree(&oid); > lookup_blob(&oid); > > is going to produce an error message. But we cannot know which object > type is wrong and which is right (if any). So we'd want to produce the > "referred to as both" message. > > _If_ the caller happens to know that it has just parsed the object > contents and got a tree, then it would call lookup_parsed_tree(&oid), > which would pass along OBJECT_AS_TYPE_EXPECT_PARSED, and produce the > other message. > > In practice, of course those two lookup_foo() calls are not right next > to each other. But they may be triggered on an identical oid by two > references from different objects. [...] >> But when we do know why would we beat around the bush and say "was >> referred to as X and Y" once we know what it is. >> >> AFAICT there's no more reason to think that parse_object_buffer() will >> be wrong about the type than "git cat-file -t" will be. They both use >> the same underlying functions to get that information. > > My point is that we are not always coming from parse_object_buffer() > when we see these error messages. If my solution of relying on the parsed v.s. non-parsed shouldn't we just devolve to a full object info lookup when emitting the error? It's more expensive, but we're emitting an error anyway...
On Wed, Mar 31, 2021 at 10:46:22PM +0200, Ævar Arnfjörð Bjarmason wrote: > > Neither of those types is the correct one. And the segfault is just a > > bonus! :) > > > > I'd expect similar cases with parsing commit parents and tree pointers. > > And probably tree entries whose modes are wrong. > > So the segfault happens without my patches, Yeah, sorry if that was unclear. It is definitely a pre-existing bug. > but the change is that > before we'd always get it wrong and say "commit, not a tree", but now > we'll get it right some of the time. Patching the relevant object.c code > to emit different messages from the various functions shows that it's > the oid_is_type*() functions that get it right, but object_as_type() is > wrong as before. > > So that's certainly something I missed. > > But are there any cases where it makes things worse? Or is it just that > it's not a full fix in all cases, but only a partial one? Right, I don't think your patch is making anything worse. It's just that it does not cover all cases where we see an object as two different types. Nor can it, since it is relying on code paths that actually parse the object, and not all of them do. > > My point is that we are not always coming from parse_object_buffer() > > when we see these error messages. > > If my solution of relying on the parsed v.s. non-parsed shouldn't we > just devolve to a full object info lookup when emitting the error? It's > more expensive, but we're emitting an error anyway... That's certainly one option (that I suggested earlier in [0]). If we go that route, then we do not need any of this "the caller passes in an extra bit to say that it is parsing the object, and it found a tree", because the error routine in object_as_type() would consult the odb itself. But I still think it does not make the error messages fully useful. We might say "object X is really a tree in the odb, but we previously saw it as a commit". But we will still have to return NULL from lookup_tree(), so whatever containing object referenced X, _even though it has the correct type_, will be the one to propagate the failure up the stack. It was whoever was responsible for that "previously saw" that is actually corrupt, and we no longer know who that was. Which is why I wonder if it is worth even bothering to put a lot of effort in here. If the issue is just that "X is a foo, not a bar" is sometimes misleading, then we could solve that by simply making the message more precise ("we saw X as a foo and a bar; one of them is wrong"). Even if we could know _which_ is wrong with respect to what's in the object contents, it isn't all that helpful without being able to tell the user which object reference was the one that led us to the wrong conclusion. -Peff [0] https://lore.kernel.org/git/YGBHH7sAVsPpVKWd@coredump.intra.peff.net/
diff --git a/blob.c b/blob.c index 182718aba9f..b233e0daa2f 100644 --- a/blob.c +++ b/blob.c @@ -2,17 +2,31 @@ #include "blob.h" #include "repository.h" #include "alloc.h" +#include "object-store.h" const char *blob_type = "blob"; -struct blob *lookup_blob(struct repository *r, const struct object_id *oid) +struct blob *lookup_blob_type(struct repository *r, + const struct object_id *oid, + enum object_type type) { struct object *obj = lookup_object(r, oid); if (!obj) return create_object(r, oid, alloc_blob_node(r)); + if (type != OBJ_NONE && + obj->type != OBJ_NONE) { + enum object_type want = OBJ_BLOB; + if (oid_is_type_or_error(oid, obj->type, &want)) + return NULL; + } return object_as_type(obj, OBJ_BLOB, 0); } +struct blob *lookup_blob(struct repository *r, const struct object_id *oid) +{ + return lookup_blob_type(r, oid, OBJ_NONE); +} + int parse_blob_buffer(struct blob *item, void *buffer, unsigned long size) { item->object.parsed = 1; diff --git a/blob.h b/blob.h index 16648720557..066a2effcbf 100644 --- a/blob.h +++ b/blob.h @@ -10,6 +10,9 @@ struct blob { }; struct blob *lookup_blob(struct repository *r, const struct object_id *oid); +struct blob *lookup_blob_type(struct repository *r, + const struct object_id *oid, + enum object_type type); int parse_blob_buffer(struct blob *item, void *buffer, unsigned long size); diff --git a/commit.c b/commit.c index b3701003678..ab6cee1e8c3 100644 --- a/commit.c +++ b/commit.c @@ -57,14 +57,26 @@ struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref return c; } -struct commit *lookup_commit(struct repository *r, const struct object_id *oid) +struct commit *lookup_commit_type(struct repository *r, const struct object_id *oid, + enum object_type type) { struct object *obj = lookup_object(r, oid); if (!obj) return create_object(r, oid, alloc_commit_node(r)); + if (type != OBJ_NONE && + obj->type != OBJ_NONE) { + enum object_type want = OBJ_COMMIT; + if (oid_is_type_or_error(oid, obj->type, &want)) + return NULL; + } return object_as_type(obj, OBJ_COMMIT, 0); } +struct commit *lookup_commit(struct repository *r, const struct object_id *oid) +{ + return lookup_commit_type(r, oid, OBJ_NONE); +} + struct commit *lookup_commit_reference_by_name(const char *name) { struct object_id oid; diff --git a/commit.h b/commit.h index df42eb434f3..9def4f3f19d 100644 --- a/commit.h +++ b/commit.h @@ -64,6 +64,8 @@ void add_name_decoration(enum decoration_type type, const char *name, struct obj const struct name_decoration *get_name_decoration(const struct object *obj); struct commit *lookup_commit(struct repository *r, const struct object_id *oid); +struct commit *lookup_commit_type(struct repository *r, const struct object_id *oid, + enum object_type type); struct commit *lookup_commit_reference(struct repository *r, const struct object_id *oid); struct commit *lookup_commit_reference_gently(struct repository *r, diff --git a/object.c b/object.c index 0f60743e61f..60037422488 100644 --- a/object.c +++ b/object.c @@ -230,14 +230,14 @@ struct object *parse_object_buffer(struct repository *r, const struct object_id obj = NULL; if (type == OBJ_BLOB) { - struct blob *blob = lookup_blob(r, oid); + struct blob *blob = lookup_blob_type(r, oid, type); if (blob) { if (parse_blob_buffer(blob, buffer, size)) return NULL; obj = &blob->object; } } else if (type == OBJ_TREE) { - struct tree *tree = lookup_tree(r, oid); + struct tree *tree = lookup_tree_type(r, oid, type); if (tree) { obj = &tree->object; if (!tree->buffer) @@ -249,7 +249,7 @@ struct object *parse_object_buffer(struct repository *r, const struct object_id } } } else if (type == OBJ_COMMIT) { - struct commit *commit = lookup_commit(r, oid); + struct commit *commit = lookup_commit_type(r, oid, type); if (commit) { if (parse_commit_buffer(r, commit, buffer, size, 1)) return NULL; @@ -260,7 +260,7 @@ struct object *parse_object_buffer(struct repository *r, const struct object_id obj = &commit->object; } } else if (type == OBJ_TAG) { - struct tag *tag = lookup_tag(r, oid); + struct tag *tag = lookup_tag_type(r, oid, type); if (tag) { if (parse_tag_buffer(r, tag, buffer, size)) return NULL; diff --git a/t/t6102-rev-list-unexpected-objects.sh b/t/t6102-rev-list-unexpected-objects.sh index 2ea1982b9ed..4a6b3cc3b01 100755 --- a/t/t6102-rev-list-unexpected-objects.sh +++ b/t/t6102-rev-list-unexpected-objects.sh @@ -156,14 +156,14 @@ test_expect_success 'setup unexpected non-tag tag' ' test_expect_success 'traverse unexpected incorrectly typed tag (to commit & tag)' ' test_must_fail git rev-list --objects $tag_tag_commit 2>err && cat >expected <<-EOF && - error: object $commit is a tag, not a commit + error: object $commit is a commit, not a tag fatal: bad object $commit EOF test_cmp expected err && test_must_fail git rev-list --objects $commit_tag_tag 2>err && cat >expected <<-EOF && - error: object $tag_commit is a commit, not a tag + error: object $tag_commit is a tag, not a commit fatal: bad object $tag_commit EOF test_cmp expected err @@ -172,14 +172,14 @@ test_expect_success 'traverse unexpected incorrectly typed tag (to commit & tag) test_expect_success 'traverse unexpected incorrectly typed tag (to tree)' ' test_must_fail git rev-list --objects $tag_tag_tree 2>err && cat >expected <<-EOF && - error: object $tree is a tag, not a tree + error: object $tree is a tree, not a tag fatal: bad object $tree EOF test_cmp expected err && test_must_fail git rev-list --objects $commit_tag_tree 2>err && cat >expected <<-EOF && - error: object $tree is a commit, not a tree + error: object $tree is a tree, not a commit fatal: bad object $tree EOF test_cmp expected err @@ -188,14 +188,14 @@ test_expect_success 'traverse unexpected incorrectly typed tag (to tree)' ' test_expect_success 'traverse unexpected incorrectly typed tag (to blob)' ' test_must_fail git rev-list --objects $tag_tag_blob 2>err && cat >expected <<-EOF && - error: object $blob is a tag, not a blob + error: object $blob is a blob, not a tag fatal: bad object $blob EOF test_cmp expected err && test_must_fail git rev-list --objects $commit_tag_blob 2>err && cat >expected <<-EOF && - error: object $blob is a commit, not a blob + error: object $blob is a blob, not a commit fatal: bad object $blob EOF test_cmp expected err @@ -204,14 +204,14 @@ test_expect_success 'traverse unexpected incorrectly typed tag (to blob)' ' test_expect_success 'traverse unexpected non-tag tag (tree seen to blob)' ' test_must_fail git rev-list --objects $tree $commit_tag_blob 2>err && cat >expected <<-EOF && - error: object $blob is a commit, not a blob + error: object $blob is a blob, not a commit fatal: bad object $blob EOF test_cmp expected err && test_must_fail git rev-list --objects $tree $tag_tag_blob 2>err && cat >expected <<-EOF && - error: object $blob is a tag, not a blob + error: object $blob is a blob, not a tag fatal: bad object $blob EOF test_cmp expected err diff --git a/tag.c b/tag.c index 3e18a418414..0ef87897b29 100644 --- a/tag.c +++ b/tag.c @@ -99,14 +99,26 @@ struct object *deref_tag_noverify(struct object *o) return o; } -struct tag *lookup_tag(struct repository *r, const struct object_id *oid) +struct tag *lookup_tag_type(struct repository *r, const struct object_id *oid, + enum object_type type) { struct object *obj = lookup_object(r, oid); if (!obj) return create_object(r, oid, alloc_tag_node(r)); + if (type != OBJ_NONE && + obj->type != OBJ_NONE) { + enum object_type want = OBJ_TAG; + if (oid_is_type_or_error(oid, obj->type, &want)) + return NULL; + } return object_as_type(obj, OBJ_TAG, 0); } +struct tag *lookup_tag(struct repository *r, const struct object_id *oid) +{ + return lookup_tag_type(r, oid, OBJ_NONE); +} + static timestamp_t parse_tag_date(const char *buf, const char *tail) { const char *dateptr; diff --git a/tag.h b/tag.h index 3ce8e721924..42bd3e64011 100644 --- a/tag.h +++ b/tag.h @@ -12,6 +12,8 @@ struct tag { timestamp_t date; }; struct tag *lookup_tag(struct repository *r, const struct object_id *oid); +struct tag *lookup_tag_type(struct repository *r, const struct object_id *oid, + enum object_type type); int parse_tag_buffer(struct repository *r, struct tag *item, const void *data, unsigned long size); int parse_tag(struct tag *item); void release_tag_memory(struct tag *t); diff --git a/tree.c b/tree.c index d9b1c70b28a..895c66420e8 100644 --- a/tree.c +++ b/tree.c @@ -195,14 +195,26 @@ int read_tree(struct repository *r, struct tree *tree, int stage, return 0; } -struct tree *lookup_tree(struct repository *r, const struct object_id *oid) +struct tree *lookup_tree_type(struct repository *r, const struct object_id *oid, + enum object_type type) { struct object *obj = lookup_object(r, oid); if (!obj) return create_object(r, oid, alloc_tree_node(r)); + if (type != OBJ_NONE && + obj->type != OBJ_NONE) { + enum object_type want = OBJ_TREE; + if (oid_is_type_or_error(oid, obj->type, &want)) + return NULL; + } return object_as_type(obj, OBJ_TREE, 0); } +struct tree *lookup_tree(struct repository *r, const struct object_id *oid) +{ + return lookup_tree_type(r, oid, OBJ_NONE); +} + int parse_tree_buffer(struct tree *item, void *buffer, unsigned long size) { if (item->object.parsed) diff --git a/tree.h b/tree.h index 3eb0484cbf2..49bd44f79b3 100644 --- a/tree.h +++ b/tree.h @@ -15,6 +15,8 @@ struct tree { extern const char *tree_type; struct tree *lookup_tree(struct repository *r, const struct object_id *oid); +struct tree *lookup_tree_type(struct repository *r, const struct object_id *oid, + enum object_type type); int parse_tree_buffer(struct tree *item, void *buffer, unsigned long size);
Fix a regression in 89e4202f982 ([PATCH] Parse tags for absent objects, 2005-06-21) (yes, that ancient!) and correctly report an error on a tag like: object <a tree hash> type commit As: error: object <a tree hash> is tree, not a commit Instead of our long-standing misbehavior of inverting the two, and reporting: error: object <a tree hash> is commit, not a tree Which, as can be trivially seen with 'git cat-file -t <a tree hash>' is incorrect. The reason for this misreporting is that in parse_tag_buffer() we end up doing a lookup_{blob,commit,tag,tree}() depending on what we read out of the "type" line. If we haven't parsed that object before we end up dispatching to the type-specific lookup functions, e.g. this for commit.c in lookup_commit_type(): struct object *obj = lookup_object(r, oid); if (!obj) return create_object(r, oid, alloc_commit_node(r)); Its allocation will then set the obj->type according to what the tag told us the type was, but which we've never validated. At this point we've got an object in memory that hasn't been parsed, and whose type is incorrect, since we mistrusted a tag to tell us the type. Then when we actually load the object with parse_object() we read it and find that it's a "tree". See 8ff226a9d5e (add object_as_type helper for casting objects, 2014-07-13) for that behavior (that's just a refactoring commit, but shows all the code involved). Which explains why we inverted the error report. Normally when object_as_type() is called it's by the lookup_{blob,commit,tag,tree}() functions via parse_object(). At that point we can trust the obj->type. In the case of parsing objects we've learned about via a tag with an incorrect type it's the opposite, the obj->type isn't correct and holds the mislabeled type, but we're parsing the object and know for sure what object type we're dealing with. Hence the non-intuitive solution of adding a lookup_{blob,commit,tag,tree}_type() function. It's to distinguish calls from parse_object_buffer() where we actually know the type, from a parse_tag_buffer() where we're just guessing about the type. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> --- blob.c | 16 +++++++++++++++- blob.h | 3 +++ commit.c | 14 +++++++++++++- commit.h | 2 ++ object.c | 8 ++++---- t/t6102-rev-list-unexpected-objects.sh | 16 ++++++++-------- tag.c | 14 +++++++++++++- tag.h | 2 ++ tree.c | 14 +++++++++++++- tree.h | 2 ++ 10 files changed, 75 insertions(+), 16 deletions(-)