Message ID | 20201123120111.13567-1-avarab@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mktag: don't check SHA-1 object length under SHA-256 | expand |
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > diff --git a/Documentation/git-mktag.txt b/Documentation/git-mktag.txt > index fa6a7561236..bbcc0a086bf 100644 > --- a/Documentation/git-mktag.txt > +++ b/Documentation/git-mktag.txt > @@ -23,7 +23,7 @@ Tag Format > A tag signature file, to be fed to this command's standard input, > has a very simple fixed format: four lines of > > - object <sha1> > + object <sha> Perhaps <hash>? or <objectname>. > diff --git a/builtin/mktag.c b/builtin/mktag.c > index 4982d3a93ef..3fa17243e34 100644 > --- a/builtin/mktag.c > +++ b/builtin/mktag.c > @@ -5,13 +5,15 @@ > > /* > * A signature file has a very simple fixed format: four lines > - * of "object <sha1>" + "type <typename>" + "tag <tagname>" + > + * of "object <sha>" + "type <typename>" + "tag <tagname>" + Ditto. > * "tagger <committer>", followed by a blank line, a free-form tag > * message and a signature block that git itself doesn't care about, > * but that can be verified with gpg or similar. > * > + * The first four lines are guaranteed to be either 83 or 107 bytes; > + * depending on whether we're referencing a SHA-1 or SHA-256 tag. > + * > + * "object <sha1>\n" is 48 or a 72 bytes, "type tag\n" at 9 bytes is the At least <sha> to be consistent with the above, or <hash>. > * shortest possible type-line, "tag .\n" at 6 bytes is the shortest > * single-character-tag line, and "tagger . <> 0 +0000\n" at 20 bytes is > * the shortest possible tagger-line. > @@ -46,9 +48,17 @@ static int verify_tag(char *buffer, unsigned long size) > struct object_id oid; > const char *object, *type_line, *tag_line, *tagger_line, *lb, *rb, *p; > size_t len; > - > - if (size < 84) > - return error("wanna fool me ? you obviously got the size wrong !"); > + int minimum_size = > + /* Minimum possible input, see t/t3800-mktag.sh */ > + strlen("object ") + the_hash_algo->hexsz + strlen("\n") + > + strlen("type tag\n") + > + strlen("tag x\n") + > + strlen("tagger . <> 0 +0000\n") + > + strlen("\n"); > + > + if (size < minimum_size) > + return error("got %"PRIuMAX" bytes of input, need at least %d bytes", > + size, minimum_size); I agree with the patch that this message is not _("marked for translation"), as it is output from a plumbing. You need to cast "size" as "(uintmax_t)size" here, probably. > @@ -58,7 +68,7 @@ static int verify_tag(char *buffer, unsigned long size) > return error("char%d: does not start with \"object \"", 0); > > if (parse_oid_hex(object + 7, &oid, &p)) > - return error("char%d: could not get SHA1 hash", 7); > + return error("char%d: expected object ID, got garbage", 7); Here you say object ID, which is better than <hash> or <sha>. Let's be consistent (I'd say "object name" if I were choosing which to use). > /* Verify it for some basic sanity: it needs to start with > - "object <sha1>\ntype\ntagger " */ > + "object <sha>\ntype\ntagger " */ Here it is <sha>. > diff --git a/t/t3800-mktag.sh b/t/t3800-mktag.sh > index d696aa4e52e..93a19bb8df9 100755 > --- a/t/t3800-mktag.sh > +++ b/t/t3800-mktag.sh > @@ -26,24 +26,42 @@ test_expect_success 'setup' ' > echo Hello >A && > git update-index --add A && > git commit -m "Initial commit" && > - head=$(git rev-parse --verify HEAD) > + head=$(git rev-parse --verify HEAD) && > + > + git tag -m"some tag" annotated && A SP between -m and its argument, like four lines above? > + annotated=$(git rev-parse --verify annotated) > ' > > ############################################################ > # 1. length check > +for objType in t ta tag cute ;-) > +do > + cat >tag.sig <<-EOF > + object $annotated > + type $objType > + tag x > + tagger . <> 0 +0000 > + > + EOF > + len=$(wc -c tag.sig) > + > + if test $objType = "tag" > + then > + test_expect_success "Tag object length check $len passed" ' Here $len may see excess leading whitespace depending on what "wc -c" on the platform does, but the only effect of that is a bit uglier test title, so it would be oK. > + git mktag <tag.sig >.git/refs/tags/x 2>message && Do we use "message" in any way? Let's not write directly into filesystem; instead do tagobj=$(git mktag <tag.sig) && git update-ref refs/tags/x $tagobj && or something like that. > + git rev-parse refs/tags/x > + ' > + else > + check_verify_failure "Tag object length check on length $len" \ > + '^error: got .* bytes of input, need at least' OK. This is not an issue with this patch, but I do not know why we want a subshell in check_verify_failure. May want to clean it up but not in this patch. > + fi > +done > > ############################################################ > # 2. object line label check > > cat >tag.sig <<EOF > -xxxxxx 139e9b33986b1c2670fff52c5067603117b3e895 > +xxxxxx $head > type tag > tag mytag > tagger . <> 0 +0000 > @@ -53,17 +71,18 @@ EOF > check_verify_failure '"object" line label check' '^error: char0: .*"object "$' > > ############################################################ > -# 3. object line SHA1 check > +# 3. object line SHA check You say "object" or "object ID" to the tester below; let's use that consistently instead of SHA. > +invalid_sha=$(echo $head | tr A-Za-z N-ZA-Mn-za-m) > cat >tag.sig <<EOF > -object zz9e9b33986b1c2670fff52c5067603117b3e895 > +object $invalid_sha > type tag > tag mytag > tagger . <> 0 +0000 > > EOF > > -check_verify_failure '"object" line SHA1 check' '^error: char7: .*SHA1 hash$' > +check_verify_failure '"object" line object check' '^error: char7: .*expected object ID, got garbage' > > ############################################################ > # 4. type line label check > @@ -125,7 +144,7 @@ check_verify_failure '"type" line type-name length check' \ > '^error: char.*: type too long$' > > ############################################################ > -# 9. verify object (SHA1/type) check > +# 9. verify object (SHA/type) check > > cat >tag.sig <<EOF > object $(test_oid deadbeef) > @@ -135,7 +154,7 @@ tagger . <> 0 +0000 > > EOF > > -check_verify_failure 'verify object (SHA1/type) check' \ > +check_verify_failure 'verify object (SHA/type) check' \ > '^error: char7: could not verify object.*$' > > ############################################################ Ditto. Thanks.
On Mon, Nov 23, 2020 at 01:01:11PM +0100, Ævar Arnfjörð Bjarmason wrote: > @@ -46,9 +48,17 @@ static int verify_tag(char *buffer, unsigned long size) > struct object_id oid; > const char *object, *type_line, *tag_line, *tagger_line, *lb, *rb, *p; > size_t len; > - > - if (size < 84) > - return error("wanna fool me ? you obviously got the size wrong !"); > + int minimum_size = > + /* Minimum possible input, see t/t3800-mktag.sh */ > + strlen("object ") + the_hash_algo->hexsz + strlen("\n") + > + strlen("type tag\n") + > + strlen("tag x\n") + > + strlen("tagger . <> 0 +0000\n") + > + strlen("\n"); This is an obvious improvement, but I have to wonder whether this kind of "minimum size" up-front check is really buying us much. We do a lot of fixed-size memcmps in the rest of the function. In the early bits, like: if (memcmp(type_line - 1, "\ntype ", 6)) return error("char%d: could not find \"\\ntype \"", 47); that is saving us from going off the edge of a too-small input (though wow, that bare "47" is really horrific). But later, we'll do this: if (memcmp(tagger_line, "tagger ", 7)) return error("char%"PRIuMAX": could not find \"tagger \"", (uintmax_t) (tagger_line - buffer)); We've already parsed "tag x" at this point, and "x" can be arbitrarily long. Are we sure we have 7 bytes of the buffer left? The buffer always has a trailing NUL in it, so it would never match but memcmp() is allowed to look at the whole thing (e.g., it can start at the end, or do quadword comparisons). And that usually wouldn't matter, but may depending on how the strbuf grows, and whether our heap buffer is near the end of the last allocate page. Try this: git init git commit --allow-empty -m foo { echo "object $(git rev-parse HEAD)" echo "type commit" perl -e 'print "tag "; print "a" x 4026; print "\n"' } | git mktag which ASan will complain about, since the "tagger" memcmp reads past the end of the buffer (I found 4026 by running that in a loop of 1..4096). I think this whole thing would be better written to just parse left-to-right within the buffer, using bits like skip_prefix() that rely on us having a trailing NUL (which we always will, and there should not be one inside the header section). This is how fsck_tag() does it. In fact, it seems like we could just reuse fsck_tag() here. It wants to pass an oid to report(), but it would be OK to use null_oid here; ultimately it just ends up back in our callback error_func(). -Peff
On Mon, Nov 23, 2020 at 11:01:57AM -0800, Junio C Hamano wrote: > > @@ -58,7 +68,7 @@ static int verify_tag(char *buffer, unsigned long size) > > return error("char%d: does not start with \"object \"", 0); > > > > if (parse_oid_hex(object + 7, &oid, &p)) > > - return error("char%d: could not get SHA1 hash", 7); > > + return error("char%d: expected object ID, got garbage", 7); > > Here you say object ID, which is better than <hash> or <sha>. Let's > be consistent (I'd say "object name" if I were choosing which to > use). It might just be me, but "object name" makes me think we'd take any name (e.g., a refname that resolves to an object), whereas "object id" would mean the object's hash specifically. And in this instance we only allow the latter. I agree very much with your other comments that if we are changing these, we should get away from <sha> completely. -Peff
Jeff King <peff@peff.net> writes: > It might just be me, but "object name" makes me think we'd take any name > (e.g., a refname that resolves to an object), whereas "object id" would > mean the object's hash specifically. And in this instance we only allow > the latter. Yeah, but glossary-content is very much explicit about this. "name" is the full hexadecimal hash, "identifier" is a synonym for it. And "id" does not even appear to be defined. We used to call "any name that refers to an object" an "extended SHA-1", but I haven't seen the phrase used for a long time on the list. > I agree very much with your other comments that if we are changing > these, we should get away from <sha> completely. Yup. I've always found it cumbersome that, when I want to mean a full hex representation, I have to say "40-byte object name". It is not even technically correct these days with SHA-256. We'd probably need to update the glossary to make sure we have ways to refer to "a way, any way, to name an object" and to "the hexadecimal representation of the full hash value that refers to an object". Here is my attempt to redefine "object ID" to be the narrower one, while losing "usually a 40-character hex" from "object name". I am not sure if I like the result as a whole, but it certainly is nice to have a word or phrase shorter than "full hex object name" to refer to it. Thanks. Documentation/glossary-content.txt | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git c/Documentation/glossary-content.txt w/Documentation/glossary-content.txt index 090c888335..e2ab920911 100644 --- c/Documentation/glossary-content.txt +++ w/Documentation/glossary-content.txt @@ -262,13 +262,20 @@ This commit is referred to as a "merge commit", or sometimes just a identified by its <<def_object_name,object name>>. The objects usually live in `$GIT_DIR/objects/`. +[[def_object_id]]object ID:: + Synonym for <<def_object_identifier,object identifier>>. + [[def_object_identifier]]object identifier:: - Synonym for <<def_object_name,object name>>. + An <<def_object_name, object name>> written as an + unabbreviated hexadecimal representation of the hash value + that uniquely identifies an <<def_object,object>>. + Also colloquially called <<def_SHA1,SHA-1>>. [[def_object_name]]object name:: - The unique identifier of an <<def_object,object>>. The - object name is usually represented by a 40 character - hexadecimal string. Also colloquially called <<def_SHA1,SHA-1>>. + A name that identifies an <<def_object,object>> uniquely, + which can be given in various ways, including but not + limited to, the object's full <<def_object_identifier,object + identifier>>, a <<def_ref,ref>> that refers to the object. [[def_object_type]]object type:: One of the identifiers "<<def_commit_object,commit>>",
On Mon, Nov 23, 2020 at 02:17:32PM -0800, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > It might just be me, but "object name" makes me think we'd take any name > > (e.g., a refname that resolves to an object), whereas "object id" would > > mean the object's hash specifically. And in this instance we only allow > > the latter. > > Yeah, but glossary-content is very much explicit about this. "name" > is the full hexadecimal hash, "identifier" is a synonym for it. And > "id" does not even appear to be defined. We used to call "any name > that refers to an object" an "extended SHA-1", but I haven't seen > the phrase used for a long time on the list. OK, then my "it might just be me" is clearly just me. :) I think the distinction I laid out earlier is nicer, but it may not be worth the trouble of trying to _change_ nomenclature at this point. Let's see what your glossary changes say... > I've always found it cumbersome that, when I want to mean a full hex > representation, I have to say "40-byte object name". It is not even > technically correct these days with SHA-256. Yeah, that is both long and wrong. I'd maybe say "hex object id" in some cases, which is slightly less cumbersome and extends to sha256. And distinguishes it from a binary object id (though see below). > Documentation/glossary-content.txt | 15 +++++++++++---- > 1 file changed, 11 insertions(+), 4 deletions(-) > > diff --git c/Documentation/glossary-content.txt w/Documentation/glossary-content.txt > index 090c888335..e2ab920911 100644 > --- c/Documentation/glossary-content.txt > +++ w/Documentation/glossary-content.txt > @@ -262,13 +262,20 @@ This commit is referred to as a "merge commit", or sometimes just a > identified by its <<def_object_name,object name>>. The objects usually > live in `$GIT_DIR/objects/`. > > +[[def_object_id]]object ID:: > + Synonym for <<def_object_identifier,object identifier>>. > + > [[def_object_identifier]]object identifier:: > - Synonym for <<def_object_name,object name>>. > + An <<def_object_name, object name>> written as an > + unabbreviated hexadecimal representation of the hash value > + that uniquely identifies an <<def_object,object>>. > + Also colloquially called <<def_SHA1,SHA-1>>. You might want to touch on "binary" here, too, with something like: This may also be used to refer to the binary representation of the hash value (e.g., as found within Git trees). Unless specified, this term typically implies the hexadecimal representation. But maybe that is overkill. When we are talking about the command line interface, stdin, etc, I can't think of a place where we'd take the binary ("hash-object -t tree", but I don't really count that). > [[def_object_name]]object name:: > - The unique identifier of an <<def_object,object>>. The > - object name is usually represented by a 40 character > - hexadecimal string. Also colloquially called <<def_SHA1,SHA-1>>. > + A name that identifies an <<def_object,object>> uniquely, > + which can be given in various ways, including but not > + limited to, the object's full <<def_object_identifier,object > + identifier>>, a <<def_ref,ref>> that refers to the object. This all seems like an improvement to me, though the real question is how often the term "object name" appears in the _other_ manpages to refer to the more limited case. -Peff
On 2020-11-23 at 21:34:21, Jeff King wrote: > This is how fsck_tag() does it. In fact, it seems like we could just > reuse fsck_tag() here. It wants to pass an oid to report(), but it would > be OK to use null_oid here; ultimately it just ends up back in our > callback error_func(). I would very much appreciate getting rid of this custom code. It was a pain during the SHA-256 transition, and even with significant effort, it appears I missed a few parts. Also, while I'm for casual language, its use here is going to be a little hard to understand for non-native speakers. Moreover, when we start mapping objects to write the SHA-1 values into the loose object index (which is a series I'm slowly working on), most of this custom code is going to have to disappear or be coalesced anyway. If we can use an existing, more robust function now, so much the better.
diff --git a/Documentation/git-mktag.txt b/Documentation/git-mktag.txt index fa6a7561236..bbcc0a086bf 100644 --- a/Documentation/git-mktag.txt +++ b/Documentation/git-mktag.txt @@ -23,7 +23,7 @@ Tag Format A tag signature file, to be fed to this command's standard input, has a very simple fixed format: four lines of - object <sha1> + object <sha> type <typename> tag <tagname> tagger <tagger> diff --git a/builtin/mktag.c b/builtin/mktag.c index 4982d3a93ef..3fa17243e34 100644 --- a/builtin/mktag.c +++ b/builtin/mktag.c @@ -5,13 +5,15 @@ /* * A signature file has a very simple fixed format: four lines - * of "object <sha1>" + "type <typename>" + "tag <tagname>" + + * of "object <sha>" + "type <typename>" + "tag <tagname>" + * "tagger <committer>", followed by a blank line, a free-form tag * message and a signature block that git itself doesn't care about, * but that can be verified with gpg or similar. * - * The first four lines are guaranteed to be at least 83 bytes: - * "object <sha1>\n" is 48 bytes, "type tag\n" at 9 bytes is the + * The first four lines are guaranteed to be either 83 or 107 bytes; + * depending on whether we're referencing a SHA-1 or SHA-256 tag. + * + * "object <sha1>\n" is 48 or a 72 bytes, "type tag\n" at 9 bytes is the * shortest possible type-line, "tag .\n" at 6 bytes is the shortest * single-character-tag line, and "tagger . <> 0 +0000\n" at 20 bytes is * the shortest possible tagger-line. @@ -46,9 +48,17 @@ static int verify_tag(char *buffer, unsigned long size) struct object_id oid; const char *object, *type_line, *tag_line, *tagger_line, *lb, *rb, *p; size_t len; - - if (size < 84) - return error("wanna fool me ? you obviously got the size wrong !"); + int minimum_size = + /* Minimum possible input, see t/t3800-mktag.sh */ + strlen("object ") + the_hash_algo->hexsz + strlen("\n") + + strlen("type tag\n") + + strlen("tag x\n") + + strlen("tagger . <> 0 +0000\n") + + strlen("\n"); + + if (size < minimum_size) + return error("got %"PRIuMAX" bytes of input, need at least %d bytes", + size, minimum_size); buffer[size] = 0; @@ -58,7 +68,7 @@ static int verify_tag(char *buffer, unsigned long size) return error("char%d: does not start with \"object \"", 0); if (parse_oid_hex(object + 7, &oid, &p)) - return error("char%d: could not get SHA1 hash", 7); + return error("char%d: expected object ID, got garbage", 7); /* Verify type line */ type_line = p + 1; @@ -166,7 +176,7 @@ int cmd_mktag(int argc, const char **argv, const char *prefix) } /* Verify it for some basic sanity: it needs to start with - "object <sha1>\ntype\ntagger " */ + "object <sha>\ntype\ntagger " */ if (verify_tag(buf.buf, buf.len) < 0) die("invalid tag signature file"); diff --git a/t/t3800-mktag.sh b/t/t3800-mktag.sh index d696aa4e52e..93a19bb8df9 100755 --- a/t/t3800-mktag.sh +++ b/t/t3800-mktag.sh @@ -26,24 +26,42 @@ test_expect_success 'setup' ' echo Hello >A && git update-index --add A && git commit -m "Initial commit" && - head=$(git rev-parse --verify HEAD) + head=$(git rev-parse --verify HEAD) && + + git tag -m"some tag" annotated && + annotated=$(git rev-parse --verify annotated) ' ############################################################ # 1. length check - -cat >tag.sig <<EOF -too short for a tag -EOF - -check_verify_failure 'Tag object length check' \ - '^error: .*size wrong.*$' +for objType in t ta tag +do + cat >tag.sig <<-EOF + object $annotated + type $objType + tag x + tagger . <> 0 +0000 + + EOF + len=$(wc -c tag.sig) + + if test $objType = "tag" + then + test_expect_success "Tag object length check $len passed" ' + git mktag <tag.sig >.git/refs/tags/x 2>message && + git rev-parse refs/tags/x + ' + else + check_verify_failure "Tag object length check on length $len" \ + '^error: got .* bytes of input, need at least' + fi +done ############################################################ # 2. object line label check cat >tag.sig <<EOF -xxxxxx 139e9b33986b1c2670fff52c5067603117b3e895 +xxxxxx $head type tag tag mytag tagger . <> 0 +0000 @@ -53,17 +71,18 @@ EOF check_verify_failure '"object" line label check' '^error: char0: .*"object "$' ############################################################ -# 3. object line SHA1 check +# 3. object line SHA check +invalid_sha=$(echo $head | tr A-Za-z N-ZA-Mn-za-m) cat >tag.sig <<EOF -object zz9e9b33986b1c2670fff52c5067603117b3e895 +object $invalid_sha type tag tag mytag tagger . <> 0 +0000 EOF -check_verify_failure '"object" line SHA1 check' '^error: char7: .*SHA1 hash$' +check_verify_failure '"object" line object check' '^error: char7: .*expected object ID, got garbage' ############################################################ # 4. type line label check @@ -125,7 +144,7 @@ check_verify_failure '"type" line type-name length check' \ '^error: char.*: type too long$' ############################################################ -# 9. verify object (SHA1/type) check +# 9. verify object (SHA/type) check cat >tag.sig <<EOF object $(test_oid deadbeef) @@ -135,7 +154,7 @@ tagger . <> 0 +0000 EOF -check_verify_failure 'verify object (SHA1/type) check' \ +check_verify_failure 'verify object (SHA/type) check' \ '^error: char7: could not verify object.*$' ############################################################
Change the hardcoded minimum tag length size to vary based on whether we're operating in the SHA-1 or SHA-256 mode, and update the "mktag" documentation, tests and code comments to reflect that this command can take either SHA-1 or SHA-256 input. Let's make the code more self-documenting by verbosely inlining what a minimum tag looks like. The fixed-string strlen() will be optimized away by any modern compiler. This also future-proofs us for any future hash function transition. Then change the tests amended in acb49d1cc8b (t3800: make hash-size independent, 2019-08-18) even more for SHA-256. Some of the tests were failing for the wrong reasons. E.g. yes <some sha-1 length garbage> is an invalid SHA-1, but we should test <some sha256 length garbage> when under SHA-256. For testing an invalid non-hexadecimal object ID let's use ROT13. For the "object line label check" test the "139e9b339..." SHA-1 didn't exist, but what's actually being tested there is getting "xxxxxx" instead of "object", so let's provide a valid SHA there instead. There's no need to make or hardcode a nonexisting object ID. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> --- Documentation/git-mktag.txt | 2 +- builtin/mktag.c | 26 +++++++++++++------- t/t3800-mktag.sh | 47 ++++++++++++++++++++++++++----------- 3 files changed, 52 insertions(+), 23 deletions(-)