mktag: don't check SHA-1 object length under SHA-256

Message ID	20201123120111.13567-1-avarab@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= <avarab@gmail.com> To: git@vger.kernel.org Cc: Junio C Hamano <gitster@pobox.com>, Jeff King <peff@peff.net>, "brian m . carlson" <sandals@crustytoothpaste.net>, Eric Sunshine <sunshine@sunshineco.com>, Johannes Schindelin <Johannes.Schindelin@gmx.de>, =?utf-8?b?w4Z2YXIgQXJu?= =?utf-8?b?ZmrDtnLDsCBCamFybWFzb24=?= <avarab@gmail.com> Subject: [PATCH] mktag: don't check SHA-1 object length under SHA-256 Date: Mon, 23 Nov 2020 13:01:11 +0100 Message-Id: <20201123120111.13567-1-avarab@gmail.com> In-Reply-To: <20190826014344.16008-11-sandals@crustytoothpaste.net> References: <20190826014344.16008-11-sandals@crustytoothpaste.net> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	mktag: don't check SHA-1 object length under SHA-256 \| expand mktag: don't check SHA-1 object length under SHA-256

Ævar Arnfjörð Bjarmason Nov. 23, 2020, 12:01 p.m. UTC

Change the hardcoded minimum tag length size to vary based on whether
we're operating in the SHA-1 or SHA-256 mode, and update the "mktag"
documentation, tests and code comments to reflect that this command
can take either SHA-1 or SHA-256 input.

Let's make the code more self-documenting by verbosely inlining what a
minimum tag looks like. The fixed-string strlen() will be optimized
away by any modern compiler. This also future-proofs us for any future
hash function transition.

Then change the tests amended in acb49d1cc8b (t3800: make hash-size
independent, 2019-08-18) even more for SHA-256. Some of the tests were
failing for the wrong reasons. E.g. yes <some sha-1 length garbage> is
an invalid SHA-1, but we should test <some sha256 length garbage> when
under SHA-256. For testing an invalid non-hexadecimal object ID let's
use ROT13.

For the "object line label check" test the "139e9b339..." SHA-1 didn't
exist, but what's actually being tested there is getting "xxxxxx"
instead of "object", so let's provide a valid SHA there
instead. There's no need to make or hardcode a nonexisting object ID.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/git-mktag.txt |  2 +-
 builtin/mktag.c             | 26 +++++++++++++-------
 t/t3800-mktag.sh            | 47 ++++++++++++++++++++++++++-----------
 3 files changed, 52 insertions(+), 23 deletions(-)

Junio C Hamano Nov. 23, 2020, 7:01 p.m. UTC | #1

Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> diff --git a/Documentation/git-mktag.txt b/Documentation/git-mktag.txt
> index fa6a7561236..bbcc0a086bf 100644
> --- a/Documentation/git-mktag.txt
> +++ b/Documentation/git-mktag.txt
> @@ -23,7 +23,7 @@ Tag Format
>  A tag signature file, to be fed to this command's standard input,
>  has a very simple fixed format: four lines of
>  
> -  object <sha1>
> +  object <sha>

Perhaps <hash>? or <objectname>.

> diff --git a/builtin/mktag.c b/builtin/mktag.c
> index 4982d3a93ef..3fa17243e34 100644
> --- a/builtin/mktag.c
> +++ b/builtin/mktag.c
> @@ -5,13 +5,15 @@
>  
>  /*
>   * A signature file has a very simple fixed format: four lines
> - * of "object <sha1>" + "type <typename>" + "tag <tagname>" +
> + * of "object <sha>" + "type <typename>" + "tag <tagname>" +

Ditto.

>   * "tagger <committer>", followed by a blank line, a free-form tag
>   * message and a signature block that git itself doesn't care about,
>   * but that can be verified with gpg or similar.
>   *
> + * The first four lines are guaranteed to be either 83 or 107 bytes;
> + * depending on whether we're referencing a SHA-1 or SHA-256 tag.
> + *
> + * "object <sha1>\n" is 48 or a 72 bytes, "type tag\n" at 9 bytes is the

At least <sha> to be consistent with the above, or <hash>.

>   * shortest possible type-line, "tag .\n" at 6 bytes is the shortest
>   * single-character-tag line, and "tagger . <> 0 +0000\n" at 20 bytes is
>   * the shortest possible tagger-line.
> @@ -46,9 +48,17 @@ static int verify_tag(char *buffer, unsigned long size)
>  	struct object_id oid;
>  	const char *object, *type_line, *tag_line, *tagger_line, *lb, *rb, *p;
>  	size_t len;
> -
> -	if (size < 84)
> -		return error("wanna fool me ? you obviously got the size wrong !");
> +	int minimum_size =
> +		/* Minimum possible input, see t/t3800-mktag.sh */
> +		strlen("object ") + the_hash_algo->hexsz + strlen("\n") +
> +		strlen("type tag\n") +
> +		strlen("tag x\n") +
> +		strlen("tagger . <> 0 +0000\n") +
> +		strlen("\n");
> +
> +	if (size < minimum_size)
> +		return error("got %"PRIuMAX" bytes of input, need at least %d bytes",
> +			     size, minimum_size);

I agree with the patch that this message is not _("marked for
translation"), as it is output from a plumbing.

You need to cast "size" as "(uintmax_t)size" here, probably.  


> @@ -58,7 +68,7 @@ static int verify_tag(char *buffer, unsigned long size)
>  		return error("char%d: does not start with \"object \"", 0);
>  
>  	if (parse_oid_hex(object + 7, &oid, &p))
> -		return error("char%d: could not get SHA1 hash", 7);
> +		return error("char%d: expected object ID, got garbage", 7);

Here you say object ID, which is better than <hash> or <sha>.  Let's
be consistent (I'd say "object name" if I were choosing which to
use).

>  	/* Verify it for some basic sanity: it needs to start with
> -	   "object <sha1>\ntype\ntagger " */
> +	   "object <sha>\ntype\ntagger " */

Here it is <sha>.

> diff --git a/t/t3800-mktag.sh b/t/t3800-mktag.sh
> index d696aa4e52e..93a19bb8df9 100755
> --- a/t/t3800-mktag.sh
> +++ b/t/t3800-mktag.sh
> @@ -26,24 +26,42 @@ test_expect_success 'setup' '
>  	echo Hello >A &&
>  	git update-index --add A &&
>  	git commit -m "Initial commit" &&
> -	head=$(git rev-parse --verify HEAD)
> +	head=$(git rev-parse --verify HEAD) &&
> +
> +	git tag -m"some tag" annotated &&

A SP between -m and its argument, like four lines above?

> +	annotated=$(git rev-parse --verify annotated)
>  '
>  
>  ############################################################
>  #  1. length check
> +for objType in t ta tag

cute ;-)

> +do
> +	cat >tag.sig <<-EOF
> +	object $annotated
> +	type $objType
> +	tag x
> +	tagger . <> 0 +0000
> +
> +	EOF
> +	len=$(wc -c tag.sig)
> +
> +	if test $objType = "tag"
> +	then
> +		test_expect_success "Tag object length check $len passed" '

Here $len may see excess leading whitespace depending on what "wc
-c" on the platform does, but the only effect of that is a bit
uglier test title, so it would be oK.

> +			git mktag <tag.sig >.git/refs/tags/x 2>message &&

Do we use "message" in any way?  

Let's not write directly into filesystem; instead do

			tagobj=$(git mktag <tag.sig) &&
			git update-ref refs/tags/x $tagobj &&

or something like that.

> +			git rev-parse refs/tags/x
> +		'
> +	else
> +		check_verify_failure "Tag object length check on length $len" \
> +			'^error: got .* bytes of input, need at least'

OK.  This is not an issue with this patch, but I do not know why we
want a subshell in check_verify_failure.  May want to clean it up
but not in this patch.

> +	fi
> +done
>  
>  ############################################################
>  #  2. object line label check
>  
>  cat >tag.sig <<EOF
> -xxxxxx 139e9b33986b1c2670fff52c5067603117b3e895
> +xxxxxx $head
>  type tag
>  tag mytag
>  tagger . <> 0 +0000
> @@ -53,17 +71,18 @@ EOF
>  check_verify_failure '"object" line label check' '^error: char0: .*"object "$'
>  
>  ############################################################
> -#  3. object line SHA1 check
> +#  3. object line SHA check

You say "object" or "object ID" to the tester below; let's use that
consistently instead of SHA.

> +invalid_sha=$(echo $head | tr A-Za-z N-ZA-Mn-za-m)
>  cat >tag.sig <<EOF
> -object zz9e9b33986b1c2670fff52c5067603117b3e895
> +object $invalid_sha
>  type tag
>  tag mytag
>  tagger . <> 0 +0000
>  
>  EOF
>  
> -check_verify_failure '"object" line SHA1 check' '^error: char7: .*SHA1 hash$'
> +check_verify_failure '"object" line object check' '^error: char7: .*expected object ID, got garbage'
>  
>  ############################################################
>  #  4. type line label check
> @@ -125,7 +144,7 @@ check_verify_failure '"type" line type-name length check' \
>  	'^error: char.*: type too long$'
>  
>  ############################################################
> -#  9. verify object (SHA1/type) check
> +#  9. verify object (SHA/type) check
>  
>  cat >tag.sig <<EOF
>  object $(test_oid deadbeef)
> @@ -135,7 +154,7 @@ tagger . <> 0 +0000
>  
>  EOF
>  
> -check_verify_failure 'verify object (SHA1/type) check' \
> +check_verify_failure 'verify object (SHA/type) check' \
>  	'^error: char7: could not verify object.*$'
>  
>  ############################################################

Ditto.

Thanks.

Jeff King Nov. 23, 2020, 9:34 p.m. UTC | #2

On Mon, Nov 23, 2020 at 01:01:11PM +0100, Ævar Arnfjörð Bjarmason wrote:

> @@ -46,9 +48,17 @@ static int verify_tag(char *buffer, unsigned long size)
>  	struct object_id oid;
>  	const char *object, *type_line, *tag_line, *tagger_line, *lb, *rb, *p;
>  	size_t len;
> -
> -	if (size < 84)
> -		return error("wanna fool me ? you obviously got the size wrong !");
> +	int minimum_size =
> +		/* Minimum possible input, see t/t3800-mktag.sh */
> +		strlen("object ") + the_hash_algo->hexsz + strlen("\n") +
> +		strlen("type tag\n") +
> +		strlen("tag x\n") +
> +		strlen("tagger . <> 0 +0000\n") +
> +		strlen("\n");

This is an obvious improvement, but I have to wonder whether this kind
of "minimum size" up-front check is really buying us much. We do a lot
of fixed-size memcmps in the rest of the function.

In the early bits, like:

  if (memcmp(type_line - 1, "\ntype ", 6))
          return error("char%d: could not find \"\\ntype \"", 47);

that is saving us from going off the edge of a too-small input (though
wow, that bare "47" is really horrific).

But later, we'll do this:

  if (memcmp(tagger_line, "tagger ", 7))
           return error("char%"PRIuMAX": could not find \"tagger \"",
                   (uintmax_t) (tagger_line - buffer));

We've already parsed "tag x" at this point, and "x" can be arbitrarily
long. Are we sure we have 7 bytes of the buffer left?

The buffer always has a trailing NUL in it, so it would never match but
memcmp() is allowed to look at the whole thing (e.g., it can start at
the end, or do quadword comparisons). And that usually wouldn't matter,
but may depending on how the strbuf grows, and whether our heap buffer
is near the end of the last allocate page. Try this:

  git init
  git commit --allow-empty -m foo
  {
    echo "object $(git rev-parse HEAD)"
    echo "type commit"
    perl -e 'print "tag "; print "a" x 4026; print "\n"'
  } | git mktag

which ASan will complain about, since the "tagger" memcmp reads past the
end of the buffer (I found 4026 by running that in a loop of 1..4096).

I think this whole thing would be better written to just parse
left-to-right within the buffer, using bits like skip_prefix() that rely
on us having a trailing NUL (which we always will, and there should not
be one inside the header section).

This is how fsck_tag() does it. In fact, it seems like we could just
reuse fsck_tag() here. It wants to pass an oid to report(), but it would
be OK to use null_oid here; ultimately it just ends up back in our
callback error_func().

-Peff

Jeff King Nov. 23, 2020, 9:36 p.m. UTC | #3

On Mon, Nov 23, 2020 at 11:01:57AM -0800, Junio C Hamano wrote:

> > @@ -58,7 +68,7 @@ static int verify_tag(char *buffer, unsigned long size)
> >  		return error("char%d: does not start with \"object \"", 0);
> >  
> >  	if (parse_oid_hex(object + 7, &oid, &p))
> > -		return error("char%d: could not get SHA1 hash", 7);
> > +		return error("char%d: expected object ID, got garbage", 7);
> 
> Here you say object ID, which is better than <hash> or <sha>.  Let's
> be consistent (I'd say "object name" if I were choosing which to
> use).

It might just be me, but "object name" makes me think we'd take any name
(e.g., a refname that resolves to an object), whereas "object id" would
mean the object's hash specifically. And in this instance we only allow
the latter.

I agree very much with your other comments that if we are changing
these, we should get away from <sha> completely.

-Peff

Junio C Hamano Nov. 23, 2020, 10:17 p.m. UTC | #4

Jeff King <peff@peff.net> writes:

> It might just be me, but "object name" makes me think we'd take any name
> (e.g., a refname that resolves to an object), whereas "object id" would
> mean the object's hash specifically. And in this instance we only allow
> the latter.

Yeah, but glossary-content is very much explicit about this.  "name"
is the full hexadecimal hash, "identifier" is a synonym for it.  And
"id" does not even appear to be defined.  We used to call "any name
that refers to an object" an "extended SHA-1", but I haven't seen
the phrase used for a long time on the list.

> I agree very much with your other comments that if we are changing
> these, we should get away from <sha> completely.

Yup.

I've always found it cumbersome that, when I want to mean a full hex
representation, I have to say "40-byte object name".  It is not even
technically correct these days with SHA-256.

We'd probably need to update the glossary to make sure we have ways
to refer to "a way, any way, to name an object" and to "the
hexadecimal representation of the full hash value that refers to an
object".  Here is my attempt to redefine "object ID" to be the
narrower one, while losing "usually a 40-character hex" from "object
name".  I am not sure if I like the result as a whole, but it
certainly is nice to have a word or phrase shorter than "full hex
object name" to refer to it.

Thanks.

 Documentation/glossary-content.txt | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git c/Documentation/glossary-content.txt w/Documentation/glossary-content.txt
index 090c888335..e2ab920911 100644
--- c/Documentation/glossary-content.txt
+++ w/Documentation/glossary-content.txt
@@ -262,13 +262,20 @@ This commit is referred to as a "merge commit", or sometimes just a
 	identified by its <<def_object_name,object name>>. The objects usually
 	live in `$GIT_DIR/objects/`.

+[[def_object_id]]object ID::
+	Synonym for <<def_object_identifier,object identifier>>.
+
 [[def_object_identifier]]object identifier::
-	Synonym for <<def_object_name,object name>>.
+	An <<def_object_name, object name>> written as an
+	unabbreviated hexadecimal representation of the hash value
+	that uniquely identifies an <<def_object,object>>.
+	Also colloquially called <<def_SHA1,SHA-1>>.

 [[def_object_name]]object name::
-	The unique identifier of an <<def_object,object>>.  The
-	object name is usually represented by a 40 character
-	hexadecimal string.  Also colloquially called <<def_SHA1,SHA-1>>.
+	A name that identifies an <<def_object,object>> uniquely,
+	which can be given in various ways, including but not
+	limited to, the object's full <<def_object_identifier,object
+	identifier>>, a <<def_ref,ref>> that refers to the object.

 [[def_object_type]]object type::
 	One of the identifiers "<<def_commit_object,commit>>",

Jeff King Nov. 24, 2020, 12:47 a.m. UTC | #5

On Mon, Nov 23, 2020 at 02:17:32PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > It might just be me, but "object name" makes me think we'd take any name
> > (e.g., a refname that resolves to an object), whereas "object id" would
> > mean the object's hash specifically. And in this instance we only allow
> > the latter.
> 
> Yeah, but glossary-content is very much explicit about this.  "name"
> is the full hexadecimal hash, "identifier" is a synonym for it.  And
> "id" does not even appear to be defined.  We used to call "any name
> that refers to an object" an "extended SHA-1", but I haven't seen
> the phrase used for a long time on the list.

OK, then my "it might just be me" is clearly just me. :)

I think the distinction I laid out earlier is nicer, but it may not be
worth the trouble of trying to _change_ nomenclature at this point.
Let's see what your glossary changes say...

> I've always found it cumbersome that, when I want to mean a full hex
> representation, I have to say "40-byte object name".  It is not even
> technically correct these days with SHA-256.

Yeah, that is both long and wrong. I'd maybe say "hex object id" in some
cases, which is slightly less cumbersome and extends to sha256. And
distinguishes it from a binary object id (though see below).

>  Documentation/glossary-content.txt | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git c/Documentation/glossary-content.txt w/Documentation/glossary-content.txt
> index 090c888335..e2ab920911 100644
> --- c/Documentation/glossary-content.txt
> +++ w/Documentation/glossary-content.txt
> @@ -262,13 +262,20 @@ This commit is referred to as a "merge commit", or sometimes just a
>  	identified by its <<def_object_name,object name>>. The objects usually
>  	live in `$GIT_DIR/objects/`.
>  
> +[[def_object_id]]object ID::
> +	Synonym for <<def_object_identifier,object identifier>>.
> +
>  [[def_object_identifier]]object identifier::
> -	Synonym for <<def_object_name,object name>>.
> +	An <<def_object_name, object name>> written as an
> +	unabbreviated hexadecimal representation of the hash value
> +	that uniquely identifies an <<def_object,object>>.
> +	Also colloquially called <<def_SHA1,SHA-1>>.

You might want to touch on "binary" here, too, with something like:

  This may also be used to refer to the binary representation of the
  hash value (e.g., as found within Git trees). Unless specified, this
  term typically implies the hexadecimal representation.

But maybe that is overkill. When we are talking about the command line
interface, stdin, etc, I can't think of a place where we'd take the
binary ("hash-object -t tree", but I don't really count that).

>  [[def_object_name]]object name::
> -	The unique identifier of an <<def_object,object>>.  The
> -	object name is usually represented by a 40 character
> -	hexadecimal string.  Also colloquially called <<def_SHA1,SHA-1>>.
> +	A name that identifies an <<def_object,object>> uniquely,
> +	which can be given in various ways, including but not
> +	limited to, the object's full <<def_object_identifier,object
> +	identifier>>, a <<def_ref,ref>> that refers to the object.

This all seems like an improvement to me, though the real question is
how often the term "object name" appears in the _other_ manpages to
refer to the more limited case.

-Peff

brian m. carlson Nov. 24, 2020, 1:07 a.m. UTC | #6

On 2020-11-23 at 21:34:21, Jeff King wrote:
> This is how fsck_tag() does it. In fact, it seems like we could just
> reuse fsck_tag() here. It wants to pass an oid to report(), but it would
> be OK to use null_oid here; ultimately it just ends up back in our
> callback error_func().

I would very much appreciate getting rid of this custom code.  It was a
pain during the SHA-256 transition, and even with significant effort, it
appears I missed a few parts.  Also, while I'm for casual language, its
use here is going to be a little hard to understand for non-native
speakers.

Moreover, when we start mapping objects to write the SHA-1 values into
the loose object index (which is a series I'm slowly working on), most
of this custom code is going to have to disappear or be coalesced
anyway.  If we can use an existing, more robust function now, so much
the better.

mktag: don't check SHA-1 object length under SHA-256

Commit Message

Comments

Patch