diff mbox series

[2/2] pretty: add merge and exclude options to %(describe)

Message ID b7bd37c4-ab13-0297-da46-716e26de10d6@web.de (mailing list archive)
State New, archived
Headers show
Series [1/2] pretty: add %(describe) | expand

Commit Message

René Scharfe Feb. 14, 2021, 10:10 a.m. UTC
Allow restricting the tags used by the placeholder %(describe) with the
options match and exclude.  E.g. the following command describes the
current commit using official version tags, without those for release
candidates:

   $ git log -1 --format='%(describe:match=v[0-9]*,exclude=*rc*)'

Signed-off-by: René Scharfe <l.s.r@web.de>
---
 Documentation/pretty-formats.txt | 13 ++++++++--
 pretty.c                         | 43 ++++++++++++++++++++++++++++++--
 t/t4205-log-pretty-formats.sh    | 16 ++++++++++++
 3 files changed, 68 insertions(+), 4 deletions(-)

--
2.30.1

Comments

Jeff King Feb. 17, 2021, 6:31 p.m. UTC | #1
On Sun, Feb 14, 2021 at 11:10:57AM +0100, René Scharfe. wrote:

> Allow restricting the tags used by the placeholder %(describe) with the
> options match and exclude.  E.g. the following command describes the
> current commit using official version tags, without those for release
> candidates:
> 
>    $ git log -1 --format='%(describe:match=v[0-9]*,exclude=*rc*)'

An interesting side effect of this series is that it allows remote users
asking for archives to fill in this data, too (by using export-subst
placeholders). That includes servers allowing "git archive --remote",
but also services like GitHub that will run git-archive on behalf of
clients.

I wonder what avenues for mischief this provides. Certainly using extra
CPU to run git-describe. But I guess also probing at otherwise hidden
refs using the match/exclude system (though since it's limited to
refs/tags/, that's pretty unlikely).

I present this mostly as an observation, not an objection. Certainly we
already have %D which can look at hidden refs. And I strongly suspect
that the server-side git-upload-archive does not respect hidden refs
when resolving object names in the first place.

Kind of an interesting thought as we extend the formatting language,
though. I generally think of them as something that is always under
control of caller, but export-subst's $Format$ will come from the repo
contents.

-Peff
René Scharfe Feb. 28, 2021, 11:22 a.m. UTC | #2
Am 17.02.21 um 19:38 schrieb Junio C Hamano:
> I wonder, in addition to "match" and "exclude", if we want to allow
> "always" as well.

I added "match" because that describe option is used by GIT-VERSION-GEN,
so I imagine it's generally useful for version names in spec files.  Not
sure why I misspelled it "merge" in the subject, though. o_O

"exclude" is not in there, but I threw it in anyway, as the example in
77d21f29ea (describe: teach describe negative pattern matches,
2017-01-18) made sense to me.  It complements "match" nicely.

I had "always" in proof-of-concept patch because I hadn't decided what
to do with commits that git describe doesn't find a description for,
and wanted to check the full output of git log --pretty='%(describe)'.
For a release tarball of a repo that lacks tags it would be easier to
use %h instead of %(describe:always) -- or tag the release.

That's why I didn't include "always" in the latest patches, but if it
turns out to be useful for someone then I wouldn't them adding it.

> Also, looking further into the future, I wonder if we should aim to
> eventually unify %h and %H as well as %(describe) into one,
> i.e. various ways to spell a commit object name, given that there is
> a separate effort underway to unify pretty and for-each-ref format
> strings.
>
> E.g. %(objectname) is the same as %H, and %(objectname:short) is the
> same as %h, so this might be %(objectname:describe), or something
> along the line.

According to the glossary and object name is:

        The unique identifier of an object.  The
        object name is usually represented by a 40 character
        hexadecimal string.  Also colloquially called SHA-1.

And that's how I understand it as well.  The object layer with it's
hashes on one hand and descriptions based on refs and commit relations
on the other are separate things in my mind.  %(describe) falling back
to %h when :always is given makes sense to me; %(objectname) "falling
forward" to show describe output feels like a layering violation.

René
René Scharfe Feb. 28, 2021, 11:22 a.m. UTC | #3
Am 17.02.21 um 19:31 schrieb Jeff King:
> On Sun, Feb 14, 2021 at 11:10:57AM +0100, René Scharfe. wrote:
>
>> Allow restricting the tags used by the placeholder %(describe) with the
>> options match and exclude.  E.g. the following command describes the
>> current commit using official version tags, without those for release
>> candidates:
>>
>>    $ git log -1 --format='%(describe:match=v[0-9]*,exclude=*rc*)'
>
> An interesting side effect of this series is that it allows remote users
> asking for archives to fill in this data, too (by using export-subst
> placeholders). That includes servers allowing "git archive --remote",
> but also services like GitHub that will run git-archive on behalf of
> clients.
>
> I wonder what avenues for mischief this provides. Certainly using extra
> CPU to run git-describe.

A repository can contain millions of files, each file can contain
millions of $Format:...$ sequences and each of them can contain millions
of %(describe) placeholders.  Each of them could have different match or
exclude args to prevent caching.  Allowing a single request to cause
trillions of calls of git describe sounds excessive.  Let's limit this.

-- >8 --
Subject: [PATCH] archive: expand only a single %(describe) per archive

Every %(describe) placeholder in $Format:...$ strings in files with the
attribute export-subst is expanded by calling git describe.  This can
potentially result in a lot of such calls per archive.  That's OK for
local repositories under control of the user of git archive, but could
be a problem for hosted repositories.

Expand only a single %(describe) placeholder per archive for now to
avoid denial-of-service attacks.  We can make this limit configurable
later if needed, but let's start out simple.

Reported-by: Jeff King <peff@peff.net>
Signed-off-by: René Scharfe <l.s.r@web.de>
---
 Documentation/gitattributes.txt |  3 ++-
 archive.c                       | 16 ++++++++++------
 archive.h                       |  2 ++
 pretty.c                        |  8 ++++++++
 pretty.h                        |  5 +++++
 t/t5001-archive-attr.sh         | 14 ++++++++++++++
 6 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index e84e104f93..0a60472bb5 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -1174,7 +1174,8 @@ tag then no replacement will be done.  The placeholders are the same
 as those for the option `--pretty=format:` of linkgit:git-log[1],
 except that they need to be wrapped like this: `$Format:PLACEHOLDERS$`
 in the file.  E.g. the string `$Format:%H$` will be replaced by the
-commit hash.
+commit hash.  However, only one `%(describe)` placeholder is expanded
+per archive to avoid denial-of-service attacks.


 Packing objects
diff --git a/archive.c b/archive.c
index 5919d9e505..2dd2236ae0 100644
--- a/archive.c
+++ b/archive.c
@@ -37,13 +37,10 @@ void init_archivers(void)

 static void format_subst(const struct commit *commit,
 			 const char *src, size_t len,
-			 struct strbuf *buf)
+			 struct strbuf *buf, struct pretty_print_context *ctx)
 {
 	char *to_free = NULL;
 	struct strbuf fmt = STRBUF_INIT;
-	struct pretty_print_context ctx = {0};
-	ctx.date_mode.type = DATE_NORMAL;
-	ctx.abbrev = DEFAULT_ABBREV;

 	if (src == buf->buf)
 		to_free = strbuf_detach(buf, NULL);
@@ -61,7 +58,7 @@ static void format_subst(const struct commit *commit,
 		strbuf_add(&fmt, b + 8, c - b - 8);

 		strbuf_add(buf, src, b - src);
-		format_commit_message(commit, fmt.buf, buf, &ctx);
+		format_commit_message(commit, fmt.buf, buf, ctx);
 		len -= c + 1 - src;
 		src  = c + 1;
 	}
@@ -94,7 +91,7 @@ static void *object_file_to_archive(const struct archiver_args *args,
 		strbuf_attach(&buf, buffer, *sizep, *sizep + 1);
 		convert_to_working_tree(args->repo->index, path, buf.buf, buf.len, &buf, &meta);
 		if (commit)
-			format_subst(commit, buf.buf, buf.len, &buf);
+			format_subst(commit, buf.buf, buf.len, &buf, args->pretty_ctx);
 		buffer = strbuf_detach(&buf, &size);
 		*sizep = size;
 	}
@@ -633,12 +630,19 @@ int write_archive(int argc, const char **argv, const char *prefix,
 		  const char *name_hint, int remote)
 {
 	const struct archiver *ar = NULL;
+	struct pretty_print_describe_status describe_status = {0};
+	struct pretty_print_context ctx = {0};
 	struct archiver_args args;
 	int rc;

 	git_config_get_bool("uploadarchive.allowunreachable", &remote_allow_unreachable);
 	git_config(git_default_config, NULL);

+	describe_status.max_invocations = 1;
+	ctx.date_mode.type = DATE_NORMAL;
+	ctx.abbrev = DEFAULT_ABBREV;
+	ctx.describe_status = &describe_status;
+	args.pretty_ctx = &ctx;
 	args.repo = repo;
 	args.prefix = prefix;
 	string_list_init(&args.extra_files, 1);
diff --git a/archive.h b/archive.h
index 33551b7ee1..49fab71aaf 100644
--- a/archive.h
+++ b/archive.h
@@ -5,6 +5,7 @@
 #include "pathspec.h"

 struct repository;
+struct pretty_print_context;

 struct archiver_args {
 	struct repository *repo;
@@ -22,6 +23,7 @@ struct archiver_args {
 	unsigned int convert : 1;
 	int compression_level;
 	struct string_list extra_files;
+	struct pretty_print_context *pretty_ctx;
 };

 /* main api */
diff --git a/pretty.c b/pretty.c
index c612d2ac9b..032e89cd4e 100644
--- a/pretty.c
+++ b/pretty.c
@@ -1247,6 +1247,14 @@ static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */
 		struct child_process cmd = CHILD_PROCESS_INIT;
 		struct strbuf out = STRBUF_INIT;
 		struct strbuf err = STRBUF_INIT;
+		struct pretty_print_describe_status *describe_status;
+
+		describe_status = c->pretty_ctx->describe_status;
+		if (describe_status) {
+			if (!describe_status->max_invocations)
+				return 0;
+			describe_status->max_invocations--;
+		}

 		cmd.git_cmd = 1;
 		strvec_push(&cmd.args, "describe");
diff --git a/pretty.h b/pretty.h
index 7ce6c0b437..27b15947ff 100644
--- a/pretty.h
+++ b/pretty.h
@@ -23,6 +23,10 @@ enum cmit_fmt {
 	CMIT_FMT_UNSPECIFIED
 };

+struct pretty_print_describe_status {
+	unsigned int max_invocations;
+};
+
 struct pretty_print_context {
 	/*
 	 * Callers should tweak these to change the behavior of pp_* functions.
@@ -44,6 +48,7 @@ struct pretty_print_context {
 	int color;
 	struct ident_split *from_ident;
 	unsigned encode_email_headers:1;
+	struct pretty_print_describe_status *describe_status;

 	/*
 	 * Fields below here are manipulated internally by pp_* functions and
diff --git a/t/t5001-archive-attr.sh b/t/t5001-archive-attr.sh
index e9aa97117a..1c9ce3956b 100755
--- a/t/t5001-archive-attr.sh
+++ b/t/t5001-archive-attr.sh
@@ -128,4 +128,18 @@ test_expect_success 'export-subst' '
 	test_cmp substfile2 archive/substfile2
 '

+test_expect_success 'export-subst expands %(describe) once' '
+	echo "\$Format:%(describe)\$" >substfile3 &&
+	echo "\$Format:%(describe)\$" >>substfile3 &&
+	echo "\$Format:%(describe)${LF}%(describe)\$" >substfile4 &&
+	git add substfile[34] &&
+	git commit -m export-subst-describe &&
+	git tag -m export-subst-describe export-subst-describe &&
+	git archive HEAD >archive-describe.tar &&
+	extract_tar_to_dir archive-describe &&
+	desc=$(git describe) &&
+	grep -F "$desc" archive-describe/substfile[34] >substituted &&
+	test_line_count = 1 substituted
+'
+
 test_done
--
2.30.1
Ævar Arnfjörð Bjarmason Feb. 28, 2021, 3:41 p.m. UTC | #4
On Sun, Feb 28 2021, René Scharfe. wrote:

> Am 17.02.21 um 19:31 schrieb Jeff King:
>> On Sun, Feb 14, 2021 at 11:10:57AM +0100, René Scharfe. wrote:
>>
>>> Allow restricting the tags used by the placeholder %(describe) with the
>>> options match and exclude.  E.g. the following command describes the
>>> current commit using official version tags, without those for release
>>> candidates:
>>>
>>>    $ git log -1 --format='%(describe:match=v[0-9]*,exclude=*rc*)'
>>
>> An interesting side effect of this series is that it allows remote users
>> asking for archives to fill in this data, too (by using export-subst
>> placeholders). That includes servers allowing "git archive --remote",
>> but also services like GitHub that will run git-archive on behalf of
>> clients.
>>
>> I wonder what avenues for mischief this provides. Certainly using extra
>> CPU to run git-describe.
>
> A repository can contain millions of files, each file can contain
> millions of $Format:...$ sequences and each of them can contain millions
> of %(describe) placeholders.  Each of them could have different match or
> exclude args to prevent caching.  Allowing a single request to cause
> trillions of calls of git describe sounds excessive.  Let's limit this.
>
> -- >8 --
> Subject: [PATCH] archive: expand only a single %(describe) per archive
>
> Every %(describe) placeholder in $Format:...$ strings in files with the
> attribute export-subst is expanded by calling git describe.  This can
> potentially result in a lot of such calls per archive.  That's OK for
> local repositories under control of the user of git archive, but could
> be a problem for hosted repositories.
>
> Expand only a single %(describe) placeholder per archive for now to
> avoid denial-of-service attacks.  We can make this limit configurable
> later if needed, but let's start out simple.
>
> Reported-by: Jeff King <peff@peff.net>
> Signed-off-by: René Scharfe <l.s.r@web.de>
> ---
>  Documentation/gitattributes.txt |  3 ++-
>  archive.c                       | 16 ++++++++++------
>  archive.h                       |  2 ++
>  pretty.c                        |  8 ++++++++
>  pretty.h                        |  5 +++++
>  t/t5001-archive-attr.sh         | 14 ++++++++++++++
>  6 files changed, 41 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index e84e104f93..0a60472bb5 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -1174,7 +1174,8 @@ tag then no replacement will be done.  The placeholders are the same
>  as those for the option `--pretty=format:` of linkgit:git-log[1],
>  except that they need to be wrapped like this: `$Format:PLACEHOLDERS$`
>  in the file.  E.g. the string `$Format:%H$` will be replaced by the
> -commit hash.
> +commit hash.  However, only one `%(describe)` placeholder is expanded
> +per archive to avoid denial-of-service attacks.
>
>
>  Packing objects
> diff --git a/archive.c b/archive.c
> index 5919d9e505..2dd2236ae0 100644
> --- a/archive.c
> +++ b/archive.c
> @@ -37,13 +37,10 @@ void init_archivers(void)
>
>  static void format_subst(const struct commit *commit,
>  			 const char *src, size_t len,
> -			 struct strbuf *buf)
> +			 struct strbuf *buf, struct pretty_print_context *ctx)
>  {
>  	char *to_free = NULL;
>  	struct strbuf fmt = STRBUF_INIT;
> -	struct pretty_print_context ctx = {0};
> -	ctx.date_mode.type = DATE_NORMAL;
> -	ctx.abbrev = DEFAULT_ABBREV;
>
>  	if (src == buf->buf)
>  		to_free = strbuf_detach(buf, NULL);
> @@ -61,7 +58,7 @@ static void format_subst(const struct commit *commit,
>  		strbuf_add(&fmt, b + 8, c - b - 8);
>
>  		strbuf_add(buf, src, b - src);
> -		format_commit_message(commit, fmt.buf, buf, &ctx);
> +		format_commit_message(commit, fmt.buf, buf, ctx);
>  		len -= c + 1 - src;
>  		src  = c + 1;
>  	}
> @@ -94,7 +91,7 @@ static void *object_file_to_archive(const struct archiver_args *args,
>  		strbuf_attach(&buf, buffer, *sizep, *sizep + 1);
>  		convert_to_working_tree(args->repo->index, path, buf.buf, buf.len, &buf, &meta);
>  		if (commit)
> -			format_subst(commit, buf.buf, buf.len, &buf);
> +			format_subst(commit, buf.buf, buf.len, &buf, args->pretty_ctx);
>  		buffer = strbuf_detach(&buf, &size);
>  		*sizep = size;
>  	}
> @@ -633,12 +630,19 @@ int write_archive(int argc, const char **argv, const char *prefix,
>  		  const char *name_hint, int remote)
>  {
>  	const struct archiver *ar = NULL;
> +	struct pretty_print_describe_status describe_status = {0};
> +	struct pretty_print_context ctx = {0};
>  	struct archiver_args args;
>  	int rc;
>
>  	git_config_get_bool("uploadarchive.allowunreachable", &remote_allow_unreachable);
>  	git_config(git_default_config, NULL);
>
> +	describe_status.max_invocations = 1;
> +	ctx.date_mode.type = DATE_NORMAL;
> +	ctx.abbrev = DEFAULT_ABBREV;
> +	ctx.describe_status = &describe_status;
> +	args.pretty_ctx = &ctx;
>  	args.repo = repo;
>  	args.prefix = prefix;
>  	string_list_init(&args.extra_files, 1);
> diff --git a/archive.h b/archive.h
> index 33551b7ee1..49fab71aaf 100644
> --- a/archive.h
> +++ b/archive.h
> @@ -5,6 +5,7 @@
>  #include "pathspec.h"
>
>  struct repository;
> +struct pretty_print_context;
>
>  struct archiver_args {
>  	struct repository *repo;
> @@ -22,6 +23,7 @@ struct archiver_args {
>  	unsigned int convert : 1;
>  	int compression_level;
>  	struct string_list extra_files;
> +	struct pretty_print_context *pretty_ctx;
>  };
>
>  /* main api */
> diff --git a/pretty.c b/pretty.c
> index c612d2ac9b..032e89cd4e 100644
> --- a/pretty.c
> +++ b/pretty.c
> @@ -1247,6 +1247,14 @@ static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */
>  		struct child_process cmd = CHILD_PROCESS_INIT;
>  		struct strbuf out = STRBUF_INIT;
>  		struct strbuf err = STRBUF_INIT;
> +		struct pretty_print_describe_status *describe_status;
> +
> +		describe_status = c->pretty_ctx->describe_status;
> +		if (describe_status) {
> +			if (!describe_status->max_invocations)
> +				return 0;
> +			describe_status->max_invocations--;
> +		}
>
>  		cmd.git_cmd = 1;
>  		strvec_push(&cmd.args, "describe");
> diff --git a/pretty.h b/pretty.h
> index 7ce6c0b437..27b15947ff 100644
> --- a/pretty.h
> +++ b/pretty.h
> @@ -23,6 +23,10 @@ enum cmit_fmt {
>  	CMIT_FMT_UNSPECIFIED
>  };
>
> +struct pretty_print_describe_status {
> +	unsigned int max_invocations;
> +};
> +
>  struct pretty_print_context {
>  	/*
>  	 * Callers should tweak these to change the behavior of pp_* functions.
> @@ -44,6 +48,7 @@ struct pretty_print_context {
>  	int color;
>  	struct ident_split *from_ident;
>  	unsigned encode_email_headers:1;
> +	struct pretty_print_describe_status *describe_status;
>
>  	/*
>  	 * Fields below here are manipulated internally by pp_* functions and
> diff --git a/t/t5001-archive-attr.sh b/t/t5001-archive-attr.sh
> index e9aa97117a..1c9ce3956b 100755
> --- a/t/t5001-archive-attr.sh
> +++ b/t/t5001-archive-attr.sh
> @@ -128,4 +128,18 @@ test_expect_success 'export-subst' '
>  	test_cmp substfile2 archive/substfile2
>  '
>
> +test_expect_success 'export-subst expands %(describe) once' '
> +	echo "\$Format:%(describe)\$" >substfile3 &&
> +	echo "\$Format:%(describe)\$" >>substfile3 &&
> +	echo "\$Format:%(describe)${LF}%(describe)\$" >substfile4 &&
> +	git add substfile[34] &&
> +	git commit -m export-subst-describe &&
> +	git tag -m export-subst-describe export-subst-describe &&
> +	git archive HEAD >archive-describe.tar &&
> +	extract_tar_to_dir archive-describe &&
> +	desc=$(git describe) &&
> +	grep -F "$desc" archive-describe/substfile[34] >substituted &&
> +	test_line_count = 1 substituted
> +'
> +
>  test_done

This whole thing seems rather backwards as a solution to the "we run it
N times" problem I mentioned in <87k0r7a4sr.fsf@evledraar.gmail.com>.

Instead of taking the trouble of putting a limit in the
pretty_print_context so we don't call it N times for the same commit,
why not just put the strbuf with the result in that same struct?

Then you can have it millions of times, and it won't be any more
expensive than the other existing %(format) specifiers (actually cheaper
than most).

Or is there some edge case I'm missing here where "git archive" can
either be fed N commits and we share the context, or we share the
context across formatting different revisions in some cases?
René Scharfe March 2, 2021, 4 p.m. UTC | #5
Am 28.02.21 um 16:41 schrieb Ævar Arnfjörð Bjarmason:
>
> On Sun, Feb 28 2021, René Scharfe. wrote:
>
>> Am 17.02.21 um 19:31 schrieb Jeff King:
>>> On Sun, Feb 14, 2021 at 11:10:57AM +0100, René Scharfe. wrote:
>>>
>>>> Allow restricting the tags used by the placeholder %(describe) with the
>>>> options match and exclude.  E.g. the following command describes the
>>>> current commit using official version tags, without those for release
>>>> candidates:
>>>>
>>>>    $ git log -1 --format='%(describe:match=v[0-9]*,exclude=*rc*)'
>>>
>>> An interesting side effect of this series is that it allows remote users
>>> asking for archives to fill in this data, too (by using export-subst
>>> placeholders). That includes servers allowing "git archive --remote",
>>> but also services like GitHub that will run git-archive on behalf of
>>> clients.
>>>
>>> I wonder what avenues for mischief this provides. Certainly using extra
>>> CPU to run git-describe.
>>
>> A repository can contain millions of files, each file can contain
>> millions of $Format:...$ sequences and each of them can contain millions
>> of %(describe) placeholders.  Each of them could have different match or
>> exclude args to prevent caching.  Allowing a single request to cause
>> trillions of calls of git describe sounds excessive.  Let's limit this.
>>
>> -- >8 --
>> Subject: [PATCH] archive: expand only a single %(describe) per archive
>>
>> Every %(describe) placeholder in $Format:...$ strings in files with the
>> attribute export-subst is expanded by calling git describe.  This can
>> potentially result in a lot of such calls per archive.  That's OK for
>> local repositories under control of the user of git archive, but could
>> be a problem for hosted repositories.
>>
>> Expand only a single %(describe) placeholder per archive for now to
>> avoid denial-of-service attacks.  We can make this limit configurable
>> later if needed, but let's start out simple.
>>
>> Reported-by: Jeff King <peff@peff.net>
>> Signed-off-by: René Scharfe <l.s.r@web.de>
>> ---
>>  Documentation/gitattributes.txt |  3 ++-
>>  archive.c                       | 16 ++++++++++------
>>  archive.h                       |  2 ++
>>  pretty.c                        |  8 ++++++++
>>  pretty.h                        |  5 +++++
>>  t/t5001-archive-attr.sh         | 14 ++++++++++++++
>>  6 files changed, 41 insertions(+), 7 deletions(-)
>>
>> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
>> index e84e104f93..0a60472bb5 100644
>> --- a/Documentation/gitattributes.txt
>> +++ b/Documentation/gitattributes.txt
>> @@ -1174,7 +1174,8 @@ tag then no replacement will be done.  The placeholders are the same
>>  as those for the option `--pretty=format:` of linkgit:git-log[1],
>>  except that they need to be wrapped like this: `$Format:PLACEHOLDERS$`
>>  in the file.  E.g. the string `$Format:%H$` will be replaced by the
>> -commit hash.
>> +commit hash.  However, only one `%(describe)` placeholder is expanded
>> +per archive to avoid denial-of-service attacks.
>>
>>
>>  Packing objects
>> diff --git a/archive.c b/archive.c
>> index 5919d9e505..2dd2236ae0 100644
>> --- a/archive.c
>> +++ b/archive.c
>> @@ -37,13 +37,10 @@ void init_archivers(void)
>>
>>  static void format_subst(const struct commit *commit,
>>  			 const char *src, size_t len,
>> -			 struct strbuf *buf)
>> +			 struct strbuf *buf, struct pretty_print_context *ctx)
>>  {
>>  	char *to_free = NULL;
>>  	struct strbuf fmt = STRBUF_INIT;
>> -	struct pretty_print_context ctx = {0};
>> -	ctx.date_mode.type = DATE_NORMAL;
>> -	ctx.abbrev = DEFAULT_ABBREV;
>>
>>  	if (src == buf->buf)
>>  		to_free = strbuf_detach(buf, NULL);
>> @@ -61,7 +58,7 @@ static void format_subst(const struct commit *commit,
>>  		strbuf_add(&fmt, b + 8, c - b - 8);
>>
>>  		strbuf_add(buf, src, b - src);
>> -		format_commit_message(commit, fmt.buf, buf, &ctx);
>> +		format_commit_message(commit, fmt.buf, buf, ctx);
>>  		len -= c + 1 - src;
>>  		src  = c + 1;
>>  	}
>> @@ -94,7 +91,7 @@ static void *object_file_to_archive(const struct archiver_args *args,
>>  		strbuf_attach(&buf, buffer, *sizep, *sizep + 1);
>>  		convert_to_working_tree(args->repo->index, path, buf.buf, buf.len, &buf, &meta);
>>  		if (commit)
>> -			format_subst(commit, buf.buf, buf.len, &buf);
>> +			format_subst(commit, buf.buf, buf.len, &buf, args->pretty_ctx);
>>  		buffer = strbuf_detach(&buf, &size);
>>  		*sizep = size;
>>  	}
>> @@ -633,12 +630,19 @@ int write_archive(int argc, const char **argv, const char *prefix,
>>  		  const char *name_hint, int remote)
>>  {
>>  	const struct archiver *ar = NULL;
>> +	struct pretty_print_describe_status describe_status = {0};
>> +	struct pretty_print_context ctx = {0};
>>  	struct archiver_args args;
>>  	int rc;
>>
>>  	git_config_get_bool("uploadarchive.allowunreachable", &remote_allow_unreachable);
>>  	git_config(git_default_config, NULL);
>>
>> +	describe_status.max_invocations = 1;
>> +	ctx.date_mode.type = DATE_NORMAL;
>> +	ctx.abbrev = DEFAULT_ABBREV;
>> +	ctx.describe_status = &describe_status;
>> +	args.pretty_ctx = &ctx;
>>  	args.repo = repo;
>>  	args.prefix = prefix;
>>  	string_list_init(&args.extra_files, 1);
>> diff --git a/archive.h b/archive.h
>> index 33551b7ee1..49fab71aaf 100644
>> --- a/archive.h
>> +++ b/archive.h
>> @@ -5,6 +5,7 @@
>>  #include "pathspec.h"
>>
>>  struct repository;
>> +struct pretty_print_context;
>>
>>  struct archiver_args {
>>  	struct repository *repo;
>> @@ -22,6 +23,7 @@ struct archiver_args {
>>  	unsigned int convert : 1;
>>  	int compression_level;
>>  	struct string_list extra_files;
>> +	struct pretty_print_context *pretty_ctx;
>>  };
>>
>>  /* main api */
>> diff --git a/pretty.c b/pretty.c
>> index c612d2ac9b..032e89cd4e 100644
>> --- a/pretty.c
>> +++ b/pretty.c
>> @@ -1247,6 +1247,14 @@ static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */
>>  		struct child_process cmd = CHILD_PROCESS_INIT;
>>  		struct strbuf out = STRBUF_INIT;
>>  		struct strbuf err = STRBUF_INIT;
>> +		struct pretty_print_describe_status *describe_status;
>> +
>> +		describe_status = c->pretty_ctx->describe_status;
>> +		if (describe_status) {
>> +			if (!describe_status->max_invocations)
>> +				return 0;
>> +			describe_status->max_invocations--;
>> +		}
>>
>>  		cmd.git_cmd = 1;
>>  		strvec_push(&cmd.args, "describe");
>> diff --git a/pretty.h b/pretty.h
>> index 7ce6c0b437..27b15947ff 100644
>> --- a/pretty.h
>> +++ b/pretty.h
>> @@ -23,6 +23,10 @@ enum cmit_fmt {
>>  	CMIT_FMT_UNSPECIFIED
>>  };
>>
>> +struct pretty_print_describe_status {
>> +	unsigned int max_invocations;
>> +};
>> +
>>  struct pretty_print_context {
>>  	/*
>>  	 * Callers should tweak these to change the behavior of pp_* functions.
>> @@ -44,6 +48,7 @@ struct pretty_print_context {
>>  	int color;
>>  	struct ident_split *from_ident;
>>  	unsigned encode_email_headers:1;
>> +	struct pretty_print_describe_status *describe_status;
>>
>>  	/*
>>  	 * Fields below here are manipulated internally by pp_* functions and
>> diff --git a/t/t5001-archive-attr.sh b/t/t5001-archive-attr.sh
>> index e9aa97117a..1c9ce3956b 100755
>> --- a/t/t5001-archive-attr.sh
>> +++ b/t/t5001-archive-attr.sh
>> @@ -128,4 +128,18 @@ test_expect_success 'export-subst' '
>>  	test_cmp substfile2 archive/substfile2
>>  '
>>
>> +test_expect_success 'export-subst expands %(describe) once' '
>> +	echo "\$Format:%(describe)\$" >substfile3 &&
>> +	echo "\$Format:%(describe)\$" >>substfile3 &&
>> +	echo "\$Format:%(describe)${LF}%(describe)\$" >substfile4 &&
>> +	git add substfile[34] &&
>> +	git commit -m export-subst-describe &&
>> +	git tag -m export-subst-describe export-subst-describe &&
>> +	git archive HEAD >archive-describe.tar &&
>> +	extract_tar_to_dir archive-describe &&
>> +	desc=$(git describe) &&
>> +	grep -F "$desc" archive-describe/substfile[34] >substituted &&
>> +	test_line_count = 1 substituted
>> +'
>> +
>>  test_done
>
> This whole thing seems rather backwards as a solution to the "we run it
> N times" problem I mentioned in <87k0r7a4sr.fsf@evledraar.gmail.com>.

In the referenced message you pointed out that using %(describe) more
than once incurs the cost of a call to "git describe" each time and
asked for a way to avoid that.

This patch here prevents the second and later calls for "git archive".
It's not intended to speed up expanding repeated placeholders, but
rather to reject maliciously crafted placeholders even if they are not
the same.

So does that mean you have the requirement to use %(describe) more
than once in the same archive, and this DoS protection would be too
strict for you?  In that case we'd need a way to increase the limit,
e.g. a configuration variable or command line option.  Otherwise I
don't see the connection between your message and this patch.

Side note: Keeping the "git describe" running and feeding it commits
one by one through a pipe (with a new --stdin option, I suppose) as
mentioned in <66720982-04d6-3096-9ea2-ea5bc3fcd121@web.de> would
speed up expanding repeated placeholders as well.

> Instead of taking the trouble of putting a limit in the
> pretty_print_context so we don't call it N times for the same commit,
> why not just put the strbuf with the result in that same struct?
>
> Then you can have it millions of times, and it won't be any more
> expensive than the other existing %(format) specifiers (actually cheaper
> than most).

Each %(describe) placeholder can have a unique set of match or exclude
arguments.  Caching them all would increase the strength of a DoS
attack.  Caching a few of them would be OK, but is ineffective in
reducing the strength of the attack.

> Or is there some edge case I'm missing here where "git archive" can
> either be fed N commits and we share the context, or we share the
> context across formatting different revisions in some cases?

git archive currently works only on a single commit.

René
René Scharfe March 2, 2021, 4 p.m. UTC | #6
Am 01.03.21 um 18:54 schrieb Junio C Hamano:
> René Scharfe. <l.s.r@web.de> writes:
>
>> Am 17.02.21 um 19:31 schrieb Jeff King:
>>> On Sun, Feb 14, 2021 at 11:10:57AM +0100, René Scharfe. wrote:
>>>
>>>> Allow restricting the tags used by the placeholder %(describe) with the
>>>> options match and exclude.  E.g. the following command describes the
>>>> current commit using official version tags, without those for release
>>>> candidates:
>>>>
>>>>    $ git log -1 --format='%(describe:match=v[0-9]*,exclude=*rc*)'
>>>
>>> An interesting side effect of this series is that it allows remote users
>>> asking for archives to fill in this data, too (by using export-subst
>>> placeholders). That includes servers allowing "git archive --remote",
>>> but also services like GitHub that will run git-archive on behalf of
>>> clients.
>>>
>>> I wonder what avenues for mischief this provides. Certainly using extra
>>> CPU to run git-describe.
>>
>> A repository can contain millions of files, each file can contain
>> millions of $Format:...$ sequences and each of them can contain millions
>> of %(describe) placeholders.  Each of them could have different match or
>> exclude args to prevent caching.  Allowing a single request to cause
>> trillions of calls of git describe sounds excessive.  Let's limit this.
>
> An invocation of "git archive" would have to deal with a single
> commit, no?  I wonder if it is a more fruitful direction to go to
> teach format_subst() to "cache" the mapping from <placeholders> to
> <resulting-string> and reuse.

Yes, git archive only works on a single commit.  Caching cannot help
against a DoS attack using describe placeholders with different match
or exclude arguments.

René
René Scharfe March 6, 2021, 4:18 p.m. UTC | #7
Am 02.03.21 um 17:00 schrieb René Scharfe.:
> Am 28.02.21 um 16:41 schrieb Ævar Arnfjörð Bjarmason:
>> Instead of taking the trouble of putting a limit in the
>> pretty_print_context so we don't call it N times for the same commit,
>> why not just put the strbuf with the result in that same struct?
>>
>> Then you can have it millions of times, and it won't be any more
>> expensive than the other existing %(format) specifiers (actually cheaper
>> than most).
>
> Each %(describe) placeholder can have a unique set of match or exclude
> arguments.  Caching them all would increase the strength of a DoS
> attack.  Caching a few of them would be OK, but is ineffective in
> reducing the strength of the attack.

The script at the bottom creates archives that illustrate the issue. On
a repo generated with parameter 10 (10 files with 10 $Format:...$ with
10 %(describe) placeholders, so 1000 total), I get the following number
with v2.30.1, which ignores %(describe):

Benchmark #1: git archive HEAD
  Time (mean ± σ):       2.2 ms ±   0.2 ms    [User: 0.9 ms, System: 1.0 ms]
  Range (min … max):     1.8 ms …   2.8 ms    705 runs

  Warning: Command took less than 5 ms to complete. Results might be inaccurate.

The version in next expands all placeholders and takes three orders of
magnitude longer:

Benchmark #1: git archive HEAD
  Time (mean ± σ):      2.300 s ±  0.003 s    [User: 819.0 ms, System: 1200.0 ms]
  Range (min … max):    2.293 s …  2.305 s    10 runs

The proposed patch to expand only a single placeholder gets the runtime
back under control:

Benchmark #1: git archive HEAD
  Time (mean ± σ):       4.7 ms ±   0.3 ms    [User: 1.8 ms, System: 2.2 ms]
  Range (min … max):     4.2 ms …   7.0 ms    451 runs

  Warning: Command took less than 5 ms to complete. Results might be inaccurate.

Using parameter 100 takes about a second to create the repo, but the git
archive version in next already needs longer to tar it up than I'm
willing to wait.


#!/bin/sh
n=$1
mkdir $n
cd $n
git init
for i in $(seq $n)
do
	awk -v i=$i -v n=$n 'END {
		for (j = 0; j < n; j++) {
			print "$Format:"
			for (k = 0; k < n; k++) {
				print "%(describe:exclude=x-" i "-" j "-" k ")"
			}
			print "$"
		}
	}' </dev/null >"file$i"
done
git add file*
echo "file* export-subst" >.gitattributes
git add .gitattributes
git commit -m initial
for tagno in $(seq $n)
do
	git tag -m "$tagno" "tag$tagno"
done
diff mbox series

Patch

diff --git a/Documentation/pretty-formats.txt b/Documentation/pretty-formats.txt
index bb8c05bc21..231010e6ef 100644
--- a/Documentation/pretty-formats.txt
+++ b/Documentation/pretty-formats.txt
@@ -208,8 +208,17 @@  The placeholders are:
 '%cs':: committer date, short format (`YYYY-MM-DD`)
 '%d':: ref names, like the --decorate option of linkgit:git-log[1]
 '%D':: ref names without the " (", ")" wrapping.
-'%(describe)':: human-readable name, like linkgit:git-describe[1];
-		empty string for undescribable commits
+'%(describe[:options])':: human-readable name, like
+			  linkgit:git-describe[1]; empty string for
+			  undescribable commits.  The `describe` string
+			  may be followed by a colon and zero or more
+			  comma-separated options.
++
+** 'match=<pattern>': Only consider tags matching the given
+   `glob(7)` pattern, excluding the "refs/tags/" prefix.
+** 'exclude=<pattern>': Do not consider tags matching the given
+   `glob(7)` pattern, excluding the "refs/tags/" prefix.
+
 '%S':: ref name given on the command line by which the commit was reached
        (like `git log --source`), only works with `git log`
 '%e':: encoding
diff --git a/pretty.c b/pretty.c
index a0c427fb61..c612d2ac9b 100644
--- a/pretty.c
+++ b/pretty.c
@@ -1150,6 +1150,34 @@  static int format_trailer_match_cb(const struct strbuf *key, void *ud)
 	return 0;
 }

+static size_t parse_describe_args(const char *start, struct strvec *args)
+{
+	const char *options[] = { "match", "exclude" };
+	const char *arg = start;
+
+	for (;;) {
+		const char *matched = NULL;
+		const char *argval;
+		size_t arglen = 0;
+		int i;
+
+		for (i = 0; i < ARRAY_SIZE(options); i++) {
+			if (match_placeholder_arg_value(arg, options[i], &arg,
+							&argval, &arglen)) {
+				matched = options[i];
+				break;
+			}
+		}
+		if (!matched)
+			break;
+
+		if (!arglen)
+			return 0;
+		strvec_pushf(args, "--%s=%.*s", matched, (int)arglen, argval);
+	}
+	return arg - start;
+}
+
 static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */
 				const char *placeholder,
 				void *context)
@@ -1215,20 +1243,31 @@  static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */
 		return parse_padding_placeholder(placeholder, c);
 	}

-	if (skip_prefix(placeholder, "(describe)", &arg)) {
+	if (skip_prefix(placeholder, "(describe", &arg)) {
 		struct child_process cmd = CHILD_PROCESS_INIT;
 		struct strbuf out = STRBUF_INIT;
 		struct strbuf err = STRBUF_INIT;

 		cmd.git_cmd = 1;
 		strvec_push(&cmd.args, "describe");
+
+		if (*arg == ':') {
+			arg++;
+			arg += parse_describe_args(arg, &cmd.args);
+		}
+
+		if (*arg != ')') {
+			child_process_clear(&cmd);
+			return 0;
+		}
+
 		strvec_push(&cmd.args, oid_to_hex(&commit->object.oid));
 		pipe_command(&cmd, NULL, 0, &out, 0, &err, 0);
 		strbuf_rtrim(&out);
 		strbuf_addbuf(sb, &out);
 		strbuf_release(&out);
 		strbuf_release(&err);
-		return arg - placeholder;
+		return arg - placeholder + 1;
 	}

 	/* these depend on the commit */
diff --git a/t/t4205-log-pretty-formats.sh b/t/t4205-log-pretty-formats.sh
index 5a44fa447d..7e36706212 100755
--- a/t/t4205-log-pretty-formats.sh
+++ b/t/t4205-log-pretty-formats.sh
@@ -972,4 +972,20 @@  test_expect_success '%(describe) vs git describe' '
 	test_must_be_empty err
 '

+test_expect_success '%(describe:match=...) vs git describe --match ...' '
+	test_when_finished "git tag -d tag-match" &&
+	git tag -a -m tagged tag-match&&
+	git describe --match "*-match" >expect &&
+	git log -1 --format="%(describe:match=*-match)" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success '%(describe:exclude=...) vs git describe --exclude ...' '
+	test_when_finished "git tag -d tag-exclude" &&
+	git tag -a -m tagged tag-exclude &&
+	git describe --exclude "*-exclude" >expect &&
+	git log -1 --format="%(describe:exclude=*-exclude)" >actual &&
+	test_cmp expect actual
+'
+
 test_done