am: don't pass strvec to apply_parse_options()

Message ID	baf93e4a-7f05-857c-e551-09675496c03c@web.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> Message-ID: <baf93e4a-7f05-857c-e551-09675496c03c@web.de> Date: Tue, 13 Dec 2022 07:47:59 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.5.1 Content-Language: en-US To: Git List <git@vger.kernel.org> Cc: Junio C Hamano <gitster@pobox.com> From: =?utf-8?q?Ren=C3=A9_Scharfe?= <l.s.r@web.de> Subject: [PATCH] am: don't pass strvec to apply_parse_options() Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable UI-OutboundReport: notjunk:1;M01:P0:ysCJK4c2YeU=;NDUwupsDmSwyoLf2exD+gM2lyD4 mPLmZSqGqq61r2LkebGUytJWw4blp6I8YCd4A4IIEGW9QT+bUf+grliDYRiYKQL+r+xa/COM4 oair7crvpF2BP584f25FW+/7bzdLUKuljt5+XjIGqqtPgQQxmg5JCi6tfPYNhBSV5Duiwr8YE qBKaueAIX2hrbFVdJODdIAoYTjw1Xsyud3LQc+8fEMEIeFoiGpaF8o4HWos6OGzBQ34LO4kj3 TmrSv0iugx4Xp1F4KRQjfZNQ7/7rMLsgn3SQ8JE3n2Tf4yDG+/7vlyZuX9HMuk8l1o18W80VR Jjqs86APbIe2lHoEJ4BcBN134fFR0gaopHR6jTsHWOzSmi+Esa1sEwHLBnPYUKfv2gfr0rDHk Os0p3uK5Ht8mBItxlVbUX3Z16v9wbZ4e+BbwQ7kaFunx8RhYtsfdxru7WULWvzosGyoEenQlc 2rUPh5+0nsomiY11OEzUK1Uv5U0FrG8zlpQykyafLpMOW8uk0N+ne/SNUelpYom1cA0tz8sko xfO+Aip2Nu0CMSwVad9fjpltGV0dIxccx1r1RTKffYkfOAHswlBIjM/BYrmHFrLImwGTe0+82 Vhz/lG9HWXxUJyRE96hk9j6j3hcG+wqLUhcjPGU51Wk/kfo9MWb1Q3RP5N5Ha/S+/rfxSDt/H EzmLozu5MJLjWvtlFjdYv7LnqLl2f12Wa4nbmcS/n9IylCXm6OIEyuPLtl9GVP735YMAyX9sl vsdeh/KAn9R+Qc+DAQVhN242VCVK5EFeQJfBXLWt0b0p0eqKdVj7e0/cHjubmfD4RLxSMuxI7 8vjYWKz+0boMDxZ+Oh+56iLnwsQAFiLgXt6nuz9zJusG4P+KYFQK3Xr+6tw+5Z6hOVz2IZMbP YdFQTtXOCngrd/8/ehNq46e6lGrRqEKznJ5Sf2vVc2xhDqDFw7hY8ytZ3nkMFw3UDKzMU/IFs ohuXFQ== Precedence: bulk
Series	am: don't pass strvec to apply_parse_options() \| expand am: don't pass strvec to apply_parse_options()

René Scharfe Dec. 13, 2022, 6:47 a.m. UTC

apply_parse_options() passes the array of argument strings to
parse_options(), which removes recognized options.  The removed strings
are not freed, though.

Make a copy of the strvec to pass to the function to retain the pointers
of its strings, so we release them all at the end.

Signed-off-by: René Scharfe <l.s.r@web.de>
---
 builtin/am.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

--
2.38.2

Ævar Arnfjörð Bjarmason Dec. 13, 2022, 8:37 a.m. UTC | #1

On Tue, Dec 13 2022, René Scharfe wrote:

> apply_parse_options() passes the array of argument strings to
> parse_options(), which removes recognized options.  The removed strings
> are not freed, though.
>
> Make a copy of the strvec to pass to the function to retain the pointers
> of its strings, so we release them all at the end.
>
> Signed-off-by: René Scharfe <l.s.r@web.de>
> ---
>  builtin/am.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/builtin/am.c b/builtin/am.c
> index 30c9b3a9cd..dddf1b9af0 100644
> --- a/builtin/am.c
> +++ b/builtin/am.c
> @@ -1476,6 +1476,7 @@ static int run_apply(const struct am_state *state, const char *index_file)
>  	int res, opts_left;
>  	int force_apply = 0;
>  	int options = 0;
> +	const char **apply_argv;
>
>  	if (init_apply_state(&apply_state, the_repository, NULL))
>  		BUG("init_apply_state() failed");
> @@ -1483,7 +1484,15 @@ static int run_apply(const struct am_state *state, const char *index_file)
>  	strvec_push(&apply_opts, "apply");
>  	strvec_pushv(&apply_opts, state->git_apply_opts.v);
>
> -	opts_left = apply_parse_options(apply_opts.nr, apply_opts.v,
> +	/*
> +	 * Build a copy that apply_parse_options() can rearrange.
> +	 * apply_opts.v keeps referencing the allocated strings for
> +	 * strvec_clear() to release.
> +	 */
> +	ALLOC_ARRAY(apply_argv, apply_opts.nr);
> +	COPY_ARRAY(apply_argv, apply_opts.v, apply_opts.nr);
> +
> +	opts_left = apply_parse_options(apply_opts.nr, apply_argv,
>  					&apply_state, &force_apply, &options,
>  					NULL);
>
> @@ -1513,6 +1522,7 @@ static int run_apply(const struct am_state *state, const char *index_file)
>  	strvec_clear(&apply_paths);
>  	strvec_clear(&apply_opts);
>  	clear_apply_state(&apply_state);
> +	free(apply_argv);
>
>  	if (res)
>  		return res;

I don't mind this going in, but it really feels like a bit of a dirty
hack.

We have widespread leaks all over the place due to this
idiom. I.e. parse_options() and a couple of other APIs expect that they
can munge the "argv", which is fine if it arrives via main(), but not if
we're editing our own strvecs.

I think less of a hack is to teach the eventual parse_options() that
when it munges it it should free() it. I did that for the revisions API
in f92dbdbc6a8 (revisions API: don't leak memory on argv elements that
need free()-ing, 2022-08-02).

What do you think?

René Scharfe Dec. 13, 2022, 6:31 p.m. UTC | #2

Am 13.12.22 um 09:37 schrieb Ævar Arnfjörð Bjarmason:
>
> On Tue, Dec 13 2022, René Scharfe wrote:
>
>> apply_parse_options() passes the array of argument strings to
>> parse_options(), which removes recognized options.  The removed strings
>> are not freed, though.
>>
>> Make a copy of the strvec to pass to the function to retain the pointers
>> of its strings, so we release them all at the end.
>>
>> Signed-off-by: René Scharfe <l.s.r@web.de>
>> ---
>>  builtin/am.c | 12 +++++++++++-
>>  1 file changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/builtin/am.c b/builtin/am.c
>> index 30c9b3a9cd..dddf1b9af0 100644
>> --- a/builtin/am.c
>> +++ b/builtin/am.c
>> @@ -1476,6 +1476,7 @@ static int run_apply(const struct am_state *state, const char *index_file)
>>  	int res, opts_left;
>>  	int force_apply = 0;
>>  	int options = 0;
>> +	const char **apply_argv;
>>
>>  	if (init_apply_state(&apply_state, the_repository, NULL))
>>  		BUG("init_apply_state() failed");
>> @@ -1483,7 +1484,15 @@ static int run_apply(const struct am_state *state, const char *index_file)
>>  	strvec_push(&apply_opts, "apply");
>>  	strvec_pushv(&apply_opts, state->git_apply_opts.v);
>>
>> -	opts_left = apply_parse_options(apply_opts.nr, apply_opts.v,
>> +	/*
>> +	 * Build a copy that apply_parse_options() can rearrange.
>> +	 * apply_opts.v keeps referencing the allocated strings for
>> +	 * strvec_clear() to release.
>> +	 */
>> +	ALLOC_ARRAY(apply_argv, apply_opts.nr);
>> +	COPY_ARRAY(apply_argv, apply_opts.v, apply_opts.nr);
>> +
>> +	opts_left = apply_parse_options(apply_opts.nr, apply_argv,
>>  					&apply_state, &force_apply, &options,
>>  					NULL);
>>
>> @@ -1513,6 +1522,7 @@ static int run_apply(const struct am_state *state, const char *index_file)
>>  	strvec_clear(&apply_paths);
>>  	strvec_clear(&apply_opts);
>>  	clear_apply_state(&apply_state);
>> +	free(apply_argv);
>>
>>  	if (res)
>>  		return res;
>
> I don't mind this going in, but it really feels like a bit of a dirty
> hack.
>
> We have widespread leaks all over the place due to this
> idiom. I.e. parse_options() and a couple of other APIs expect that they
> can munge the "argv", which is fine if it arrives via main(), but not if
> we're editing our own strvecs.

Where?  A quick "git grep 'parse_options.*nr'" turns up only this place
as one that passes a strvec to parse_options.

> I think less of a hack is to teach the eventual parse_options() that
> when it munges it it should free() it. I did that for the revisions API
> in f92dbdbc6a8 (revisions API: don't leak memory on argv elements that
> need free()-ing, 2022-08-02).
>
> What do you think?

Generating string lists and then parsing them is weird.  When calls have
to cross a process boundary then we have no choice, but in-process we
shouldn't have to lower our request to an intermediate text format.  git
am does it anyway because it writes its options to a file and reads them
back when it resumes with --continue, IIUC.

I hope that is and will be the only place that uses parse_options() with
a strvec -- and then we don't have to change that function.

If this pattern is used more widely then we could package the copying
done by this patch somehow, e.g. by adding a strvec_parse_options()
that wraps the real thing.

If we have to change parse_options() at all then I'd prefer it to not
free() anything (to keep it usable with main()'s parameters), but to
reorder in a non-destructive way.  That would mean keeping the NULL
sentinel where it is, and making sure all callers use only the returned
argc to determine which arguments parse_options() didn't recognize.

René

Ævar Arnfjörð Bjarmason Dec. 14, 2022, 8:44 a.m. UTC | #3

On Tue, Dec 13 2022, René Scharfe wrote:

> Am 13.12.22 um 09:37 schrieb Ævar Arnfjörð Bjarmason:
>>
>> On Tue, Dec 13 2022, René Scharfe wrote:
>>
>>> apply_parse_options() passes the array of argument strings to
>>> parse_options(), which removes recognized options.  The removed strings
>>> are not freed, though.
>>>
>>> Make a copy of the strvec to pass to the function to retain the pointers
>>> of its strings, so we release them all at the end.
>>>
>>> Signed-off-by: René Scharfe <l.s.r@web.de>
>>> ---
>>>  builtin/am.c | 12 +++++++++++-
>>>  1 file changed, 11 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/builtin/am.c b/builtin/am.c
>>> index 30c9b3a9cd..dddf1b9af0 100644
>>> --- a/builtin/am.c
>>> +++ b/builtin/am.c
>>> @@ -1476,6 +1476,7 @@ static int run_apply(const struct am_state *state, const char *index_file)
>>>  	int res, opts_left;
>>>  	int force_apply = 0;
>>>  	int options = 0;
>>> +	const char **apply_argv;
>>>
>>>  	if (init_apply_state(&apply_state, the_repository, NULL))
>>>  		BUG("init_apply_state() failed");
>>> @@ -1483,7 +1484,15 @@ static int run_apply(const struct am_state *state, const char *index_file)
>>>  	strvec_push(&apply_opts, "apply");
>>>  	strvec_pushv(&apply_opts, state->git_apply_opts.v);
>>>
>>> -	opts_left = apply_parse_options(apply_opts.nr, apply_opts.v,
>>> +	/*
>>> +	 * Build a copy that apply_parse_options() can rearrange.
>>> +	 * apply_opts.v keeps referencing the allocated strings for
>>> +	 * strvec_clear() to release.
>>> +	 */
>>> +	ALLOC_ARRAY(apply_argv, apply_opts.nr);
>>> +	COPY_ARRAY(apply_argv, apply_opts.v, apply_opts.nr);
>>> +
>>> +	opts_left = apply_parse_options(apply_opts.nr, apply_argv,
>>>  					&apply_state, &force_apply, &options,
>>>  					NULL);
>>>
>>> @@ -1513,6 +1522,7 @@ static int run_apply(const struct am_state *state, const char *index_file)
>>>  	strvec_clear(&apply_paths);
>>>  	strvec_clear(&apply_opts);
>>>  	clear_apply_state(&apply_state);
>>> +	free(apply_argv);
>>>
>>>  	if (res)
>>>  		return res;
>>
>> I don't mind this going in, but it really feels like a bit of a dirty
>> hack.
>>
>> We have widespread leaks all over the place due to this
>> idiom. I.e. parse_options() and a couple of other APIs expect that they
>> can munge the "argv", which is fine if it arrives via main(), but not if
>> we're editing our own strvecs.
>
> Where?  A quick "git grep 'parse_options.*nr'" turns up only this place
> as one that passes a strvec to parse_options.

Sorry about the vagueness, this was from memory and I didn't have a full
list handy.

First, that grep isn't going to give you what you want, since it only
catches cases where we're passing the "strvec" to the "parse_options()"
directly.

But if you look at e.g. how cmd_stash() invokes push_stash() or
cmd_describe() invoking cmd_name_rev() you'll see callers where we're
potentially losing the strvec due to parse_options().

"Potentially" because in both those cases we call that API, but in the
latter case we've got a missing strvec_clear(), so maybe that's all
that's needed (it doesn't always munge the argv).

>> I think less of a hack is to teach the eventual parse_options() that
>> when it munges it it should free() it. I did that for the revisions API
>> in f92dbdbc6a8 (revisions API: don't leak memory on argv elements that
>> need free()-ing, 2022-08-02).
>>
>> What do you think?
>
> Generating string lists and then parsing them is weird.  When calls have
> to cross a process boundary then we have no choice, but in-process we
> shouldn't have to lower our request to an intermediate text format.  git
> am does it anyway because it writes its options to a file and reads them
> back when it resumes with --continue, IIUC.

I agree, but making all those use a nicer API would be a much bigger
change.

> I hope that is and will be the only place that uses parse_options() with
> a strvec -- and then we don't have to change that function.
>
> If this pattern is used more widely then we could package the copying
> done by this patch somehow, e.g. by adding a strvec_parse_options()
> that wraps the real thing.

The reason I'm encouraging you to look at some of the other strvec leaks
is to see if you can think of a pattern that fits these various leaky
callers.

So e.g. a strvec_parse_options() won't work for the ones noted above
where we've lost track of it being a "struct strvec" in the first place.

I have a local version somewhere where I did (pseudocode):

	struct strvec vec = STRVEC_INIT;
	struct string_list to_free = STRING_LIST_INIT_DUP;

        // populate vec...
        for (size_t i = 0; i < vec.nr; i++)
		string_list_append_nodup(&to_free, v[i]):
	// call parse_options(), or otherwise munge "vec"...
	free(strvec_detach(&vec));
        string_list_clear(&to_free, 0);

I.e. you snapshot the members of the "vec" before it's munged, and
free() those via a "struct string_list".

Then you don't strvec_clear(), but instead free() the container, its
members being free'd by the string_list.

Here's a (not yet in-tree) patch that uses that:
https://lore.kernel.org/git/patch-10.10-81f138e460c-20221017T115544Z-avarab@gmail.com/

I think that's more reliable & general than the pattern you're adding
here.

In your version the only reason we're not segfaulting is because the
parse_options() consumes all the arguments, but that's not something you
can generally rely on with parse_options().

Also, and maybe I'm missing something, but wouldn't this be functionally
the same as your implementation:
	
	diff --git a/builtin/am.c b/builtin/am.c
	index 30c9b3a9cd7..e65262cbb21 100644
	--- a/builtin/am.c
	+++ b/builtin/am.c
	@@ -1476,11 +1476,13 @@ static int run_apply(const struct am_state *state, const char *index_file)
	 	int res, opts_left;
	 	int force_apply = 0;
	 	int options = 0;
	+	char *to_free;
	 
	 	if (init_apply_state(&apply_state, the_repository, NULL))
	 		BUG("init_apply_state() failed");
	 
	 	strvec_push(&apply_opts, "apply");
	+	to_free = (char *)apply_opts.v[0];
	 	strvec_pushv(&apply_opts, state->git_apply_opts.v);
	 
	 	opts_left = apply_parse_options(apply_opts.nr, apply_opts.v,
	@@ -1489,6 +1491,7 @@ static int run_apply(const struct am_state *state, const char *index_file)
	 
	 	if (opts_left != 0)
	 		die("unknown option passed through to git apply");
	+	free(to_free);
	 
	 	if (index_file) {
	 		apply_state.index_file = index_file;

I.e. we leak the strvec_push() of the "apply" literal, but for the rest
we append the "state->git_apply_opts.v", which we already free()
elsewhere.

So actually, aren't we really just fumbling our way towards the "nodup"
interface that the "struct string_list" has, and we should add the same
to the "struct strvec".

This seems to also fix all the leaks you fixed, but without any of the
copying:

	diff --git a/builtin/am.c b/builtin/am.c
	index 30c9b3a9cd7..691ec1d152d 100644
	--- a/builtin/am.c
	+++ b/builtin/am.c
	@@ -1471,7 +1471,7 @@ static int parse_mail_rebase(struct am_state *state, const char *mail)
	 static int run_apply(const struct am_state *state, const char *index_file)
	 {
	 	struct strvec apply_paths = STRVEC_INIT;
	-	struct strvec apply_opts = STRVEC_INIT;
	+	struct strvec apply_opts = STRVEC_INIT_NODUP;
	 	struct apply_state apply_state;
	 	int res, opts_left;
	 	int force_apply = 0;
	diff --git a/strvec.c b/strvec.c
	index 61a76ce6cb9..efdc47a5854 100644
	--- a/strvec.c
	+++ b/strvec.c
	@@ -22,7 +22,9 @@ static void strvec_push_nodup(struct strvec *array, const char *value)
	 
	 const char *strvec_push(struct strvec *array, const char *value)
	 {
	-	strvec_push_nodup(array, xstrdup(value));
	+	const char *to_push = array->nodup_strings ? value : xstrdup(value);
	+
	+	strvec_push_nodup(array, to_push);
	 	return array->v[array->nr - 1];
	 }
	 
	@@ -31,6 +33,9 @@ const char *strvec_pushf(struct strvec *array, const char *fmt, ...)
	 	va_list ap;
	 	struct strbuf v = STRBUF_INIT;
	 
	+	if (array->nodup_strings)
	+		BUG("cannot strvec_pushf() on a 'nodup' strvec");
	+
	 	va_start(ap, fmt);
	 	strbuf_vaddf(&v, fmt, ap);
	 	va_end(ap);
	@@ -67,6 +72,9 @@ void strvec_pop(struct strvec *array)
	 
	 void strvec_split(struct strvec *array, const char *to_split)
	 {
	+	if (array->nodup_strings)
	+		BUG("cannot strvec_split() on a 'nodup' strvec");
	+
	 	while (isspace(*to_split))
	 		to_split++;
	 	for (;;) {
	@@ -90,7 +98,7 @@ void strvec_clear(struct strvec *array)
	 	if (array->v != empty_strvec) {
	 		int i;
	 		for (i = 0; i < array->nr; i++)
	-			free((char *)array->v[i]);
	+			free(array->nodup_strings ? NULL : (char *)array->v[i]);
	 		free(array->v);
	 	}
	 	strvec_init(array);
	diff --git a/strvec.h b/strvec.h
	index 9f55c8766ba..ac6e2c04cea 100644
	--- a/strvec.h
	+++ b/strvec.h
	@@ -31,12 +31,18 @@ struct strvec {
	 	const char **v;
	 	size_t nr;
	 	size_t alloc;
	+	unsigned int nodup_strings:1;
	 };
	 
	 #define STRVEC_INIT { \
	 	.v = empty_strvec, \
	 }
	 
	+#define STRVEC_INIT_NODUP { \
	+	.v = empty_strvec, \
	+	.nodup_strings = 1, \
	+}
	+
	 /**
	  * Initialize an array. This is no different than assigning from
	  * `STRVEC_INIT`.

In any case, for this change (and leak fixes in general), could you
please also mark the now-passing leak-free tests. This fails after your
patch, but passes before.

The failure is a "good" one, as they're now leak-free, but should be
marked as such:

	GIT_TEST_PASSING_SANITIZE_LEAK=check GIT_TEST_SANITIZE_LEAK_LOG=true \
	make SANITIZE=leak test T="t4256-am-format-flowed.sh t0023-crlf-am.sh t4254-am-corrupt.sh t4257-am-interactive.sh t5403-post-checkout-hook.sh t4152-am-subjects.sh"

> If we have to change parse_options() at all then I'd prefer it to not
> free() anything (to keep it usable with main()'s parameters),...

I'm suggesting passing a flag to it to have it free() things it
NULL's.

Maybe that's a bad idea, but I don't see the incompatibility with
main(), for those we just wouldn't pass the flag.

It does have the inherent limitation that you couldn't mix strvec and
non-strvec parameters, i.e. if some of your elements are dup'd but
others aren't you'd need to arrange to have them all dup'd.

But I don't think that's an issue in practice, either we pass the fully
dup'd version, or main() parameters.

> but to
> reorder in a non-destructive way.  That would mean keeping the NULL
> sentinel where it is, and making sure all callers use only the returned
> argc to determine which arguments parse_options() didn't recognize.

I think this would work if parse_options() wasn't clobbering those
elements in some cases, and replacing them with new ones, but doesn't it
do that e.g. in parse_options_step()?

It also seems like a much more fragile change to try to ensure that
nothing uses the "argv" again, but only looks at the updated
"argc".

Both that & adding a "const" to it would be a huge change across the
codebase, so I'd like to avoid them, but at least the "const" method
would be helped along by the compiler.

René Scharfe Dec. 17, 2022, 12:46 p.m. UTC | #4

Am 14.12.22 um 09:44 schrieb Ævar Arnfjörð Bjarmason:
>
> On Tue, Dec 13 2022, René Scharfe wrote:
>
>> Am 13.12.22 um 09:37 schrieb Ævar Arnfjörð Bjarmason:
>>>
>>> On Tue, Dec 13 2022, René Scharfe wrote:
>>>
>>>> apply_parse_options() passes the array of argument strings to
>>>> parse_options(), which removes recognized options.  The removed strings
>>>> are not freed, though.
>>>>
>>>> Make a copy of the strvec to pass to the function to retain the pointers
>>>> of its strings, so we release them all at the end.
>>>>
>>>> Signed-off-by: René Scharfe <l.s.r@web.de>
>>>> ---
>>>>  builtin/am.c | 12 +++++++++++-
>>>>  1 file changed, 11 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/builtin/am.c b/builtin/am.c
>>>> index 30c9b3a9cd..dddf1b9af0 100644
>>>> --- a/builtin/am.c
>>>> +++ b/builtin/am.c
>>>> @@ -1476,6 +1476,7 @@ static int run_apply(const struct am_state *state, const char *index_file)
>>>>  	int res, opts_left;
>>>>  	int force_apply = 0;
>>>>  	int options = 0;
>>>> +	const char **apply_argv;
>>>>
>>>>  	if (init_apply_state(&apply_state, the_repository, NULL))
>>>>  		BUG("init_apply_state() failed");
>>>> @@ -1483,7 +1484,15 @@ static int run_apply(const struct am_state *state, const char *index_file)
>>>>  	strvec_push(&apply_opts, "apply");
>>>>  	strvec_pushv(&apply_opts, state->git_apply_opts.v);
>>>>
>>>> -	opts_left = apply_parse_options(apply_opts.nr, apply_opts.v,
>>>> +	/*
>>>> +	 * Build a copy that apply_parse_options() can rearrange.
>>>> +	 * apply_opts.v keeps referencing the allocated strings for
>>>> +	 * strvec_clear() to release.
>>>> +	 */
>>>> +	ALLOC_ARRAY(apply_argv, apply_opts.nr);
>>>> +	COPY_ARRAY(apply_argv, apply_opts.v, apply_opts.nr);
>>>> +
>>>> +	opts_left = apply_parse_options(apply_opts.nr, apply_argv,
>>>>  					&apply_state, &force_apply, &options,
>>>>  					NULL);
>>>>
>>>> @@ -1513,6 +1522,7 @@ static int run_apply(const struct am_state *state, const char *index_file)
>>>>  	strvec_clear(&apply_paths);
>>>>  	strvec_clear(&apply_opts);
>>>>  	clear_apply_state(&apply_state);
>>>> +	free(apply_argv);
>>>>
>>>>  	if (res)
>>>>  		return res;
>>>
>>> I don't mind this going in, but it really feels like a bit of a dirty
>>> hack.

What's so dirty about it, by the way?

>>>
>>> We have widespread leaks all over the place due to this
>>> idiom. I.e. parse_options() and a couple of other APIs expect that they
>>> can munge the "argv", which is fine if it arrives via main(), but not if
>>> we're editing our own strvecs.
>>
>> Where?  A quick "git grep 'parse_options.*nr'" turns up only this place
>> as one that passes a strvec to parse_options.
>
> Sorry about the vagueness, this was from memory and I didn't have a full
> list handy.
>
> First, that grep isn't going to give you what you want, since it only
> catches cases where we're passing the "strvec" to the "parse_options()"
> directly.
>
> But if you look at e.g. how cmd_stash() invokes push_stash() or
> cmd_describe() invoking cmd_name_rev() you'll see callers where we're
> potentially losing the strvec due to parse_options().
>
> "Potentially" because in both those cases we call that API, but in the
> latter case we've got a missing strvec_clear(), so maybe that's all
> that's needed (it doesn't always munge the argv).

OK.

>>> I think less of a hack is to teach the eventual parse_options() that
>>> when it munges it it should free() it. I did that for the revisions API
>>> in f92dbdbc6a8 (revisions API: don't leak memory on argv elements that
>>> need free()-ing, 2022-08-02).
>>>
>>> What do you think?
>>
>> Generating string lists and then parsing them is weird.  When calls have
>> to cross a process boundary then we have no choice, but in-process we
>> shouldn't have to lower our request to an intermediate text format.  git
>> am does it anyway because it writes its options to a file and reads them
>> back when it resumes with --continue, IIUC.
>
> I agree, but making all those use a nicer API would be a much bigger
> change.

Indeed.

>> I hope that is and will be the only place that uses parse_options() with
>> a strvec -- and then we don't have to change that function.
>>
>> If this pattern is used more widely then we could package the copying
>> done by this patch somehow, e.g. by adding a strvec_parse_options()
>> that wraps the real thing.
>
> The reason I'm encouraging you to look at some of the other strvec leaks
> is to see if you can think of a pattern that fits these various leaky
> callers.
>
> So e.g. a strvec_parse_options() won't work for the ones noted above
> where we've lost track of it being a "struct strvec" in the first place.

Right.  It would work if we used a strvec in place of *every* argc/argv
pair, but that seems a but heavy a change.

> I have a local version somewhere where I did (pseudocode):
>
> 	struct strvec vec = STRVEC_INIT;
> 	struct string_list to_free = STRING_LIST_INIT_DUP;
>
>         // populate vec...
>         for (size_t i = 0; i < vec.nr; i++)
> 		string_list_append_nodup(&to_free, v[i]):
> 	// call parse_options(), or otherwise munge "vec"...
> 	free(strvec_detach(&vec));
>         string_list_clear(&to_free, 0);
>
> I.e. you snapshot the members of the "vec" before it's munged, and
> free() those via a "struct string_list".
>
> Then you don't strvec_clear(), but instead free() the container, its
> members being free'd by the string_list.

So the same as my patch, just with a string_list instead of a string
array and ownership juggling.

> Here's a (not yet in-tree) patch that uses that:
> https://lore.kernel.org/git/patch-10.10-81f138e460c-20221017T115544Z-avarab@gmail.com/
>
> I think that's more reliable & general than the pattern you're adding
> here.

How so?

> In your version the only reason we're not segfaulting is because the
> parse_options() consumes all the arguments, but that's not something you
> can generally rely on with parse_options().

Could you please elaborate?  What do you mean with "consume"?  Or could
you give an example of a segfault do to partial consumption?

I see one caveat: We should keep a NULL sentinel at the end when we
hand the copy to arbitrary functions, as some of them don't even look
at argc.  Is that what you meant?

> Also, and maybe I'm missing something, but wouldn't this be functionally
> the same as your implementation:
>
> 	diff --git a/builtin/am.c b/builtin/am.c
> 	index 30c9b3a9cd7..e65262cbb21 100644
> 	--- a/builtin/am.c
> 	+++ b/builtin/am.c
> 	@@ -1476,11 +1476,13 @@ static int run_apply(const struct am_state *state, const char *index_file)
> 	 	int res, opts_left;
> 	 	int force_apply = 0;
> 	 	int options = 0;
> 	+	char *to_free;
>
> 	 	if (init_apply_state(&apply_state, the_repository, NULL))
> 	 		BUG("init_apply_state() failed");
>
> 	 	strvec_push(&apply_opts, "apply");
> 	+	to_free = (char *)apply_opts.v[0];
> 	 	strvec_pushv(&apply_opts, state->git_apply_opts.v);
>
> 	 	opts_left = apply_parse_options(apply_opts.nr, apply_opts.v,
> 	@@ -1489,6 +1491,7 @@ static int run_apply(const struct am_state *state, const char *index_file)
>
> 	 	if (opts_left != 0)
> 	 		die("unknown option passed through to git apply");
> 	+	free(to_free);
>
> 	 	if (index_file) {
> 	 		apply_state.index_file = index_file;
>
> I.e. we leak the strvec_push() of the "apply" literal, but for the rest
> we append the "state->git_apply_opts.v", which we already free()
> elsewhere.

state->git_apply_opts is released elsewhere, but the copies of its elements
in apply_opts are leaked here.

> So actually, aren't we really just fumbling our way towards the "nodup"
> interface that the "struct string_list" has, and we should add the same
> to the "struct strvec".
>
> This seems to also fix all the leaks you fixed, but without any of the
> copying:
>
> 	diff --git a/builtin/am.c b/builtin/am.c
> 	index 30c9b3a9cd7..691ec1d152d 100644
> 	--- a/builtin/am.c
> 	+++ b/builtin/am.c
> 	@@ -1471,7 +1471,7 @@ static int parse_mail_rebase(struct am_state *state, const char *mail)
> 	 static int run_apply(const struct am_state *state, const char *index_file)
> 	 {
> 	 	struct strvec apply_paths = STRVEC_INIT;
> 	-	struct strvec apply_opts = STRVEC_INIT;
> 	+	struct strvec apply_opts = STRVEC_INIT_NODUP;
> 	 	struct apply_state apply_state;
> 	 	int res, opts_left;
> 	 	int force_apply = 0;
> 	diff --git a/strvec.c b/strvec.c
> 	index 61a76ce6cb9..efdc47a5854 100644
> 	--- a/strvec.c
> 	+++ b/strvec.c
> 	@@ -22,7 +22,9 @@ static void strvec_push_nodup(struct strvec *array, const char *value)
>
> 	 const char *strvec_push(struct strvec *array, const char *value)
> 	 {
> 	-	strvec_push_nodup(array, xstrdup(value));
> 	+	const char *to_push = array->nodup_strings ? value : xstrdup(value);
> 	+
> 	+	strvec_push_nodup(array, to_push);
> 	 	return array->v[array->nr - 1];
> 	 }
>
> 	@@ -31,6 +33,9 @@ const char *strvec_pushf(struct strvec *array, const char *fmt, ...)
> 	 	va_list ap;
> 	 	struct strbuf v = STRBUF_INIT;
>
> 	+	if (array->nodup_strings)
> 	+		BUG("cannot strvec_pushf() on a 'nodup' strvec");
> 	+
> 	 	va_start(ap, fmt);
> 	 	strbuf_vaddf(&v, fmt, ap);
> 	 	va_end(ap);
> 	@@ -67,6 +72,9 @@ void strvec_pop(struct strvec *array)
>
> 	 void strvec_split(struct strvec *array, const char *to_split)
> 	 {
> 	+	if (array->nodup_strings)
> 	+		BUG("cannot strvec_split() on a 'nodup' strvec");
> 	+
> 	 	while (isspace(*to_split))
> 	 		to_split++;
> 	 	for (;;) {
> 	@@ -90,7 +98,7 @@ void strvec_clear(struct strvec *array)
> 	 	if (array->v != empty_strvec) {
> 	 		int i;
> 	 		for (i = 0; i < array->nr; i++)
> 	-			free((char *)array->v[i]);
> 	+			free(array->nodup_strings ? NULL : (char *)array->v[i]);
> 	 		free(array->v);
> 	 	}
> 	 	strvec_init(array);
> 	diff --git a/strvec.h b/strvec.h
> 	index 9f55c8766ba..ac6e2c04cea 100644
> 	--- a/strvec.h
> 	+++ b/strvec.h
> 	@@ -31,12 +31,18 @@ struct strvec {
> 	 	const char **v;
> 	 	size_t nr;
> 	 	size_t alloc;
> 	+	unsigned int nodup_strings:1;
> 	 };
>
> 	 #define STRVEC_INIT { \
> 	 	.v = empty_strvec, \
> 	 }
>
> 	+#define STRVEC_INIT_NODUP { \
> 	+	.v = empty_strvec, \
> 	+	.nodup_strings = 1, \
> 	+}
> 	+
> 	 /**
> 	  * Initialize an array. This is no different than assigning from
> 	  * `STRVEC_INIT`.

We can do it, but it's quite complicated overall.  strvec is with its
"duplicate everything" stance is easy to understand and reason about.
Adding a nodup variant will remove that feature.

> In any case, for this change (and leak fixes in general), could you
> please also mark the now-passing leak-free tests. This fails after your
> patch, but passes before.
>
> The failure is a "good" one, as they're now leak-free, but should be
> marked as such:
>
> 	GIT_TEST_PASSING_SANITIZE_LEAK=check GIT_TEST_SANITIZE_LEAK_LOG=true \
> 	make SANITIZE=leak test T="t4256-am-format-flowed.sh t0023-crlf-am.sh t4254-am-corrupt.sh t4257-am-interactive.sh t5403-post-checkout-hook.sh t4152-am-subjects.sh"
>

Well, it might be my laziness talking, but can we avoid that?  There are
lots of leaks present in tests and plugging the last one for a particular
test script holds no particular significance.  Wouldn't it be better to
update the annotations at the end of the release cycle, similar to
translations?  Or automate it via some kind of CI job?

In any case: Requiring to run the test suite with LeakSanitizer for leak
fixes would be a hurdle that I'd jump only reluctantly.  (LeakSanitizer
is not supported on my development system, an M1 Mac.)

>> If we have to change parse_options() at all then I'd prefer it to not
>> free() anything (to keep it usable with main()'s parameters),...
>
> I'm suggesting passing a flag to it to have it free() things it
> NULL's.
>
> Maybe that's a bad idea, but I don't see the incompatibility with
> main(), for those we just wouldn't pass the flag.
>
> It does have the inherent limitation that you couldn't mix strvec and
> non-strvec parameters, i.e. if some of your elements are dup'd but
> others aren't you'd need to arrange to have them all dup'd.
>
> But I don't think that's an issue in practice, either we pass the fully
> dup'd version, or main() parameters.

Yes, but your examples at the top show that sometimes the parse_options()
call is hidden by several layers of other functions.  We better have a
more general solution.

>> but to
>> reorder in a non-destructive way.  That would mean keeping the NULL
>> sentinel where it is, and making sure all callers use only the returned
>> argc to determine which arguments parse_options() didn't recognize.
>
> I think this would work if parse_options() wasn't clobbering those
> elements in some cases, and replacing them with new ones, but doesn't it
> do that e.g. in parse_options_step()?

You mean the "fake a short option thing", right?  That would have to be
addressed somehow.

> It also seems like a much more fragile change to try to ensure that
> nothing uses the "argv" again, but only looks at the updated
> "argc".
>
> Both that & adding a "const" to it would be a huge change across the
> codebase, so I'd like to avoid them, but at least the "const" method
> would be helped along by the compiler.

Agreed.

René

Jeff King Dec. 17, 2022, 1:24 p.m. UTC | #5

On Tue, Dec 13, 2022 at 07:31:13PM +0100, René Scharfe wrote:

> > I think less of a hack is to teach the eventual parse_options() that
> > when it munges it it should free() it. I did that for the revisions API
> > in f92dbdbc6a8 (revisions API: don't leak memory on argv elements that
> > need free()-ing, 2022-08-02).
> >
> > What do you think?
> 
> Generating string lists and then parsing them is weird.  When calls have
> to cross a process boundary then we have no choice, but in-process we
> shouldn't have to lower our request to an intermediate text format.  git
> am does it anyway because it writes its options to a file and reads them
> back when it resumes with --continue, IIUC.

The argument has been made[1] in the past that the public API for the
revision machinery is not poking bits in the rev_info struct yourself,
but passing the appropriate options to setup_revisions().

And I can see the point; if option "--foo" requires twiddling both
revs.foo and revs.bar, then we have no way to enforce that callers have
remembered to do both. But if the only code which handles this is the
parser for "--foo", then we're OK.

In a non-C language, we'd probably mark "foo" and "bar" as private and
provide a method to enable the option. We could provide a function, but
we'd have no compiler help to ensure that it was used over fiddling the
bits. Possibly calling them "foo_private" would be sufficient to make
people think twice about tweaking them, though.

[1] Apologies for the passive voice; I didn't want to muddle the
    paragraph by discussing who and when. But I know this is an argument
    Junio has made; I don't know if he still stands by it these days
    (there is quite a lot of field-twiddling of rev_info by now). I'm
    not sure to what degree I agree with it myself, but I thought it was
    worth mentioning here.

> If we have to change parse_options() at all then I'd prefer it to not
> free() anything (to keep it usable with main()'s parameters), but to
> reorder in a non-destructive way.  That would mean keeping the NULL
> sentinel where it is, and making sure all callers use only the returned
> argc to determine which arguments parse_options() didn't recognize.

My findings with -Wunused-parameter show that there are quite a lot of
places that take argv/argc but assume the NULL-termination of the argv.

If we are just re-ordering argv, though, it feels like this could still
work with a strvec. Right now a strvec with "nr" items will free items 0
through nr-1, assuming that v[nr] is its NULL invariant. But it could
free v[nr], too, in case the NULL was swapped into an earlier position.

It's a little weird already, because that swap has violated the
invariant, so trying to strvec_push() onto it would cause confusing
results. But if the general use case is to pass the strvec to
parse_options(), get reordered, and then clear() it, it should work. If
you wanted to get really fancy, push() et al could double-check the
invariant and BUG().

-Peff

René Scharfe Dec. 17, 2022, 4:07 p.m. UTC | #6

Am 17.12.22 um 14:24 schrieb Jeff King:
> On Tue, Dec 13, 2022 at 07:31:13PM +0100, René Scharfe wrote:
>
>>> I think less of a hack is to teach the eventual parse_options() that
>>> when it munges it it should free() it. I did that for the revisions API
>>> in f92dbdbc6a8 (revisions API: don't leak memory on argv elements that
>>> need free()-ing, 2022-08-02).
>>>
>>> What do you think?
>>
>> Generating string lists and then parsing them is weird.  When calls have
>> to cross a process boundary then we have no choice, but in-process we
>> shouldn't have to lower our request to an intermediate text format.  git
>> am does it anyway because it writes its options to a file and reads them
>> back when it resumes with --continue, IIUC.
>
> The argument has been made[1] in the past that the public API for the
> revision machinery is not poking bits in the rev_info struct yourself,
> but passing the appropriate options to setup_revisions().

My point was that we are stuck with the way it's done in git am
specifically because of its on-disk parameter store.  It forces us to
be able to handle a parameter list that we allocate ourselves instead
of being given one by the OS like argc/argv in main().  And this
justifies a bespoke solution instead of changing parse_options().

In the meantime Ævar showed enough examples to convince me that a more
general solution is indeed needed, but I still think we should keep
parse_options() unchanged and add something like strvec_clone_nodup()
instead.

Regarding using textual interfaces internally in general: It's the
easiest method, as we have to provide it for users anyway.  It's not
simple, though -- generating strings and managing their ownership adds
overhead like this patch, which we wouldn't have in an struct
interface.

> And I can see the point; if option "--foo" requires twiddling both
> revs.foo and revs.bar, then we have no way to enforce that callers have
> remembered to do both. But if the only code which handles this is the
> parser for "--foo", then we're OK.

Depends on the specifics.  Perhaps .bar is redundant and could be
inferred from .foo.  Or they could be replaced by an enum.

> In a non-C language, we'd probably mark "foo" and "bar" as private and
> provide a method to enable the option. We could provide a function, but
> we'd have no compiler help to ensure that it was used over fiddling the
> bits. Possibly calling them "foo_private" would be sufficient to make
> people think twice about tweaking them, though.

Or normalize the struct to avoid dependencies between fields.

> [1] Apologies for the passive voice; I didn't want to muddle the
>     paragraph by discussing who and when. But I know this is an argument
>     Junio has made; I don't know if he still stands by it these days
>     (there is quite a lot of field-twiddling of rev_info by now). I'm
>     not sure to what degree I agree with it myself, but I thought it was
>     worth mentioning here.
>
>> If we have to change parse_options() at all then I'd prefer it to not
>> free() anything (to keep it usable with main()'s parameters), but to
>> reorder in a non-destructive way.  That would mean keeping the NULL
>> sentinel where it is, and making sure all callers use only the returned
>> argc to determine which arguments parse_options() didn't recognize.
>
> My findings with -Wunused-parameter show that there are quite a lot of
> places that take argv/argc but assume the NULL-termination of the argv.

Right.

> If we are just re-ordering argv, though, it feels like this could still
> work with a strvec. Right now a strvec with "nr" items will free items 0
> through nr-1, assuming that v[nr] is its NULL invariant. But it could
> free v[nr], too, in case the NULL was swapped into an earlier position.
>
> It's a little weird already, because that swap has violated the
> invariant, so trying to strvec_push() onto it would cause confusing
> results. But if the general use case is to pass the strvec to
> parse_options(), get reordered, and then clear() it, it should work. If
> you wanted to get really fancy, push() et al could double-check the
> invariant and BUG().

Yes, parse_options() and strvec are not fitting perfectly.  Changing the
former to reorder the elements and keeping them all would require
checking that all callers use the return value.  Feels like a lot of work.

A variant that takes a strvec and removes and frees parsed items from it
would be clean, but would require refactoring indirect callers like
apply_parse_options() users.  Busywork.

Making a shallow copy to give to parse_options() in callers that currently
pass a strvec directly or indirectly seems like the simplest solution to
me for now.

René

Jeff King Dec. 17, 2022, 9:53 p.m. UTC | #7

On Sat, Dec 17, 2022 at 05:07:12PM +0100, René Scharfe wrote:

> > If we are just re-ordering argv, though, it feels like this could still
> > work with a strvec. Right now a strvec with "nr" items will free items 0
> > through nr-1, assuming that v[nr] is its NULL invariant. But it could
> > free v[nr], too, in case the NULL was swapped into an earlier position.
> >
> > It's a little weird already, because that swap has violated the
> > invariant, so trying to strvec_push() onto it would cause confusing
> > results. But if the general use case is to pass the strvec to
> > parse_options(), get reordered, and then clear() it, it should work. If
> > you wanted to get really fancy, push() et al could double-check the
> > invariant and BUG().
> 
> Yes, parse_options() and strvec are not fitting perfectly.  Changing the
> former to reorder the elements and keeping them all would require
> checking that all callers use the return value.  Feels like a lot of work.

I think we're already munging the strvec arrays in the option parser,
though. I'm just suggesting that parse_options() swap arguments to the
end instead of overwriting a NULL (actually, I'm not even sure it
doesn't do that already), and strvec_clear() checking the final element.
The end state is not necessarily safe, but it's no worse than what we
have today.

That said...

> A variant that takes a strvec and removes and frees parsed items from it
> would be clean, but would require refactoring indirect callers like
> apply_parse_options() users.  Busywork.
> 
> Making a shallow copy to give to parse_options() in callers that currently
> pass a strvec directly or indirectly seems like the simplest solution to
> me for now.

Yes, I thought your original patch actually got to the root of the
problem. strvec owns the array and its elements, and parse-options wants
to munge the array itself (but not the elements). Making a shallow copy
is eliminates the conflict over ownership.

-Peff

Junio C Hamano Dec. 18, 2022, 2:42 a.m. UTC | #8

Jeff King <peff@peff.net> writes:

>> Making a shallow copy to give to parse_options() in callers that currently
>> pass a strvec directly or indirectly seems like the simplest solution to
>> me for now.
>
> Yes, I thought your original patch actually got to the root of the
> problem. strvec owns the array and its elements, and parse-options wants
> to munge the array itself (but not the elements). Making a shallow copy
> is eliminates the conflict over ownership.

Yup, it did look very sensible to me.

Junio C Hamano Dec. 20, 2022, 1:29 a.m. UTC | #9

René Scharfe <l.s.r@web.de> writes:

> Depends on the specifics.  Perhaps .bar is redundant and could be
> inferred from .foo.  Or they could be replaced by an enum.
> ...
> Or normalize the struct to avoid dependencies between fields.

I think the example Peff brought up was mostly from the revisions
API, where there are two vastly different traversal behaviour
(i.e. limited and !limited).  With .limited bit set, we first
fully enumerate the potential commits, before showing a single
entry in the output, and without, we can stream the output.
In general, we prefer !limited traversal because that would give us
better interactive latency, but some options only can work with
the limited bit set (e.g. --ancestry-path, --cherry).

We _could_ probably have do without the .limited bit and instead
wrote each and every check we currently make to the revs->limited
bit as something like 

#define is_limited(revs) (revs->cherry_mark || \
			  revs->ancestry_path || \
			  ...)

so that the "struct members" API users do not have to know that they
have to set the .limit bit when they set .cherry_mark bit.  There
are many others even in the revisions API alone.

IOW, yes, this depends on the specifics, and "normalize" is easier
said than done, unfortunately.

am: don't pass strvec to apply_parse_options()

Commit Message

Comments

Patch