mbox series

[RFC,0/5] strvec: add a "nodup" mode, fix memory leaks

Message ID RFC-cover-0.5-00000000000-20221215T090226Z-avarab@gmail.com (mailing list archive)
Headers show
Series strvec: add a "nodup" mode, fix memory leaks | expand

Message

Ævar Arnfjörð Bjarmason Dec. 15, 2022, 9:11 a.m. UTC
This is an alternative to René's [1], his already fixes a leak in "git
am", and this could be done later, so I'm submitting it as RFC, but it
could also replace it.

I think as this series shows extending the "strvec" API to get a
feature that works like the existing "strdup_strings" that the "struct
string_list" has can make memory management much simpler.

The 4/5 here shows cases where we were leaking because our "v" was
clobbered, but where all the strings we were pushing to the "strvec"
were fixed strings, so we could skip xstrdup()-ing them.

The 5/5 then shows more complex cases where we have mixed-use,
i.e. most strings are fixed, but some are not. For those we use a
"struct string_list to_free = STRING_LIST_INIT_DUP", which we then
push to the "to_free" list with "string_list_append_nodup()".

This does make the API slightly more dangerous to use, as it's no
longer guaranteed that it owns all the members it points to. But as
the "struct string_list" has shown this isn't an issue in practice,
and e.g. SANITIZE=address et al are good about finding double-frees,
or frees of fixed strings.

A branch & CI for this are found at [2].

1. https://lore.kernel.org/git/baf93e4a-7f05-857c-e551-09675496c03c@web.de/
2. https://github.com/avar/git/tree/avar/add-and-use-strvec-init-nodup

Ævar Arnfjörð Bjarmason (5):
  builtin/annotate.c: simplify for strvec API
  various: add missing strvec_clear()
  strvec API: add a "STRVEC_INIT_NODUP"
  strvec API users: fix leaks by using "STRVEC_INIT_NODUP"
  strvec API users: fix more leaks by using "STRVEC_INIT_NODUP"

 builtin/am.c                  |  2 +-
 builtin/annotate.c            | 17 ++++++++---------
 builtin/describe.c            | 28 +++++++++++++++++++++-------
 builtin/stash.c               |  8 ++++++--
 builtin/upload-archive.c      | 16 ++++++++++++----
 strvec.c                      | 20 ++++++++++++++++++--
 strvec.h                      | 30 +++++++++++++++++++++++++++++-
 t/t0023-crlf-am.sh            |  1 +
 t/t4152-am-subjects.sh        |  2 ++
 t/t4254-am-corrupt.sh         |  2 ++
 t/t4256-am-format-flowed.sh   |  1 +
 t/t4257-am-interactive.sh     |  2 ++
 t/t5003-archive-zip.sh        |  1 +
 t/t5403-post-checkout-hook.sh |  1 +
 14 files changed, 105 insertions(+), 26 deletions(-)

Comments

René Scharfe Dec. 17, 2022, 12:45 p.m. UTC | #1
Am 15.12.22 um 10:11 schrieb Ævar Arnfjörð Bjarmason:
> This is an alternative to René's [1], his already fixes a leak in "git
> am", and this could be done later, so I'm submitting it as RFC, but it
> could also replace it.
>
> I think as this series shows extending the "strvec" API to get a
> feature that works like the existing "strdup_strings" that the "struct
> string_list" has can make memory management much simpler.
>
> The 4/5 here shows cases where we were leaking because our "v" was
> clobbered, but where all the strings we were pushing to the "strvec"
> were fixed strings, so we could skip xstrdup()-ing them.
>
> The 5/5 then shows more complex cases where we have mixed-use,
> i.e. most strings are fixed, but some are not. For those we use a
> "struct string_list to_free = STRING_LIST_INIT_DUP", which we then
> push to the "to_free" list with "string_list_append_nodup()".
>
> This does make the API slightly more dangerous to use, as it's no
> longer guaranteed that it owns all the members it points to. But as
> the "struct string_list" has shown this isn't an issue in practice,
> and e.g. SANITIZE=address et al are good about finding double-frees,
> or frees of fixed strings.
>
> A branch & CI for this are found at [2].
>
> 1. https://lore.kernel.org/git/baf93e4a-7f05-857c-e551-09675496c03c@web.de/
> 2. https://github.com/avar/git/tree/avar/add-and-use-strvec-init-nodup
>
> Ævar Arnfjörð Bjarmason (5):
>   builtin/annotate.c: simplify for strvec API
>   various: add missing strvec_clear()
>   strvec API: add a "STRVEC_INIT_NODUP"
>   strvec API users: fix leaks by using "STRVEC_INIT_NODUP"
>   strvec API users: fix more leaks by using "STRVEC_INIT_NODUP"
>
>  builtin/am.c                  |  2 +-
>  builtin/annotate.c            | 17 ++++++++---------
>  builtin/describe.c            | 28 +++++++++++++++++++++-------
>  builtin/stash.c               |  8 ++++++--
>  builtin/upload-archive.c      | 16 ++++++++++++----
>  strvec.c                      | 20 ++++++++++++++++++--
>  strvec.h                      | 30 +++++++++++++++++++++++++++++-
>  t/t0023-crlf-am.sh            |  1 +
>  t/t4152-am-subjects.sh        |  2 ++
>  t/t4254-am-corrupt.sh         |  2 ++
>  t/t4256-am-format-flowed.sh   |  1 +
>  t/t4257-am-interactive.sh     |  2 ++
>  t/t5003-archive-zip.sh        |  1 +
>  t/t5403-post-checkout-hook.sh |  1 +
>  14 files changed, 105 insertions(+), 26 deletions(-)
>

This complicates strvec to gain functionality that we basically already
have with REALLOC_ARRAY.  It allows for nice conversions in some cases
(patch 4), but places that require some kind of double accounting become
quite nasty (patch 5).  I'd prefer to keep strvec simple.  Avoiding
allocations isn't necessary because the number of options to parse is
low -- we just need to keep track of and release them at the end.

The problem here is that parse_options() takes a string list in the form
of an int paired with a const char ** and modifies this list in place.
That's fine if we don't care about the memory ownership of the list and
its elements, e.g. because we get it from the OS (main() style) or
accept leakage.

It's not compatible with the list being a strvec that we don't want to
leak, though.  What we need is something that translates from strvec to
the argc/argv convention.  If we allow leaks this is trivial: Just use
the .v member directly.  If we don't then it is still simple: Make a
shallow copy of the .v member, release it after use.  Demo patch below.

 builtin/am.c             |  5 ++++-
 builtin/annotate.c       |  9 ++++++++-
 builtin/describe.c       |  8 +++++++-
 builtin/stash.c          | 10 +++++++++-
 builtin/upload-archive.c | 11 +++++++++--
 strvec.c                 |  9 +++++++++
 strvec.h                 |  7 +++++++
 7 files changed, 53 insertions(+), 6 deletions(-)

diff --git a/builtin/am.c b/builtin/am.c
index 30c9b3a9cd..a80d40f4be 100644
--- a/builtin/am.c
+++ b/builtin/am.c
@@ -1476,14 +1476,16 @@ static int run_apply(const struct am_state *state, const char *index_file)
 	int res, opts_left;
 	int force_apply = 0;
 	int options = 0;
+	const char **apply_argv;

 	if (init_apply_state(&apply_state, the_repository, NULL))
 		BUG("init_apply_state() failed");

 	strvec_push(&apply_opts, "apply");
 	strvec_pushv(&apply_opts, state->git_apply_opts.v);
+	apply_argv = strvec_clone_nodup(&apply_opts);

-	opts_left = apply_parse_options(apply_opts.nr, apply_opts.v,
+	opts_left = apply_parse_options(apply_opts.nr, apply_argv,
 					&apply_state, &force_apply, &options,
 					NULL);

@@ -1513,6 +1515,7 @@ static int run_apply(const struct am_state *state, const char *index_file)
 	strvec_clear(&apply_paths);
 	strvec_clear(&apply_opts);
 	clear_apply_state(&apply_state);
+	free(apply_argv);

 	if (res)
 		return res;
diff --git a/builtin/annotate.c b/builtin/annotate.c
index 58ff977a23..7e75f0f48d 100644
--- a/builtin/annotate.c
+++ b/builtin/annotate.c
@@ -10,6 +10,8 @@
 int cmd_annotate(int argc, const char **argv, const char *prefix)
 {
 	struct strvec args = STRVEC_INIT;
+	const char **blame_argv;
+	int ret;
 	int i;

 	strvec_pushl(&args, "annotate", "-c", NULL);
@@ -17,6 +19,11 @@ int cmd_annotate(int argc, const char **argv, const char *prefix)
 	for (i = 1; i < argc; i++) {
 		strvec_push(&args, argv[i]);
 	}
+	blame_argv = strvec_clone_nodup(&args);

-	return cmd_blame(args.nr, args.v, prefix);
+	ret = cmd_blame(args.nr, blame_argv, prefix);
+
+	strvec_clear(&args);
+	free(blame_argv);
+	return ret;
 }
diff --git a/builtin/describe.c b/builtin/describe.c
index eea1e330c0..2ef2fbc0d4 100644
--- a/builtin/describe.c
+++ b/builtin/describe.c
@@ -598,6 +598,8 @@ int cmd_describe(int argc, const char **argv, const char *prefix)
 	if (contains) {
 		struct string_list_item *item;
 		struct strvec args;
+		const char **name_rev_argv;
+		int ret;

 		strvec_init(&args);
 		strvec_pushl(&args, "name-rev",
@@ -616,7 +618,11 @@ int cmd_describe(int argc, const char **argv, const char *prefix)
 			strvec_pushv(&args, argv);
 		else
 			strvec_push(&args, "HEAD");
-		return cmd_name_rev(args.nr, args.v, prefix);
+		name_rev_argv = strvec_clone_nodup(&args);
+		ret = cmd_name_rev(args.nr, name_rev_argv, prefix);
+		strvec_clear(&args);
+		free(name_rev_argv);
+		return ret;
 	}

 	hashmap_init(&names, commit_name_neq, NULL, 0);
diff --git a/builtin/stash.c b/builtin/stash.c
index bb0fd86143..864b7c573a 100644
--- a/builtin/stash.c
+++ b/builtin/stash.c
@@ -1820,6 +1820,8 @@ static int save_stash(int argc, const char **argv, const char *prefix)

 int cmd_stash(int argc, const char **argv, const char *prefix)
 {
+	int ret;
+	const char **push_stash_argv;
 	pid_t pid = getpid();
 	const char *index_file;
 	struct strvec args = STRVEC_INIT;
@@ -1861,5 +1863,11 @@ int cmd_stash(int argc, const char **argv, const char *prefix)
 	/* Assume 'stash push' */
 	strvec_push(&args, "push");
 	strvec_pushv(&args, argv);
-	return !!push_stash(args.nr, args.v, prefix, 1);
+	push_stash_argv = strvec_clone_nodup(&args);
+
+	ret = !!push_stash(args.nr, push_stash_argv, prefix, 1);
+
+	strvec_clear(&args);
+	free(push_stash_argv);
+	return ret;
 }
diff --git a/builtin/upload-archive.c b/builtin/upload-archive.c
index 945ee2b412..36ed85e10a 100644
--- a/builtin/upload-archive.c
+++ b/builtin/upload-archive.c
@@ -19,6 +19,8 @@ static const char deadchild[] =

 int cmd_upload_archive_writer(int argc, const char **argv, const char *prefix)
 {
+	int ret;
+	const char **upload_archive_argv;
 	struct strvec sent_argv = STRVEC_INIT;
 	const char *arg_cmd = "argument ";

@@ -43,10 +45,15 @@ int cmd_upload_archive_writer(int argc, const char **argv, const char *prefix)
 			die("'argument' token or flush expected");
 		strvec_push(&sent_argv, buf + strlen(arg_cmd));
 	}
+	upload_archive_argv = strvec_clone_nodup(&sent_argv);

 	/* parse all options sent by the client */
-	return write_archive(sent_argv.nr, sent_argv.v, prefix,
-			     the_repository, NULL, 1);
+	ret = write_archive(sent_argv.nr, upload_archive_argv, prefix,
+			    the_repository, NULL, 1);
+
+	strvec_clear(&sent_argv);
+	free(upload_archive_argv);
+	return ret;
 }

 __attribute__((format (printf, 1, 2)))
diff --git a/strvec.c b/strvec.c
index 61a76ce6cb..5c41c8c506 100644
--- a/strvec.c
+++ b/strvec.c
@@ -106,3 +106,12 @@ const char **strvec_detach(struct strvec *array)
 		return ret;
 	}
 }
+
+const char **strvec_clone_nodup(const struct strvec *sv)
+{
+	const char **ret;
+
+	CALLOC_ARRAY(ret, sv->nr + 1);
+	COPY_ARRAY(ret, sv->v, sv->nr);
+	return ret;
+}
diff --git a/strvec.h b/strvec.h
index 9f55c8766b..6234634b00 100644
--- a/strvec.h
+++ b/strvec.h
@@ -88,4 +88,11 @@ void strvec_clear(struct strvec *);
  */
 const char **strvec_detach(struct strvec *);

+/**
+ * Create a copy of the array that references the original strings.
+ * The caller is responsible for freeing the memory of that copy,
+ * but must not free the individual strings.
+ */
+const char **strvec_clone_nodup(const struct strvec *);
+
 #endif /* STRVEC_H */
Jeff King Dec. 17, 2022, 1:13 p.m. UTC | #2
On Thu, Dec 15, 2022 at 10:11:06AM +0100, Ævar Arnfjörð Bjarmason wrote:

> This is an alternative to René's [1], his already fixes a leak in "git
> am", and this could be done later, so I'm submitting it as RFC, but it
> could also replace it.
> 
> I think as this series shows extending the "strvec" API to get a
> feature that works like the existing "strdup_strings" that the "struct
> string_list" has can make memory management much simpler.

I know this is kind of a surface level review, but...please don't do
this. We have chased so many bugs over the years due to string-list's
"maybe this is allocated and maybe not", in both directions (accidental
leaks and double-frees).

One of the reasons I advocated for strvec in the first place is so that
it would have consistent memory management semantics, at the minor cost
of sometimes duplicating them when we don't need to.

And having a nodup form doesn't even save you from having to call
strvec_clear(); you still need to do so to avoid leaking the array
itself. It only helps in the weird parse-options case, where we don't
handle ownership of the array very well (the strvec owns it, but
parse-options wants to modify it).

> This does make the API slightly more dangerous to use, as it's no
> longer guaranteed that it owns all the members it points to. But as
> the "struct string_list" has shown this isn't an issue in practice,
> and e.g. SANITIZE=address et al are good about finding double-frees,
> or frees of fixed strings.

I would disagree that this hasn't been an issue in practice. A few
recent examples:

  - 5eeb9aa208 (refs: fix memory leak when parsing hideRefs config,
    2022-11-17)
  - 7e2619d8ff (list_objects_filter_options: plug leak of filter_spec
    strings, 2022-09-08)
  - 4c81ee9669 (submodule--helper: fix "reference" leak, 2022-09-01)

Now you could argue that those leaks might still exist if we only had a
duplicating version of string-list (after all, the problem in a leak is
an extra duplication). But IMHO it is the ambiguity and the games we
play with setting/unsetting the strdup_strings field that lead to these
errors.

And yes, leak-checking and sanitizers can sometimes find these bugs. But
that implies triggering the bug in the test suite. And it implies extra
time to track and fix them. An interface which is harder to get wrong in
the first place is preferable.

-Peff
Ævar Arnfjörð Bjarmason Dec. 19, 2022, 9:20 a.m. UTC | #3
On Sat, Dec 17 2022, Jeff King wrote:

> On Thu, Dec 15, 2022 at 10:11:06AM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>> This is an alternative to René's [1], his already fixes a leak in "git
>> am", and this could be done later, so I'm submitting it as RFC, but it
>> could also replace it.
>> 
>> I think as this series shows extending the "strvec" API to get a
>> feature that works like the existing "strdup_strings" that the "struct
>> string_list" has can make memory management much simpler.
>
> I know this is kind of a surface level review, but...please don't do
> this. We have chased so many bugs over the years due to string-list's
> "maybe this is allocated and maybe not", in both directions (accidental
> leaks and double-frees).
>
> One of the reasons I advocated for strvec in the first place is so that
> it would have consistent memory management semantics, at the minor cost
> of sometimes duplicating them when we don't need to.
>
> And having a nodup form doesn't even save you from having to call
> strvec_clear(); you still need to do so to avoid leaking the array
> itself. It only helps in the weird parse-options case, where we don't
> handle ownership of the array very well (the strvec owns it, but
> parse-options wants to modify it).

Yes, just like "struct string_list" in the "nodup" mode.

I hear you, but I also think you're implicitly conflating two things
here.

There's the question of whether we should in general optimize for safety
over more optimila memory use. I.e. if we simply have every strvec,
string_list etc. own its memory fully we don't need to think as much
about allocation or ownership.

I think we should do that in general, but we also have cases where we'd
like to not do that, e.g. where we're adding thousands of strings to a
string_list, which are all borrewed from elsewhere, except for a few
we'd like to xstrdup().

Such API use *is* tricky, but I think formalizing it as the
"string_list" does is making it better, not worse. In particular...:

>> This does make the API slightly more dangerous to use, as it's no
>> longer guaranteed that it owns all the members it points to. But as
>> the "struct string_list" has shown this isn't an issue in practice,
>> and e.g. SANITIZE=address et al are good about finding double-frees,
>> or frees of fixed strings.
>
> I would disagree that this hasn't been an issue in practice. A few
> recent examples:
>
>   - 5eeb9aa208 (refs: fix memory leak when parsing hideRefs config,
>     2022-11-17)
>   - 7e2619d8ff (list_objects_filter_options: plug leak of filter_spec
>     strings, 2022-09-08)
>   - 4c81ee9669 (submodule--helper: fix "reference" leak, 2022-09-01)

...it's funny that those are the examples I would have dug up to argue
that this is a good idea, and to add some:

	- 4a0479086a9 (commit-graph: fix memory leak in misused
          string_list API, 2022-03-04)
	- b202e51b154 (grep: fix a "path_list" memory leak, 2021-10-22)

I.e. above you note "in both directions [...] leaks [...] and double
frees", but these (and the ones I added) are all in the second category.

Which is why I don't think it's an issue in practice. The leaks have
been a non-issue, and to the extent that we care the SANITIZE=leak
testing is closing that gap.

The dangerous issue is the double-free, but (and over the years I have
looked at pretty much every caller) I can't imagine a string_list
use-case where we:

 a) Actually still want to keep that memory optimization, i.e. it
    wouldn't be better by just saying "screw it, let's dup it".
 b) Given "a", we'd be better off with some bespoke custom pattern over
    the facility to do this with the "string_list".

So I really think we're in agreement about 99% of this, I just don't see
how *if* we want to do this why we're better of re-inventing this
particular wheel for every caller.
Jeff King Jan. 7, 2023, 1:21 p.m. UTC | #4
On Mon, Dec 19, 2022 at 10:20:00AM +0100, Ævar Arnfjörð Bjarmason wrote:

> I hear you, but I also think you're implicitly conflating two things
> here.
> 
> There's the question of whether we should in general optimize for safety
> over more optimila memory use. I.e. if we simply have every strvec,
> string_list etc. own its memory fully we don't need to think as much
> about allocation or ownership.
> 
> I think we should do that in general, but we also have cases where we'd
> like to not do that, e.g. where we're adding thousands of strings to a
> string_list, which are all borrewed from elsewhere, except for a few
> we'd like to xstrdup().
> 
> Such API use *is* tricky, but I think formalizing it as the
> "string_list" does is making it better, not worse. In particular...:

I think the two things are related, though, because "safety" is "people
not mis-using the API". Once you give the API the option to do the
trickier thing, people will do it, and they will do it badly. That has
been my experience with the string-list API (especially that people try
to do clever things with transferring ownership by using a
non-duplicating list and then setting strdup_strings midway through the
function).

> > I would disagree that this hasn't been an issue in practice. A few
> > recent examples:
> >
> >   - 5eeb9aa208 (refs: fix memory leak when parsing hideRefs config,
> >     2022-11-17)
> >   - 7e2619d8ff (list_objects_filter_options: plug leak of filter_spec
> >     strings, 2022-09-08)
> >   - 4c81ee9669 (submodule--helper: fix "reference" leak, 2022-09-01)
> 
> ...it's funny that those are the examples I would have dug up to argue
> that this is a good idea, and to add some:
> 
> 	- 4a0479086a9 (commit-graph: fix memory leak in misused
>           string_list API, 2022-03-04)
> 	- b202e51b154 (grep: fix a "path_list" memory leak, 2021-10-22)
> 
> I.e. above you note "in both directions [...] leaks [...] and double
> frees", but these (and the ones I added) are all in the second category.

I almost didn't give a list, because it's hard to dig into all of the
details. For example the second one in my list, which I worked on, did
fix things by consistently setting strdup_strings, and it was "just" a
leak fix. But it took me a full day to untangle the mess of that code,
and there were intermediate states were we did access dangling pointers
before I got to a working order of the series.  And all of it was
complicated by the fact that string_list is configurable, so different
bits of the code expected different things. Even after that commit,
there's a horrible hack to set the strdup field and hope it's enough.

If that were the only time I'd wrestled with string_list, I'd be less
concerned. But (subjectively) this feels like the latest in a long line
of bugs and irritations.

-Peff