Message ID | RFC-cover-0.5-00000000000-20221215T090226Z-avarab@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | strvec: add a "nodup" mode, fix memory leaks | expand |
Am 15.12.22 um 10:11 schrieb Ævar Arnfjörð Bjarmason: > This is an alternative to René's [1], his already fixes a leak in "git > am", and this could be done later, so I'm submitting it as RFC, but it > could also replace it. > > I think as this series shows extending the "strvec" API to get a > feature that works like the existing "strdup_strings" that the "struct > string_list" has can make memory management much simpler. > > The 4/5 here shows cases where we were leaking because our "v" was > clobbered, but where all the strings we were pushing to the "strvec" > were fixed strings, so we could skip xstrdup()-ing them. > > The 5/5 then shows more complex cases where we have mixed-use, > i.e. most strings are fixed, but some are not. For those we use a > "struct string_list to_free = STRING_LIST_INIT_DUP", which we then > push to the "to_free" list with "string_list_append_nodup()". > > This does make the API slightly more dangerous to use, as it's no > longer guaranteed that it owns all the members it points to. But as > the "struct string_list" has shown this isn't an issue in practice, > and e.g. SANITIZE=address et al are good about finding double-frees, > or frees of fixed strings. > > A branch & CI for this are found at [2]. > > 1. https://lore.kernel.org/git/baf93e4a-7f05-857c-e551-09675496c03c@web.de/ > 2. https://github.com/avar/git/tree/avar/add-and-use-strvec-init-nodup > > Ævar Arnfjörð Bjarmason (5): > builtin/annotate.c: simplify for strvec API > various: add missing strvec_clear() > strvec API: add a "STRVEC_INIT_NODUP" > strvec API users: fix leaks by using "STRVEC_INIT_NODUP" > strvec API users: fix more leaks by using "STRVEC_INIT_NODUP" > > builtin/am.c | 2 +- > builtin/annotate.c | 17 ++++++++--------- > builtin/describe.c | 28 +++++++++++++++++++++------- > builtin/stash.c | 8 ++++++-- > builtin/upload-archive.c | 16 ++++++++++++---- > strvec.c | 20 ++++++++++++++++++-- > strvec.h | 30 +++++++++++++++++++++++++++++- > t/t0023-crlf-am.sh | 1 + > t/t4152-am-subjects.sh | 2 ++ > t/t4254-am-corrupt.sh | 2 ++ > t/t4256-am-format-flowed.sh | 1 + > t/t4257-am-interactive.sh | 2 ++ > t/t5003-archive-zip.sh | 1 + > t/t5403-post-checkout-hook.sh | 1 + > 14 files changed, 105 insertions(+), 26 deletions(-) > This complicates strvec to gain functionality that we basically already have with REALLOC_ARRAY. It allows for nice conversions in some cases (patch 4), but places that require some kind of double accounting become quite nasty (patch 5). I'd prefer to keep strvec simple. Avoiding allocations isn't necessary because the number of options to parse is low -- we just need to keep track of and release them at the end. The problem here is that parse_options() takes a string list in the form of an int paired with a const char ** and modifies this list in place. That's fine if we don't care about the memory ownership of the list and its elements, e.g. because we get it from the OS (main() style) or accept leakage. It's not compatible with the list being a strvec that we don't want to leak, though. What we need is something that translates from strvec to the argc/argv convention. If we allow leaks this is trivial: Just use the .v member directly. If we don't then it is still simple: Make a shallow copy of the .v member, release it after use. Demo patch below. builtin/am.c | 5 ++++- builtin/annotate.c | 9 ++++++++- builtin/describe.c | 8 +++++++- builtin/stash.c | 10 +++++++++- builtin/upload-archive.c | 11 +++++++++-- strvec.c | 9 +++++++++ strvec.h | 7 +++++++ 7 files changed, 53 insertions(+), 6 deletions(-) diff --git a/builtin/am.c b/builtin/am.c index 30c9b3a9cd..a80d40f4be 100644 --- a/builtin/am.c +++ b/builtin/am.c @@ -1476,14 +1476,16 @@ static int run_apply(const struct am_state *state, const char *index_file) int res, opts_left; int force_apply = 0; int options = 0; + const char **apply_argv; if (init_apply_state(&apply_state, the_repository, NULL)) BUG("init_apply_state() failed"); strvec_push(&apply_opts, "apply"); strvec_pushv(&apply_opts, state->git_apply_opts.v); + apply_argv = strvec_clone_nodup(&apply_opts); - opts_left = apply_parse_options(apply_opts.nr, apply_opts.v, + opts_left = apply_parse_options(apply_opts.nr, apply_argv, &apply_state, &force_apply, &options, NULL); @@ -1513,6 +1515,7 @@ static int run_apply(const struct am_state *state, const char *index_file) strvec_clear(&apply_paths); strvec_clear(&apply_opts); clear_apply_state(&apply_state); + free(apply_argv); if (res) return res; diff --git a/builtin/annotate.c b/builtin/annotate.c index 58ff977a23..7e75f0f48d 100644 --- a/builtin/annotate.c +++ b/builtin/annotate.c @@ -10,6 +10,8 @@ int cmd_annotate(int argc, const char **argv, const char *prefix) { struct strvec args = STRVEC_INIT; + const char **blame_argv; + int ret; int i; strvec_pushl(&args, "annotate", "-c", NULL); @@ -17,6 +19,11 @@ int cmd_annotate(int argc, const char **argv, const char *prefix) for (i = 1; i < argc; i++) { strvec_push(&args, argv[i]); } + blame_argv = strvec_clone_nodup(&args); - return cmd_blame(args.nr, args.v, prefix); + ret = cmd_blame(args.nr, blame_argv, prefix); + + strvec_clear(&args); + free(blame_argv); + return ret; } diff --git a/builtin/describe.c b/builtin/describe.c index eea1e330c0..2ef2fbc0d4 100644 --- a/builtin/describe.c +++ b/builtin/describe.c @@ -598,6 +598,8 @@ int cmd_describe(int argc, const char **argv, const char *prefix) if (contains) { struct string_list_item *item; struct strvec args; + const char **name_rev_argv; + int ret; strvec_init(&args); strvec_pushl(&args, "name-rev", @@ -616,7 +618,11 @@ int cmd_describe(int argc, const char **argv, const char *prefix) strvec_pushv(&args, argv); else strvec_push(&args, "HEAD"); - return cmd_name_rev(args.nr, args.v, prefix); + name_rev_argv = strvec_clone_nodup(&args); + ret = cmd_name_rev(args.nr, name_rev_argv, prefix); + strvec_clear(&args); + free(name_rev_argv); + return ret; } hashmap_init(&names, commit_name_neq, NULL, 0); diff --git a/builtin/stash.c b/builtin/stash.c index bb0fd86143..864b7c573a 100644 --- a/builtin/stash.c +++ b/builtin/stash.c @@ -1820,6 +1820,8 @@ static int save_stash(int argc, const char **argv, const char *prefix) int cmd_stash(int argc, const char **argv, const char *prefix) { + int ret; + const char **push_stash_argv; pid_t pid = getpid(); const char *index_file; struct strvec args = STRVEC_INIT; @@ -1861,5 +1863,11 @@ int cmd_stash(int argc, const char **argv, const char *prefix) /* Assume 'stash push' */ strvec_push(&args, "push"); strvec_pushv(&args, argv); - return !!push_stash(args.nr, args.v, prefix, 1); + push_stash_argv = strvec_clone_nodup(&args); + + ret = !!push_stash(args.nr, push_stash_argv, prefix, 1); + + strvec_clear(&args); + free(push_stash_argv); + return ret; } diff --git a/builtin/upload-archive.c b/builtin/upload-archive.c index 945ee2b412..36ed85e10a 100644 --- a/builtin/upload-archive.c +++ b/builtin/upload-archive.c @@ -19,6 +19,8 @@ static const char deadchild[] = int cmd_upload_archive_writer(int argc, const char **argv, const char *prefix) { + int ret; + const char **upload_archive_argv; struct strvec sent_argv = STRVEC_INIT; const char *arg_cmd = "argument "; @@ -43,10 +45,15 @@ int cmd_upload_archive_writer(int argc, const char **argv, const char *prefix) die("'argument' token or flush expected"); strvec_push(&sent_argv, buf + strlen(arg_cmd)); } + upload_archive_argv = strvec_clone_nodup(&sent_argv); /* parse all options sent by the client */ - return write_archive(sent_argv.nr, sent_argv.v, prefix, - the_repository, NULL, 1); + ret = write_archive(sent_argv.nr, upload_archive_argv, prefix, + the_repository, NULL, 1); + + strvec_clear(&sent_argv); + free(upload_archive_argv); + return ret; } __attribute__((format (printf, 1, 2))) diff --git a/strvec.c b/strvec.c index 61a76ce6cb..5c41c8c506 100644 --- a/strvec.c +++ b/strvec.c @@ -106,3 +106,12 @@ const char **strvec_detach(struct strvec *array) return ret; } } + +const char **strvec_clone_nodup(const struct strvec *sv) +{ + const char **ret; + + CALLOC_ARRAY(ret, sv->nr + 1); + COPY_ARRAY(ret, sv->v, sv->nr); + return ret; +} diff --git a/strvec.h b/strvec.h index 9f55c8766b..6234634b00 100644 --- a/strvec.h +++ b/strvec.h @@ -88,4 +88,11 @@ void strvec_clear(struct strvec *); */ const char **strvec_detach(struct strvec *); +/** + * Create a copy of the array that references the original strings. + * The caller is responsible for freeing the memory of that copy, + * but must not free the individual strings. + */ +const char **strvec_clone_nodup(const struct strvec *); + #endif /* STRVEC_H */
On Thu, Dec 15, 2022 at 10:11:06AM +0100, Ævar Arnfjörð Bjarmason wrote: > This is an alternative to René's [1], his already fixes a leak in "git > am", and this could be done later, so I'm submitting it as RFC, but it > could also replace it. > > I think as this series shows extending the "strvec" API to get a > feature that works like the existing "strdup_strings" that the "struct > string_list" has can make memory management much simpler. I know this is kind of a surface level review, but...please don't do this. We have chased so many bugs over the years due to string-list's "maybe this is allocated and maybe not", in both directions (accidental leaks and double-frees). One of the reasons I advocated for strvec in the first place is so that it would have consistent memory management semantics, at the minor cost of sometimes duplicating them when we don't need to. And having a nodup form doesn't even save you from having to call strvec_clear(); you still need to do so to avoid leaking the array itself. It only helps in the weird parse-options case, where we don't handle ownership of the array very well (the strvec owns it, but parse-options wants to modify it). > This does make the API slightly more dangerous to use, as it's no > longer guaranteed that it owns all the members it points to. But as > the "struct string_list" has shown this isn't an issue in practice, > and e.g. SANITIZE=address et al are good about finding double-frees, > or frees of fixed strings. I would disagree that this hasn't been an issue in practice. A few recent examples: - 5eeb9aa208 (refs: fix memory leak when parsing hideRefs config, 2022-11-17) - 7e2619d8ff (list_objects_filter_options: plug leak of filter_spec strings, 2022-09-08) - 4c81ee9669 (submodule--helper: fix "reference" leak, 2022-09-01) Now you could argue that those leaks might still exist if we only had a duplicating version of string-list (after all, the problem in a leak is an extra duplication). But IMHO it is the ambiguity and the games we play with setting/unsetting the strdup_strings field that lead to these errors. And yes, leak-checking and sanitizers can sometimes find these bugs. But that implies triggering the bug in the test suite. And it implies extra time to track and fix them. An interface which is harder to get wrong in the first place is preferable. -Peff
On Sat, Dec 17 2022, Jeff King wrote: > On Thu, Dec 15, 2022 at 10:11:06AM +0100, Ævar Arnfjörð Bjarmason wrote: > >> This is an alternative to René's [1], his already fixes a leak in "git >> am", and this could be done later, so I'm submitting it as RFC, but it >> could also replace it. >> >> I think as this series shows extending the "strvec" API to get a >> feature that works like the existing "strdup_strings" that the "struct >> string_list" has can make memory management much simpler. > > I know this is kind of a surface level review, but...please don't do > this. We have chased so many bugs over the years due to string-list's > "maybe this is allocated and maybe not", in both directions (accidental > leaks and double-frees). > > One of the reasons I advocated for strvec in the first place is so that > it would have consistent memory management semantics, at the minor cost > of sometimes duplicating them when we don't need to. > > And having a nodup form doesn't even save you from having to call > strvec_clear(); you still need to do so to avoid leaking the array > itself. It only helps in the weird parse-options case, where we don't > handle ownership of the array very well (the strvec owns it, but > parse-options wants to modify it). Yes, just like "struct string_list" in the "nodup" mode. I hear you, but I also think you're implicitly conflating two things here. There's the question of whether we should in general optimize for safety over more optimila memory use. I.e. if we simply have every strvec, string_list etc. own its memory fully we don't need to think as much about allocation or ownership. I think we should do that in general, but we also have cases where we'd like to not do that, e.g. where we're adding thousands of strings to a string_list, which are all borrewed from elsewhere, except for a few we'd like to xstrdup(). Such API use *is* tricky, but I think formalizing it as the "string_list" does is making it better, not worse. In particular...: >> This does make the API slightly more dangerous to use, as it's no >> longer guaranteed that it owns all the members it points to. But as >> the "struct string_list" has shown this isn't an issue in practice, >> and e.g. SANITIZE=address et al are good about finding double-frees, >> or frees of fixed strings. > > I would disagree that this hasn't been an issue in practice. A few > recent examples: > > - 5eeb9aa208 (refs: fix memory leak when parsing hideRefs config, > 2022-11-17) > - 7e2619d8ff (list_objects_filter_options: plug leak of filter_spec > strings, 2022-09-08) > - 4c81ee9669 (submodule--helper: fix "reference" leak, 2022-09-01) ...it's funny that those are the examples I would have dug up to argue that this is a good idea, and to add some: - 4a0479086a9 (commit-graph: fix memory leak in misused string_list API, 2022-03-04) - b202e51b154 (grep: fix a "path_list" memory leak, 2021-10-22) I.e. above you note "in both directions [...] leaks [...] and double frees", but these (and the ones I added) are all in the second category. Which is why I don't think it's an issue in practice. The leaks have been a non-issue, and to the extent that we care the SANITIZE=leak testing is closing that gap. The dangerous issue is the double-free, but (and over the years I have looked at pretty much every caller) I can't imagine a string_list use-case where we: a) Actually still want to keep that memory optimization, i.e. it wouldn't be better by just saying "screw it, let's dup it". b) Given "a", we'd be better off with some bespoke custom pattern over the facility to do this with the "string_list". So I really think we're in agreement about 99% of this, I just don't see how *if* we want to do this why we're better of re-inventing this particular wheel for every caller.
On Mon, Dec 19, 2022 at 10:20:00AM +0100, Ævar Arnfjörð Bjarmason wrote: > I hear you, but I also think you're implicitly conflating two things > here. > > There's the question of whether we should in general optimize for safety > over more optimila memory use. I.e. if we simply have every strvec, > string_list etc. own its memory fully we don't need to think as much > about allocation or ownership. > > I think we should do that in general, but we also have cases where we'd > like to not do that, e.g. where we're adding thousands of strings to a > string_list, which are all borrewed from elsewhere, except for a few > we'd like to xstrdup(). > > Such API use *is* tricky, but I think formalizing it as the > "string_list" does is making it better, not worse. In particular...: I think the two things are related, though, because "safety" is "people not mis-using the API". Once you give the API the option to do the trickier thing, people will do it, and they will do it badly. That has been my experience with the string-list API (especially that people try to do clever things with transferring ownership by using a non-duplicating list and then setting strdup_strings midway through the function). > > I would disagree that this hasn't been an issue in practice. A few > > recent examples: > > > > - 5eeb9aa208 (refs: fix memory leak when parsing hideRefs config, > > 2022-11-17) > > - 7e2619d8ff (list_objects_filter_options: plug leak of filter_spec > > strings, 2022-09-08) > > - 4c81ee9669 (submodule--helper: fix "reference" leak, 2022-09-01) > > ...it's funny that those are the examples I would have dug up to argue > that this is a good idea, and to add some: > > - 4a0479086a9 (commit-graph: fix memory leak in misused > string_list API, 2022-03-04) > - b202e51b154 (grep: fix a "path_list" memory leak, 2021-10-22) > > I.e. above you note "in both directions [...] leaks [...] and double > frees", but these (and the ones I added) are all in the second category. I almost didn't give a list, because it's hard to dig into all of the details. For example the second one in my list, which I worked on, did fix things by consistently setting strdup_strings, and it was "just" a leak fix. But it took me a full day to untangle the mess of that code, and there were intermediate states were we did access dangling pointers before I got to a working order of the series. And all of it was complicated by the fact that string_list is configurable, so different bits of the code expected different things. Even after that commit, there's a horrible hack to set the strdup field and hope it's enough. If that were the only time I'd wrestled with string_list, I'd be less concerned. But (subjectively) this feels like the latest in a long line of bugs and irritations. -Peff