Message ID | b9310e9941e91336258edd97a913e5908180720e.1597561152.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Clean up some memory leaks in and around dir.c | expand |
On Sun, Aug 16, 2020 at 06:59:11AM +0000, Elijah Newren via GitGitGadget wrote: > Digging further, I found that despite the pretty clear documentation > near the top of dir.h that folks were supposed to call clear_directory() > when the user no longer needed the dir_struct, there were four callers > that didn't bother doing that at all. However, two of them clearly > thought about leaks since they had an UNLEAK(dir) directive, which to me > suggests that the method to free the data was too unclear. I suspect > the non-obviousness of the API and its holes led folks to avoid it, > which then snowballed into further problems with the entries[], > ignored[], parent_hashmap, and recursive_hashmap problems. The UNLEAK() ones are sort-of my fault, and are a combination of: - The commit adding them says "in some cases (e.g., dir_struct) we don't even have a function which knows how to free all of the struct members". I'm not sure if why I didn't fix clear_directory() as you did. I may not have known about it, or I may have been lazy. Or more likely I was simply holding the UNLEAK() hammer, so it looked like a nail. ;) - My focus was on making leak-checker output cleaner. And it wasn't clear for cases where we're about to exit the program whether we should be actually freeing (which has small but non-zero cost) or just annotating (which is zero-cost, but clearly has confused some people about how UNLEAK() was meant to be used). I think I'm leaning these days towards "free if it is easy to do so". So this definitely seems like a good direction to me. > Rename clear_directory() to dir_free() to be more in line with other > data structures in git, and introduce a dir_init() to handle the > suggested memsetting of dir_struct to all zeroes. I hope that a name > like "dir_free()" is more clear, and that the presence of dir_init() > will provide a hint to those looking at the code that there may be a > corresponding dir_free() that they need to call. I think having a pair of matched calls is good. I _thought_ we had established a pattern that we should prefer "clear" to "free" for cases where the struct itself its not freed. But grepping around, I see there are a few exceptions (hashmap_free() is the big one, and then oidmap_free() which is built around it seems to have inherited it). The rest seem to follow that pattern, though: attr_check_free, cache_tree_free, and submodule_cache_free all actually free the pointer passed in. And "git grep '_clear(' *.h" shows lots of clear-but-don't-free examples. Would dir_clear() make the pairing more obvious, but keep the clear name? (I also wouldn't be opposed to changing hashmap and oidmap to use the name "clear", but that's obviously a separate patch). > builtin/add.c | 4 ++-- > builtin/check-ignore.c | 4 ++-- > builtin/clean.c | 8 ++++---- > builtin/grep.c | 3 ++- > builtin/ls-files.c | 4 ++-- > builtin/stash.c | 4 ++-- > dir.c | 7 ++++++- > dir.h | 19 ++++++++++--------- > merge.c | 3 ++- > wt-status.c | 4 ++-- > 10 files changed, 34 insertions(+), 26 deletions(-) That patch itself looks good except for two minor points: > /* Frees memory within dir which was allocated. Does not free dir itself. */ > -void clear_directory(struct dir_struct *dir) > +void dir_free(struct dir_struct *dir) > { > int i, j; > struct exclude_list_group *group; As I mentioned in my previous email, I think it would be nice if this called dir_init() at the end, so that the struct is always in a consistent state. > diff --git a/dir.h b/dir.h > index 7d76d0644f..7c55c1a460 100644 > --- a/dir.h > +++ b/dir.h > [...] > - * - Use `dir.entries[]`. > + * - Use `dir.entries[]` and `dir.ignored[]`. > * > * - Call `clear_directory()` when the contained elements are no longer in use. > * This last line should become dir_free() / dir_clear(). -Peff
On Sun, Aug 16, 2020 at 2:11 AM Jeff King <peff@peff.net> wrote: > > On Sun, Aug 16, 2020 at 06:59:11AM +0000, Elijah Newren via GitGitGadget wrote: > > > Digging further, I found that despite the pretty clear documentation > > near the top of dir.h that folks were supposed to call clear_directory() > > when the user no longer needed the dir_struct, there were four callers > > that didn't bother doing that at all. However, two of them clearly > > thought about leaks since they had an UNLEAK(dir) directive, which to me > > suggests that the method to free the data was too unclear. I suspect > > the non-obviousness of the API and its holes led folks to avoid it, > > which then snowballed into further problems with the entries[], > > ignored[], parent_hashmap, and recursive_hashmap problems. > > The UNLEAK() ones are sort-of my fault, and are a combination of: > > - The commit adding them says "in some cases (e.g., dir_struct) we > don't even have a function which knows how to free all of the struct > members". I'm not sure if why I didn't fix clear_directory() as you > did. I may not have known about it, or I may have been lazy. Or more > likely I was simply holding the UNLEAK() hammer, so it looked like a > nail. ;) > > - My focus was on making leak-checker output cleaner. And it wasn't > clear for cases where we're about to exit the program whether we > should be actually freeing (which has small but non-zero cost) or > just annotating (which is zero-cost, but clearly has confused some > people about how UNLEAK() was meant to be used). I think I'm leaning > these days towards "free if it is easy to do so". > > So this definitely seems like a good direction to me. > > > Rename clear_directory() to dir_free() to be more in line with other > > data structures in git, and introduce a dir_init() to handle the > > suggested memsetting of dir_struct to all zeroes. I hope that a name > > like "dir_free()" is more clear, and that the presence of dir_init() > > will provide a hint to those looking at the code that there may be a > > corresponding dir_free() that they need to call. > > I think having a pair of matched calls is good. I _thought_ we had > established a pattern that we should prefer "clear" to "free" for cases > where the struct itself its not freed. But grepping around, I see there > are a few exceptions (hashmap_free() is the big one, and then > oidmap_free() which is built around it seems to have inherited it). > > The rest seem to follow that pattern, though: attr_check_free, > cache_tree_free, and submodule_cache_free all actually free the pointer > passed in. And "git grep '_clear(' *.h" shows lots of > clear-but-don't-free examples. > > Would dir_clear() make the pairing more obvious, but keep the clear > name? Sure, that works for me. The case where having an actual _free() makes sense to me -- possibly in addition to a _clear() -- is when some memory has to be allocated before first use, and thus a foo_clear() would leave that memory allocated. Then you really do need a foo_free() for use when the data structure won't be used again. > (I also wouldn't be opposed to changing hashmap and oidmap to use the > name "clear", but that's obviously a separate patch). hashmap is one of the cases that needs to have a free construct, because the table in which to stuff the entries has to be allocated and thus a hashmap_clear() would have to leave the table allocated if it wants to be ready for re-use. If someone really is done with a hashmap, then to avoid leaking, both the entries and the table need to be deallocated. I keep getting confused by the hashmap API, and what pieces it frees -- it looks like my earlier comments today were wrong and hashmap_free_entries() does free the table. So...perhaps I should create a patch to make that clearer, and also submit the patch I've had for a while to introduce a hashmap_clear() function (which is similar to hashmap_free_entries, in that it frees the entries and zeros out most of the map, but it leaves the table allocated and ready for use). I really wish hashmap_free() did what hashmap_free_entries() did. So annoying and counter-intuitive... > > builtin/add.c | 4 ++-- > > builtin/check-ignore.c | 4 ++-- > > builtin/clean.c | 8 ++++---- > > builtin/grep.c | 3 ++- > > builtin/ls-files.c | 4 ++-- > > builtin/stash.c | 4 ++-- > > dir.c | 7 ++++++- > > dir.h | 19 ++++++++++--------- > > merge.c | 3 ++- > > wt-status.c | 4 ++-- > > 10 files changed, 34 insertions(+), 26 deletions(-) > > That patch itself looks good except for two minor points: > > > /* Frees memory within dir which was allocated. Does not free dir itself. */ > > -void clear_directory(struct dir_struct *dir) > > +void dir_free(struct dir_struct *dir) > > { > > int i, j; > > struct exclude_list_group *group; > > As I mentioned in my previous email, I think it would be nice if this > called dir_init() at the end, so that the struct is always in a > consistent state. > > > diff --git a/dir.h b/dir.h > > index 7d76d0644f..7c55c1a460 100644 > > --- a/dir.h > > +++ b/dir.h > > [...] > > - * - Use `dir.entries[]`. > > + * - Use `dir.entries[]` and `dir.ignored[]`. > > * > > * - Call `clear_directory()` when the contained elements are no longer in use. > > * > > This last line should become dir_free() / dir_clear(). I'll fix these up. Elijah
On Mon, Aug 17, 2020 at 10:19:03AM -0700, Elijah Newren wrote: > > (I also wouldn't be opposed to changing hashmap and oidmap to use the > > name "clear", but that's obviously a separate patch). > > hashmap is one of the cases that needs to have a free construct, > because the table in which to stuff the entries has to be allocated > and thus a hashmap_clear() would have to leave the table allocated if > it wants to be ready for re-use. If someone really is done with a > hashmap, then to avoid leaking, both the entries and the table need to > be deallocated. Hmm, you're right. oidmap() will lazy-initialize the table, but hashmap will not. You _do_ have to initialize a hashmap because somebody has to set the comparison function. But that could be fixed, and would make it more like the rest of our code. I.e., you should be able to do: struct hashmap foo = HASHMAP_INIT(my_cmpfn); if (bar) return; /* no leak, because we never put anything in the map! */ ... add some stuff to the map ... hashmap_clear(&foo); ... now it's empty and nothing allocated; we could return here without a leak or we could add more stuff to it ... I don't even think it would be that big a change. Just translate a NULL table to "not found" on the read side, and lazily call alloc_table() on the write side. And have hashmap_free() not very the _whole_ struct, but leave the cmpfn in place. > I keep getting confused by the hashmap API, and what pieces it frees > -- it looks like my earlier comments today were wrong and > hashmap_free_entries() does free the table. So...perhaps I should > create a patch to make that clearer, and also submit the patch I've > had for a while to introduce a hashmap_clear() function (which is > similar to hashmap_free_entries, in that it frees the entries and > zeros out most of the map, but it leaves the table allocated and ready > for use). > > I really wish hashmap_free() did what hashmap_free_entries() did. So > annoying and counter-intuitive... I left some comments in my other reply, but in case you do pursue this: the obvious thing to have is a free_entries boolean parameter to the function, so that each caller is clear about what they want. And we used to have that. But it's awkward because "free the entries" isn't a boolean anymore; it's "free the entries you can find by moving backwards to this offset from the hashmap_entry pointer". So callers who don't want to free them have to pass some sentinel value there. And that's how we ended up with two separate wrapper functions. I think your main complaint is just about the naming though. If we had: /* drop all entries, freeing any hashmap-specific memory */ hashmap_clear(); /* ditto, but also free the entries themselves */ hashmap_clear_and_free_entires(); that would be a bit more obvious (though I imagine it would still be easy to forget that "clear" doesn't drop the entries). Another approach would be to have a flag in the map for "do I own the entry memory". Most callers are happy to hand off ownership of the entries when they're added. And it may even be that this would open up the possibility of more convenience functions on the adding/allocation side. I didn't think it through carefully, though. -Peff
diff --git a/builtin/add.c b/builtin/add.c index ab39a60a0d..46c9a0f552 100644 --- a/builtin/add.c +++ b/builtin/add.c @@ -534,11 +534,11 @@ int cmd_add(int argc, const char **argv, const char *prefix) die_in_unpopulated_submodule(&the_index, prefix); die_path_inside_submodule(&the_index, &pathspec); + dir_init(&dir); if (add_new_files) { int baselen; /* Set up the default git porcelain excludes */ - memset(&dir, 0, sizeof(dir)); if (!ignored_too) { dir.flags |= DIR_COLLECT_IGNORED; setup_standard_excludes(&dir); @@ -611,7 +611,7 @@ int cmd_add(int argc, const char **argv, const char *prefix) COMMIT_LOCK | SKIP_IF_UNCHANGED)) die(_("Unable to write new index file")); + dir_free(&dir); UNLEAK(pathspec); - UNLEAK(dir); return exit_status; } diff --git a/builtin/check-ignore.c b/builtin/check-ignore.c index ea5d0ae3a6..29b464a1fa 100644 --- a/builtin/check-ignore.c +++ b/builtin/check-ignore.c @@ -180,7 +180,7 @@ int cmd_check_ignore(int argc, const char **argv, const char *prefix) if (!no_index && read_cache() < 0) die(_("index file corrupt")); - memset(&dir, 0, sizeof(dir)); + dir_init(&dir); setup_standard_excludes(&dir); if (stdin_paths) { @@ -190,7 +190,7 @@ int cmd_check_ignore(int argc, const char **argv, const char *prefix) maybe_flush_or_die(stdout, "ignore to stdout"); } - clear_directory(&dir); + dir_free(&dir); return !num_ignored; } diff --git a/builtin/clean.c b/builtin/clean.c index 4ffe00dd7f..a4cf58af74 100644 --- a/builtin/clean.c +++ b/builtin/clean.c @@ -667,7 +667,7 @@ static int filter_by_patterns_cmd(void) if (!confirm.len) break; - memset(&dir, 0, sizeof(dir)); + dir_init(&dir); pl = add_pattern_list(&dir, EXC_CMDL, "manual exclude"); ignore_list = strbuf_split_max(&confirm, ' ', 0); @@ -698,7 +698,7 @@ static int filter_by_patterns_cmd(void) } strbuf_list_free(ignore_list); - clear_directory(&dir); + dir_free(&dir); } strbuf_release(&confirm); @@ -923,7 +923,7 @@ int cmd_clean(int argc, const char **argv, const char *prefix) argc = parse_options(argc, argv, prefix, options, builtin_clean_usage, 0); - memset(&dir, 0, sizeof(dir)); + dir_init(&dir); if (!interactive && !dry_run && !force) { if (config_set) die(_("clean.requireForce set to true and neither -i, -n, nor -f given; " @@ -1021,7 +1021,7 @@ int cmd_clean(int argc, const char **argv, const char *prefix) string_list_append(&del_list, rel); } - clear_directory(&dir); + dir_free(&dir); if (interactive && del_list.nr > 0) interactive_main_loop(); diff --git a/builtin/grep.c b/builtin/grep.c index cee9db3477..fef2cc05ca 100644 --- a/builtin/grep.c +++ b/builtin/grep.c @@ -693,7 +693,7 @@ static int grep_directory(struct grep_opt *opt, const struct pathspec *pathspec, struct dir_struct dir; int i, hit = 0; - memset(&dir, 0, sizeof(dir)); + dir_init(&dir); if (!use_index) dir.flags |= DIR_NO_GITLINKS; if (exc_std) @@ -705,6 +705,7 @@ static int grep_directory(struct grep_opt *opt, const struct pathspec *pathspec, if (hit && opt->status_only) break; } + dir_free(&dir); return hit; } diff --git a/builtin/ls-files.c b/builtin/ls-files.c index 30a4c10334..740feff53e 100644 --- a/builtin/ls-files.c +++ b/builtin/ls-files.c @@ -584,7 +584,7 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix) if (argc == 2 && !strcmp(argv[1], "-h")) usage_with_options(ls_files_usage, builtin_ls_files_options); - memset(&dir, 0, sizeof(dir)); + dir_init(&dir); prefix = cmd_prefix; if (prefix) prefix_len = strlen(prefix); @@ -688,6 +688,6 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix) return bad ? 1 : 0; } - UNLEAK(dir); + dir_free(&dir); return 0; } diff --git a/builtin/stash.c b/builtin/stash.c index da48533d49..85b646b7ad 100644 --- a/builtin/stash.c +++ b/builtin/stash.c @@ -864,7 +864,7 @@ static int get_untracked_files(const struct pathspec *ps, int include_untracked, int found = 0; struct dir_struct dir; - memset(&dir, 0, sizeof(dir)); + dir_init(&dir); if (include_untracked != INCLUDE_ALL_FILES) setup_standard_excludes(&dir); @@ -877,7 +877,7 @@ static int get_untracked_files(const struct pathspec *ps, int include_untracked, strbuf_addch(untracked_files, '\0'); } - clear_directory(&dir); + dir_free(&dir); return found; } diff --git a/dir.c b/dir.c index b136c037d9..e3c8e7ffd6 100644 --- a/dir.c +++ b/dir.c @@ -54,6 +54,11 @@ static enum path_treatment read_directory_recursive(struct dir_struct *dir, static int resolve_dtype(int dtype, struct index_state *istate, const char *path, int len); +void dir_init(struct dir_struct *dir) +{ + memset(dir, 0, sizeof(*dir)); +} + int count_slashes(const char *s) { int cnt = 0; @@ -3013,7 +3018,7 @@ int remove_path(const char *name) } /* Frees memory within dir which was allocated. Does not free dir itself. */ -void clear_directory(struct dir_struct *dir) +void dir_free(struct dir_struct *dir) { int i, j; struct exclude_list_group *group; diff --git a/dir.h b/dir.h index 7d76d0644f..7c55c1a460 100644 --- a/dir.h +++ b/dir.h @@ -19,22 +19,21 @@ * CE_SKIP_WORKTREE marked. If you want to exclude files, make sure you have * loaded the index first. * - * - Prepare `struct dir_struct dir` and clear it with `memset(&dir, 0, - * sizeof(dir))`. + * - Prepare `struct dir_struct dir` using `dir_init()` function. * * - To add single exclude pattern, call `add_pattern_list()` and then * `add_pattern()`. * * - To add patterns from a file (e.g. `.git/info/exclude`), call - * `add_patterns_from_file()` , and/or set `dir.exclude_per_dir`. A - * short-hand function `setup_standard_excludes()` can be used to set - * up the standard set of exclude settings. + * `add_patterns_from_file()` , and/or set `dir.exclude_per_dir`. * - * - Set options described in the Data Structure section above. + * - A short-hand function `setup_standard_excludes()` can be used to set + * up the standard set of exclude settings, instead of manually calling + * the add_pattern*() family of functions. * - * - Call `read_directory()`. + * - Call `fill_directory()`. * - * - Use `dir.entries[]`. + * - Use `dir.entries[]` and `dir.ignored[]`. * * - Call `clear_directory()` when the contained elements are no longer in use. * @@ -362,6 +361,8 @@ int match_pathspec(const struct index_state *istate, int report_path_error(const char *ps_matched, const struct pathspec *pathspec); int within_depth(const char *name, int namelen, int depth, int max_depth); +void dir_init(struct dir_struct *dir); + int fill_directory(struct dir_struct *dir, struct index_state *istate, const struct pathspec *pathspec); @@ -428,7 +429,7 @@ void parse_path_pattern(const char **string, int *patternlen, unsigned *flags, i void add_pattern(const char *string, const char *base, int baselen, struct pattern_list *pl, int srcpos); void clear_pattern_list(struct pattern_list *pl); -void clear_directory(struct dir_struct *dir); +void dir_free(struct dir_struct *dir); int repo_file_exists(struct repository *repo, const char *path); int file_exists(const char *); diff --git a/merge.c b/merge.c index 753e461659..ee149b87ea 100644 --- a/merge.c +++ b/merge.c @@ -80,8 +80,8 @@ int checkout_fast_forward(struct repository *r, } memset(&opts, 0, sizeof(opts)); + dir_init(&dir); if (overwrite_ignore) { - memset(&dir, 0, sizeof(dir)); dir.flags |= DIR_SHOW_IGNORED; setup_standard_excludes(&dir); opts.dir = &dir; @@ -102,6 +102,7 @@ int checkout_fast_forward(struct repository *r, clear_unpack_trees_porcelain(&opts); return -1; } + dir_free(&dir); clear_unpack_trees_porcelain(&opts); if (write_locked_index(r->index, &lock_file, COMMIT_LOCK)) diff --git a/wt-status.c b/wt-status.c index c00ea3e06a..5624a8ff7f 100644 --- a/wt-status.c +++ b/wt-status.c @@ -703,7 +703,7 @@ static void wt_status_collect_untracked(struct wt_status *s) if (!s->show_untracked_files) return; - memset(&dir, 0, sizeof(dir)); + dir_init(&dir); if (s->show_untracked_files != SHOW_ALL_UNTRACKED_FILES) dir.flags |= DIR_SHOW_OTHER_DIRECTORIES | DIR_HIDE_EMPTY_DIRECTORIES; @@ -732,7 +732,7 @@ static void wt_status_collect_untracked(struct wt_status *s) string_list_insert(&s->ignored, ent->name); } - clear_directory(&dir); + dir_free(&dir); if (advice_status_u_option) s->untracked_in_ms = (getnanotime() - t_begin) / 1000000;