diff mbox series

pack-objects: lazily set up "struct rev_info", don't leak

Message ID patch-1.1-193534b0f07-20220325T121715Z-avarab@gmail.com (mailing list archive)
State Superseded
Headers show
Series pack-objects: lazily set up "struct rev_info", don't leak | expand

Commit Message

Ævar Arnfjörð Bjarmason March 25, 2022, 2:25 p.m. UTC
In the preceding [1] (pack-objects: move revs out of
get_object_list(), 2022-03-22) the "repo_init_revisions()" was moved
to cmd_pack_objects() so that it unconditionally took place for all
invocations of "git pack-objects".

We'd thus start leaking memory, which is easily reproduced in
e.g. git.git by feeding e83c5163316 (Initial revision of "git", the
information manager from hell, 2005-04-07) to "git pack-objects";

    $ echo e83c5163316f89bfbde7d9ab23ca2e25604af290 | ./git pack-objects initial
    [...]
	==19130==ERROR: LeakSanitizer: detected memory leaks

	Direct leak of 7120 byte(s) in 1 object(s) allocated from:
	    #0 0x455308 in __interceptor_malloc (/home/avar/g/git/git+0x455308)
	    #1 0x75b399 in do_xmalloc /home/avar/g/git/wrapper.c:41:8
	    #2 0x75b356 in xmalloc /home/avar/g/git/wrapper.c:62:9
	    #3 0x5d7609 in prep_parse_options /home/avar/g/git/diff.c:5647:2
	    #4 0x5d415a in repo_diff_setup /home/avar/g/git/diff.c:4621:2
	    #5 0x6dffbb in repo_init_revisions /home/avar/g/git/revision.c:1853:2
	    #6 0x4f599d in cmd_pack_objects /home/avar/g/git/builtin/pack-objects.c:3980:2
	    #7 0x4592ca in run_builtin /home/avar/g/git/git.c:465:11
	    #8 0x457d81 in handle_builtin /home/avar/g/git/git.c:718:3
	    #9 0x458ca5 in run_argv /home/avar/g/git/git.c:785:4
	    #10 0x457b40 in cmd_main /home/avar/g/git/git.c:916:19
	    #11 0x562259 in main /home/avar/g/git/common-main.c:56:11
	    #12 0x7fce792ac7ec in __libc_start_main csu/../csu/libc-start.c:332:16
	    #13 0x4300f9 in _start (/home/avar/g/git/git+0x4300f9)

	SUMMARY: LeakSanitizer: 7120 byte(s) leaked in 1 allocation(s).
	Aborted

Narrowly fixing that commit would have been easy, just add call
repo_init_revisions() right before get_object_list(), which is
effectively what was done before that commit.

But an unstated constraint when setting it up early is that it was
needed for the subsequent [2] (pack-objects: parse --filter directly
into revs.filter, 2022-03-22), i.e. we might have a --filter
command-line option, and need to either have the "struct rev_info"
setup when we encounter that option, or later.

Let's just change the control flow so that we'll instead set up the
"struct rev_info" only when we need it. Doing so leads to a bit more
verbosity, but it's a lot clearer what we're doing and why.

We could furthermore combine the two get_object_list() invocations
here by having repo_init_revisions() invoked on &pfd.revs, but I think
clearly separating the two makes the flow clearer. Likewise
redundantly but explicitly (i.e. redundant v.s. a "{ 0 }") "0" to
"have_revs" early in cmd_pack_objects().

This does add the future constraint to opt_parse_list_objects_filter()
that we'll need to adjust this wrapper code if it looks at any other
value of the "struct option" than the "value" member.

But that regression should be relatively easy to spot. I'm
intentionally not initializing the "struct wrap" with e.g. "{ 0 }" so
that various memory sanity checkers would spot that, we just
initialize the "value" in po_filter_cb(). By doing this e.g. we'll die
on e.g. this test if we were to use another member of "opt" in
opt_parse_list_objects_filter()>

    ./t5317-pack-objects-filter-objects.sh -vixd --valgrind-only=3

While we're at it add parentheses around the arguments to the OPT_*
macros in in list-objects-filter-options.h, as we need to change those
lines anyway. It doesn't matter in this case, but is good general
practice.

1. https://lore.kernel.org/git/619b757d98465dbc4995bdc11a5282fbfcbd3daa.1647970119.git.gitgitgadget@gmail.com
2. https://lore.kernel.org/git/97de926904988b89b5663bd4c59c011a1723a8f5.1647970119.git.gitgitgadget@gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---

This is on top of ds/partial-bundle-more: I thought the fix for this
new leak was involved enough to propose it as a commit-on-top rather
than a fixup for a re-roll, especially since aside from the newly
leaked memory I don't think ds/partial-bundle-more is breaking
anything by doing that.

Except that is, interacting badly with my release_revisions() series
in "seen", which currently causes the "linux-leaks" job to fail there:
https://lore.kernel.org/git/cover-v2-00.27-00000000000-20220323T203149Z-avarab@gmail.com/

This is proper fix for the issue the interaction with my topic
revealed (not caused, we just started testing for this leak there),
i.e. it obsoletes the suggestion of adding an UNLEAK() there.

 builtin/pack-objects.c        | 30 +++++++++++++++++++++++++-----
 list-objects-filter-options.h |  8 +++++---
 2 files changed, 30 insertions(+), 8 deletions(-)

Comments

Derrick Stolee March 25, 2022, 2:57 p.m. UTC | #1
On 3/25/2022 10:25 AM, Ævar Arnfjörð Bjarmason wrote:
> In the preceding [1] (pack-objects: move revs out of
> get_object_list(), 2022-03-22) the "repo_init_revisions()" was moved
> to cmd_pack_objects() so that it unconditionally took place for all
> invocations of "git pack-objects".
> 
> We'd thus start leaking memory, which is easily reproduced in
> e.g. git.git by feeding e83c5163316 (Initial revision of "git", the
> information manager from hell, 2005-04-07) to "git pack-objects";
> 
>     $ echo e83c5163316f89bfbde7d9ab23ca2e25604af290 | ./git pack-objects initial
>     [...]
> 	==19130==ERROR: LeakSanitizer: detected memory leaks
> 
> 	Direct leak of 7120 byte(s) in 1 object(s) allocated from:
> 	    #0 0x455308 in __interceptor_malloc (/home/avar/g/git/git+0x455308)
> 	    #1 0x75b399 in do_xmalloc /home/avar/g/git/wrapper.c:41:8
> 	    #2 0x75b356 in xmalloc /home/avar/g/git/wrapper.c:62:9
> 	    #3 0x5d7609 in prep_parse_options /home/avar/g/git/diff.c:5647:2
> 	    #4 0x5d415a in repo_diff_setup /home/avar/g/git/diff.c:4621:2
> 	    #5 0x6dffbb in repo_init_revisions /home/avar/g/git/revision.c:1853:2
> 	    #6 0x4f599d in cmd_pack_objects /home/avar/g/git/builtin/pack-objects.c:3980:2
> 	    #7 0x4592ca in run_builtin /home/avar/g/git/git.c:465:11
> 	    #8 0x457d81 in handle_builtin /home/avar/g/git/git.c:718:3
> 	    #9 0x458ca5 in run_argv /home/avar/g/git/git.c:785:4
> 	    #10 0x457b40 in cmd_main /home/avar/g/git/git.c:916:19
> 	    #11 0x562259 in main /home/avar/g/git/common-main.c:56:11
> 	    #12 0x7fce792ac7ec in __libc_start_main csu/../csu/libc-start.c:332:16
> 	    #13 0x4300f9 in _start (/home/avar/g/git/git+0x4300f9)
> 
> 	SUMMARY: LeakSanitizer: 7120 byte(s) leaked in 1 allocation(s).
> 	Aborted
> 
> Narrowly fixing that commit would have been easy, just add call
> repo_init_revisions() right before get_object_list(), which is
> effectively what was done before that commit.
> 
> But an unstated constraint when setting it up early is that it was
> needed for the subsequent [2] (pack-objects: parse --filter directly
> into revs.filter, 2022-03-22), i.e. we might have a --filter
> command-line option, and need to either have the "struct rev_info"
> setup when we encounter that option, or later.
> 
> Let's just change the control flow so that we'll instead set up the
> "struct rev_info" only when we need it. Doing so leads to a bit more
> verbosity, but it's a lot clearer what we're doing and why.

This makes sense.

> We could furthermore combine the two get_object_list() invocations
> here by having repo_init_revisions() invoked on &pfd.revs, but I think
> clearly separating the two makes the flow clearer. Likewise
> redundantly but explicitly (i.e. redundant v.s. a "{ 0 }") "0" to
> "have_revs" early in cmd_pack_objects().

I disagree, especially when you later want to make sure we free
the data from revs using your release_revisions().

> This does add the future constraint to opt_parse_list_objects_filter()
> that we'll need to adjust this wrapper code if it looks at any other
> value of the "struct option" than the "value" member.

So we are coupling ourselves to the implementation of this method.

> But that regression should be relatively easy to spot. I'm
> intentionally not initializing the "struct wrap" with e.g. "{ 0 }" so
> that various memory sanity checkers would spot that, we just
> initialize the "value" in po_filter_cb(). By doing this e.g. we'll die
> on e.g. this test if we were to use another member of "opt" in
> opt_parse_list_objects_filter()>

So you are using uninitialized memory as a way to discover any
necessary changes to that coupling. I'm not sure this is super-safe
because we don't necessarily run memory checkers during CI builds.

I'd rather have a consistently initialized chunk of data that would
behave predictably (and hopefully we'd discover it is behaving
incorrectly with that predictable behavior).


> While we're at it add parentheses around the arguments to the OPT_*
> macros in in list-objects-filter-options.h, as we need to change those
> lines anyway. It doesn't matter in this case, but is good general
> practice.
> 
> 1. https://lore.kernel.org/git/619b757d98465dbc4995bdc11a5282fbfcbd3daa.1647970119.git.gitgitgadget@gmail.com
> 2. https://lore.kernel.org/git/97de926904988b89b5663bd4c59c011a1723a8f5.1647970119.git.gitgitgadget@gmail.com/
> 
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> ---
> 
> This is on top of ds/partial-bundle-more: I thought the fix for this
> new leak was involved enough to propose it as a commit-on-top rather
> than a fixup for a re-roll, especially since aside from the newly
> leaked memory I don't think ds/partial-bundle-more is breaking
> anything by doing that.
> 
> Except that is, interacting badly with my release_revisions() series
> in "seen", which currently causes the "linux-leaks" job to fail there:
> https://lore.kernel.org/git/cover-v2-00.27-00000000000-20220323T203149Z-avarab@gmail.com/
> 
> This is proper fix for the issue the interaction with my topic
> revealed (not caused, we just started testing for this leak there),
> i.e. it obsoletes the suggestion of adding an UNLEAK() there.
> 
>  builtin/pack-objects.c        | 30 +++++++++++++++++++++++++-----
>  list-objects-filter-options.h |  8 +++++---
>  2 files changed, 30 insertions(+), 8 deletions(-)
> 
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index d39f668ad56..735080a4a95 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -3857,6 +3857,23 @@ static int option_parse_unpack_unreachable(const struct option *opt,
>  	return 0;
>  }
>  
> +struct po_filter_data {
> +	unsigned have_revs:1;
> +	struct rev_info revs;
> +};
> +
> +static int po_filter_cb(const struct option *opt, const char *arg, int unset)
> +{
> +	struct po_filter_data *data = opt->value;
> +	struct option wrap; /* don't initialize! */
> +
> +	repo_init_revisions(the_repository, &data->revs, NULL);
> +	wrap.value = &data->revs.filter;
> +	data->have_revs = 1;
> +
> +	return opt_parse_list_objects_filter(&wrap, arg, unset);
> +}

The coupling here is unfortunate, but unavoidable. The future-proof
way to do it would be to modify opt->value and pass the rest of its
members as-is, but it's marked const so that's not an option.

One way to help make this coupling more obvious would be to move
this method into list-filter-options.c so we can have their
implementations adjacent and even refer to them.

Here is a potential version that looks like that:

--- >8 ---

diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
index f02d8df142..55c7131814 100644
--- a/list-objects-filter-options.c
+++ b/list-objects-filter-options.c
@@ -281,6 +281,10 @@ void parse_list_objects_filter(
 		die("%s", errbuf.buf);
 }
 
+/*
+ * If you change this to use any member in 'opt' other than 'value',
+ * then be sure to update opt_parse_list_objects_filter_in_revs().
+ */
 int opt_parse_list_objects_filter(const struct option *opt,
 				  const char *arg, int unset)
 {
@@ -293,6 +297,18 @@ int opt_parse_list_objects_filter(const struct option *opt,
 	return 0;
 }
 
+int opt_parse_list_objects_filter_in_revs(const struct option *opt,
+					  const char *arg, int unset)
+{
+	struct rev_info_maybe_empty *ri = opt->value;
+	struct option wrap = { .value = &ri->revs.filter };
+
+	repo_init_revisions(the_repository, &ri->revs, NULL);
+	ri->has_revs = 1;
+
+	return opt_parse_list_objects_filter(&wrap, arg, unset);
+}
+
 const char *list_objects_filter_spec(struct list_objects_filter_options *filter)
 {
 	if (!filter->filter_spec.nr)
diff --git a/list-objects-filter-options.h b/list-objects-filter-options.h
index 90e4bc9625..cf8b710a76 100644
--- a/list-objects-filter-options.h
+++ b/list-objects-filter-options.h
@@ -106,6 +106,8 @@ void parse_list_objects_filter(
 
 int opt_parse_list_objects_filter(const struct option *opt,
 				  const char *arg, int unset);
+int opt_parse_list_objects_filter_in_revs(const struct option *opt,
+					  const char *arg, int unset);
 
 #define OPT_PARSE_LIST_OBJECTS_FILTER(fo) \
 	OPT_CALLBACK(0, "filter", fo, N_("args"), \
diff --git a/revision.h b/revision.h
index 5bc59c7bfe..95aa3ee891 100644
--- a/revision.h
+++ b/revision.h
@@ -329,6 +329,11 @@ struct rev_info {
 	struct tmp_objdir *remerge_objdir;
 };
 
+struct rev_info_maybe_empty {
+	int has_revs;
+	struct rev_info revs;
+};
+
 int ref_excluded(struct string_list *, const char *path);
 void clear_ref_exclusion(struct string_list **);
 void add_ref_exclusion(struct string_list **, const char *exclude);

--- >8 ---


> +	} else if (pfd.have_revs) {
> +		get_object_list(&pfd.revs, rp.nr, rp.v);
>  	} else {
> +		struct rev_info revs;
> +
> +		repo_init_revisions(the_repository, &revs, NULL);
>  		get_object_list(&revs, rp.nr, rp.v);
>  	}

Here, I think it would be better to have

	else {
		if (!pfd.have_revs) {
			repo_init_revisions(the_repository, &pfd.revs, NULL);
			pfd.have_revs = 1;
		}
		get_object_list(&pfd.revs, rp.nr, rp.v);
	}

and then later you can add

	if (pfd.have_revs)
		release_revisions(&pfd.revs);

to clear the memory in exactly one place.

Thanks,
-Stolee
Ævar Arnfjörð Bjarmason March 25, 2022, 4 p.m. UTC | #2
On Fri, Mar 25 2022, Derrick Stolee wrote:

> On 3/25/2022 10:25 AM, Ævar Arnfjörð Bjarmason wrote:
>> In the preceding [1] (pack-objects: move revs out of
>> get_object_list(), 2022-03-22) the "repo_init_revisions()" was moved
>> to cmd_pack_objects() so that it unconditionally took place for all
>> invocations of "git pack-objects".
>> 
>> We'd thus start leaking memory, which is easily reproduced in
>> e.g. git.git by feeding e83c5163316 (Initial revision of "git", the
>> information manager from hell, 2005-04-07) to "git pack-objects";
>> 
>>     $ echo e83c5163316f89bfbde7d9ab23ca2e25604af290 | ./git pack-objects initial
>>     [...]
>> 	==19130==ERROR: LeakSanitizer: detected memory leaks
>> 
>> 	Direct leak of 7120 byte(s) in 1 object(s) allocated from:
>> 	    #0 0x455308 in __interceptor_malloc (/home/avar/g/git/git+0x455308)
>> 	    #1 0x75b399 in do_xmalloc /home/avar/g/git/wrapper.c:41:8
>> 	    #2 0x75b356 in xmalloc /home/avar/g/git/wrapper.c:62:9
>> 	    #3 0x5d7609 in prep_parse_options /home/avar/g/git/diff.c:5647:2
>> 	    #4 0x5d415a in repo_diff_setup /home/avar/g/git/diff.c:4621:2
>> 	    #5 0x6dffbb in repo_init_revisions /home/avar/g/git/revision.c:1853:2
>> 	    #6 0x4f599d in cmd_pack_objects /home/avar/g/git/builtin/pack-objects.c:3980:2
>> 	    #7 0x4592ca in run_builtin /home/avar/g/git/git.c:465:11
>> 	    #8 0x457d81 in handle_builtin /home/avar/g/git/git.c:718:3
>> 	    #9 0x458ca5 in run_argv /home/avar/g/git/git.c:785:4
>> 	    #10 0x457b40 in cmd_main /home/avar/g/git/git.c:916:19
>> 	    #11 0x562259 in main /home/avar/g/git/common-main.c:56:11
>> 	    #12 0x7fce792ac7ec in __libc_start_main csu/../csu/libc-start.c:332:16
>> 	    #13 0x4300f9 in _start (/home/avar/g/git/git+0x4300f9)
>> 
>> 	SUMMARY: LeakSanitizer: 7120 byte(s) leaked in 1 allocation(s).
>> 	Aborted
>> 
>> Narrowly fixing that commit would have been easy, just add call
>> repo_init_revisions() right before get_object_list(), which is
>> effectively what was done before that commit.
>> 
>> But an unstated constraint when setting it up early is that it was
>> needed for the subsequent [2] (pack-objects: parse --filter directly
>> into revs.filter, 2022-03-22), i.e. we might have a --filter
>> command-line option, and need to either have the "struct rev_info"
>> setup when we encounter that option, or later.
>> 
>> Let's just change the control flow so that we'll instead set up the
>> "struct rev_info" only when we need it. Doing so leads to a bit more
>> verbosity, but it's a lot clearer what we're doing and why.
>
> This makes sense.
>
>> We could furthermore combine the two get_object_list() invocations
>> here by having repo_init_revisions() invoked on &pfd.revs, but I think
>> clearly separating the two makes the flow clearer. Likewise
>> redundantly but explicitly (i.e. redundant v.s. a "{ 0 }") "0" to
>> "have_revs" early in cmd_pack_objects().
>
> I disagree, especially when you later want to make sure we free
> the data from revs using your release_revisions().

Because we'll need two release_revisions() instead of one?

>> This does add the future constraint to opt_parse_list_objects_filter()
>> that we'll need to adjust this wrapper code if it looks at any other
>> value of the "struct option" than the "value" member.
>
> So we are coupling ourselves to the implementation of this method.

Sure, but aren't all OPT_* where the macro is providing some custom
callback coupling themselves to how it's implemented?

>> But that regression should be relatively easy to spot. I'm
>> intentionally not initializing the "struct wrap" with e.g. "{ 0 }" so
>> that various memory sanity checkers would spot that, we just
>> initialize the "value" in po_filter_cb(). By doing this e.g. we'll die
>> on e.g. this test if we were to use another member of "opt" in
>> opt_parse_list_objects_filter()>
>
> So you are using uninitialized memory as a way to discover any
> necessary changes to that coupling. I'm not sure this is super-safe
> because we don't necessarily run memory checkers during CI builds.
>
> I'd rather have a consistently initialized chunk of data that would
> behave predictably (and hopefully we'd discover it is behaving
> incorrectly with that predictable behavior).

I'd like to keep this as-is, i.e. if you zero it out it might also
behave unpredictably or even in undefined ways (e.g. a NULL ptr
dereference). Likewise in my version, but it has the benefit of being
caught by some tooling we have.

>> +struct po_filter_data {
>> +	unsigned have_revs:1;
>> +	struct rev_info revs;
>> +};
>> +
>> +static int po_filter_cb(const struct option *opt, const char *arg, int unset)
>> +{
>> +	struct po_filter_data *data = opt->value;
>> +	struct option wrap; /* don't initialize! */
>> +
>> +	repo_init_revisions(the_repository, &data->revs, NULL);
>> +	wrap.value = &data->revs.filter;
>> +	data->have_revs = 1;
>> +
>> +	return opt_parse_list_objects_filter(&wrap, arg, unset);
>> +}
>
> The coupling here is unfortunate, but unavoidable. The future-proof
> way to do it would be to modify opt->value and pass the rest of its
> members as-is, but it's marked const so that's not an option.

We could always cast it..., but that would probably be more nasty.

> One way to help make this coupling more obvious would be to move
> this method into list-filter-options.c so we can have their
> implementations adjacent and even refer to them.
>
> Here is a potential version that looks like that:
>
> --- >8 ---
>
> diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
> index f02d8df142..55c7131814 100644
> --- a/list-objects-filter-options.c
> +++ b/list-objects-filter-options.c
> @@ -281,6 +281,10 @@ void parse_list_objects_filter(
>  		die("%s", errbuf.buf);
>  }
>  
> +/*
> + * If you change this to use any member in 'opt' other than 'value',
> + * then be sure to update opt_parse_list_objects_filter_in_revs().
> + */
>  int opt_parse_list_objects_filter(const struct option *opt,
>  				  const char *arg, int unset)
>  {
> @@ -293,6 +297,18 @@ int opt_parse_list_objects_filter(const struct option *opt,
>  	return 0;
>  }
>  
> +int opt_parse_list_objects_filter_in_revs(const struct option *opt,
> +					  const char *arg, int unset)
> +{
> +	struct rev_info_maybe_empty *ri = opt->value;
> +	struct option wrap = { .value = &ri->revs.filter };
> +
> +	repo_init_revisions(the_repository, &ri->revs, NULL);
> +	ri->has_revs = 1;
> +
> +	return opt_parse_list_objects_filter(&wrap, arg, unset);
> +}
> +
>  const char *list_objects_filter_spec(struct list_objects_filter_options *filter)
>  {
>  	if (!filter->filter_spec.nr)
> diff --git a/list-objects-filter-options.h b/list-objects-filter-options.h
> index 90e4bc9625..cf8b710a76 100644
> --- a/list-objects-filter-options.h
> +++ b/list-objects-filter-options.h
> @@ -106,6 +106,8 @@ void parse_list_objects_filter(
>  
>  int opt_parse_list_objects_filter(const struct option *opt,
>  				  const char *arg, int unset);
> +int opt_parse_list_objects_filter_in_revs(const struct option *opt,
> +					  const char *arg, int unset);
>  
>  #define OPT_PARSE_LIST_OBJECTS_FILTER(fo) \
>  	OPT_CALLBACK(0, "filter", fo, N_("args"), \
> diff --git a/revision.h b/revision.h
> index 5bc59c7bfe..95aa3ee891 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -329,6 +329,11 @@ struct rev_info {
>  	struct tmp_objdir *remerge_objdir;
>  };
>  
> +struct rev_info_maybe_empty {
> +	int has_revs;
> +	struct rev_info revs;
> +};
> +
>  int ref_excluded(struct string_list *, const char *path);
>  void clear_ref_exclusion(struct string_list **);
>  void add_ref_exclusion(struct string_list **, const char *exclude);
>
> --- >8 ---
>
>
>> +	} else if (pfd.have_revs) {
>> +		get_object_list(&pfd.revs, rp.nr, rp.v);
>>  	} else {
>> +		struct rev_info revs;
>> +
>> +		repo_init_revisions(the_repository, &revs, NULL);
>>  		get_object_list(&revs, rp.nr, rp.v);
>>  	}
>
> Here, I think it would be better to have
>
> 	else {
> 		if (!pfd.have_revs) {
> 			repo_init_revisions(the_repository, &pfd.revs, NULL);
> 			pfd.have_revs = 1;
> 		}
> 		get_object_list(&pfd.revs, rp.nr, rp.v);
> 	}

Conceptually I think the saving of that one line isn't worth it to the
reader.

Then you'd need to read up and see exactly how pfd.revs gets mutated,
and its callback etc., only to see we're just doing this to save
ourselves a variable deceleration and a call to release_revisions().

> and then later you can add
>
> 	if (pfd.have_revs)
> 		release_revisions(&pfd.revs);
>
> to clear the memory in exactly one place.

Yeah, it would work. I'd just prefer control flow that's trivial to
reason about over saving a couple of lines here.
Derrick Stolee March 25, 2022, 4:41 p.m. UTC | #3
On 3/25/2022 12:00 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Fri, Mar 25 2022, Derrick Stolee wrote:
> 
>> On 3/25/2022 10:25 AM, Ævar Arnfjörð Bjarmason wrote:
>>> In the preceding [1] (pack-objects: move revs out of
>>> get_object_list(), 2022-03-22) the "repo_init_revisions()" was moved
>>> to cmd_pack_objects() so that it unconditionally took place for all
>>> invocations of "git pack-objects".
>>>
>>> We'd thus start leaking memory, which is easily reproduced in
>>> e.g. git.git by feeding e83c5163316 (Initial revision of "git", the
>>> information manager from hell, 2005-04-07) to "git pack-objects";
>>>
>>>     $ echo e83c5163316f89bfbde7d9ab23ca2e25604af290 | ./git pack-objects initial
>>>     [...]
>>> 	==19130==ERROR: LeakSanitizer: detected memory leaks
>>>
>>> 	Direct leak of 7120 byte(s) in 1 object(s) allocated from:
>>> 	    #0 0x455308 in __interceptor_malloc (/home/avar/g/git/git+0x455308)
>>> 	    #1 0x75b399 in do_xmalloc /home/avar/g/git/wrapper.c:41:8
>>> 	    #2 0x75b356 in xmalloc /home/avar/g/git/wrapper.c:62:9
>>> 	    #3 0x5d7609 in prep_parse_options /home/avar/g/git/diff.c:5647:2
>>> 	    #4 0x5d415a in repo_diff_setup /home/avar/g/git/diff.c:4621:2
>>> 	    #5 0x6dffbb in repo_init_revisions /home/avar/g/git/revision.c:1853:2
>>> 	    #6 0x4f599d in cmd_pack_objects /home/avar/g/git/builtin/pack-objects.c:3980:2
>>> 	    #7 0x4592ca in run_builtin /home/avar/g/git/git.c:465:11
>>> 	    #8 0x457d81 in handle_builtin /home/avar/g/git/git.c:718:3
>>> 	    #9 0x458ca5 in run_argv /home/avar/g/git/git.c:785:4
>>> 	    #10 0x457b40 in cmd_main /home/avar/g/git/git.c:916:19
>>> 	    #11 0x562259 in main /home/avar/g/git/common-main.c:56:11
>>> 	    #12 0x7fce792ac7ec in __libc_start_main csu/../csu/libc-start.c:332:16
>>> 	    #13 0x4300f9 in _start (/home/avar/g/git/git+0x4300f9)
>>>
>>> 	SUMMARY: LeakSanitizer: 7120 byte(s) leaked in 1 allocation(s).
>>> 	Aborted
>>>
>>> Narrowly fixing that commit would have been easy, just add call
>>> repo_init_revisions() right before get_object_list(), which is
>>> effectively what was done before that commit.
>>>
>>> But an unstated constraint when setting it up early is that it was
>>> needed for the subsequent [2] (pack-objects: parse --filter directly
>>> into revs.filter, 2022-03-22), i.e. we might have a --filter
>>> command-line option, and need to either have the "struct rev_info"
>>> setup when we encounter that option, or later.
>>>
>>> Let's just change the control flow so that we'll instead set up the
>>> "struct rev_info" only when we need it. Doing so leads to a bit more
>>> verbosity, but it's a lot clearer what we're doing and why.
>>
>> This makes sense.
>>
>>> We could furthermore combine the two get_object_list() invocations
>>> here by having repo_init_revisions() invoked on &pfd.revs, but I think
>>> clearly separating the two makes the flow clearer. Likewise
>>> redundantly but explicitly (i.e. redundant v.s. a "{ 0 }") "0" to
>>> "have_revs" early in cmd_pack_objects().
>>
>> I disagree, especially when you later want to make sure we free
>> the data from revs using your release_revisions().
> 
> Because we'll need two release_revisions() instead of one?

And it would be easy to miss one of the two if you are only testing
certain paths with leak-check on.

>>> This does add the future constraint to opt_parse_list_objects_filter()
>>> that we'll need to adjust this wrapper code if it looks at any other
>>> value of the "struct option" than the "value" member.
>>
>> So we are coupling ourselves to the implementation of this method.
> 
> Sure, but aren't all OPT_* where the macro is providing some custom
> callback coupling themselves to how it's implemented?

No, I mean that the callback function you are making here is coupling
itself to the existing callback function in a novel way.

I tried to find another example of a nested callback, but I didn't
even find another instance of creating a new 'struct option'.

>>> But that regression should be relatively easy to spot. I'm
>>> intentionally not initializing the "struct wrap" with e.g. "{ 0 }" so
>>> that various memory sanity checkers would spot that, we just
>>> initialize the "value" in po_filter_cb(). By doing this e.g. we'll die
>>> on e.g. this test if we were to use another member of "opt" in
>>> opt_parse_list_objects_filter()>
>>
>> So you are using uninitialized memory as a way to discover any
>> necessary changes to that coupling. I'm not sure this is super-safe
>> because we don't necessarily run memory checkers during CI builds.
>>
>> I'd rather have a consistently initialized chunk of data that would
>> behave predictably (and hopefully we'd discover it is behaving
>> incorrectly with that predictable behavior).
> 
> I'd like to keep this as-is, i.e. if you zero it out it might also
> behave unpredictably or even in undefined ways (e.g. a NULL ptr
> dereference). Likewise in my version, but it has the benefit of being
> caught by some tooling we have.

I guess by "predictable" I mean "well-defined" or "deterministic".

I don't want the build options to change how this works (for instance,
debug builds sometimes initialize data to zero).

>> +struct rev_info_maybe_empty {
>> +	int has_revs;
>> +	struct rev_info revs;
>> +};

Thinking about this a second time, perhaps it would be best to add
an "unsigned initialized:1;" to struct rev_info so we can look at
such a struct and know whether or not repo_init_revisions() has
been run or not. Avoids the custom struct and unifies a few things.

In particular, release_revisions() could choose to do nothing if
revs->initialized is false.

Further, a second repo_init_revisions() could do nothing if
revs->initialized is true. This allows us to safely "re-init"
without our own "if (has_revs)" checks...
>> 	else {
>> 		if (!pfd.have_revs) {
>> 			repo_init_revisions(the_repository, &pfd.revs, NULL);
>> 			pfd.have_revs = 1;
>> 		}
>> 		get_object_list(&pfd.revs, rp.nr, rp.v);
>> 	}

...making this just

	else {
		repo_init_revisions(the_repository, &revs, NULL);
		get_object_list(&revs. rp.nr, rp.v);
	}

> Conceptually I think the saving of that one line isn't worth it to the
> reader.
> 
> Then you'd need to read up and see exactly how pfd.revs gets mutated,
> and its callback etc., only to see we're just doing this to save
> ourselves a variable deceleration and a call to release_revisions().
> 
>> and then later you can add
>>
>> 	if (pfd.have_revs)
>> 		release_revisions(&pfd.revs);

And this would just be

	release_revisions(&revs);

>> to clear the memory in exactly one place.
> 
> Yeah, it would work. I'd just prefer control flow that's trivial to
> reason about over saving a couple of lines here.

I think having multiple revision things that can live at different
levels (one embedded in a custom struct, one not) is not trivial to
reason about. If we change the "is this initialized" indicator to be
within 'struct rev_info', then this gets simpler.

It seems to me that it is easier to track a single struct and release
the memory in one place based on its lifespan.

Thanks,
-Stolee
Ævar Arnfjörð Bjarmason March 25, 2022, 5:34 p.m. UTC | #4
On Fri, Mar 25 2022, Derrick Stolee wrote:

> On 3/25/2022 12:00 PM, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Fri, Mar 25 2022, Derrick Stolee wrote:
>> [...]
>>>> We could furthermore combine the two get_object_list() invocations
>>>> here by having repo_init_revisions() invoked on &pfd.revs, but I think
>>>> clearly separating the two makes the flow clearer. Likewise
>>>> redundantly but explicitly (i.e. redundant v.s. a "{ 0 }") "0" to
>>>> "have_revs" early in cmd_pack_objects().
>>>
>>> I disagree, especially when you later want to make sure we free
>>> the data from revs using your release_revisions().
>> 
>> Because we'll need two release_revisions() instead of one?
>
> And it would be easy to miss one of the two if you are only testing
> certain paths with leak-check on.
>
>>>> This does add the future constraint to opt_parse_list_objects_filter()
>>>> that we'll need to adjust this wrapper code if it looks at any other
>>>> value of the "struct option" than the "value" member.
>>>
>>> So we are coupling ourselves to the implementation of this method.
>> 
>> Sure, but aren't all OPT_* where the macro is providing some custom
>> callback coupling themselves to how it's implemented?
>
> No, I mean that the callback function you are making here is coupling
> itself to the existing callback function in a novel way.

Yeah, I did start by just copy/pasting its contents, or I could have
both of them call a semi-"private" wrapper, would you prefer one of
those instead?

> I tried to find another example of a nested callback, but I didn't
> even find another instance of creating a new 'struct option'.

I haven't seen a nested one, FWIW OPT_ALIAS() creates a new "struct
option", but in a completely unrelated way, what I'm doing here is
probably novel within the codebase...

>>>> But that regression should be relatively easy to spot. I'm
>>>> intentionally not initializing the "struct wrap" with e.g. "{ 0 }" so
>>>> that various memory sanity checkers would spot that, we just
>>>> initialize the "value" in po_filter_cb(). By doing this e.g. we'll die
>>>> on e.g. this test if we were to use another member of "opt" in
>>>> opt_parse_list_objects_filter()>
>>>
>>> So you are using uninitialized memory as a way to discover any
>>> necessary changes to that coupling. I'm not sure this is super-safe
>>> because we don't necessarily run memory checkers during CI builds.
>>>
>>> I'd rather have a consistently initialized chunk of data that would
>>> behave predictably (and hopefully we'd discover it is behaving
>>> incorrectly with that predictable behavior).
>> 
>> I'd like to keep this as-is, i.e. if you zero it out it might also
>> behave unpredictably or even in undefined ways (e.g. a NULL ptr
>> dereference). Likewise in my version, but it has the benefit of being
>> caught by some tooling we have.
>
> I guess by "predictable" I mean "well-defined" or "deterministic".
>
> I don't want the build options to change how this works (for instance,
> debug builds sometimes initialize data to zero).

Yeah I agree that does suck.

Honestly I don't care that much in this case, I just wanted to resolve
the topic conflict.

In the general case though I think it's much better to have a flaky
failure caught on some platforms, via valgrind or whatever because we
used uninitialized memory than to have a silent failure (e.g. because
it's always an int=0, but we expected something else...)

>>> +struct rev_info_maybe_empty {
>>> +	int has_revs;
>>> +	struct rev_info revs;
>>> +};
>
> Thinking about this a second time, perhaps it would be best to add
> an "unsigned initialized:1;" to struct rev_info so we can look at
> such a struct and know whether or not repo_init_revisions() has
> been run or not. Avoids the custom struct and unifies a few things.
>
> In particular, release_revisions() could choose to do nothing if
> revs->initialized is false.

This plan won't work because that behavior is both undefined per the
standard, and something that's wildly undefined in practice.

I.e. we initialize it on the stack, so it'll point to uninitialized
memory, sometimes that bit will be 0, sometimes 1...

If you mean just initialize it to { 0 } or whatever that would work,
yes, but if we're going to refactor all the callers to do that we might
as well refactor the few missing bits that would be needed to initialize
it statically, and drop the dynamic by default initialization...

> Further, a second repo_init_revisions() could do nothing if
> revs->initialized is true. This allows us to safely "re-init"
> without our own "if (has_revs)" checks...

Yeah if you had a previous repo_init_revisions() you could rely on that.

>>> 	else {
>>> 		if (!pfd.have_revs) {
>>> 			repo_init_revisions(the_repository, &pfd.revs, NULL);
>>> 			pfd.have_revs = 1;
>>> 		}
>>> 		get_object_list(&pfd.revs, rp.nr, rp.v);
>>> 	}
>
> ...making this just
>
> 	else {
> 		repo_init_revisions(the_repository, &revs, NULL);
> 		get_object_list(&revs. rp.nr, rp.v);
> 	}
>
>> Conceptually I think the saving of that one line isn't worth it to the
>> reader.
>> 
>> Then you'd need to read up and see exactly how pfd.revs gets mutated,
>> and its callback etc., only to see we're just doing this to save
>> ourselves a variable deceleration and a call to release_revisions().
>> 
>>> and then later you can add
>>>
>>> 	if (pfd.have_revs)
>>> 		release_revisions(&pfd.revs);
>
> And this would just be
>
> 	release_revisions(&revs);
>
>>> to clear the memory in exactly one place.
>> 
>> Yeah, it would work. I'd just prefer control flow that's trivial to
>> reason about over saving a couple of lines here.
>
> I think having multiple revision things that can live at different
> levels (one embedded in a custom struct, one not) is not trivial to
> reason about. If we change the "is this initialized" indicator to be
> within 'struct rev_info', then this gets simpler.

I meant: if you want to read that code through without considering the
"filter" case it's obvious that you can skip the whole "pfd.revs" part
in that case, and that it's only something to do with "filter".

> It seems to me that it is easier to track a single struct and release
> the memory in one place based on its lifespan.

I submitted a patch on top because it looked like your topic was going
to be merged to "next" soon, and I really didn't care much in this case.

But FWIW I think a much more obvious thing to do overall would be to
skip the whole "filter bust me in rev_info" refactoring part of your
series and just add a trivial list_objects_filter_copy_attach() method,
or do it inline with memcpy/memset.

I.e. to not touch the "filter" etc. callback stuff at all, still pass it
to get_object_list(). Can't 2/5 and 3/5 in your series be replaced by
this simpler and smaller change?:

	diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
	index 829ca359cf9..4ce76a378c8 100644
	--- a/builtin/pack-objects.c
	+++ b/builtin/pack-objects.c
	@@ -237,8 +237,6 @@ static unsigned long cache_max_small_delta_size = 1000;
	 
	 static unsigned long window_memory_limit = 0;
	 
	-static struct list_objects_filter_options filter_options;
	-
	 static struct string_list uri_protocols = STRING_LIST_INIT_NODUP;
	 
	 enum missing_action {
	@@ -3714,7 +3712,8 @@ static void mark_bitmap_preferred_tips(void)
	 	}
	 }
	 
	-static void get_object_list(int ac, const char **av)
	+static void get_object_list(int ac, const char **av,
	+			    struct list_objects_filter_options *filter)
	 {
	 	struct rev_info revs;
	 	struct setup_revision_opt s_r_opt = {
	@@ -3727,7 +3726,9 @@ static void get_object_list(int ac, const char **av)
	 	repo_init_revisions(the_repository, &revs, NULL);
	 	save_commit_buffer = 0;
	 	setup_revisions(ac, av, &revs, &s_r_opt);
	-	list_objects_filter_copy(&revs.filter, &filter_options);
	+	/* attach our CLI --filter to rev_info's filter */
	+	memcpy(&revs.filter, filter, sizeof(*filter));
	+	memset(filter, 0, sizeof(*filter));
	 
	 	/* make sure shallows are read */
	 	is_repository_shallow(the_repository);
	@@ -3872,6 +3873,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
	 	int rev_list_index = 0;
	 	int stdin_packs = 0;
	 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
	+	struct list_objects_filter_options filter_options = { 0 };
	 	struct option pack_objects_options[] = {
	 		OPT_SET_INT('q', "quiet", &progress,
	 			    N_("do not show progress meter"), 0),
	@@ -4154,7 +4156,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
	 	} else if (!use_internal_rev_list) {
	 		read_object_list_from_stdin();
	 	} else {
	-		get_object_list(rp.nr, rp.v);
	+		get_object_list(rp.nr, rp.v, &filter_options);
	 	}
	 	cleanup_preferred_base();
	 	if (include_tag && nr_result)

And even most of that could be omitted by not removing the global
"static struct" since pack-objects is a one-off anyway ... :)
Junio C Hamano March 25, 2022, 6:53 p.m. UTC | #5
Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> In the preceding [1] (pack-objects: move revs out of
> get_object_list(), 2022-03-22) the "repo_init_revisions()" was moved
> to cmd_pack_objects() so that it unconditionally took place for all
> invocations of "git pack-objects".
>
> We'd thus start leaking memory, which is easily reproduced in
> e.g. git.git by feeding e83c5163316 (Initial revision of "git", the
> information manager from hell, 2005-04-07) to "git pack-objects";
> ...
> Narrowly fixing that commit would have been easy, just add call
> repo_init_revisions() right before get_object_list(), which is
> effectively what was done before that commit.
>
> But an unstated constraint when setting it up early is that it was
> needed for the subsequent [2] (pack-objects: parse --filter directly
> into revs.filter, 2022-03-22), i.e. we might have a --filter
> command-line option, and need to either have the "struct rev_info"
> setup when we encounter that option, or later.
>
> Let's just change the control flow so that we'll instead set up the
> "struct rev_info" only when we need it. Doing so leads to a bit more
> verbosity, but it's a lot clearer what we're doing and why.

Is this about "we take it as given that the use of rev_info leaks
until we fix revisions API, so let's keep its use limited to avoid
unnecessary leaks"?

If so, it sort-of makes sense, but smells like a roundabout way to
address the issue.  An obvious alternative is to wait until both the
topic and the "plug revision API" topic graduate and then add a
"release" call to release the resource in the same sope as the
unconditional call to init_revisions at the end.  I do not quite get
what on-demand lazy set-up buys us.  What we need to lazily set-up,
when we do lazily set-up, needs to be released either way, no?
Derrick Stolee March 25, 2022, 7:08 p.m. UTC | #6
On 3/25/2022 1:34 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Fri, Mar 25 2022, Derrick Stolee wrote:
> 
>> On 3/25/2022 12:00 PM, Ævar Arnfjörð Bjarmason wrote:
>>>> +struct rev_info_maybe_empty {
>>>> +	int has_revs;
>>>> +	struct rev_info revs;
>>>> +};
>>
>> Thinking about this a second time, perhaps it would be best to add
>> an "unsigned initialized:1;" to struct rev_info so we can look at
>> such a struct and know whether or not repo_init_revisions() has
>> been run or not. Avoids the custom struct and unifies a few things.
>>
>> In particular, release_revisions() could choose to do nothing if
>> revs->initialized is false.
> 
> This plan won't work because that behavior is both undefined per the
> standard, and something that's wildly undefined in practice.
> 
> I.e. we initialize it on the stack, so it'll point to uninitialized
> memory, sometimes that bit will be 0, sometimes 1...
> 
> If you mean just initialize it to { 0 } or whatever that would work,
> yes, but if we're going to refactor all the callers to do that we might
> as well refactor the few missing bits that would be needed to initialize
> it statically, and drop the dynamic by default initialization...

Yes, I was assuming that we initialize all structs to all-zero,
but the existing failure to do this will cause such a change too
large for this issue.

> But FWIW I think a much more obvious thing to do overall would be to
> skip the whole "filter bust me in rev_info" refactoring part of your
> series and just add a trivial list_objects_filter_copy_attach() method,
> or do it inline with memcpy/memset.
> 
> I.e. to not touch the "filter" etc. callback stuff at all, still pass it
> to get_object_list(). Can't 2/5 and 3/5 in your series be replaced by
> this simpler and smaller change?:

> 	-	list_objects_filter_copy(&revs.filter, &filter_options);
> 	+	/* attach our CLI --filter to rev_info's filter */
> 	+	memcpy(&revs.filter, filter, sizeof(*filter));
> 	+	memset(filter, 0, sizeof(*filter));

Here, you are removing a deep copy with a shallow copy. After this,
freeing the arrays within revs.filter would cause a double-free when
freeing the arrays in the original filter_options.

If you went this way, then you could do a s/&filter_options/filter/
in the existing line.

> 	 	/* make sure shallows are read */
> 	 	is_repository_shallow(the_repository);
> 	@@ -3872,6 +3873,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
> 	 	int rev_list_index = 0;
> 	 	int stdin_packs = 0;
> 	 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
> 	+	struct list_objects_filter_options filter_options = { 0 };
> 	 	struct option pack_objects_options[] = {
> 	 		OPT_SET_INT('q', "quiet", &progress,
> 	 			    N_("do not show progress meter"), 0),
> 	@@ -4154,7 +4156,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
> 	 	} else if (!use_internal_rev_list) {
> 	 		read_object_list_from_stdin();
> 	 	} else {
> 	-		get_object_list(rp.nr, rp.v);
> 	+		get_object_list(rp.nr, rp.v, &filter_options);
> 	 	}
> 	 	cleanup_preferred_base();
> 	 	if (include_tag && nr_result)
> 
> And even most of that could be omitted by not removing the global
> "static struct" since pack-objects is a one-off anyway ... :)

Even if you fix the deep/shallow copy above, you still need to
clean up the filter in two places.

Thanks,
-Stolee
Ævar Arnfjörð Bjarmason March 26, 2022, 12:52 a.m. UTC | #7
On Fri, Mar 25 2022, Derrick Stolee wrote:

> On 3/25/2022 1:34 PM, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Fri, Mar 25 2022, Derrick Stolee wrote:
>> 
>>> On 3/25/2022 12:00 PM, Ævar Arnfjörð Bjarmason wrote:
>>>>> +struct rev_info_maybe_empty {
>>>>> +	int has_revs;
>>>>> +	struct rev_info revs;
>>>>> +};
>>>
>>> Thinking about this a second time, perhaps it would be best to add
>>> an "unsigned initialized:1;" to struct rev_info so we can look at
>>> such a struct and know whether or not repo_init_revisions() has
>>> been run or not. Avoids the custom struct and unifies a few things.
>>>
>>> In particular, release_revisions() could choose to do nothing if
>>> revs->initialized is false.
>> 
>> This plan won't work because that behavior is both undefined per the
>> standard, and something that's wildly undefined in practice.
>> 
>> I.e. we initialize it on the stack, so it'll point to uninitialized
>> memory, sometimes that bit will be 0, sometimes 1...
>> 
>> If you mean just initialize it to { 0 } or whatever that would work,
>> yes, but if we're going to refactor all the callers to do that we might
>> as well refactor the few missing bits that would be needed to initialize
>> it statically, and drop the dynamic by default initialization...
>
> Yes, I was assuming that we initialize all structs to all-zero,
> but the existing failure to do this will cause such a change too
> large for this issue.

I don't see how that wouldn't be a regression on the upthread patch in
the sense that yes, we could of course initialize it, but the whole
point of not doing so was to have our tooling detect if the downstream
code assumed it could start using a struct member we hadn't filled in.

By initializing it we'll never know.

But yes, if you consider that a non-goal then init to "{ 0 }" makes the
most sense.

>> But FWIW I think a much more obvious thing to do overall would be to
>> skip the whole "filter bust me in rev_info" refactoring part of your
>> series and just add a trivial list_objects_filter_copy_attach() method,
>> or do it inline with memcpy/memset.
>> 
>> I.e. to not touch the "filter" etc. callback stuff at all, still pass it
>> to get_object_list(). Can't 2/5 and 3/5 in your series be replaced by
>> this simpler and smaller change?:
>
>> 	-	list_objects_filter_copy(&revs.filter, &filter_options);
>> 	+	/* attach our CLI --filter to rev_info's filter */
>> 	+	memcpy(&revs.filter, filter, sizeof(*filter));
>> 	+	memset(filter, 0, sizeof(*filter));
>
> Here, you are removing a deep copy with a shallow copy. After this,
> freeing the arrays within revs.filter would cause a double-free when
> freeing the arrays in the original filter_options.

Yes, and that's what we want, right? I.e. we don't want a copy, but to
use the &filter for parse_options(), then once that's populated we
shallow-copy that to "struct rev_info"'s "filter", and forget about our
own copy (i.e. the memset there is redundant, but just a "let's not use
this again) marker.

Of course this will leak now, but once merged with my
release_revisions() patch will work, and we'll free what we allocated
(once!).

> If you went this way, then you could do a s/&filter_options/filter/
> in the existing line.
>
>> 	 	/* make sure shallows are read */
>> 	 	is_repository_shallow(the_repository);
>> 	@@ -3872,6 +3873,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
>> 	 	int rev_list_index = 0;
>> 	 	int stdin_packs = 0;
>> 	 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
>> 	+	struct list_objects_filter_options filter_options = { 0 };
>> 	 	struct option pack_objects_options[] = {
>> 	 		OPT_SET_INT('q', "quiet", &progress,
>> 	 			    N_("do not show progress meter"), 0),
>> 	@@ -4154,7 +4156,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
>> 	 	} else if (!use_internal_rev_list) {
>> 	 		read_object_list_from_stdin();
>> 	 	} else {
>> 	-		get_object_list(rp.nr, rp.v);
>> 	+		get_object_list(rp.nr, rp.v, &filter_options);
>> 	 	}
>> 	 	cleanup_preferred_base();
>> 	 	if (include_tag && nr_result)
>> 
>> And even most of that could be omitted by not removing the global
>> "static struct" since pack-objects is a one-off anyway ... :)
>
> Even if you fix the deep/shallow copy above, you still need to
> clean up the filter in two places.

If you "fix" the shallow copying you need to free it twice, but if you
don't you free it once.

I.e. this is conceptually the same as strbuf_detach() + strbuf_attach().

But maybe I'm missing something...

(If I am it's rather worrying that it passed all our tests, both in your
series + merged with the release_revisions() series).
Ævar Arnfjörð Bjarmason March 26, 2022, 1:09 a.m. UTC | #8
On Fri, Mar 25 2022, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
>
>> In the preceding [1] (pack-objects: move revs out of
>> get_object_list(), 2022-03-22) the "repo_init_revisions()" was moved
>> to cmd_pack_objects() so that it unconditionally took place for all
>> invocations of "git pack-objects".
>>
>> We'd thus start leaking memory, which is easily reproduced in
>> e.g. git.git by feeding e83c5163316 (Initial revision of "git", the
>> information manager from hell, 2005-04-07) to "git pack-objects";
>> ...
>> Narrowly fixing that commit would have been easy, just add call
>> repo_init_revisions() right before get_object_list(), which is
>> effectively what was done before that commit.
>>
>> But an unstated constraint when setting it up early is that it was
>> needed for the subsequent [2] (pack-objects: parse --filter directly
>> into revs.filter, 2022-03-22), i.e. we might have a --filter
>> command-line option, and need to either have the "struct rev_info"
>> setup when we encounter that option, or later.
>>
>> Let's just change the control flow so that we'll instead set up the
>> "struct rev_info" only when we need it. Doing so leads to a bit more
>> verbosity, but it's a lot clearer what we're doing and why.
>
> Is this about "we take it as given that the use of rev_info leaks
> until we fix revisions API, so let's keep its use limited to avoid
> unnecessary leaks"?

Not exactly,. When you use the revisions API to do "filter" stuff in
this codepath it leaks both before & after Derrick's patches, so nothing
has changed in that case, but...

> If so, it sort-of makes sense, but smells like a roundabout way to
> address the issue.  An obvious alternative is to wait until both the
> topic and the "plug revision API" topic graduate and then add a
> "release" call to release the resource in the same sope as the
> unconditional call to init_revisions at the end.  I do not quite get
> what on-demand lazy set-up buys us.  What we need to lazily set-up,
> when we do lazily set-up, needs to be released either way, no?

...We were doing lazy setup of "struct rev_info" before the parent
series, and as a result it introduces a new memory leak. We do a
malloc() for some diff.c code that revisions.c uses unconditionally,
which then don't use at all in some common cases.

The patch I've submitted on top just restores the previous state of the
initialization being lazy, but in a way that has to be adapted for other
code changes the series made.
Derrick Stolee March 28, 2022, 2:04 p.m. UTC | #9
On 3/25/2022 8:52 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Fri, Mar 25 2022, Derrick Stolee wrote:
> 
>> On 3/25/2022 1:34 PM, Ævar Arnfjörð Bjarmason wrote:>>> 	-	list_objects_filter_copy(&revs.filter, &filter_options);
>>> 	+	/* attach our CLI --filter to rev_info's filter */
>>> 	+	memcpy(&revs.filter, filter, sizeof(*filter));
>>> 	+	memset(filter, 0, sizeof(*filter));
>>
>> Here, you are removing a deep copy with a shallow copy. After this,
>> freeing the arrays within revs.filter would cause a double-free when
>> freeing the arrays in the original filter_options.
> 
> Yes, and that's what we want, right? I.e. we don't want a copy, but to
> use the &filter for parse_options(), then once that's populated we
> shallow-copy that to "struct rev_info"'s "filter", and forget about our
> own copy (i.e. the memset there is redundant, but just a "let's not use
> this again) marker.
> 
> Of course this will leak now, but once merged with my
> release_revisions() patch will work, and we'll free what we allocated
> (once!).


>> Even if you fix the deep/shallow copy above, you still need to
>> clean up the filter in two places.
> 
> If you "fix" the shallow copying you need to free it twice, but if you
> don't you free it once.
> 
> I.e. this is conceptually the same as strbuf_detach() + strbuf_attach().
> 
> But maybe I'm missing something...
> 
> (If I am it's rather worrying that it passed all our tests, both in your
> series + merged with the release_revisions() series).

My problem is that you need to know that the filter data was
"detached" in a different scope than it is defined.

 * filter_options is defined in cmd_pack_objects()
 * The detach and reattach to revs is in get_object_list()

Your model requires internal information from get_object_list()
to know that you shouldn't release filter_options within
cmd_pack_objects(), which I think is a code smell. Better to
have something allocated in cmd_pack_objects() be freed in that
same method so it is visually detectable that we are freeing
correctly.

Thanks,
-Stolee
diff mbox series

Patch

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index d39f668ad56..735080a4a95 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3857,6 +3857,23 @@  static int option_parse_unpack_unreachable(const struct option *opt,
 	return 0;
 }
 
+struct po_filter_data {
+	unsigned have_revs:1;
+	struct rev_info revs;
+};
+
+static int po_filter_cb(const struct option *opt, const char *arg, int unset)
+{
+	struct po_filter_data *data = opt->value;
+	struct option wrap; /* don't initialize! */
+
+	repo_init_revisions(the_repository, &data->revs, NULL);
+	wrap.value = &data->revs.filter;
+	data->have_revs = 1;
+
+	return opt_parse_list_objects_filter(&wrap, arg, unset);
+}
+
 int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 {
 	int use_internal_rev_list = 0;
@@ -3867,7 +3884,7 @@  int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	int rev_list_index = 0;
 	int stdin_packs = 0;
 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
-	struct rev_info revs;
+	struct po_filter_data pfd = { .have_revs = 0 };
 
 	struct option pack_objects_options[] = {
 		OPT_SET_INT('q', "quiet", &progress,
@@ -3954,7 +3971,7 @@  int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			      &write_bitmap_index,
 			      N_("write a bitmap index if possible"),
 			      WRITE_BITMAP_QUIET, PARSE_OPT_HIDDEN),
-		OPT_PARSE_LIST_OBJECTS_FILTER(&revs.filter),
+		OPT_PARSE_LIST_OBJECTS_FILTER_CB(&pfd, po_filter_cb),
 		OPT_CALLBACK_F(0, "missing", NULL, N_("action"),
 		  N_("handling for missing objects"), PARSE_OPT_NONEG,
 		  option_parse_missing_action),
@@ -3973,8 +3990,6 @@  int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	read_replace_refs = 0;
 
-	repo_init_revisions(the_repository, &revs, NULL);
-
 	sparse = git_env_bool("GIT_TEST_PACK_SPARSE", -1);
 	if (the_repository->gitdir) {
 		prepare_repo_settings(the_repository);
@@ -4076,7 +4091,7 @@  int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
-	if (revs.filter.choice) {
+	if (pfd.have_revs && pfd.revs.filter.choice) {
 		if (!pack_to_stdout)
 			die(_("cannot use --filter without --stdout"));
 		if (stdin_packs)
@@ -4152,7 +4167,12 @@  int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			add_unreachable_loose_objects();
 	} else if (!use_internal_rev_list) {
 		read_object_list_from_stdin();
+	} else if (pfd.have_revs) {
+		get_object_list(&pfd.revs, rp.nr, rp.v);
 	} else {
+		struct rev_info revs;
+
+		repo_init_revisions(the_repository, &revs, NULL);
 		get_object_list(&revs, rp.nr, rp.v);
 	}
 	cleanup_preferred_base();
diff --git a/list-objects-filter-options.h b/list-objects-filter-options.h
index 90e4bc96252..910f4a46e91 100644
--- a/list-objects-filter-options.h
+++ b/list-objects-filter-options.h
@@ -107,10 +107,12 @@  void parse_list_objects_filter(
 int opt_parse_list_objects_filter(const struct option *opt,
 				  const char *arg, int unset);
 
+#define OPT_PARSE_LIST_OBJECTS_FILTER_CB(fo, cb) \
+	OPT_CALLBACK(0, "filter", (fo), N_("args"), \
+		     N_("object filtering"), (cb))
+
 #define OPT_PARSE_LIST_OBJECTS_FILTER(fo) \
-	OPT_CALLBACK(0, "filter", fo, N_("args"), \
-	  N_("object filtering"), \
-	  opt_parse_list_objects_filter)
+	OPT_PARSE_LIST_OBJECTS_FILTER_CB((fo), opt_parse_list_objects_filter)
 
 /*
  * Translates abbreviated numbers in the filter's filter_spec into their