diff mbox series

[1/3] line-log: free diff queue when processing non-merge commits

Message ID 20221102220142.574890-2-szeder.dev@gmail.com (mailing list archive)
State Accepted
Commit 04ae00062d2ec301828e8df9931817be4b2f8653
Headers show
Series line-log: plug some memory leaks | expand

Commit Message

SZEDER Gábor Nov. 2, 2022, 10:01 p.m. UTC
When processing a non-merge commit, the line-level log first asks the
tree-diff machinery whether any of the files in the given line ranges
were modified between the current commit and its parent, and if some
of them were, then it loads the contents of those files from both
commits to see whether their line ranges were modified and/or need to
be adjusted.  Alas, it doesn't free() the diff queue holding the
results of that query and the contents of those files once its done.
This can add up to a substantial amount of leaked memory, especially
when the file in question is big and is frequently modified: a user
reported "Out of memory, malloc failed" errors with a 2MB text file
that was modified ~2800 times [1] (I estimate the leak would use up
almost 11GB memory in that case).

Free that diff queue to plug this memory leak.  However, instead of
simply open-coding the necessary three lines, add them as a helper
function to the diff API, because it will be useful elsewhere as well.

[1] https://public-inbox.org/git/CAFOPqVXz2XwzX8vGU7wLuqb2ZuwTuOFAzBLRM_QPk+NJa=eC-g@mail.gmail.com/

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 diff.c     | 7 +++++++
 diffcore.h | 1 +
 line-log.c | 1 +
 3 files changed, 9 insertions(+)

Comments

Taylor Blau Nov. 3, 2022, 12:20 a.m. UTC | #1
On Wed, Nov 02, 2022 at 11:01:40PM +0100, SZEDER Gábor wrote:
> When processing a non-merge commit, the line-level log first asks the
> tree-diff machinery whether any of the files in the given line ranges
> were modified between the current commit and its parent, and if some
> of them were, then it loads the contents of those files from both
> commits to see whether their line ranges were modified and/or need to
> be adjusted.  Alas, it doesn't free() the diff queue holding the
> results of that query and the contents of those files once its done.
> This can add up to a substantial amount of leaked memory, especially
> when the file in question is big and is frequently modified: a user
> reported "Out of memory, malloc failed" errors with a 2MB text file
> that was modified ~2800 times [1] (I estimate the leak would use up
> almost 11GB memory in that case).
>
> Free that diff queue to plug this memory leak.  However, instead of
> simply open-coding the necessary three lines, add them as a helper
> function to the diff API, because it will be useful elsewhere as well.

Nicely explained.

> ---
>  diff.c     | 7 +++++++
>  diffcore.h | 1 +
>  line-log.c | 1 +
>  3 files changed, 9 insertions(+)

And all looks reasonable here, good...

> diff --git a/diff.c b/diff.c
> index 35e46dd968..ef94175163 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -5773,6 +5773,13 @@ void diff_free_filepair(struct diff_filepair *p)
>  	free(p);
>  }
>
> +void diff_free_queue(struct diff_queue_struct *q)
> +{
> +	for (int i = 0; i < q->nr; i++)
> +		diff_free_filepair(q->queue[i]);
> +	free(q->queue);
> +}

Though I wonder, should diff_free_queue() be a noop when q is NULL? The
caller in process_ranges_ordinary_commit() doesn't care, of course,
since q is always non-NULL there.

But if we're making it part of the diff API, we should probably err on
the side of flexibility.

Thanks,
Taylor
SZEDER Gábor Nov. 7, 2022, 3:11 p.m. UTC | #2
On Wed, Nov 02, 2022 at 08:20:21PM -0400, Taylor Blau wrote:
> > +void diff_free_queue(struct diff_queue_struct *q)
> > +{
> > +	for (int i = 0; i < q->nr; i++)
> > +		diff_free_filepair(q->queue[i]);
> > +	free(q->queue);
> > +}
> 
> Though I wonder, should diff_free_queue() be a noop when q is NULL? The
> caller in process_ranges_ordinary_commit() doesn't care, of course,
> since q is always non-NULL there.
> 
> But if we're making it part of the diff API, we should probably err on
> the side of flexibility.

On one hand, strbuf_reset(), string_list_clear(), or strvec_clear()
would all segfault on a NULL strbuf, string_list, or strvec pointer.

On the other hand, given the usage patterns of the diff API, and that
it mostly only works on the dreaded global 'diff_queued_diff'
instance, I don't think there is any flexibility to be gained with
this; indeed it is already more flexible than many diff API functions
as it works on the diff queue given as parameter instead of that
global instance.
Ævar Arnfjörð Bjarmason Nov. 7, 2022, 3:29 p.m. UTC | #3
On Mon, Nov 07 2022, SZEDER Gábor wrote:

> On Wed, Nov 02, 2022 at 08:20:21PM -0400, Taylor Blau wrote:
>> > +void diff_free_queue(struct diff_queue_struct *q)
>> > +{
>> > +	for (int i = 0; i < q->nr; i++)
>> > +		diff_free_filepair(q->queue[i]);
>> > +	free(q->queue);
>> > +}
>> 
>> Though I wonder, should diff_free_queue() be a noop when q is NULL? The
>> caller in process_ranges_ordinary_commit() doesn't care, of course,
>> since q is always non-NULL there.
>> 
>> But if we're making it part of the diff API, we should probably err on
>> the side of flexibility.
>
> On one hand, strbuf_reset(), string_list_clear(), or strvec_clear()
> would all segfault on a NULL strbuf, string_list, or strvec pointer.

But the reason we do that is because those APIs will always ensure that
the struct is never in an inconsistent state, as opposed to the
destructor you're adding here.

I.e. if you were to work with the queue after this diff_free_queue()
call in process_ranges_ordinary_commit() you'd segfault, not so with
those other APIs.

> On the other hand, given the usage patterns of the diff API, and that
> it mostly only works on the dreaded global 'diff_queued_diff'
> instance, I don't think there is any flexibility to be gained with
> this; indeed it is already more flexible than many diff API functions
> as it works on the diff queue given as parameter instead of that
> global instance.

I pointed how this could be nicer if you made it work like those other
APIs in
https://lore.kernel.org/git/221103.864jvg2yit.gmgdl@evledraar.gmail.com/;
I.e. we could do away with DIFF_QUEUE_CLEAR() after calling this
"free()".

But in lieu of such a larger change, just adding a call to
"DIFF_QUEUE_CLEAR()" in this new free() function seems like it could
make thing safer at very little cost.

We're also far from consistent about this, but I wish it worked like
that and were called:

	diff_queue_struct_{release,clear}()

I.e. the usual naming is:

	<struct name>_{release,clear}()

In cases where we don't free() the pointer itself, but assume that we're
working on a struct on the stack, whereas *_free() functions will free
the malloc'd pointer itself, as well as anything it contains.
SZEDER Gábor Nov. 7, 2022, 3:57 p.m. UTC | #4
On Mon, Nov 07, 2022 at 04:29:39PM +0100, Ævar Arnfjörð Bjarmason wrote:
> 
> On Mon, Nov 07 2022, SZEDER Gábor wrote:
> 
> > On Wed, Nov 02, 2022 at 08:20:21PM -0400, Taylor Blau wrote:
> >> > +void diff_free_queue(struct diff_queue_struct *q)
> >> > +{
> >> > +	for (int i = 0; i < q->nr; i++)
> >> > +		diff_free_filepair(q->queue[i]);
> >> > +	free(q->queue);
> >> > +}
> >> 
> >> Though I wonder, should diff_free_queue() be a noop when q is NULL? The
> >> caller in process_ranges_ordinary_commit() doesn't care, of course,
> >> since q is always non-NULL there.
> >> 
> >> But if we're making it part of the diff API, we should probably err on
> >> the side of flexibility.
> >
> > On one hand, strbuf_reset(), string_list_clear(), or strvec_clear()
> > would all segfault on a NULL strbuf, string_list, or strvec pointer.
> 
> But the reason we do that is because those APIs will always ensure that
> the struct is never in an inconsistent state, as opposed to the
> destructor you're adding here.

Taylor's suggestion quoted above is not about the internal state of
the diff queue, but about a NULL pointer passed to diff_free_queue().

> I.e. if you were to work with the queue after this diff_free_queue()
> call in process_ranges_ordinary_commit() you'd segfault, not so with
> those other APIs.
> 
> > On the other hand, given the usage patterns of the diff API, and that
> > it mostly only works on the dreaded global 'diff_queued_diff'
> > instance, I don't think there is any flexibility to be gained with
> > this; indeed it is already more flexible than many diff API functions
> > as it works on the diff queue given as parameter instead of that
> > global instance.
> 
> I pointed how this could be nicer if you made it work like those other
> APIs in
> https://lore.kernel.org/git/221103.864jvg2yit.gmgdl@evledraar.gmail.com/;
> I.e. we could do away with DIFF_QUEUE_CLEAR() after calling this
> "free()".
> 
> But in lieu of such a larger change, just adding a call to
> "DIFF_QUEUE_CLEAR()" in this new free() function seems like it could
> make thing safer at very little cost.
> 
> We're also far from consistent about this, but I wish it worked like
> that and were called:
> 
> 	diff_queue_struct_{release,clear}()
> 
> I.e. the usual naming is:
> 
> 	<struct name>_{release,clear}()
> 
> In cases where we don't free() the pointer itself, but assume that we're
> working on a struct on the stack, whereas *_free() functions will free
> the malloc'd pointer itself, as well as anything it contains.
>
Taylor Blau Nov. 8, 2022, 2:14 a.m. UTC | #5
On Mon, Nov 07, 2022 at 04:57:21PM +0100, SZEDER Gábor wrote:
> On Mon, Nov 07, 2022 at 04:29:39PM +0100, Ævar Arnfjörð Bjarmason wrote:
> >
> > On Mon, Nov 07 2022, SZEDER Gábor wrote:
> >
> > > On Wed, Nov 02, 2022 at 08:20:21PM -0400, Taylor Blau wrote:
> > >> > +void diff_free_queue(struct diff_queue_struct *q)
> > >> > +{
> > >> > +	for (int i = 0; i < q->nr; i++)
> > >> > +		diff_free_filepair(q->queue[i]);
> > >> > +	free(q->queue);
> > >> > +}
> > >>
> > >> Though I wonder, should diff_free_queue() be a noop when q is NULL? The
> > >> caller in process_ranges_ordinary_commit() doesn't care, of course,
> > >> since q is always non-NULL there.
> > >>
> > >> But if we're making it part of the diff API, we should probably err on
> > >> the side of flexibility.
> > >
> > > On one hand, strbuf_reset(), string_list_clear(), or strvec_clear()
> > > would all segfault on a NULL strbuf, string_list, or strvec pointer.
> >
> > But the reason we do that is because those APIs will always ensure that
> > the struct is never in an inconsistent state, as opposed to the
> > destructor you're adding here.
>
> Taylor's suggestion quoted above is not about the internal state of
> the diff queue, but about a NULL pointer passed to diff_free_queue().

I think your perspective that strbuf_reset(), string_list_clear(), etc.
all segfault on a NULL argument is extremely reasonable. Let's start
merging this down.

Thanks,
Taylor
diff mbox series

Patch

diff --git a/diff.c b/diff.c
index 35e46dd968..ef94175163 100644
--- a/diff.c
+++ b/diff.c
@@ -5773,6 +5773,13 @@  void diff_free_filepair(struct diff_filepair *p)
 	free(p);
 }
 
+void diff_free_queue(struct diff_queue_struct *q)
+{
+	for (int i = 0; i < q->nr; i++)
+		diff_free_filepair(q->queue[i]);
+	free(q->queue);
+}
+
 const char *diff_aligned_abbrev(const struct object_id *oid, int len)
 {
 	int abblen;
diff --git a/diffcore.h b/diffcore.h
index badc2261c2..9b588a1ee1 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -162,6 +162,7 @@  struct diff_filepair *diff_queue(struct diff_queue_struct *,
 				 struct diff_filespec *,
 				 struct diff_filespec *);
 void diff_q(struct diff_queue_struct *, struct diff_filepair *);
+void diff_free_queue(struct diff_queue_struct *q);
 
 /* dir_rename_relevance: the reason we want rename information for a dir */
 enum dir_rename_relevance {
diff --git a/line-log.c b/line-log.c
index 51d93310a4..7a74daf2e8 100644
--- a/line-log.c
+++ b/line-log.c
@@ -1195,6 +1195,7 @@  static int process_ranges_ordinary_commit(struct rev_info *rev, struct commit *c
 	if (parent)
 		add_line_range(rev, parent, parent_range);
 	free_line_log_data(parent_range);
+	diff_free_queue(&queue);
 	return changed;
 }