diff mbox series

commit: avoid parent list buildup in clear_commit_marks_many()

Message ID 16a7b572-0a3d-4707-9034-0dac69ea99ac@web.de (mailing list archive)
State New
Headers show
Series commit: avoid parent list buildup in clear_commit_marks_many() | expand

Commit Message

René Scharfe Feb. 9, 2025, 10:41 a.m. UTC
clear_commit_marks_1() clears the marks of the first parent and its
first parent and so on, and saves the higher numbered parents in a list
for later.  There is no benefit in keeping that list growing with each
handled commit.  Clear it after each run to reduce peak memory usage.

Signed-off-by: René Scharfe <l.s.r@web.de>
---
 commit.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--
2.48.1

Comments

Patrick Steinhardt Feb. 12, 2025, 7:05 a.m. UTC | #1
On Sun, Feb 09, 2025 at 11:41:15AM +0100, René Scharfe wrote:
> clear_commit_marks_1() clears the marks of the first parent and its
> first parent and so on, and saves the higher numbered parents in a list
> for later.  There is no benefit in keeping that list growing with each
> handled commit.  Clear it after each run to reduce peak memory usage.

Okay. So the issue here is that `clear_commit_marks_1()` only processes
the first-parent chain of commits, and any other commits will be added
to the `struct commit_list` backlog. Consequently, merge-heavy histories
are very likely to build up quite a backlog of non-first-parent commits.

> Signed-off-by: René Scharfe <l.s.r@web.de>
> ---
>  commit.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/commit.c b/commit.c
> index 540660359d..6efdb03997 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -780,14 +780,14 @@ static void clear_commit_marks_1(struct commit_list **plist,
> 
>  void clear_commit_marks_many(size_t nr, struct commit **commit, unsigned int mark)
>  {
> -	struct commit_list *list = NULL;
> -
>  	for (size_t i = 0; i < nr; i++) {
> +		struct commit_list *list = NULL;
> +
>  		clear_commit_marks_1(&list, *commit, mark);
> +		while (list)
> +			clear_commit_marks_1(&list, pop_commit(&list), mark);
>  		commit++;
>  	}
> -	while (list)
> -		clear_commit_marks_1(&list, pop_commit(&list), mark);
>  }

And this is fixed by immediately processing all commits that we
currently have in the list. As `clear_commit_marks_1()` only processes
immediate children of the handed-in commit we know that it will have
processed the first parent, and `list` will contain remaining commits,
if any.

We also end up adding grandchildren to the list, so this change
essentially converts the algorithm from depth-first to breadth-first. I
bet we can construct cases where this will perform _worse_ than the
current algorithm, e.g. when you have branch thickets where every commit
is a merge: But I assume that for the most common cases this might be an
improvement indeed.

The question to me is: does this actually matter in the real world? It
would be nice to maybe get some numbers that demonstrate the improvement
in a repository.

Patrick
René Scharfe Feb. 13, 2025, 9:38 p.m. UTC | #2
Am 12.02.25 um 08:05 schrieb Patrick Steinhardt:
> On Sun, Feb 09, 2025 at 11:41:15AM +0100, René Scharfe wrote:
>> clear_commit_marks_1() clears the marks of the first parent and its
>> first parent and so on, and saves the higher numbered parents in a list
>> for later.  There is no benefit in keeping that list growing with each
>> handled commit.  Clear it after each run to reduce peak memory usage.
>
> Okay. So the issue here is that `clear_commit_marks_1()` only processes
> the first-parent chain of commits, and any other commits will be added
> to the `struct commit_list` backlog. Consequently, merge-heavy histories
> are very likely to build up quite a backlog of non-first-parent commits.
>
>> Signed-off-by: René Scharfe <l.s.r@web.de>
>> ---
>>  commit.c | 8 ++++----
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/commit.c b/commit.c
>> index 540660359d..6efdb03997 100644
>> --- a/commit.c
>> +++ b/commit.c
>> @@ -780,14 +780,14 @@ static void clear_commit_marks_1(struct commit_list **plist,
>>
>>  void clear_commit_marks_many(size_t nr, struct commit **commit, unsigned int mark)
>>  {
>> -	struct commit_list *list = NULL;
>> -
>>  	for (size_t i = 0; i < nr; i++) {
>> +		struct commit_list *list = NULL;
>> +
>>  		clear_commit_marks_1(&list, *commit, mark);
>> +		while (list)
>> +			clear_commit_marks_1(&list, pop_commit(&list), mark);
>>  		commit++;
>>  	}
>> -	while (list)
>> -		clear_commit_marks_1(&list, pop_commit(&list), mark);
>>  }
>
> And this is fixed by immediately processing all commits that we
> currently have in the list. As `clear_commit_marks_1()` only processes
> immediate children of the handed-in commit we know that it will have
> processed the first parent, and `list` will contain remaining commits,
> if any.

clear_commit_marks_1() processes the whole ancestral chain of first
parents down to the root or first clean ancestor.

> We also end up adding grandchildren to the list, so this change
> essentially converts the algorithm from depth-first to breadth-first.

It's still depth-first, but all second and higher numbered parents added
to the list are cleaned before starting to clean the next commit from
the input list.  So we go from "clean straight down for each input
commit first and clean their side branches later" to "clean all
ancestors for each input commit".

> I bet we can construct cases where this will perform _worse_ than the
> current algorithm, e.g. when you have branch thickets where every commit
> is a merge: But I assume that for the most common cases this might be an
> improvement indeed.

I won't bet, but I'd like to see such a case.  Can't imagine one.

> The question to me is: does this actually matter in the real world? It
> would be nice to maybe get some numbers that demonstrate the improvement
> in a repository.

Well, the maximum list length for clear_commit_marks_many() calls with
nr > 1 in the test suite goes from 12 in t6600 to 4 with the patch.  Not
that exciting.  The question to me is: Why pile up parents in the list
when we can clean them earlier with no downside?  Or is there any?

René
Patrick Steinhardt Feb. 17, 2025, 6:33 a.m. UTC | #3
On Thu, Feb 13, 2025 at 10:38:51PM +0100, René Scharfe wrote:
> Am 12.02.25 um 08:05 schrieb Patrick Steinhardt:
> > And this is fixed by immediately processing all commits that we
> > currently have in the list. As `clear_commit_marks_1()` only processes
> > immediate children of the handed-in commit we know that it will have
> > processed the first parent, and `list` will contain remaining commits,
> > if any.
> 
> clear_commit_marks_1() processes the whole ancestral chain of first
> parents down to the root or first clean ancestor.
> 
> > We also end up adding grandchildren to the list, so this change
> > essentially converts the algorithm from depth-first to breadth-first.
> 
> It's still depth-first, but all second and higher numbered parents added
> to the list are cleaned before starting to clean the next commit from
> the input list.  So we go from "clean straight down for each input
> commit first and clean their side branches later" to "clean all
> ancestors for each input commit".

Ah, fair.

> > I bet we can construct cases where this will perform _worse_ than the
> > current algorithm, e.g. when you have branch thickets where every commit
> > is a merge: But I assume that for the most common cases this might be an
> > improvement indeed.
> 
> I won't bet, but I'd like to see such a case.  Can't imagine one.
> 
> > The question to me is: does this actually matter in the real world? It
> > would be nice to maybe get some numbers that demonstrate the improvement
> > in a repository.
> 
> Well, the maximum list length for clear_commit_marks_many() calls with
> nr > 1 in the test suite goes from 12 in t6600 to 4 with the patch.  Not
> that exciting.  The question to me is: Why pile up parents in the list
> when we can clean them earlier with no downside?  Or is there any?

If it really is without downsides then yes, it's a nice improvement. I
was mostly wondering whether you're fixing something where the old way
of doing things performs _significantly_ worse and where the change
could lead to a user-visible improvement. Like, requiring us to store
orders of magnitudes less commits at the same time.

Patrick
diff mbox series

Patch

diff --git a/commit.c b/commit.c
index 540660359d..6efdb03997 100644
--- a/commit.c
+++ b/commit.c
@@ -780,14 +780,14 @@  static void clear_commit_marks_1(struct commit_list **plist,

 void clear_commit_marks_many(size_t nr, struct commit **commit, unsigned int mark)
 {
-	struct commit_list *list = NULL;
-
 	for (size_t i = 0; i < nr; i++) {
+		struct commit_list *list = NULL;
+
 		clear_commit_marks_1(&list, *commit, mark);
+		while (list)
+			clear_commit_marks_1(&list, pop_commit(&list), mark);
 		commit++;
 	}
-	while (list)
-		clear_commit_marks_1(&list, pop_commit(&list), mark);
 }

 void clear_commit_marks(struct commit *commit, unsigned int mark)