Message ID | 16a7b572-0a3d-4707-9034-0dac69ea99ac@web.de (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | commit: avoid parent list buildup in clear_commit_marks_many() | expand |
On Sun, Feb 09, 2025 at 11:41:15AM +0100, René Scharfe wrote: > clear_commit_marks_1() clears the marks of the first parent and its > first parent and so on, and saves the higher numbered parents in a list > for later. There is no benefit in keeping that list growing with each > handled commit. Clear it after each run to reduce peak memory usage. Okay. So the issue here is that `clear_commit_marks_1()` only processes the first-parent chain of commits, and any other commits will be added to the `struct commit_list` backlog. Consequently, merge-heavy histories are very likely to build up quite a backlog of non-first-parent commits. > Signed-off-by: René Scharfe <l.s.r@web.de> > --- > commit.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/commit.c b/commit.c > index 540660359d..6efdb03997 100644 > --- a/commit.c > +++ b/commit.c > @@ -780,14 +780,14 @@ static void clear_commit_marks_1(struct commit_list **plist, > > void clear_commit_marks_many(size_t nr, struct commit **commit, unsigned int mark) > { > - struct commit_list *list = NULL; > - > for (size_t i = 0; i < nr; i++) { > + struct commit_list *list = NULL; > + > clear_commit_marks_1(&list, *commit, mark); > + while (list) > + clear_commit_marks_1(&list, pop_commit(&list), mark); > commit++; > } > - while (list) > - clear_commit_marks_1(&list, pop_commit(&list), mark); > } And this is fixed by immediately processing all commits that we currently have in the list. As `clear_commit_marks_1()` only processes immediate children of the handed-in commit we know that it will have processed the first parent, and `list` will contain remaining commits, if any. We also end up adding grandchildren to the list, so this change essentially converts the algorithm from depth-first to breadth-first. I bet we can construct cases where this will perform _worse_ than the current algorithm, e.g. when you have branch thickets where every commit is a merge: But I assume that for the most common cases this might be an improvement indeed. The question to me is: does this actually matter in the real world? It would be nice to maybe get some numbers that demonstrate the improvement in a repository. Patrick
Am 12.02.25 um 08:05 schrieb Patrick Steinhardt: > On Sun, Feb 09, 2025 at 11:41:15AM +0100, René Scharfe wrote: >> clear_commit_marks_1() clears the marks of the first parent and its >> first parent and so on, and saves the higher numbered parents in a list >> for later. There is no benefit in keeping that list growing with each >> handled commit. Clear it after each run to reduce peak memory usage. > > Okay. So the issue here is that `clear_commit_marks_1()` only processes > the first-parent chain of commits, and any other commits will be added > to the `struct commit_list` backlog. Consequently, merge-heavy histories > are very likely to build up quite a backlog of non-first-parent commits. > >> Signed-off-by: René Scharfe <l.s.r@web.de> >> --- >> commit.c | 8 ++++---- >> 1 file changed, 4 insertions(+), 4 deletions(-) >> >> diff --git a/commit.c b/commit.c >> index 540660359d..6efdb03997 100644 >> --- a/commit.c >> +++ b/commit.c >> @@ -780,14 +780,14 @@ static void clear_commit_marks_1(struct commit_list **plist, >> >> void clear_commit_marks_many(size_t nr, struct commit **commit, unsigned int mark) >> { >> - struct commit_list *list = NULL; >> - >> for (size_t i = 0; i < nr; i++) { >> + struct commit_list *list = NULL; >> + >> clear_commit_marks_1(&list, *commit, mark); >> + while (list) >> + clear_commit_marks_1(&list, pop_commit(&list), mark); >> commit++; >> } >> - while (list) >> - clear_commit_marks_1(&list, pop_commit(&list), mark); >> } > > And this is fixed by immediately processing all commits that we > currently have in the list. As `clear_commit_marks_1()` only processes > immediate children of the handed-in commit we know that it will have > processed the first parent, and `list` will contain remaining commits, > if any. clear_commit_marks_1() processes the whole ancestral chain of first parents down to the root or first clean ancestor. > We also end up adding grandchildren to the list, so this change > essentially converts the algorithm from depth-first to breadth-first. It's still depth-first, but all second and higher numbered parents added to the list are cleaned before starting to clean the next commit from the input list. So we go from "clean straight down for each input commit first and clean their side branches later" to "clean all ancestors for each input commit". > I bet we can construct cases where this will perform _worse_ than the > current algorithm, e.g. when you have branch thickets where every commit > is a merge: But I assume that for the most common cases this might be an > improvement indeed. I won't bet, but I'd like to see such a case. Can't imagine one. > The question to me is: does this actually matter in the real world? It > would be nice to maybe get some numbers that demonstrate the improvement > in a repository. Well, the maximum list length for clear_commit_marks_many() calls with nr > 1 in the test suite goes from 12 in t6600 to 4 with the patch. Not that exciting. The question to me is: Why pile up parents in the list when we can clean them earlier with no downside? Or is there any? René
On Thu, Feb 13, 2025 at 10:38:51PM +0100, René Scharfe wrote: > Am 12.02.25 um 08:05 schrieb Patrick Steinhardt: > > And this is fixed by immediately processing all commits that we > > currently have in the list. As `clear_commit_marks_1()` only processes > > immediate children of the handed-in commit we know that it will have > > processed the first parent, and `list` will contain remaining commits, > > if any. > > clear_commit_marks_1() processes the whole ancestral chain of first > parents down to the root or first clean ancestor. > > > We also end up adding grandchildren to the list, so this change > > essentially converts the algorithm from depth-first to breadth-first. > > It's still depth-first, but all second and higher numbered parents added > to the list are cleaned before starting to clean the next commit from > the input list. So we go from "clean straight down for each input > commit first and clean their side branches later" to "clean all > ancestors for each input commit". Ah, fair. > > I bet we can construct cases where this will perform _worse_ than the > > current algorithm, e.g. when you have branch thickets where every commit > > is a merge: But I assume that for the most common cases this might be an > > improvement indeed. > > I won't bet, but I'd like to see such a case. Can't imagine one. > > > The question to me is: does this actually matter in the real world? It > > would be nice to maybe get some numbers that demonstrate the improvement > > in a repository. > > Well, the maximum list length for clear_commit_marks_many() calls with > nr > 1 in the test suite goes from 12 in t6600 to 4 with the patch. Not > that exciting. The question to me is: Why pile up parents in the list > when we can clean them earlier with no downside? Or is there any? If it really is without downsides then yes, it's a nice improvement. I was mostly wondering whether you're fixing something where the old way of doing things performs _significantly_ worse and where the change could lead to a user-visible improvement. Like, requiring us to store orders of magnitudes less commits at the same time. Patrick
diff --git a/commit.c b/commit.c index 540660359d..6efdb03997 100644 --- a/commit.c +++ b/commit.c @@ -780,14 +780,14 @@ static void clear_commit_marks_1(struct commit_list **plist, void clear_commit_marks_many(size_t nr, struct commit **commit, unsigned int mark) { - struct commit_list *list = NULL; - for (size_t i = 0; i < nr; i++) { + struct commit_list *list = NULL; + clear_commit_marks_1(&list, *commit, mark); + while (list) + clear_commit_marks_1(&list, pop_commit(&list), mark); commit++; } - while (list) - clear_commit_marks_1(&list, pop_commit(&list), mark); } void clear_commit_marks(struct commit *commit, unsigned int mark)
clear_commit_marks_1() clears the marks of the first parent and its first parent and so on, and saves the higher numbered parents in a list for later. There is no benefit in keeping that list growing with each handled commit. Clear it after each run to reduce peak memory usage. Signed-off-by: René Scharfe <l.s.r@web.de> --- commit.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) -- 2.48.1