diff mbox series

[v2,12/13] name-rev: eliminate recursion in name_rev()

Message ID 20191112103821.30265-13-szeder.dev@gmail.com (mailing list archive)
State New, archived
Headers show
Series name-rev: eliminate recursion | expand

Commit Message

SZEDER Gábor Nov. 12, 2019, 10:38 a.m. UTC
The name_rev() function calls itself recursively for each interesting
parent of the commit it got as parameter, and, consequently, it can
segfault when processing a deep history if it exhausts the available
stack space.  E.g. running 'git name-rev --all' and 'git name-rev
HEAD~100000' in the gcc, gecko-dev, llvm, and WebKit repositories
results in segfaults on my machine ('ulimit -s' reports 8192kB of
stack size limit), and nowadays the former segfaults in the Linux repo
as well (it reached the necessasry depth sometime between v5.3-rc4 and
-rc5).

Eliminate the recursion by inserting the interesting parents into a
LIFO 'prio_queue' [1] and iterating until the queue becomes empty.

Note that the parent commits must be added in reverse order to the
LIFO 'prio_queue', so their relative order is preserved during
processing, i.e. the first parent should come out first from the
queue, because otherwise performance greatly suffers on mergy
histories [2].

The stacksize-limited test 'name-rev works in a deep repo' in
't6120-describe.sh' demonstrated this issue and expected failure.  Now
the recursion is gone, so flip it to expect success.  Also gone are
the dmesg entries logging the segfault of that segfaulting 'git
name-rev' process on every execution of the test suite.

Note that this slightly changes the order of lines in the output of
'git name-rev --all', usually swapping two lines every 35 lines in
git.git or every 150 lines in linux.git.  This shouldn't matter in
practice, because the output has always been unordered anyway.

This patch is best viewed with '--ignore-all-space'.

[1] Early versions of this patch used a 'commit_list', resulting in
    ~15% performance penalty for 'git name-rev --all' in 'linux.git',
    presumably because of the memory allocation and release for each
    insertion and removal. Using a LIFO 'prio_queue' has basically no
    effect on performance.

[2] We prefer shorter names, i.e. 'v0.1~234' is preferred over
    'v0.1^2~5', meaning that usually following the first parent of a
    merge results in the best name for its ancestors.  So when later
    we follow the remaining parent(s) of a merge, and reach an already
    named commit, then we usually find that we can't give that commit
    a better name, and thus we don't have to visit any of its
    ancestors again.

    OTOH, if we were to follow the Nth parent of the merge first, then
    the name of all its ancestors would include a corresponding '^N'.
    Those are not the best names for those commits, so when later we
    reach an already named commit following the first parent of that
    merge, then we would have to update the name of that commit and
    the names of all of its ancestors as well.  Consequently, we would
    have to visit many commits several times, resulting in a
    significant slowdown.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 builtin/name-rev.c  | 99 +++++++++++++++++++++++++++++----------------
 t/t6120-describe.sh |  2 +-
 2 files changed, 65 insertions(+), 36 deletions(-)

Comments

Jonathan Tan Nov. 27, 2019, 5:57 p.m. UTC | #1
> Note that this slightly changes the order of lines in the output of
> 'git name-rev --all', usually swapping two lines every 35 lines in
> git.git or every 150 lines in linux.git.  This shouldn't matter in
> practice, because the output has always been unordered anyway.

I didn't verify that the changing of order is fine, but other than that,
this patch looks great.

> This patch is best viewed with '--ignore-all-space'.

Thanks for the tip! I ended up unindenting the loop to see the changes
better, but I should have done this instead.

> -static void name_rev(struct commit *commit,
> +static void name_rev(struct commit *start_commit,
>  		const char *tip_name, timestamp_t taggerdate,
>  		int from_tag)

There are many changes from tip_name to name->tip_name in this function
that mean that tip_name is no longer used within this function. Should
this cleanup have been done in one of the earlier patches?

Apart from that, overall, this patch looks like a straightforward good
change. When we have a parent, instead of immediately calling name_rev()
recursively, we first add it to an array, and then (in reverse order)
add it to a priority queue which is actually used as a LIFO stack.
SZEDER Gábor Dec. 9, 2019, 12:22 p.m. UTC | #2
On Wed, Nov 27, 2019 at 09:57:51AM -0800, Jonathan Tan wrote:
> > Note that this slightly changes the order of lines in the output of
> > 'git name-rev --all', usually swapping two lines every 35 lines in
> > git.git or every 150 lines in linux.git.  This shouldn't matter in
> > practice, because the output has always been unordered anyway.
> 
> I didn't verify that the changing of order is fine, but other than that,
> this patch looks great.

FWIW, the sorted output is the same (well, it would clearly be a bug
if it wasn't):

  $ .../v2.24.0/bin/git name-rev --all |sort >orig.sorted
  $ git name-rev --all |sort >new.sorted
  $ diff -u orig.sorted new.sorted 
  $ echo $?
  0

I still don't understand where that slight change in the order of
lines comes from, though I didn't really tried to understand it, to be
honest...

> > This patch is best viewed with '--ignore-all-space'.
> 
> Thanks for the tip! I ended up unindenting the loop to see the changes
> better, but I should have done this instead.

At one point I was considering an additional noop preparatory commit
that would have added one more indentation level to name_rev()'s for
loop.  That patch would have an empty diff with '--ignore-all-spaces',
of course, and the diff of the patch eliminating the recursion would
look sensible by default.
Would that have been worth it?  Dunno.

> > -static void name_rev(struct commit *commit,
> > +static void name_rev(struct commit *start_commit,
> >  		const char *tip_name, timestamp_t taggerdate,
> >  		int from_tag)
> 
> There are many changes from tip_name to name->tip_name in this function
> that mean that tip_name is no longer used within this function. Should
> this cleanup have been done in one of the earlier patches?

I've added a new patch to do that.

> Apart from that, overall, this patch looks like a straightforward good
> change. When we have a parent, instead of immediately calling name_rev()
> recursively, we first add it to an array, and then (in reverse order)
> add it to a priority queue which is actually used as a LIFO stack.

Yeah, 'commit->parents' is a single linked list, so we can't iterate
over it backwards, hence that interim array.
diff mbox series

Patch

diff --git a/builtin/name-rev.c b/builtin/name-rev.c
index fc61d6fa71..a3b796eac4 100644
--- a/builtin/name-rev.c
+++ b/builtin/name-rev.c
@@ -6,6 +6,7 @@ 
 #include "tag.h"
 #include "refs.h"
 #include "parse-options.h"
+#include "prio-queue.h"
 #include "sha1-lookup.h"
 #include "commit-slab.h"
 
@@ -104,49 +105,77 @@  static struct rev_name *create_or_update_name(struct commit *commit,
 		return NULL;
 }
 
-static void name_rev(struct commit *commit,
+static void name_rev(struct commit *start_commit,
 		const char *tip_name, timestamp_t taggerdate,
 		int from_tag)
 {
-	struct rev_name *name = get_commit_rev_name(commit);
-	struct commit_list *parents;
-	int parent_number = 1;
-
-	for (parents = commit->parents;
-			parents;
-			parents = parents->next, parent_number++) {
-		struct commit *parent = parents->item;
-		const char *new_name;
-		int generation, distance;
-
-		parse_commit(parent);
-		if (parent->date < cutoff)
-			continue;
+	struct prio_queue queue;
+	struct commit *commit;
+	struct commit **parents_to_queue = NULL;
+	size_t parents_to_queue_nr, parents_to_queue_alloc = 0;
+
+	memset(&queue, 0, sizeof(queue)); /* Use the prio_queue as LIFO */
+	prio_queue_put(&queue, start_commit);
+
+	while ((commit = prio_queue_get(&queue))) {
+		struct rev_name *name = get_commit_rev_name(commit);
+		struct commit_list *parents;
+		int parent_number = 1;
+
+		parents_to_queue_nr = 0;
+
+		for (parents = commit->parents;
+				parents;
+				parents = parents->next, parent_number++) {
+			struct commit *parent = parents->item;
+			const char *new_name;
+			int generation, distance;
+
+			parse_commit(parent);
+			if (parent->date < cutoff)
+				continue;
 
-		if (parent_number > 1) {
-			size_t len;
+			if (parent_number > 1) {
+				size_t len;
+
+				strip_suffix(name->tip_name, "^0", &len);
+				if (name->generation > 0)
+					new_name = xstrfmt("%.*s~%d^%d",
+							   (int)len,
+							   name->tip_name,
+							   name->generation,
+							   parent_number);
+				else
+					new_name = xstrfmt("%.*s^%d", (int)len,
+							   name->tip_name,
+							   parent_number);
+				generation = 0;
+				distance = name->distance + MERGE_TRAVERSAL_WEIGHT;
+			} else {
+				new_name = name->tip_name;
+				generation = name->generation + 1;
+				distance = name->distance + 1;
+			}
 
-			strip_suffix(tip_name, "^0", &len);
-			if (name->generation > 0)
-				new_name = xstrfmt("%.*s~%d^%d", (int)len, tip_name,
-						   name->generation,
-						   parent_number);
-			else
-				new_name = xstrfmt("%.*s^%d", (int)len, tip_name,
-						   parent_number);
-			generation = 0;
-			distance = name->distance + MERGE_TRAVERSAL_WEIGHT;
-		} else {
-			new_name = tip_name;
-			generation = name->generation + 1;
-			distance = name->distance + 1;
+			if (create_or_update_name(parent, new_name, taggerdate,
+						  generation, distance,
+						  from_tag)) {
+				ALLOC_GROW(parents_to_queue,
+					   parents_to_queue_nr + 1,
+					   parents_to_queue_alloc);
+				parents_to_queue[parents_to_queue_nr] = parent;
+				parents_to_queue_nr++;
+			}
 		}
 
-		if (create_or_update_name(parent, new_name, taggerdate,
-					  generation, distance,
-					  from_tag))
-			name_rev(parent, new_name, taggerdate, from_tag);
+		/* The first parent must come out first from the prio_queue */
+		while (parents_to_queue_nr)
+			prio_queue_put(&queue,
+				       parents_to_queue[--parents_to_queue_nr]);
 	}
+
+	clear_prio_queue(&queue);
+	free(parents_to_queue);
 }
 
 static int subpath_matches(const char *path, const char *filter)
diff --git a/t/t6120-describe.sh b/t/t6120-describe.sh
index 0d119e9652..09c50f3f04 100755
--- a/t/t6120-describe.sh
+++ b/t/t6120-describe.sh
@@ -381,7 +381,7 @@  test_expect_success 'describe tag object' '
 	test_i18ngrep "fatal: test-blob-1 is neither a commit nor blob" actual
 '
 
-test_expect_failure ULIMIT_STACK_SIZE 'name-rev works in a deep repo' '
+test_expect_success ULIMIT_STACK_SIZE 'name-rev works in a deep repo' '
 	i=1 &&
 	while test $i -lt 8000
 	do