mbox series

[v4,0/9] ref-filter: ahead/behind counting, faster --merged option

Message ID pull.1489.v4.git.1679311615.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series ref-filter: ahead/behind counting, faster --merged option | expand

Message

Jean-Noël Avila via GitGitGadget March 20, 2023, 11:26 a.m. UTC
At $DAYJOB, we have used a custom 'ahead-behind' builtin in our fork of Git
for lots of reasons. The main goal of the builtin is to compare multiple
references against a common base reference. The comparison is number of
commits that are in each side of the symmtric difference of their reachable
sets. A commit C is "ahead" of a commit B by the number of commits in B..C
(reachable from C but not reachable from B). Similarly, the commit C is
"behind" the commit B by the number of commits in C..B (reachable from B but
not reachable from C).

These numbers can be computed by 'git rev-list --count B..C' and 'git
rev-list --count C..B', but there are common needs that benefit from having
the checks being done in the same process:

 1. Our "branches" page lists ahead/behind counts for each listed branch as
    compared to the repo's default branch. This can be done with a single
    'git ahead-behind' process.
 2. When a branch is updated, a background job checks if any pull requests
    that target that branch should be closed because their branches were
    merged implicitly by that update. These queries can be batched into 'git
    ahead-behind' calls.

In that second example, we don't need the full ahead/behind counts (although
it is sufficient to look for branches that are "zero commits ahead", meaning
they are reachable from the base), and instead reachability is the critical
piece.

This series contributes the custom algorithms we used for our 'git
ahead-behind' builtin, but as extensions to 'git for-each-ref':

 * Add a new "%(ahead-behind:<base>)" format token to for-each-ref which
   allows outputting the ahead/behind values in the format string for a
   matching ref.
 * Add a new algorithm that speeds up the 'git for-each-ref --merged='
   option. This also applies to the 'git branch --merged=' option.

The idea to use 'git for-each-ref' instead of creating a new builtin is from
Junio, and simplifies this series significantly compared to v1. I was
initially concerned about the overhead of 'git for-each-ref' and its
generality and sorting, but I was not able to measure any important
difference between this implementation and our internal 'git ahead-behind'
implementation. In particular, when a pattern is given to 'git for-each-ref'
that looks like an exact ref, it navigates directly to the ref instead of
scanning all references for matches.

However, for our specific uses, we like to batch a list of exact references
that could be very long. We introduce a new --stdin option here.

To keep things close to the v1 outline, I replaced the existing patches with
closely-related ones, when possible.

Patch 1 adds the --stdin option to 'git for-each-ref'. (This is similar to
the boilerplate patch from v1.)

Patch 2 adds a test to explicitly check that 'git for-each-ref' will still
succeed when all input refs are missing. (This is similar to the
--ignore-missing patch from v1.)

Patches 3-6 introduce a new method: ensure_generations_valid(). Patch 3
refactors the existing compute_topological_levels() to make it more generic
while Patch 4 ports compute_generation_numbers() onto that generic base.
Patch 5 updates the definition of commit_graph_generation() slightly, making
way for patch 6 to implement the in-memory computation of generation
numbers. With an existing commit-graph file, the commits that are not
present in the file are considered as having generation number "infinity".
This is useful for most of our reachability queries to this point, since
those commits are "above" the ones tracked by the commit-graph. When these
commits are low in number, then there is very little performance cost and
zero correctness cost. (These patches match v1 exactly.)

However, we will see that the ahead/behind computation requires accurate
generation numbers to avoid overcounting. Thus, ensure_generations_valid()
isa way to specify a list of commits that need generation numbers computed
before continuing. It's a no-op if all of those commits are in the
commit-graph file. It's expensive if the commit-graph doesn't exist.
However, '%(ahead-behind:)' computations are likely to be slow no matter
what without a commit-graph, so assuming an existing commit-graph file is
reasonable. If we find sufficient desire to have an implementation that does
not have this requirement, we could create a second implementation and
toggle to it when generation_numbers_enabled() returns false.

Patch 7 implements the ahead-behind algorithm, but it is not connected to a
builtin. It's a long commit message, so hopefully it explains the algorithm
sufficiently. (The difference from v1 is that it no longer integrates with a
builtin and there are no new tests. It also uses 'unsigned int' and is
correctly co-authored by Taylor.)

Patch 8 integrates the ahead-behind algorithm with the ref-filter code,
including parsing the "ahead-behind" token. This finally adds tests that
check both ahead_behind() and ensure_generations_valid() via
t6600-test-reach.sh. (This patch is essentially completely new in v2.)

Patch 9 implements the tips_reachable_from_base() method, and uses it within
the ref-filter code to speed up 'git for-each-ref --merged' and 'git branch
--merged'. (The interface is slightly different than v1, due to the needs of
the new caller.)


Updates in v4
=============

 * A string leak in Patch 1 is remedied.
 * v3's patch 3 is split into v4's patches 3 and 4. Patch 3 now only
   refactors compute_topological_levels(), leaving Patch 4 to only port
   compute_generation_numbers() to the new
   compute_reachable_generation_numbers().
 * Other edits in Patch 3:
   * compute_reachable_generation_numbers() drops the "_1" in its name.
   * Redundant progress counters are removed.
   * The progress struct is passed down into 'struct
     compute_generation_info' and not just the 'struct
     write_commit_graph_context'
 * Patch 6 adds context around why we care about in-memory generation
   number.
 * Patch 7 has several changes:
   * Includes extra context around the input data to ahead_behind() as well
     as some rewrite of the algorithm description.
   * uses DIV_ROUND_UP() to initialize the bitmap width.
   * renames init_bit_array() to get_bit_array() to make it clear that it
     will not regenerate the bitmap if it already exists.
   * adds a new comment before assigning the STALE bit
 * Patch 8 changes the expected format to output <committish> and checks the
   given base during the token parsing instead of delaying.


Updates in v3
=============

 * The APIs are modified to take a 'struct repository *' and use them
   appropriately.
 * The --stdin option in 'git for-each-ref' now uses strvec instead of an
   ad-hoc array.

Thanks, -Stolee

Derrick Stolee (8):
  for-each-ref: add --stdin option
  for-each-ref: explicitly test no matches
  commit-graph: refactor compute_topological_levels()
  commit-graph: simplify compute_generation_numbers()
  commit-graph: return generation from memory
  commit-reach: implement ahead_behind() logic
  for-each-ref: add ahead-behind format atom
  commit-reach: add tips_reachable_from_bases()

Taylor Blau (1):
  commit-graph: introduce `ensure_generations_valid()`

 Documentation/git-for-each-ref.txt |  12 +-
 builtin/branch.c                   |   1 +
 builtin/for-each-ref.c             |  26 +++-
 builtin/tag.c                      |   1 +
 commit-graph.c                     | 207 +++++++++++++++++----------
 commit-graph.h                     |   8 ++
 commit-reach.c                     | 216 +++++++++++++++++++++++++++++
 commit-reach.h                     |  40 ++++++
 ref-filter.c                       |  93 ++++++++++---
 ref-filter.h                       |  26 +++-
 t/perf/p1500-graph-walks.sh        |  50 +++++++
 t/t3203-branch-output.sh           |  14 ++
 t/t5318-commit-graph.sh            |   2 +-
 t/t6300-for-each-ref.sh            |  50 +++++++
 t/t6301-for-each-ref-errors.sh     |  14 ++
 t/t6600-test-reach.sh              | 169 ++++++++++++++++++++++
 t/t7004-tag.sh                     |  28 ++++
 17 files changed, 866 insertions(+), 91 deletions(-)
 create mode 100755 t/perf/p1500-graph-walks.sh


base-commit: 725f57037d81e24eacfda6e59a19c60c0b4c8062
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1489%2Fderrickstolee%2Fstolee%2Fupstream-ahead-behind-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1489/derrickstolee/stolee/upstream-ahead-behind-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/1489

Range-diff vs v3:

  1:  f9e80e233f1 !  1:  27d94077aa9 for-each-ref: add --stdin option
     @@ builtin/for-each-ref.c: int cmd_for_each_ref(int argc, const char **argv, const
      +		while (strbuf_getline(&line, stdin) != EOF)
      +			strvec_push(&vec, line.buf);
      +
     ++		strbuf_release(&line);
     ++
      +		/* vec.v is NULL-terminated, just like 'argv'. */
      +		filter.name_patterns = vec.v;
      +	} else {
  2:  f56d6a64d24 =  2:  1e3d499431a for-each-ref: explicitly test no matches
  3:  3b15e9df770 !  3:  79a57f30a85 commit-graph: combine generation computations
     @@ Metadata
      Author: Derrick Stolee <derrickstolee@github.com>
      
       ## Commit message ##
     -    commit-graph: combine generation computations
     +    commit-graph: refactor compute_topological_levels()
      
          This patch extracts the common code used to compute topological levels
          and corrected committer dates into a common routine,
     -    compute_reachable_generation_numbers_1().
     +    compute_reachable_generation_numbers(). For ease of reading, it only
     +    modifies compute_topological_levels() to use this new routine, leaving
     +    compute_generation_numbers() to be modified in the next change.
      
          This new routine dispatches to call the necessary functions to get and
          set the generation number for a given commit through a vtable (the
     @@ Commit message
          Computing the generation number itself is done in
          compute_generation_from_max(), which dispatches its implementation based
          on the generation version requested, or issuing a BUG() for unrecognized
     -    generation versions.
     +    generation versions. This does not use a vtable because the logic
     +    depends only on the generation number version, not where the data is
     +    being loaded from or being stored to. This is a subtle point that will
     +    make more sense in a future change that modifies the in-memory
     +    generation values instead of just preparing values for writing to a
     +    commit-graph file.
      
     -    This patch cleans up the two places that currently compute topological
     -    levels and corrected commit dates by reducing the amount of duplicated
     -    code. It also makes it possible to introduce a function which
     -    dynamically computes those values for commits that aren't stored in a
     -    commit-graph, which will be required for the forthcoming ahead-behind
     -    rewrite.
     +    This change looks like it adds a lot of new code. However, two upcoming
     +    changes will be quite small due to the work being done in this change.
      
     +    Co-authored-by: Taylor Blau <me@ttaylorr.com>
          Signed-off-by: Taylor Blau <me@ttaylorr.com>
          Signed-off-by: Derrick Stolee <derrickstolee@github.com>
      
     @@ commit-graph.c: static void close_reachable(struct write_commit_graph_context *c
      +	}
      +}
      +
     -+static void compute_reachable_generation_numbers_1(
     ++static void compute_reachable_generation_numbers(
      +			struct compute_generation_info *info,
      +			int generation_version)
       {
     @@ commit-graph.c: static void close_reachable(struct write_commit_graph_context *c
      -	for (i = 0; i < ctx->commits.nr; i++) {
      -		struct commit *c = ctx->commits.list[i];
      -		uint32_t level;
     +-
     +-		repo_parse_commit(ctx->r, c);
     +-		level = *topo_level_slab_at(ctx->topo_levels, c);
      +	for (i = 0; i < info->commits->nr; i++) {
      +		struct commit *c = info->commits->list[i];
      +		timestamp_t gen;
      +		repo_parse_commit(info->r, c);
      +		gen = info->get_generation(c, info->data);
     - 
     --		repo_parse_commit(ctx->r, c);
     --		level = *topo_level_slab_at(ctx->topo_levels, c);
      +		display_progress(info->progress, info->progress_cnt + 1);
       
      -		display_progress(ctx->progress, i + 1);
     @@ commit-graph.c: static void compute_topological_levels(struct write_commit_graph
      +{
      +	struct write_commit_graph_context *ctx = data;
      +	*topo_level_slab_at(ctx->topo_levels, c) = (uint32_t)t;
     -+	display_progress(ctx->progress, ctx->progress_cnt + 1);
      +}
      +
      +static void compute_topological_levels(struct write_commit_graph_context *ctx)
      +{
      +	struct compute_generation_info info = {
      +		.r = ctx->r,
     -+		.progress = ctx->progress,
      +		.commits = &ctx->commits,
      +		.get_generation = get_topo_level,
      +		.set_generation = set_topo_level,
     @@ commit-graph.c: static void compute_topological_levels(struct write_commit_graph
      +	};
      +
      +	if (ctx->report_progress)
     -+		ctx->progress = start_delayed_progress(
     ++		info.progress = ctx->progress
     ++			      = start_delayed_progress(
      +					_("Computing commit graph topological levels"),
      +					ctx->commits.nr);
      +
     -+	compute_reachable_generation_numbers_1(&info, 1);
     -+
     - 	stop_progress(&ctx->progress);
     - }
     - 
     -+static timestamp_t get_generation_from_graph_data(struct commit *c, void *data)
     -+{
     -+	return commit_graph_data_at(c)->generation;
     -+}
     -+
     -+static void set_generation_v2(struct commit *c, timestamp_t t, void *data)
     -+{
     -+	struct write_commit_graph_context *ctx = data;
     -+	struct commit_graph_data *g = commit_graph_data_at(c);
     -+	g->generation = (uint32_t)t;
     -+	display_progress(ctx->progress, ctx->progress_cnt + 1);
     -+}
     -+
     - static void compute_generation_numbers(struct write_commit_graph_context *ctx)
     - {
     - 	int i;
     --	struct commit_list *list = NULL;
     -+	struct compute_generation_info info = {
     -+		.r = ctx->r,
     -+		.progress = ctx->progress,
     -+		.commits = &ctx->commits,
     -+		.get_generation = get_generation_from_graph_data,
     -+		.set_generation = set_generation_v2,
     -+		.data = ctx,
     -+	};
     - 
     - 	if (ctx->report_progress)
     - 		ctx->progress = start_delayed_progress(
     -@@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
     - 		}
     - 	}
     - 
     --	for (i = 0; i < ctx->commits.nr; i++) {
     --		struct commit *c = ctx->commits.list[i];
     --		timestamp_t corrected_commit_date;
     --
     --		repo_parse_commit(ctx->r, c);
     --		corrected_commit_date = commit_graph_data_at(c)->generation;
     --
     --		display_progress(ctx->progress, i + 1);
     --		if (corrected_commit_date != GENERATION_NUMBER_ZERO)
     --			continue;
     --
     --		commit_list_insert(c, &list);
     --		while (list) {
     --			struct commit *current = list->item;
     --			struct commit_list *parent;
     --			int all_parents_computed = 1;
     --			timestamp_t max_corrected_commit_date = 0;
     --
     --			for (parent = current->parents; parent; parent = parent->next) {
     --				repo_parse_commit(ctx->r, parent->item);
     --				corrected_commit_date = commit_graph_data_at(parent->item)->generation;
     --
     --				if (corrected_commit_date == GENERATION_NUMBER_ZERO) {
     --					all_parents_computed = 0;
     --					commit_list_insert(parent->item, &list);
     --					break;
     --				}
     --
     --				if (corrected_commit_date > max_corrected_commit_date)
     --					max_corrected_commit_date = corrected_commit_date;
     --			}
     --
     --			if (all_parents_computed) {
     --				pop_commit(&list);
     --
     --				if (current->date && current->date > max_corrected_commit_date)
     --					max_corrected_commit_date = current->date - 1;
     --				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
     --			}
     --		}
     --	}
     -+	compute_reachable_generation_numbers_1(&info, 2);
     - 
     - 	for (i = 0; i < ctx->commits.nr; i++) {
     - 		struct commit *c = ctx->commits.list[i];
     -@@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
     - 		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
     - 			ctx->num_generation_data_overflows++;
     - 	}
     ++	compute_reachable_generation_numbers(&info, 1);
      +
       	stop_progress(&ctx->progress);
       }
  -:  ----------- >  4:  3fd6c758129 commit-graph: simplify compute_generation_numbers()
  4:  abd3e7a67be !  5:  fed76f0f08e commit-graph: return generation from memory
     @@ Commit message
          for the given commit but the graph_pos indicated the commit was not in
          the commit-graph file.
      
     +    However, an upcoming change will introduce the ability to set generation
     +    values in-memory without writing the commit-graph file. Thus, we can no
     +    longer trust 'graph_pos' to indicate whether or not the generation
     +    member can be trusted.
     +
          Instead, trust the 'generation' member if the commit has a value in the
          slab _and_ the 'generation' member is non-zero. Otherwise, treat it as
          GENERATION_NUMBER_INFINITY.
  5:  e197bddcace !  6:  17a1fc9b15e commit-graph: introduce `ensure_generations_valid()`
     @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph
      +		.set_generation = set_generation_in_graph_data,
      +	};
      +
     -+	compute_reachable_generation_numbers_1(&info, generation_version);
     ++	compute_reachable_generation_numbers(&info, generation_version);
      +}
      +
       static void trace2_bloom_filter_write_statistics(struct write_commit_graph_context *ctx)
  6:  0fb3913810b !  7:  5d937184a0e commit-reach: implement ahead_behind() logic
     @@ Commit message
      
          The interface for ahead_behind() uses two arrays. The first array of
          commits contains the list of all starting points for the walk. This
     -    includes all tip commits _and_ base commits. The second array, using the
     -    new ahead_behind_count struct, indicates which commits from that initial
     -    array form the base/tip pair for the ahead/behind count it will store.
     +    includes all tip commits _and_ base commits. The second array specifies
     +    base/tip pairs by pointing to commits within the first array, by index.
     +    The second array also stores the resulting ahead/behind counts for each
     +    of these pairs.
      
          This implementation of ahead_behind() allows multiple bases, if desired.
          Even with multiple bases, there is only one commit walk used for
     @@ Commit message
      
          Now, let's discuss the ahead/behind counting algorithm.
      
     -    Each commit in the input commit list is associated with a bit position
     -    indicating "the ith commit can reach this commit". Each of these commits
     -    is associated with a bitmap with its position flipped on and then
     -    placed in a queue for walking commit history. We walk commits by popping
     -    the commit with maximum generation number out of the queue, guaranteeing
     -    that we will never walk a child of that commit in any future steps.
     +    The first array of commits are considered the starting commits. The
     +    index within that array will play a critical role.
     +
     +    We create a new commit slab that maps commits to a bitmap. For a given
     +    commit (anywhere in the history), its bitmap stores information relative
     +    to which of the input commits can reach that commit. The ith bit will be
     +    on if the ith commit from the starting list can reach that commit. It is
     +    important to notice that these bitmaps are not the typical "reachability
     +    bitmaps" that are stored in .bitmap files. Instead of signalling which
     +    objects are reachable from the current commit, they instead signal
     +    "which starting commits can reach me?" It is also important to know that
     +    the bitmap is not necessarily "complete" until we walk that commit. We
     +    will perform a commit walk by generation number in such a way that we
     +    can guarantee the bitmap is correct when we visit that commit.
     +
     +    At the beginning of the ahead_behind() method, we initialize the bitmaps
     +    for each of the starting commits. By enabling the ith bit for the ith
     +    starting commit, we signal "the ith commit can reach itself."
     +
     +    We walk commits by popping the commit with maximum generation number out
     +    of the queue, guaranteeing that we will never walk a child of that
     +    commit in any future steps.
      
          As we walk, we load the bitmap for the current commit and perform two
          main steps. The _second_ step examines each parent of the current commit
     @@ commit-reach.c: struct commit_list *get_reachable_subset(struct commit **from, i
      +	c->object.flags |= PARENT2;
      +}
      +
     -+static struct bitmap *init_bit_array(struct commit *c, int width)
     ++static struct bitmap *get_bit_array(struct commit *c, int width)
      +{
      +	struct bitmap **bitmap = bit_arrays_at(&bit_arrays, c);
      +	if (!*bitmap)
     @@ commit-reach.c: struct commit_list *get_reachable_subset(struct commit **from, i
      +		  struct ahead_behind_count *counts, size_t counts_nr)
      +{
      +	struct prio_queue queue = { .compare = compare_commits_by_gen_then_commit_date };
     -+	size_t width = (commits_nr + BITS_IN_EWORD - 1) / BITS_IN_EWORD;
     ++	size_t width = DIV_ROUND_UP(commits_nr, BITS_IN_EWORD);
      +
      +	if (!commits_nr || !counts_nr)
      +		return;
     @@ commit-reach.c: struct commit_list *get_reachable_subset(struct commit **from, i
      +
      +	for (size_t i = 0; i < commits_nr; i++) {
      +		struct commit *c = commits[i];
     -+		struct bitmap *bitmap = init_bit_array(c, width);
     ++		struct bitmap *bitmap = get_bit_array(c, width);
      +
      +		bitmap_set(bitmap, i);
      +		insert_no_dup(&queue, c);
     @@ commit-reach.c: struct commit_list *get_reachable_subset(struct commit **from, i
      +	while (queue_has_nonstale(&queue)) {
      +		struct commit *c = prio_queue_get(&queue);
      +		struct commit_list *p;
     -+		struct bitmap *bitmap_c = init_bit_array(c, width);
     ++		struct bitmap *bitmap_c = get_bit_array(c, width);
      +
      +		for (size_t i = 0; i < counts_nr; i++) {
      +			int reach_from_tip = !!bitmap_get(bitmap_c, counts[i].tip_index);
     @@ commit-reach.c: struct commit_list *get_reachable_subset(struct commit **from, i
      +
      +			repo_parse_commit(r, p->item);
      +
     -+			bitmap_p = init_bit_array(p->item, width);
     ++			bitmap_p = get_bit_array(p->item, width);
      +			bitmap_or(bitmap_p, bitmap_c);
      +
     ++			/*
     ++			 * If this parent is reachable from every starting
     ++			 * commit, then none of its ancestors can contribute
     ++			 * to the ahead/behind count. Mark it as STALE, so
     ++			 * we can stop the walk when every commit in the
     ++			 * queue is STALE.
     ++			 */
      +			if (bitmap_popcount(bitmap_p) == commits_nr)
      +				p->item->object.flags |= STALE;
      +
  7:  59cf6759e60 !  8:  b8523e2be0b for-each-ref: add ahead-behind format atom
     @@ ref-filter.c: static int rest_atom_parser(struct ref_format *format, struct used
      +static int ahead_behind_atom_parser(struct ref_format *format, struct used_atom *atom,
      +				    const char *arg, struct strbuf *err)
      +{
     ++	struct string_list_item *item;
     ++
      +	if (!arg)
     -+		return strbuf_addf_ret(err, -1, _("expected format: %%(ahead-behind:<ref>)"));
     ++		return strbuf_addf_ret(err, -1, _("expected format: %%(ahead-behind:<committish>)"));
     ++
     ++	item = string_list_append(&format->bases, arg);
     ++	item->util = lookup_commit_reference_by_name(arg);
     ++	if (!item->util)
     ++		die("failed to find '%s'", arg);
      +
     -+	string_list_append(&format->bases, arg);
      +	return 0;
      +}
      +
     @@ ref-filter.c: static void reach_filter(struct ref_array *array,
      +		return;
      +
      +	ALLOC_ARRAY(commits, commits_nr);
     -+	for (size_t i = 0; i < format->bases.nr; i++) {
     -+		const char *name = format->bases.items[i].string;
     -+		commits[i] = lookup_commit_reference_by_name(name);
     -+		if (!commits[i])
     -+			die("failed to find '%s'", name);
     -+	}
     ++	for (size_t i = 0; i < format->bases.nr; i++)
     ++		commits[i] = format->bases.items[i].util;
      +
      +	ALLOC_ARRAY(array->counts, st_mult(format->bases.nr, array->nr));
      +
     @@ t/t6301-for-each-ref-errors.sh: test_expect_success 'Missing objects are reporte
      +test_expect_success 'ahead-behind requires an argument' '
      +	test_must_fail git for-each-ref \
      +		--format="%(ahead-behind)" 2>err &&
     -+	echo "fatal: expected format: %(ahead-behind:<ref>)" >expect &&
     ++	echo "fatal: expected format: %(ahead-behind:<committish>)" >expect &&
      +	test_cmp expect err
      +'
      +
  8:  7476a39331e =  9:  87fe9676aec commit-reach: add tips_reachable_from_bases()