mbox series

[v4,00/14] more miscellaneous Bloom filter improvements

Message ID cover.1599172907.git.me@ttaylorr.com (mailing list archive)
Headers show
Series more miscellaneous Bloom filter improvements | expand

Message

Taylor Blau Sept. 3, 2020, 10:45 p.m. UTC
Hi,

Here is another reroll of my series to introduce a '--max-new-filters'
option to the 'git commit-graph write' sub-command to limit the number
of changed-path Bloom filters a single process is willing to compute
from scratch.

As a reminder (since it's been a while since v3), this is done by adding
a new 'BFXL' chunk, which specifies which Bloom filters (a) have been
computed and (b) were too large to store, and thus are encoded as
zero-length filters.

Things seem to have settled down since the review in v3, so I'm hoping
that this is what will end up being queued.

Thanks again for all of your review, and sorry for the prolonged radio
silence on this topic. I've had my hands full working on something else
that I'm hoping to start testing and sending to the list soon.

Derrick Stolee (1):
  bloom/diff: properly short-circuit on max_changes

Taylor Blau (13):
  commit-graph: introduce 'get_bloom_filter_settings()'
  t4216: use an '&&'-chain
  commit-graph: pass a 'struct repository *' in more places
  t/helper/test-read-graph.c: prepare repo settings
  commit-graph: respect 'commitGraph.readChangedPaths'
  commit-graph.c: store maximum changed paths
  bloom: split 'get_bloom_filter()' in two
  bloom: use provided 'struct bloom_filter_settings'
  commit-graph.c: sort index into commits list
  csum-file.h: introduce 'hashwrite_be64()'
  commit-graph: add large-filters bitmap chunk
  commit-graph: rename 'split_commit_graph_opts'
  builtin/commit-graph.c: introduce '--max-new-filters=<n>'

 Documentation/config.txt                      |   2 +
 Documentation/config/commitgraph.txt          |   8 +
 Documentation/git-commit-graph.txt            |   6 +
 .../technical/commit-graph-format.txt         |  13 +
 blame.c                                       |   8 +-
 bloom.c                                       |  43 ++-
 bloom.h                                       |  22 +-
 builtin/commit-graph.c                        |  61 +++-
 commit-graph.c                                | 278 +++++++++++++-----
 commit-graph.h                                |  28 +-
 csum-file.h                                   |   6 +
 diff.h                                        |   2 -
 fuzz-commit-graph.c                           |   5 +-
 line-log.c                                    |   2 +-
 make: *** [Makefile                           |   0
 midx.c                                        |   3 +-
 repo-settings.c                               |   3 +
 repository.h                                  |   1 +
 revision.c                                    |   7 +-
 t/helper/test-bloom.c                         |   4 +-
 t/helper/test-read-graph.c                    |   3 +-
 t/t4216-log-bloom.sh                          | 173 +++++++++--
 t/t5324-split-commit-graph.sh                 |  13 +
 tree-diff.c                                   |   5 +-
 24 files changed, 553 insertions(+), 143 deletions(-)
 create mode 100644 Documentation/config/commitgraph.txt
 create mode 100644 make: *** [Makefile

Range-diff against v3:
[rebased onto master]
  1:  e714e54240 = 232:  97d80f109f commit-graph: introduce 'get_bloom_filter_settings()'
  2:  9fc8b17d6f = 233:  d60698d2f8 t4216: use an '&&'-chain
  3:  8dbe4838b7 = 234:  639a962a49 commit-graph: pass a 'struct repository *' in more places
  4:  f59db1e30d = 235:  ccab59dfe4 t/helper/test-read-graph.c: prepare repo settings
  5:  daae6788c0 = 236:  8aff54d83e commit-graph: respect 'commitGraph.readChangedPaths'
  6:  bf498844ef = 237:  965489d361 commit-graph.c: store maximum changed paths
  7:  eba2794873 = 238:  ba89a0cb83 bloom: split 'get_bloom_filter()' in two
  8:  4f08177dbe = 239:  89bedba089 bloom: use provided 'struct bloom_filter_settings'
  9:  cc1dc8b121 = 240:  427f129656 bloom/diff: properly short-circuit on max_changes
 10:  23fd52c3b8 = 241:  08b5f185f6 commit-graph.c: sort index into commits list
 11:  4800cd373e = 242:  d7cbd4ca1a csum-file.h: introduce 'hashwrite_be64()'
 12:  619e0c619d ! 243:  3063beb588 commit-graph: add large-filters bitmap chunk
    @@ Commit message

         When a commit has more than a certain number of changed paths (commonly
         512), the commit-graph machinery represents it as a zero-length filter.
    -    This is done since having many entries in the Bloom filter has
    -    undesirable effects on the false positivity rate.
    -
         In addition to these too-large filters, the commit-graph machinery also
         represents commits with no filter and commits with no changed paths in
         the same way.
    @@ Commit message
         data in network byte order from the 64-bit words. This means we also
         need to read the array from the commit-graph file by translating each
         word from network byte order using get_be64() when loading the commit
    -    graph. (Note that this *could* be delayed until first-use, but a later
    -    patch will rely on this being initialized early, so we assume the
    -    up-front cost when parsing instead of delaying initialization).
    +    graph. Initialize this bitmap lazily to avoid paying a linear-time cost
    +    upon each commit-graph load even if we do not need the bitmaps
    +    themselves.

         By avoiding the need to move to new versions of the BDAT and BIDX chunk,
         we can give ourselves more time to consider whether or not other
    @@ Commit message
         except the most-significant bit of each offset is interpreted as "this
         filter is too big" iff looking at a BID2 chunk. This avoids having to
         write a bitmap, but forces older clients to rewrite their commit-graphs
    -    (as well as reduces the theoretical largest Bloom filters we couldl
    +    (as well as reduces the theoretical largest Bloom filters we could
         write, and forces us to maintain the code necessary to translate BIDX
         chunks to BID2 ones). Separately from this patch, I implemented this
         alternate approach and did not find it to be advantageous.
    @@ Documentation/technical/commit-graph-format.txt: CHUNK DATA:
     +  Large Bloom Filters (ID: {'B', 'F', 'X', 'L'}) [Optional]
     +    * It starts with a 32-bit unsigned integer specifying the maximum number of
     +      changed-paths that can be stored in a single Bloom filter.
    -+    * It then contains a list of 64-bit words (the length of this list is
    -+      determined by the width of the chunk) which is a bitmap. The 'i'th bit is
    -+      set exactly when the 'i'th commit in the graph has a changed-path Bloom
    -+      filter with zero entries (either because the commit is empty, or because
    -+      it contains more than 512 changed paths).
    ++    * It then contains a list of 64-bit words in network order (the length of
    ++      this list is determined by the width of the chunk) which is a bitmap. The
    ++      'i'th bit is set exactly when the 'i'th commit in the graph has a
    ++      changed-path Bloom filter with zero entries (either because the commit is
    ++      empty, or because it contains more entries than is allowed per filter by
    ++      the layer that contains it).
     +    * The BFXL chunk is present only when the BIDX and BDAT chunks are
     +      also present.
     +
    @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repository *r,
     +			if (graph->chunk_bloom_large_filters)
     +				chunk_repeated = 1;
     +			else if (r->settings.commit_graph_read_changed_paths) {
    -+				size_t alloc = get_be64(chunk_lookup + 4) - chunk_offset - sizeof(uint32_t);
     +				graph->chunk_bloom_large_filters = data + chunk_offset + sizeof(uint32_t);
    ++				graph->bloom_large_alloc = get_be64(chunk_lookup + 4) - chunk_offset - sizeof(uint32_t);
     +				graph->bloom_filter_settings->max_changed_paths = get_be32(data + chunk_offset);
    -+				if (alloc) {
    -+					size_t j;
    -+					graph->bloom_large = bitmap_word_alloc(alloc);
    -+
    -+					for (j = 0; j < graph->bloom_large->word_alloc; j++)
    -+						graph->bloom_large->words[j] = get_be64(
    -+							graph->chunk_bloom_large_filters + j * sizeof(eword_t));
    -+				}
     +			}
     +			break;
      		}
    @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repository *r,
      		graph->chunk_bloom_indexes = NULL;
      		graph->chunk_bloom_data = NULL;
     +		graph->chunk_bloom_large_filters = NULL;
    ++		graph->bloom_large_alloc = 0;
      		FREE_AND_NULL(graph->bloom_filter_settings);
    -+		bitmap_free(graph->bloom_large);
      	}

    - 	hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len);
     @@ commit-graph.c: struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
      	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
      }

    -+static int get_bloom_filter_large_in_graph(struct commit_graph *g,
    -+					   const struct commit *c)
    ++int get_bloom_filter_large_in_graph(struct commit_graph *g,
    ++				    const struct commit *c)
     +{
    -+	uint32_t graph_pos = commit_graph_position(c);
    -+	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
    ++	uint32_t graph_pos;
    ++	if (!find_commit_in_graph(c, g, &graph_pos))
     +		return 0;
     +
     +	while (g && graph_pos < g->num_commits_in_base)
     +		g = g->base_graph;
     +
    -+	if (!(g && g->bloom_large))
    ++	if (!g)
    ++		return 0;
    ++
    ++	if (!g->bloom_large && g->bloom_large_alloc) {
    ++		size_t i;
    ++		g->bloom_large = bitmap_word_alloc(g->bloom_large_alloc);
    ++
    ++		for (i = 0; i < g->bloom_large->word_alloc; i++)
    ++			g->bloom_large->words[i] = get_be64(
    ++				g->chunk_bloom_large_filters + i * sizeof(eword_t));
    ++	}
    ++
    ++	if (!g->bloom_large)
     +		return 0;
     +	return bitmap_get(g->bloom_large, graph_pos - g->num_commits_in_base);
     +}
    @@ commit-graph.h

      #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
      #define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
    +@@ commit-graph.h: void load_commit_graph_info(struct repository *r, struct commit *item);
    + struct tree *get_commit_tree_in_graph(struct repository *r,
    + 				      const struct commit *c);
    +
    ++int get_bloom_filter_large_in_graph(struct commit_graph *g,
    ++				    const struct commit *c);
    ++
    + struct commit_graph {
    + 	const unsigned char *data;
    + 	size_t data_len;
     @@ commit-graph.h: struct commit_graph {
      	const unsigned char *chunk_base_graphs;
      	const unsigned char *chunk_bloom_indexes;
    @@ commit-graph.h: struct commit_graph {
     +	const unsigned char *chunk_bloom_large_filters;
     +
     +	struct bitmap *bloom_large;
    ++	size_t bloom_large_alloc;

      	struct bloom_filter_settings *bloom_filter_settings;
      };
    +@@ commit-graph.h: struct commit_graph *read_commit_graph_one(struct repository *r,
    + struct commit_graph *parse_commit_graph(struct repository *r,
    + 					void *graph_map, size_t graph_size);
    +
    ++void prepare_commit_graph_bloom_large(struct commit_graph *g);
    ++
    + /*
    +  * Return 1 if and only if the repository has a commit-graph
    +  * file and generation numbers are computed in that file.

      ## t/t4216-log-bloom.sh ##
     @@ t/t4216-log-bloom.sh: test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
    - 	git commit-graph write --reachable --changed-paths
    + 	EOF
      '
      graph_read_expect () {
     -	NUM_CHUNKS=5
     +	NUM_CHUNKS=6
      	cat >expect <<- EOF
    - 	header: 43475048 1 1 $NUM_CHUNKS 0
    + 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
      	num_commits: $1
     @@ t/t4216-log-bloom.sh: test_expect_success 'correctly report changes over limit' '
      		done
 13:  b2e33ecba8 ! 244:  ee0bc109f3 commit-graph: rename 'split_commit_graph_opts'
    @@ Commit message
         commit-graph API which have nothing to do with splitting.

         Rename the 'split_commit_graph_opts' structure to the more-generic
    -    'commit_graph_opts' to encompass both.
    +    'commit_graph_opts' to encompass both. Likewise, rename the 'flags'
    +    member to instead be 'split_flags' to clarify that it only has to do
    +    with the behavior implied by '--split'.

    -    Suggsted-by: Derrick Stolee <dstolee@microsoft.com>
    +    Suggested-by: Derrick Stolee <dstolee@microsoft.com>
         Signed-off-by: Taylor Blau <me@ttaylorr.com>

      ## builtin/commit-graph.c ##
    @@ builtin/commit-graph.c: static int graph_write(int argc, const char **argv)
      			N_("enable computation for changed paths")),
      		OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
     -		OPT_CALLBACK_F(0, "split", &split_opts.flags, NULL,
    -+		OPT_CALLBACK_F(0, "split", &write_opts.flags, NULL,
    ++		OPT_CALLBACK_F(0, "split", &write_opts.split_flags, NULL,
      			N_("allow writing an incremental commit-graph file"),
      			PARSE_OPT_OPTARG | PARSE_OPT_NONEG,
      			write_option_parse_split),
    @@ commit-graph.c: static void close_reachable(struct write_commit_graph_context *c
     -	enum commit_graph_split_flags flags = ctx->split_opts ?
     -		ctx->split_opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
     +	enum commit_graph_split_flags flags = ctx->opts ?
    -+		ctx->opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
    ++		ctx->opts->split_flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;

      	if (ctx->report_progress)
      		ctx->progress = start_delayed_progress(
    @@ commit-graph.c: static uint32_t count_distinct_commits(struct write_commit_graph
     -	enum commit_graph_split_flags flags = ctx->split_opts ?
     -		ctx->split_opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
     +	enum commit_graph_split_flags flags = ctx->opts ?
    -+		ctx->opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
    ++		ctx->opts->split_flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;

      	ctx->num_extra_edges = 0;
      	if (ctx->report_progress)
    @@ commit-graph.c: static void split_graph_merge_strategy(struct write_commit_graph
     +			size_mult = ctx->opts->size_multiple;

     -		flags = ctx->split_opts->flags;
    -+		flags = ctx->opts->flags;
    ++		flags = ctx->opts->split_flags;
      	}

      	g = ctx->r->objects->commit_graph;
    @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
     -		if (ctx->split_opts)
     -			replace = ctx->split_opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
     +		if (ctx->opts)
    -+			replace = ctx->opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
    ++			replace = ctx->opts->split_flags & COMMIT_GRAPH_SPLIT_REPLACE;
      	}

      	ctx->approx_nr_objects = approximate_object_count();
    @@ commit-graph.h: enum commit_graph_split_flags {
      	int size_multiple;
      	int max_commits;
      	timestamp_t expire_time;
    +-	enum commit_graph_split_flags flags;
    ++	enum commit_graph_split_flags split_flags;
    + };
    +
    + /*
     @@ commit-graph.h: struct split_commit_graph_opts {
       */
      int write_commit_graph_reachable(struct object_directory *odb,
    @@ commit-graph.h: struct split_commit_graph_opts {

      #define COMMIT_GRAPH_VERIFY_SHALLOW	(1 << 0)

    +
    + ## make: *** [Makefile (new) ##
 14:  09f6871f66 ! 245:  cd0a9da639 builtin/commit-graph.c: introduce '--max-new-filters=<n>'
    @@ Commit message
         builtin/commit-graph.c: introduce '--max-new-filters=<n>'

         Introduce a command-line flag and configuration variable to fill in the
    -    'max_new_filters' variable introduced by the previous patch.
    +    'max_new_filters' variable introduced two patches ago.

         The command-line option '--max-new-filters' takes precedence over
         'commitGraph.maxNewFilters', which is the default value.
    @@ Documentation/git-commit-graph.txt: this option is given, future commit-graph wr
      +
     +With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
     +filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
    -+enforced. Overrides the `commitGraph.maxNewFilters` configuration.
    ++enforced. Commits whose filters are not calculated are stored as a
    ++length zero Bloom filter, and their bit is marked in the `BFXL` chunk.
    ++Overrides the `commitGraph.maxNewFilters` configuration.
     ++
      With the `--split[=<strategy>]` option, write the commit-graph as a
      chain of multiple commit-graph files stored in
      `<dir>/info/commit-graphs`. Commit-graph layers are merged based on the

      ## bloom.c ##
    -@@ bloom.c: static int load_bloom_filter_from_graph(struct commit_graph *g,
    - 	else
    - 		start_index = 0;
    +@@ bloom.c: struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,

    -+	if ((start_index == end_index) &&
    -+	    (g->bloom_large && !bitmap_get(g->bloom_large, lex_pos))) {
    -+		/*
    -+		 * If the filter is zero-length, either (1) the filter has no
    -+		 * changes, (2) the filter has too many changes, or (3) it
    -+		 * wasn't computed (eg., due to '--max-new-filters').
    -+		 *
    -+		 * If either (1) or (2) is the case, the 'large' bit will be set
    -+		 * for this Bloom filter. If it is unset, then it wasn't
    -+		 * computed. In that case, return nothing, since we don't have
    -+		 * that filter in the graph.
    -+		 */
    -+		return 0;
    -+	}
    + 	if (!filter->data) {
    + 		load_commit_graph_info(r, c);
    +-		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH &&
    +-			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
    +-				return filter;
    ++		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH)
    ++			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
    + 	}
    +
    +-	if (filter->data)
    ++	if (filter->data && filter->len)
    + 		return filter;
    + 	if (!compute_if_not_present)
    + 		return NULL;
    +
    ++	if (filter && !filter->len &&
    ++	    get_bloom_filter_large_in_graph(r->objects->commit_graph, c,
    ++					    settings->max_changed_paths))
    ++		return filter;
    ++
     +
    - 	filter->len = end_index - start_index;
    - 	filter->data = (unsigned char *)(g->chunk_bloom_data +
    - 					sizeof(unsigned char) * start_index +
    + 	repo_diff_setup(r, &diffopt);
    + 	diffopt.flags.recursive = 1;
    + 	diffopt.detect_rename = 0;

      ## builtin/commit-graph.c ##
     @@ builtin/commit-graph.c: static char const * const builtin_commit_graph_usage[] = {
    @@ commit-graph.c
     @@ commit-graph.c: struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
      }

    - static int get_bloom_filter_large_in_graph(struct commit_graph *g,
    --					   const struct commit *c)
    -+					   const struct commit *c,
    -+					   uint32_t max_changed_paths)
    + int get_bloom_filter_large_in_graph(struct commit_graph *g,
    +-				    const struct commit *c)
    ++				    const struct commit *c,
    ++				    uint32_t max_changed_paths)
      {
    - 	uint32_t graph_pos = commit_graph_position(c);
    - 	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
    -@@ commit-graph.c: static int get_bloom_filter_large_in_graph(struct commit_graph *g,
    + 	uint32_t graph_pos;
    + 	if (!find_commit_in_graph(c, g, &graph_pos))
    +@@ commit-graph.c: int get_bloom_filter_large_in_graph(struct commit_graph *g,

    - 	if (!(g && g->bloom_large))
    + 	if (!g->bloom_large)
      		return 0;
     +	if (g->bloom_filter_settings->max_changed_paths != max_changed_paths) {
     +		/*
    @@ commit-graph.c: static void compute_bloom_filters(struct write_commit_graph_cont
      		ctx->order_by_pack ? commit_pos_cmp : commit_gen_cmp,
      		&ctx->commits);

    -+	max_new_filters = ctx->opts->max_new_filters >= 0 ?
    ++	max_new_filters = ctx->opts && ctx->opts->max_new_filters >= 0 ?
     +		ctx->opts->max_new_filters : ctx->commits.nr;
     +
      	for (i = 0; i < ctx->commits.nr; i++) {
    @@ commit-graph.c: static void compute_bloom_filters(struct write_commit_graph_cont
      	}

      ## commit-graph.h ##
    +@@ commit-graph.h: struct tree *get_commit_tree_in_graph(struct repository *r,
    + 				      const struct commit *c);
    +
    + int get_bloom_filter_large_in_graph(struct commit_graph *g,
    +-				    const struct commit *c);
    ++				    const struct commit *c,
    ++				    uint32_t max_changed_paths);
    +
    + struct commit_graph {
    + 	const unsigned char *data;
     @@ commit-graph.h: struct commit_graph_opts {
      	int max_commits;
      	timestamp_t expire_time;
    - 	enum commit_graph_split_flags flags;
    + 	enum commit_graph_split_flags split_flags;
     +	int max_new_filters;
      };

    @@ t/t4216-log-bloom.sh: test_expect_success 'Bloom generation does not recompute t
     +			2 0 1
     +	)
     +'
    ++
    ++test_expect_success 'Bloom generation backfills empty commits' '
    ++	git init empty &&
    ++	test_when_finished "rm -fr empty" &&
    ++	(
    ++		cd empty &&
    ++		for i in $(test_seq 1 6)
    ++		do
    ++			git commit --allow-empty -m "$i"
    ++		done &&
    ++
    ++		# Generate Bloom filters for empty commits 1-6, two at a time.
    ++		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
    ++			0 2 2 &&
    ++		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
    ++			2 2 2 &&
    ++		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
    ++			4 2 2 &&
    ++
    ++		# Finally, make sure that once all commits have filters, that
    ++		# none are subsequently recomputed.
    ++		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
    ++			6 0 0
    ++	)
    ++'
     +
      test_done
--
2.27.0.2918.gc99a27ff8f

Comments

Derrick Stolee Sept. 4, 2020, 2:39 p.m. UTC | #1
On 9/3/2020 6:45 PM, Taylor Blau wrote:
> Things seem to have settled down since the review in v3, so I'm hoping
> that this is what will end up being queued.
I haven't paid close attention to this topic, but I took a close
look at v4 and found nothing to improve. I'm happy with this
version.

Thanks,
-Stolee