diff mbox series

[2/4] commit-graph: write a Bloom filter containing changed paths for each commit

Message ID 20181009193445.21908-3-szeder.dev@gmail.com (mailing list archive)
State New, archived
Headers show
Series Bloom filter experiment | expand

Commit Message

SZEDER Gábor Oct. 9, 2018, 7:34 p.m. UTC
You can create a Bloom filter containing changed paths for each commit
in the history by running:

  $ GIT_USE_POC_BLOOM_FILTER=$((8*1024*1024*8)) git commit-graph write

where the value of $GIT_USE_POC_BLOOM_FILTER must specify the number
of bits used in the Bloom filter's bitmap.

Writing the Bloom filter is tied into the 'git commit-graph' command,
mainly because that's where it might end up anyway, if it turns out to
be useful, but for now it's written to a different file
('object/info/bloom').  No incremental updates yet, the Bloom filter
is regenerated from scratch each time.

There is one single, big Bloom filter for the whole history (mainly
because that was the simplest way to get this PoC experiment up and
running).  The Bloom filter stores tuples of (path, parent-oid,
commit-oid) using the hash function:

  XOR(SHA1(path), XOR(parent-oid, commit-oid))

The resulting 20 bytes are turned into 5 unsigned 32 bit ints, which
then specify the positions of the bits to set or check in the Bloom
filter's bitmap (modulo the bitmap's size).

The parent oid is taken into account, because during revision walking
the diff is checked in rev_compare_tree(), which compares one commit
to _one_ of its parents, and in case of merge commits there are
multiple rev_compare_tree() calls with the same commit but with
different parent parameters.

Combining hashes with XOR is, in general, frowned upon, because of its
intrinsic properties:

  XOR(A, A) = 0
  XOR(A, B) = XOR(B, A)

In this case it should be fine, because all of XOR's operands are
cryptographic hashes, so we can safely assume that they'll never be
the same.

Add each leading directory of the changed file, i.e. for
'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so the Bloom
filter could be used to speed up commands like 'git log dir/subdir',
too.

Creating the Bloom filter is sloooow.  Running it on git.git takes
about 23s on my hardware, while

  git log --format='%H%n%P' --name-only --all >/dev/null

gathers all the information necessary for that in about 5.3s.

About 30% of the runtime is wasted by naively hashing and rehashing
the same paths over and over again.  A hash function faster than SHA1
could help with that; I just haven't yet bothered with spicing up
memhash() and friends to produce 5 ints, and neither wanted to
introduce another hash function with wideer output just yet.  Or
perhaps our hashmap mapping paths of files to their SHAs and the SHAs
of their leading directories...

That's not the only factor though.  After ripping out all the loops
from add_changes_to_bloom_filter() there are no repeated SHA1(path)
calculations and no writes to the Bloom filter at all, i.e. all what
remains is revision walking and diffing, yet it still takes about 16s,
i.e. aroung 3 times more than the above mentioned 'git log' command.
I guess some other fields in 'struct rev_info' or 'struct
diff_options' need to be set, but both of those are huge, and I
haven't yet spotted which ones.
---
 commit-graph.c | 116 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 116 insertions(+)

Comments

Jeff King Oct. 9, 2018, 9:06 p.m. UTC | #1
On Tue, Oct 09, 2018 at 09:34:43PM +0200, SZEDER Gábor wrote:

> Creating the Bloom filter is sloooow.  Running it on git.git takes
> about 23s on my hardware, while
> 
>   git log --format='%H%n%P' --name-only --all >/dev/null
> 
> gathers all the information necessary for that in about 5.3s.

That command won't open the trees for merges at all. But your
implementation here looks like it does a diff against each parent of a
merge. Adding "-m" would be a more accurate comparison, I think.

Though I find that puzzling, because "-m --name-only" seems to take
about 20x longer, not 3x. So perhaps I'm missing something.

-Peff
SZEDER Gábor Oct. 9, 2018, 9:37 p.m. UTC | #2
On Tue, Oct 09, 2018 at 05:06:20PM -0400, Jeff King wrote:
> On Tue, Oct 09, 2018 at 09:34:43PM +0200, SZEDER Gábor wrote:
> 
> > Creating the Bloom filter is sloooow.  Running it on git.git takes
> > about 23s on my hardware, while
> > 
> >   git log --format='%H%n%P' --name-only --all >/dev/null
> > 
> > gathers all the information necessary for that in about 5.3s.
> 
> That command won't open the trees for merges at all. But your
> implementation here looks like it does a diff against each parent of a
> merge.

Yeah, it does so, because that is what try_to_simplify_commit() /
rev_compare_tree() will do while traversing the history.

> Adding "-m" would be a more accurate comparison, I think.
> 
> Though I find that puzzling, because "-m --name-only" seems to take
> about 20x longer, not 3x. So perhaps I'm missing something.

Ugh, indeed.
diff mbox series

Patch

diff --git a/commit-graph.c b/commit-graph.c
index a1454c52a6..f415d3b41f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -14,6 +14,9 @@ 
 #include "object-store.h"
 #include "alloc.h"
 #include "progress.h"
+#include "bloom-filter.h"
+#include "diff.h"
+#include "diffcore.h"
 
 #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
@@ -709,6 +712,117 @@  static int add_ref_to_list(const char *refname,
 	return 0;
 }
 
+static void add_changes_to_bloom_filter(struct bloom_filter *bf,
+					struct commit *parent,
+					struct commit *commit,
+					struct diff_options *diffopt)
+{
+	unsigned char p_c_hash[GIT_MAX_RAWSZ];
+	int i;
+
+	hashxor(parent->object.oid.hash, commit->object.oid.hash, p_c_hash);
+
+	diff_tree_oid(&parent->object.oid, &commit->object.oid, "", diffopt);
+	diffcore_std(diffopt);
+
+	for (i = 0; i < diff_queued_diff.nr; i++) {
+		const char *path = diff_queued_diff.queue[i]->two->path;
+		const char *p = path;
+
+		/*
+		 * Add each leading directory of the changed file, i.e. for
+		 * 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
+		 * the Bloom filter could be used to speed up commands like
+		 * 'git log dir/subdir', too.
+		 *
+		 * Note that directories are added without the trailing '/'.
+		 */
+		do {
+			git_hash_ctx ctx;
+			unsigned char name_hash[GIT_MAX_RAWSZ];
+			unsigned char hash[GIT_MAX_RAWSZ];
+
+			p = strchrnul(p + 1, '/');
+
+			/*
+			 * Beware all the wasted CPU cycles!
+			 *
+			 * Most paths change (a lot) more than once in the
+			 * history of a repository, so this hashes the same
+			 * paths over and over again, accounting for almost
+			 * 40% of the runtime.
+			 */
+			the_hash_algo->init_fn(&ctx);
+			the_hash_algo->update_fn(&ctx, path, p - path);
+			the_hash_algo->final_fn(name_hash, &ctx);
+
+			hashxor(name_hash, p_c_hash, hash);
+			bloom_filter_add_hash(bf, hash);
+		} while (*p);
+
+		diff_free_filepair(diff_queued_diff.queue[i]);
+	}
+
+	free(diff_queued_diff.queue);
+	DIFF_QUEUE_CLEAR(&diff_queued_diff);
+}
+
+static void fill_bloom_filter(struct bloom_filter *bf,
+				    struct progress *progress)
+{
+	struct rev_info revs;
+	const char *revs_argv[] = {NULL, "--all", NULL};
+	struct commit *commit;
+	int i = 0;
+
+	/* We (re-)create the bloom filter from scratch every time for now. */
+	init_revisions(&revs, NULL);
+	revs.diffopt.flags.recursive = 1;
+	setup_revisions(2, revs_argv, &revs, NULL);
+
+	if (prepare_revision_walk(&revs))
+		die("revision walk setup failed while preparing bloom filter");
+
+	while ((commit = get_revision(&revs))) {
+		struct commit_list *parent;
+
+		for (parent = commit->parents; parent; parent = parent->next)
+			add_changes_to_bloom_filter(bf, parent->item, commit,
+						    &revs.diffopt);
+
+		display_progress(progress, ++i);
+	}
+}
+
+static void write_bloom_filter(int report_progress, int commit_nr)
+{
+	struct bloom_filter bf;
+	struct progress *progress = NULL;
+	const char *v = getenv("GIT_USE_POC_BLOOM_FILTER");
+	unsigned int bitsize;
+	char *end;
+
+	if (!v)
+		return;
+
+	bitsize = strtol(v, &end, 10);
+	if (*end)
+		die("GIT_USE_POC_BLOOM_FILTER must specify the number of bits in the bloom filter (multiple of 8, n < 2^32)");
+
+	bloom_filter_init(&bf, bitsize);
+
+	if (report_progress)
+		progress = start_progress(_("Computing bloom filter"),
+					  commit_nr);
+
+	fill_bloom_filter(&bf, progress);
+
+	bloom_filter_write(&bf);
+	bloom_filter_free(&bf);
+
+	stop_progress(&progress);
+}
+
 void write_commit_graph_reachable(const char *obj_dir, int append,
 				  int report_progress)
 {
@@ -916,6 +1030,8 @@  void write_commit_graph(const char *obj_dir,
 	finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
 	commit_lock_file(&lk);
 
+	write_bloom_filter(report_progress, commits.nr);
+
 	free(graph_name);
 	free(commits.list);
 	free(oids.list);