Message ID | pull.497.v2.git.1580943390.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Changed Paths Bloom Filters | expand |
On Wed, Feb 05, 2020 at 10:56:19PM +0000, Garima Singh via GitGitGadget wrote: > Hey! > > The commit graph feature brought in a lot of performance improvements across > multiple commands. However, file based history continues to be a performance > pain point, especially in large repositories. > > Adopting changed path bloom filters has been discussed on the list before, > and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr. > Derrick Stolee [1]. This series is based on Dr. Stolee's proof of concept in > [2] > > Performance Gains: We tested the performance of git log -- path on the git > repo, the linux repo and some internal large repos, with a variety of paths > of varying depths. > > On the git and linux repos: We observed a 2x to 5x speed up. > > On a large internal repo with files seated 6-10 levels deep in the tree: We > observed 10x to 20x speed ups, with some paths going up to 28 times faster. > > Future Work (not included in the scope of this series): > > 1. Supporting multiple path based revision walk > 2. Adopting it in git blame logic. > 3. Interactions with line log git log -L > > > ---------------------------------------------------------------------------- > > Updates since the last submission > > * Removed all the RFC callouts, this is a ready for full review version Don't know when I'll find enough time to properly review the series. maybe someday... > * Added unit tests for the bloom filter computation layer This fails on big endian, e.g. in Travis CI's s390x build: https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/647253022#L2210 (The link highlights the failure, but I'm afraid your browser won't jump there right away; you'll have to click on the print-test-failures fold at the bottom, and scroll down a bit...)
On 2/7/2020 8:52 AM, SZEDER Gábor wrote: >> * Added unit tests for the bloom filter computation layer > > This fails on big endian, e.g. in Travis CI's s390x build: > > https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/647253022#L2210 > > (The link highlights the failure, but I'm afraid your browser won't > jump there right away; you'll have to click on the print-test-failures > fold at the bottom, and scroll down a bit...) > Thank you so much for running this pipeline and pointing out the error! We will carefully review our interactions with the binary data and hopefully solve this in the next version. Cheers! Garima Singh
On 2/7/2020 10:09 AM, Garima Singh wrote: > > On 2/7/2020 8:52 AM, SZEDER Gábor wrote: >>> * Added unit tests for the bloom filter computation layer >> >> This fails on big endian, e.g. in Travis CI's s390x build: >> >> https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/647253022#L2210 >> >> (The link highlights the failure, but I'm afraid your browser won't >> jump there right away; you'll have to click on the print-test-failures >> fold at the bottom, and scroll down a bit...) >> > > Thank you so much for running this pipeline and pointing out the error! > > We will carefully review our interactions with the binary data and > hopefully solve this in the next version. Szeder, Thanks so much for running this test. We don't have access to a big endian machine right now, so could you please apply this patch and re-run your tests? The issue is described in the message below, and Garima is working to ensure the handling of the filter data is clarified in the next version. This is an issue from WAY back in the original prototype, and it highlights that we've never been writing the data in network-byte order. This is completely my fault. Thanks, -Stolee -->8-- From c1067db5d618b2dae430dfe373a11c771517da9e Mon Sep 17 00:00:00 2001 From: Derrick Stolee <dstolee@microsoft.com> Date: Fri, 7 Feb 2020 10:24:05 -0500 Subject: [PATCH] fixup! bloom: core Bloom filter implementation for changed paths The 'data' field of 'struct bloom_filter' can point to a memory location (when computing one before writing to the commit-graph) or a memmap()'d file location (when reading from the Bloom data chunk of the commit-graph file). This means that the memory representation may be backwards in Little Endian or Big Endian machines. Always write and read bits from 'filter->data' using network order. This allows us to avoid loading the data streams from the file into memory buffers. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- bloom.c | 6 ++++-- t/helper/test-bloom.c | 2 +- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/bloom.c b/bloom.c index 90d84dc713..aa6896584b 100644 --- a/bloom.c +++ b/bloom.c @@ -124,8 +124,9 @@ void add_key_to_filter(struct bloom_key *key, for (i = 0; i < settings->num_hashes; i++) { uint64_t hash_mod = key->hashes[i] % mod; uint64_t block_pos = hash_mod / BITS_PER_WORD; + uint64_t bit = get_bitmask(hash_mod); - filter->data[block_pos] |= get_bitmask(hash_mod); + filter->data[block_pos] |= htonll(bit); } } @@ -269,7 +270,8 @@ int bloom_filter_contains(struct bloom_filter *filter, for (i = 0; i < settings->num_hashes; i++) { uint64_t hash_mod = key->hashes[i] % mod; uint64_t block_pos = hash_mod / BITS_PER_WORD; - if (!(filter->data[block_pos] & get_bitmask(hash_mod))) + uint64_t bit = get_bitmask(hash_mod); + if (!(filter->data[block_pos] & htonll(bit))) return 0; } diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c index 9b4be97f75..09b2bb0a00 100644 --- a/t/helper/test-bloom.c +++ b/t/helper/test-bloom.c @@ -23,7 +23,7 @@ static void print_bloom_filter(struct bloom_filter *filter) { printf("Filter_Length:%d\n", filter->len); printf("Filter_Data:"); for (i = 0; i < filter->len; i++){ - printf("%"PRIx64"|", filter->data[i]); + printf("%"PRIx64"|", ntohll(filter->data[i])); } printf("\n"); }
On Fri, Feb 07, 2020 at 10:36:58AM -0500, Derrick Stolee wrote: > On 2/7/2020 10:09 AM, Garima Singh wrote: > > > > On 2/7/2020 8:52 AM, SZEDER Gábor wrote: > >>> * Added unit tests for the bloom filter computation layer > >> > >> This fails on big endian, e.g. in Travis CI's s390x build: > >> > >> https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/647253022#L2210 > >> > >> (The link highlights the failure, but I'm afraid your browser won't > >> jump there right away; you'll have to click on the print-test-failures > >> fold at the bottom, and scroll down a bit...) > >> > > > > Thank you so much for running this pipeline and pointing out the error! > > > > We will carefully review our interactions with the binary data and > > hopefully solve this in the next version. > > Szeder, > > Thanks so much for running this test. We don't have access to a big endian > machine right now, so could you please apply this patch and re-run your tests? Unfortunately, it still failed: https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/647395554#L2204 > The issue is described in the message below, and Garima is working to ensure > the handling of the filter data is clarified in the next version. > > This is an issue from WAY back in the original prototype, and it highlights > that we've never been writing the data in network-byte order. This is completely > my fault. > > Thanks, > -Stolee > > > -->8-- > > From c1067db5d618b2dae430dfe373a11c771517da9e Mon Sep 17 00:00:00 2001 > From: Derrick Stolee <dstolee@microsoft.com> > Date: Fri, 7 Feb 2020 10:24:05 -0500 > Subject: [PATCH] fixup! bloom: core Bloom filter implementation for changed > paths > > The 'data' field of 'struct bloom_filter' can point to a memory location > (when computing one before writing to the commit-graph) or a memmap()'d > file location (when reading from the Bloom data chunk of the commit-graph > file). This means that the memory representation may be backwards in > Little Endian or Big Endian machines. > > Always write and read bits from 'filter->data' using network order. This > allows us to avoid loading the data streams from the file into memory > buffers. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > bloom.c | 6 ++++-- > t/helper/test-bloom.c | 2 +- > 2 files changed, 5 insertions(+), 3 deletions(-) > > diff --git a/bloom.c b/bloom.c > index 90d84dc713..aa6896584b 100644 > --- a/bloom.c > +++ b/bloom.c > @@ -124,8 +124,9 @@ void add_key_to_filter(struct bloom_key *key, > for (i = 0; i < settings->num_hashes; i++) { > uint64_t hash_mod = key->hashes[i] % mod; > uint64_t block_pos = hash_mod / BITS_PER_WORD; > + uint64_t bit = get_bitmask(hash_mod); > > - filter->data[block_pos] |= get_bitmask(hash_mod); > + filter->data[block_pos] |= htonll(bit); > } > } > > @@ -269,7 +270,8 @@ int bloom_filter_contains(struct bloom_filter *filter, > for (i = 0; i < settings->num_hashes; i++) { > uint64_t hash_mod = key->hashes[i] % mod; > uint64_t block_pos = hash_mod / BITS_PER_WORD; > - if (!(filter->data[block_pos] & get_bitmask(hash_mod))) > + uint64_t bit = get_bitmask(hash_mod); > + if (!(filter->data[block_pos] & htonll(bit))) > return 0; > } > > diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c > index 9b4be97f75..09b2bb0a00 100644 > --- a/t/helper/test-bloom.c > +++ b/t/helper/test-bloom.c > @@ -23,7 +23,7 @@ static void print_bloom_filter(struct bloom_filter *filter) { > printf("Filter_Length:%d\n", filter->len); > printf("Filter_Data:"); > for (i = 0; i < filter->len; i++){ > - printf("%"PRIx64"|", filter->data[i]); > + printf("%"PRIx64"|", ntohll(filter->data[i])); > } > printf("\n"); > } > -- > 2.25.0.vfs.1.1.1.g9906319d24.dirty > > >
On 2/7/2020 11:15 AM, SZEDER Gábor wrote: > On Fri, Feb 07, 2020 at 10:36:58AM -0500, Derrick Stolee wrote: >> On 2/7/2020 10:09 AM, Garima Singh wrote: >>> >>> On 2/7/2020 8:52 AM, SZEDER Gábor wrote: >>>>> * Added unit tests for the bloom filter computation layer >>>> >>>> This fails on big endian, e.g. in Travis CI's s390x build: >>>> >>>> https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/647253022#L2210 >>>> >>>> (The link highlights the failure, but I'm afraid your browser won't >>>> jump there right away; you'll have to click on the print-test-failures >>>> fold at the bottom, and scroll down a bit...) >>>> >>> >>> Thank you so much for running this pipeline and pointing out the error! >>> >>> We will carefully review our interactions with the binary data and >>> hopefully solve this in the next version. >> >> Szeder, >> >> Thanks so much for running this test. We don't have access to a big endian >> machine right now, so could you please apply this patch and re-run your tests? > > Unfortunately, it still failed: > > https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/647395554#L2204 Thanks! Both fail on test 2 of t0095-bloom.sh, which includes this expected output line: Filter_Data:508928809087080a|8a7648210804001|4089824400951000|841ab310098051a8| We may not be properly adjusting the output in the test-helper. I still think the fixup patch I included is a good idea, but Garima continues to dig into the problem from all angles to understand this failure and the full fix. -Stolee
"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes: > Hey! > > The commit graph feature brought in a lot of performance improvements across > multiple commands. However, file based history continues to be a performance > pain point, especially in large repositories. > > Adopting changed path Bloom filters has been discussed on the list before, > and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr. > Derrick Stolee [1]. This series is based on Dr. Stolee's proof of > concept in [2]. Sidenote: I wondered why it did use MurmurHash3 (64-bit version), which requires adding its implementation, instead of reusing FNV-1 hash (Fowler–Noll–Vo hash function) used by Git hashmap implementation, see https://github.com/git/git/blob/228f53135a4a41a37b6be8e4d6e2b6153db4a8ed/hashmap.h#L109 Beside the fact that everyone is using MurmurHash for Bloom filters ;-) It turns out that in various benchmark MurmurHash is faster and also slightly better as a hash than FNV-1 or FNV-1b. I wonder then if it would be a good idea (in the future) to make it easy to use hashmap with MurmurHash3 instead of FNV-1, or maybe to even make it the default for hashing strings. > > Performance Gains: We tested the performance of git log -- path on the git > repo, the linux repo and some internal large repos, with a variety of paths > of varying depths. As I wrote in reply to previous version of this series, a good public repository (and thus being able to use by anyone) to test the Bloom filter performance improvements could be AOSP (Android) base: https://android.googlesource.com/platform/frameworks/base/ which is a large repository with long path depths (due to Java file naming conventions). > > On the git and linux repos: We observed a 2x to 5x speed up. > > On a large internal repo with files seated 6-10 levels deep in the tree: We > observed 10x to 20x speed ups, with some paths going up to 28 times faster. Very nice! Good work! What is the cost of this feature, that is how long it takes to generate Bloom filters, and how much larger commit-graph file gets? It would be nice to know. > > Future Work (not included in the scope of this series): > > 1. Supporting multiple path based revision walk Shouldn't then tests that were added in v2 mark use of Bloom filters with multiple paths revision walking as _not working *yet*_ (test_expect_failure), and not expected to not work (test_expect_success with test_bloom_filters_not_used)? > 2. Adopting it in git blame logic. > 3. Interactions with line log git log -L > > > ---------------------------------------------------------------------------- > > Updates since the last submission > > * Removed all the RFC callouts, this is a ready for full review version > * Added unit tests for the bloom filter computation layer > * Added more evolved functional tests for git log > * Fixed a lot of the bugs found by the tests > * Reacted to other miscellaneous feedback on the RFC series. > > Cheers! Garima Singh > > [1] https://lore.kernel.org/git/20181009193445.21908-1-szeder.dev@gmail.com/ > [2] https://lore.kernel.org/git/61559c5b-546e-d61b-d2e1-68de692f5972@gmail.com/ > > Derrick Stolee (2): > diff: halt tree-diff early after max_changes > commit-graph: examine commits by generation number > > Garima Singh (8): > commit-graph: use MAX_NUM_CHUNKS > bloom: core Bloom filter implementation for changed paths > commit-graph: compute Bloom filters for changed paths > commit-graph: write Bloom filters to commit graph file > commit-graph: reuse existing Bloom filters during write. > commit-graph: add --changed-paths option to write subcommand > revision.c: use Bloom filters to speed up path based revision walks > commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag > > Jeff King (1): > commit-graph: examine changed-path objects in pack order The shortlog summary is a fine tool to show contributors to the patch series, but is not as useful to show patch series as a whole: splitting of patches and their ordering. I will review each of patches individually, but now I would like to say a few things about the series as a whole. - [PATCH v2 01/11] commit-graph: use MAX_NUM_CHUNKS Simple and non-controversial patch, improvement to existing code with the goal of helping future development (including further patches). - [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths In my opinion this patch could be split into three individual pieces, though one might think it is not worth it. a. Add implementation of MurmurHash v3 (64-bit) Include tests based on test-tool (creating file similar to the t/helper/test-hash.c, or enhancing to that file) that the implementation is correct, for example that 'The quick brown fox jumps over the lazy dog' with given seed (for example the default feed of 0) hashes to the same value as other implementations. b. Add implementation of Bloom filter Include generic Bloom filter tests i.e. that it correctly answers "yes" and "maybe" (create filter, save it or print it, then use stored filter), and tests specific to our implementation, namely that the size of the filter behaves as it should. c. Bloom filter implementation for changed paths Here include tests that use 'test-tool bloom get_filter_for_commit', that filter for commit with no changes and for commit with more than 512 fies changed works correctly, that directories are added along the paths, etc. - [PATCH v2 03/11] diff: halt tree-diff early after max_changes I think keeping this patch as a separate step makes individual commits easier to understand and review. - [PATCH v2 04/11] commit-graph: compute Bloom filters for changed paths Here we compute Bloom filters for changed paths for each commit in the commit-graph file, without writing it to file; as a side-effect we calculate total Bloom filters data size. This doesn't make much sense as a standalone patch, but it is nice, easy to understand incremental step in building the feature. - [PATCH v2 05/11] commit-graph: examine changed-path objects in pack order - [PATCH v2 06/11] commit-graph: examine commits by generation number Those two are performance improvements of previous step. It is good to keep them as separate commits, makes it easier to understand (and easier to catch error via git-bisect, if there would be any) - [PATCH v2 07/11] commit-graph: write Bloom filters to commit graph file This commit includes the documentation of the two new chunks of commit-graph file format. I wonder if the 9th patch in this series, namely commit-graph: add --changed-paths option to write subcommand should not precede this commit. Otherwise we have this new code but no way of testing it. On the other hand it makes it easier to review. On the gripping hand, you can't really test that writing works without the ability to parse Bloom filter data out of commit-graph file... which is the next commit. - [PATCH v2 08/11] commit-graph: reuse existing Bloom filters during write This implements reading Bloom filters data from commit-graph file. Is it a good split? I think it makes it easier to review the single patch, but itt also makes them less standalone. - [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand One thing we could test there is that we are writing two new chunks to the commit-graph file (and perhaps checking that they are correctly formatted, and have correct shape). - [PATCH v2 10/11] revision.c: use Bloom filters to speed up path based revision walks This is quite a big and involved patch, which in my opinion could be split in two or three parts: a. Add a bare bones implementation, like in v2 This limits amount of testing we can do; the only thing we can really test is that we get the same results with and without Bloom filters. b.1. Add trace2 Bloom filter statistics b.2. Use said trace2 statistics to test use of Bloom filters - [PATCH v2 11/11] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag This one is for (optional) exhaustive testing of the feature. Feel free to disagree with those ideas. Best,
On 2/7/2020 10:09 AM, Garima Singh wrote: > > On 2/7/2020 8:52 AM, SZEDER Gábor wrote: >>> * Added unit tests for the bloom filter computation layer >> >> This fails on big endian, e.g. in Travis CI's s390x build: >> >> https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/647253022#L2210 >> >> (The link highlights the failure, but I'm afraid your browser won't >> jump there right away; you'll have to click on the print-test-failures >> fold at the bottom, and scroll down a bit...) >> > > Thank you so much for running this pipeline and pointing out the error! > > We will carefully review our interactions with the binary data and > hopefully solve this in the next version. > > Cheers! > Garima Singh > Hey! The patch below carries the fix for the failure on Big-endian architectures. We now treat bloom filter data as a simple binary stream of 1 byte words instead of 8 byte words. This avoids the Big-endian vs Little-endian confusion on different CPU architectures. Here is the successful run of SZEDER's Travis CI s390x build. https://travis-ci.org/szeder/git/jobs/649044879 I will be squashing this patch into the appropriate commits in the series in v3, which I will send out after people have had a chance to complete their review of v2. A special thanks to SZEDER for helping us test our patches on his CI pipeline and saving us the overhead of setting up a Big-endian machine! Cheers! Garima Singh -->8-- From ee72310dd8c3ad2b810914edb651008f637e7c2a Mon Sep 17 00:00:00 2001 From: Garima Singh <garima.singh@microsoft.com> Date: Tue, 11 Feb 2020 13:55:03 -0500 Subject: [PATCH] Process bloom filter data as 1 byte words Process bloom filter data as 1 byte words instead of 8 byte words to avoid the Big-endian vs Little-endian confusion on different CPU architectures Helped-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Garima Singh <garima.singh@microsoft.com> --- bloom.c | 24 ++++----- bloom.h | 4 +- commit-graph.c | 4 +- t/helper/test-bloom.c | 4 +- t/t0095-bloom.sh | 118 +++++++++++++++++++++--------------------- 5 files changed, 77 insertions(+), 77 deletions(-) diff --git a/bloom.c b/bloom.c index 90d84dc713..6d5d6bb2ef 100644 --- a/bloom.c +++ b/bloom.c @@ -45,12 +45,13 @@ static uint32_t seed_murmur3(uint32_t seed, const char *data, int len) int len4 = len / sizeof(uint32_t); - const uint32_t *blocks = (const uint32_t*)data; - uint32_t k; - for (i = 0; i < len4; i++) - { - k = blocks[i]; + for (i = 0; i < len4; i++) { + uint32_t byte1 = (uint32_t)data[4*i]; + uint32_t byte2 = ((uint32_t)data[4*i + 1]) << 8; + uint32_t byte3 = ((uint32_t)data[4*i + 2]) << 16; + uint32_t byte4 = ((uint32_t)data[4*i + 3]) << 24; + k = byte1 | byte2 | byte3 | byte4; k *= c1; k = rotate_right(k, r1); k *= c2; @@ -61,8 +62,7 @@ static uint32_t seed_murmur3(uint32_t seed, const char *data, int len) tail = (data + len4 * sizeof(uint32_t)); - switch (len & (sizeof(uint32_t) - 1)) - { + switch (len & (sizeof(uint32_t) - 1)) { case 3: k1 ^= ((uint32_t)tail[2]) << 16; /*-fallthrough*/ @@ -88,9 +88,9 @@ static uint32_t seed_murmur3(uint32_t seed, const char *data, int len) return seed; } -static inline uint64_t get_bitmask(uint32_t pos) +static inline unsigned char get_bitmask(uint32_t pos) { - return ((uint64_t)1) << (pos & (BITS_PER_WORD - 1)); + return ((unsigned char)1) << (pos & (BITS_PER_WORD - 1)); } void load_bloom_filters(void) @@ -152,8 +152,8 @@ static int load_bloom_filter_from_graph(struct commit_graph *g, start_index = 0; filter->len = end_index - start_index; - filter->data = (uint64_t *)(g->chunk_bloom_data + - sizeof(uint64_t) * start_index + + filter->data = (unsigned char *)(g->chunk_bloom_data + + sizeof(unsigned char) * start_index + BLOOMDATA_CHUNK_HEADER_SIZE); return 1; @@ -234,7 +234,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r, } filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD; - filter->data = xcalloc(filter->len, sizeof(uint64_t)); + filter->data = xcalloc(filter->len, sizeof(unsigned char)); hashmap_for_each_entry(&pathmap, &iter, e, entry) { struct bloom_key key; diff --git a/bloom.h b/bloom.h index 76f8a9ad0c..9604723ce0 100644 --- a/bloom.h +++ b/bloom.h @@ -12,7 +12,7 @@ struct bloom_filter_settings { }; #define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 } -#define BITS_PER_WORD 64 +#define BITS_PER_WORD 8 #define BLOOMDATA_CHUNK_HEADER_SIZE 3*sizeof(uint32_t) /* @@ -22,7 +22,7 @@ struct bloom_filter_settings { * 'data'. */ struct bloom_filter { - uint64_t *data; + unsigned char *data; int len; }; diff --git a/commit-graph.c b/commit-graph.c index c0e9834bf2..f5f9a23c9a 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -1125,7 +1125,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f, while (list < last) { struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0); display_progress(progress, ++i); - hashwrite(f, filter->data, filter->len * sizeof(uint64_t)); + hashwrite(f, filter->data, filter->len * sizeof(unsigned char)); list++; } @@ -1305,7 +1305,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx) for (i = 0; i < ctx->commits.nr; i++) { struct commit *c = sorted_by_pos[i]; struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1); - ctx->total_bloom_filter_data_size += sizeof(uint64_t) * filter->len; + ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len; display_progress(progress, i + 1); } diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c index 9b4be97f75..8fa2d8fc25 100644 --- a/t/helper/test-bloom.c +++ b/t/helper/test-bloom.c @@ -23,7 +23,7 @@ static void print_bloom_filter(struct bloom_filter *filter) { printf("Filter_Length:%d\n", filter->len); printf("Filter_Data:"); for (i = 0; i < filter->len; i++){ - printf("%"PRIx64"|", filter->data[i]); + printf("%02x|", filter->data[i]); } printf("\n"); } @@ -57,7 +57,7 @@ int cmd__bloom(int argc, const char **argv) struct bloom_filter filter; int i = 2; filter.len = (settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD; - filter.data = xcalloc(filter.len, sizeof(uint64_t)); + filter.data = xcalloc(filter.len, sizeof(unsigned char)); if (!argv[2]){ die("at least one input string expected"); diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh index 424fe4fc29..58273219ff 100755 --- a/t/t0095-bloom.sh +++ b/t/t0095-bloom.sh @@ -3,58 +3,11 @@ test_description='test bloom.c' . ./test-lib.sh -test_expect_success 'get bloom filters for commit with no changes' ' - git init && - git commit --allow-empty -m "c0" && - cat >expect <<-\EOF && - Filter_Length:0 - Filter_Data: - EOF - test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual && - test_cmp expect actual -' - -test_expect_success 'get bloom filter for commit with 10 changes' ' - rm actual && - rm expect && - mkdir smallDir && - for i in $(test_seq 0 9) - do - echo $i >smallDir/$i - done && - git add smallDir && - git commit -m "commit with 10 changes" && - cat >expect <<-\EOF && - Filter_Length:4 - Filter_Data:508928809087080a|8a7648210804001|4089824400951000|841ab310098051a8| - EOF - test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual && - test_cmp expect actual -' - -test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' ' - rm actual && - rm expect && - mkdir bigDir && - for i in $(test_seq 0 512) - do - echo $i >bigDir/$i - done && - git add bigDir && - git commit -m "commit with 513 changes" && - cat >expect <<-\EOF && - Filter_Length:0 - Filter_Data: - EOF - test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual && - test_cmp expect actual -' - test_expect_success 'compute bloom key for empty string' ' cat >expect <<-\EOF && Hashes:5615800c|5b966560|61174ab4|66983008|6c19155c|7199fab0|771ae004| - Filter_Length:1 - Filter_Data:11000110001110| + Filter_Length:2 + Filter_Data:11|11| EOF test-tool bloom generate_filter "" >actual && test_cmp expect actual @@ -63,8 +16,8 @@ test_expect_success 'compute bloom key for empty string' ' test_expect_success 'compute bloom key for whitespace' ' cat >expect <<-\EOF && Hashes:1bf014e6|8a91b50b|f9335530|67d4f555|d676957a|4518359f|b3b9d5c4| - Filter_Length:1 - Filter_Data:401004080200810| + Filter_Length:2 + Filter_Data:71|8c| EOF test-tool bloom generate_filter " " >actual && test_cmp expect actual @@ -73,8 +26,8 @@ test_expect_success 'compute bloom key for whitespace' ' test_expect_success 'compute bloom key for a root level folder' ' cat >expect <<-\EOF && Hashes:1a21016f|fff1c06d|e5c27f6b|cb933e69|b163fd67|9734bc65|7d057b63| - Filter_Length:1 - Filter_Data:aaa800000000| + Filter_Length:2 + Filter_Data:a8|aa| EOF test-tool bloom generate_filter "A" >actual && test_cmp expect actual @@ -83,8 +36,8 @@ test_expect_success 'compute bloom key for a root level folder' ' test_expect_success 'compute bloom key for a root level file' ' cat >expect <<-\EOF && Hashes:e2d51107|30970605|7e58fb03|cc1af001|19dce4ff|679ed9fd|b560cefb| - Filter_Length:1 - Filter_Data:a8000000000000aa| + Filter_Length:2 + Filter_Data:aa|a8| EOF test-tool bloom generate_filter "file.txt" >actual && test_cmp expect actual @@ -93,8 +46,8 @@ test_expect_success 'compute bloom key for a root level file' ' test_expect_success 'compute bloom key for a deep folder' ' cat >expect <<-\EOF && Hashes:864cf838|27f055cd|c993b362|6b3710f7|0cda6e8c|ae7dcc21|502129b6| - Filter_Length:1 - Filter_Data:1c0000600003000| + Filter_Length:2 + Filter_Data:c6|31| EOF test-tool bloom generate_filter "A/B/C/D/E" >actual && test_cmp expect actual @@ -103,11 +56,58 @@ test_expect_success 'compute bloom key for a deep folder' ' test_expect_success 'compute bloom key for a deep file' ' cat >expect <<-\EOF && Hashes:07cdf850|4af629c7|8e1e5b3e|d1468cb5|146ebe2c|5796efa3|9abf211a| - Filter_Length:1 - Filter_Data:4020100804010080| + Filter_Length:2 + Filter_Data:a9|54| EOF test-tool bloom generate_filter "A/B/C/D/E/file.txt" >actual && test_cmp expect actual ' +test_expect_success 'get bloom filters for commit with no changes' ' + git init && + git commit --allow-empty -m "c0" && + cat >expect <<-\EOF && + Filter_Length:0 + Filter_Data: + EOF + test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual && + test_cmp expect actual +' + +test_expect_success 'get bloom filter for commit with 10 changes' ' + rm actual && + rm expect && + mkdir smallDir && + for i in $(test_seq 0 9) + do + echo $i >smallDir/$i + done && + git add smallDir && + git commit -m "commit with 10 changes" && + cat >expect <<-\EOF && + Filter_Length:25 + Filter_Data:c2|0b|b8|c0|10|88|f0|1d|c1|0c|01|a4|01|28|81|80|01|30|10|d0|92|be|88|10|8a| + EOF + test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual && + test_cmp expect actual +' + +test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' ' + rm actual && + rm expect && + mkdir bigDir && + for i in $(test_seq 0 512) + do + echo $i >bigDir/$i + done && + git add bigDir && + git commit -m "commit with 513 changes" && + cat >expect <<-\EOF && + Filter_Length:0 + Filter_Data: + EOF + test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual && + test_cmp expect actual +' + test_done
On 2/8/2020 6:04 PM, Jakub Narebski wrote: > "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> Hey! >> >> The commit graph feature brought in a lot of performance improvements across >> multiple commands. However, file based history continues to be a performance >> pain point, especially in large repositories. >> >> Adopting changed path Bloom filters has been discussed on the list before, >> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr. >> Derrick Stolee [1]. This series is based on Dr. Stolee's proof of >> concept in [2]. > > Sidenote: I wondered why it did use MurmurHash3 (64-bit version), which > requires adding its implementation, instead of reusing FNV-1 hash > (Fowler–Noll–Vo hash function) used by Git hashmap implementation, see > https://github.com/git/git/blob/228f53135a4a41a37b6be8e4d6e2b6153db4a8ed/hashmap.h#L109 > Beside the fact that everyone is using MurmurHash for Bloom filters ;-) > > It turns out that in various benchmark MurmurHash is faster and also > slightly better as a hash than FNV-1 or FNV-1b. > > > I wonder then if it would be a good idea (in the future) to make it easy > to use hashmap with MurmurHash3 instead of FNV-1, or maybe to even make > it the default for hashing strings. > Making Murmur3 hash the default for hashing strings is definitely outside the scope of this series. Also, if the method signatures for the murmur3 hash matched the existing hash method signatures in hashmap.c, then it would be appropriate to place them adjacently, even if no hashmap consumer uses it for hashmaps. However, we need the option to start at a custom seed to do our double hashing. A change in the future that involves adopting murmur3 in the hashmap code would involve a simple code move before creating the new methods that avoid a custom seed. So for now, it makes sense that these methods leave in bloom.c where they are being used for a very specific purpose. >> >> Performance Gains: We tested the performance of git log -- path on the git >> repo, the linux repo and some internal large repos, with a variety of paths >> of varying depths. > > As I wrote in reply to previous version of this series, a good public > repository (and thus being able to use by anyone) to test the Bloom > filter performance improvements could be AOSP (Android) base: > > https://android.googlesource.com/platform/frameworks/base/ > > which is a large repository with long path depths (due to Java file > naming conventions). > Thank you! I will incorporate these results into the commit messages as appropriate in v3. >> >> On the git and linux repos: We observed a 2x to 5x speed up. >> >> On a large internal repo with files seated 6-10 levels deep in the tree: We >> observed 10x to 20x speed ups, with some paths going up to 28 times faster. > > Very nice! Good work! > > What is the cost of this feature, that is how long it takes to generate > Bloom filters, and how much larger commit-graph file gets? It would be > nice to know. > The cost of writing is much better now with Peff and Dr. Stolee's improvements. I will include these numbers as well in the commit messages as appropriate in v3. >> >> Future Work (not included in the scope of this series): >> >> 1. Supporting multiple path based revision walk > > Shouldn't then tests that were added in v2 mark use of Bloom filters > with multiple paths revision walking as _not working *yet*_ > (test_expect_failure), and not expected to not work (test_expect_success > with test_bloom_filters_not_used)? > My intent is to ensure that bloom filters are not being used in any of the unsupported code paths. I don't have a strong preference about the test semantics as long as I get that coverage :) So I will look into switching it to test_expect_failure as you have suggested. >> Derrick Stolee (2): >> diff: halt tree-diff early after max_changes >> commit-graph: examine commits by generation number >> >> Garima Singh (8): >> commit-graph: use MAX_NUM_CHUNKS >> bloom: core Bloom filter implementation for changed paths >> commit-graph: compute Bloom filters for changed paths >> commit-graph: write Bloom filters to commit graph file >> commit-graph: reuse existing Bloom filters during write. >> commit-graph: add --changed-paths option to write subcommand >> revision.c: use Bloom filters to speed up path based revision walks >> commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag >> >> Jeff King (1): >> commit-graph: examine changed-path objects in pack order > > The shortlog summary is a fine tool to show contributors to the patch > series, but is not as useful to show patch series as a whole: splitting > of patches and their ordering. > This is a GitGitGadget specific thing, and it is probably by design. I have opened an issue in that repo for any follow up discussions: https://github.com/gitgitgadget/gitgitgadget/issues/203 > - [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths > > In my opinion this patch could be split into three individual pieces, > though one might think it is not worth it. > I have gone back and forth on doing this. I like most of the core Bloom filter computations being isolated in one patch/commit. But based on the rest of your review, it seems like you are leaning heavily on having this split out. So, I will take a proper stab at doing it for v3. > - [PATCH v2 07/11] commit-graph: write Bloom filters to commit graph file > > This commit includes the documentation of the two new chunks of > commit-graph file format. > > I wonder if the 9th patch in this series, namely > commit-graph: add --changed-paths option to write subcommand > should not precede this commit. Otherwise we have this new code but > no way of testing it. On the other hand it makes it easier to > review. On the gripping hand, you can't really test that writing > works without the ability to parse Bloom filter data out of > commit-graph file... which is the next commit. > Getting complete test coverage within a single patch would require 2 or 3 of these patches to be combined. This would lead to a large patch that would be much more difficult to review. My tests in the patches following this one run git commands. Hence the tests get introduced when the command line is ready to use all the new code. The current ordering of patches works better than adding the --changed-paths option before the logic that computes and writes. Otherwise the option will not be doing what it is supposed to do in the patch it was introduced in. > - [PATCH v2 08/11] commit-graph: reuse existing Bloom filters during write > > This implements reading Bloom filters data from commit-graph file. > Is it a good split? I think it makes it easier to review the single > patch, but itt also makes them less standalone. > All the logic upto this point works just fine without the ability to read and parse precomputed bloom filters. This patch is an enhancement and it also separates out the reading and writing logic. Reusing existing bloom filters during write is the simplest interatcion that involves reading from the commit graph file, and builds the foundation to make the `git log` improvements. Hence, it warrants its own patch and review. > - [PATCH v2 10/11] revision.c: use Bloom filters to speed up path based revision walks > > This is quite a big and involved patch, which in my opinion could be > split in two or three parts: > > a. Add a bare bones implementation, like in v2 > > This limits amount of testing we can do; the only thing we can really > test is that we get the same results with and without Bloom filters. > > b.1. Add trace2 Bloom filter statistics > b.2. Use said trace2 statistics to test use of Bloom filters > Sure. I will look into doing this split as well for v3. > > Feel free to disagree with those ideas. > > Best, Thanks for taking the time for reviewing this series so thoroughly! It is greatly appreciated! Cheers, Garima Singh
Garima Singh <garimasigit@gmail.com> writes: > On 2/8/2020 6:04 PM, Jakub Narebski wrote: >> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes: >> ... > I have gone back and forth on doing this. I like most of the core Bloom filter > computations being isolated in one patch/commit. But based on the rest of your > review, it seems like you are leaning heavily on having this split out. > So, I will take a proper stab at doing it for v3. > ... > Thanks for taking the time for reviewing this series so thoroughly! > It is greatly appreciated! Thanks for a great discussion. Just a friendly ping to the thread, so that something from the discussion thread will stay on the first page of mailing list archive's threaded view ;-)