Message ID | 9cc31c289aa785f026eec84452ed68e80505d95e.1586566981.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Integrate changed-path Bloom filters with 'git blame' | expand |
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Derrick Stolee <dstolee@microsoft.com> > > The changed-path Bloom filters work only when we can compute an > explicit Bloom filter key in advance. When a pathspec is given > that allows case-insensitive checks or wildcard matching, we > must disable the Bloom filter performance checks. How often do we want to do case-insensitive path matching, I wonder. As somebody who never touches case-insensitive filesystem, I would be a bad judge, and I would imagine that I would be looking for a pathspec "[Mm]akefile" rather than ":(icase)makefile" myself if there are projects that may have either/both, so I do not mind disabling the bloom filter for case insensitivity myself. But if users may use icase pathspec very often, it may be worth considering to build the bloom filter after downcasing the paths, perhaps? Given that many projects extract their source code to a case insensitive filesystem, I would imagine that downcasing paths would map two originally different paths into the same thing only rarely, if ever, so there may not be much downside to do so. > By checking the pathspec in prepare_to_use_bloom_filters(), we > avoid setting up the Bloom filter data and thus revert to the > usual logic. > > Before this change, the following tests would fail*: > > t6004-rev-list-path-optim.sh (Tests 6-7) > t6130-pathspec-noglob.sh (Tests 3-6) > t6131-pathspec-icase.sh (Tests 3-5) > > *These tests would fail when using GIT_TEST_COMMIT_GRAPH and > GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS except that the latter > environment variable was not set up correctly to write the changed- > path Bloom filters in the test suite. That will be fixed in the > next change. Thanks. This is exciting.
Hi Stolee, On Sat, Apr 11, 2020 at 01:02:59AM +0000, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <dstolee@microsoft.com> > > The changed-path Bloom filters work only when we can compute an > explicit Bloom filter key in advance. When a pathspec is given > that allows case-insensitive checks or wildcard matching, we > must disable the Bloom filter performance checks. > > By checking the pathspec in prepare_to_use_bloom_filters(), we > avoid setting up the Bloom filter data and thus revert to the > usual logic. All makes sense to me, and this seems like the only reasonable thing *to* do in this situation. That's fine, since we're not regressing, we're just not using Bloom filters. > Before this change, the following tests would fail*: > > t6004-rev-list-path-optim.sh (Tests 6-7) > t6130-pathspec-noglob.sh (Tests 3-6) > t6131-pathspec-icase.sh (Tests 3-5) > > *These tests would fail when using GIT_TEST_COMMIT_GRAPH and > GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS except that the latter > environment variable was not set up correctly to write the changed- > path Bloom filters in the test suite. That will be fixed in the > next change. Nicely done. > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > revision.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/revision.c b/revision.c > index 2b06ee739c8..e37b5b06108 100644 > --- a/revision.c > +++ b/revision.c > @@ -661,6 +661,15 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) > if (!revs->commits) > return; > I certainly wouldn't complain about a comment here explaining these three checks, but I suppose that the rationale is only a 'git blame' away (and I guess that is faster now after this series ;-)). > + if (revs->prune_data.has_wildcard) > + return; > + if (revs->prune_data.nr > 1) > + return; > + if (revs->prune_data.magic || > + (revs->prune_data.nr && > + revs->prune_data.items[0].magic)) > + return; > + > repo_parse_commit(revs->repo, revs->commits->item); > > if (!revs->repo->objects->commit_graph) > -- > gitgitgadget Reviewed-by: Taylor Blau <me@ttaylorr.com> Thanks, Taylor
Taylor Blau <me@ttaylorr.com> writes: > I certainly wouldn't complain about a comment here explaining these > three checks, but I suppose that the rationale is only a 'git blame' > away (and I guess that is faster now after this series ;-)). > >> + if (revs->prune_data.has_wildcard) >> + return; >> + if (revs->prune_data.nr > 1) >> + return; >> + if (revs->prune_data.magic || >> + (revs->prune_data.nr && >> + revs->prune_data.items[0].magic)) This says "any magic", but it is overly pessimistic to disable the optimization for "literal" magic. That magic is the one that lets well written scripts to say "I have in a '$variable' that the user gave me as a pathname (not pathspec), and it may have a wildcard letter in it, but please treat it as a literal string" by prefixing ":(literal)" before that user-supplied data, so it is punishing well disciplined folks. >> + return; >> + >> repo_parse_commit(revs->repo, revs->commits->item); >> >> if (!revs->repo->objects->commit_graph) >> -- >> gitgitgadget > > Reviewed-by: Taylor Blau <me@ttaylorr.com> > > Thanks, > Taylor
On Sun, Apr 12, 2020 at 03:30:07PM -0700, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > I certainly wouldn't complain about a comment here explaining these > > three checks, but I suppose that the rationale is only a 'git blame' > > away (and I guess that is faster now after this series ;-)). > > > >> + if (revs->prune_data.has_wildcard) > >> + return; > >> + if (revs->prune_data.nr > 1) > >> + return; > >> + if (revs->prune_data.magic || > >> + (revs->prune_data.nr && > >> + revs->prune_data.items[0].magic)) > > This says "any magic", but it is overly pessimistic to disable the > optimization for "literal" magic. That magic is the one that lets > well written scripts to say "I have in a '$variable' that the user > gave me as a pathname (not pathspec), and it may have a wildcard > letter in it, but please treat it as a literal string" by prefixing > ":(literal)" before that user-supplied data, so it is punishing well > disciplined folks. I hadn't thought of that, but it makes sense to me. How about something like this squashed into this patch? I moved the if-chain that Stolee introduced out to its own function, at least since they seem well-contained and related to one another. I figure that this simplifies the implementation of 'prepare_to_use_bloom_filter' by giving the reader less to think about up-front. diff --git a/revision.c b/revision.c index 534c0bf996..15bf4ccff5 100644 --- a/revision.c +++ b/revision.c @@ -654,6 +654,18 @@ static void trace2_bloom_filter_statistics_atexit(void) jw_release(&jw); } +static int has_bloom_key(struct pathspec *spec) +{ + if (spec->has_wildcard) + return 0; + if (spec->nr > 1) + return 0; + if ((spec->magic & ~PATHSPEC_LITERAL) || + (spec->nr && spec->items[0].magic & ~PATHSPEC_LITERAL)) + return 0; + return 1; +} + static void prepare_to_use_bloom_filter(struct rev_info *revs) { struct pathspec_item *pi; @@ -665,13 +677,7 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) if (!revs->commits) return; - if (revs->prune_data.has_wildcard) - return; - if (revs->prune_data.nr > 1) - return; - if (revs->prune_data.magic || - (revs->prune_data.nr && - revs->prune_data.items[0].magic)) + if (!has_bloom_key(&revs->prune_data)) return; repo_parse_commit(revs->repo, revs->commits->item); > >> + return; > >> + > >> repo_parse_commit(revs->repo, revs->commits->item); > >> > >> if (!revs->repo->objects->commit_graph) > >> -- > >> gitgitgadget > > > > Reviewed-by: Taylor Blau <me@ttaylorr.com> > > > > Thanks, > > Taylor Thanks, Taylor
On 4/11/2020 5:40 PM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> From: Derrick Stolee <dstolee@microsoft.com> >> >> The changed-path Bloom filters work only when we can compute an >> explicit Bloom filter key in advance. When a pathspec is given >> that allows case-insensitive checks or wildcard matching, we >> must disable the Bloom filter performance checks. > > How often do we want to do case-insensitive path matching, I wonder. > > As somebody who never touches case-insensitive filesystem, I would > be a bad judge, and I would imagine that I would be looking for a > pathspec "[Mm]akefile" rather than ":(icase)makefile" myself if > there are projects that may have either/both, so I do not mind > disabling the bloom filter for case insensitivity myself. > > But if users may use icase pathspec very often, it may be worth > considering to build the bloom filter after downcasing the paths, > perhaps? Given that many projects extract their source code to a > case insensitive filesystem, I would imagine that downcasing paths > would map two originally different paths into the same thing only > rarely, if ever, so there may not be much downside to do so. This behavior could be extended later, and carefully. My initial thought was that the case check would happen on every commit. If the :(icase) check only happens at the walk tip(s), then we could compute a single Bloom key at the start. Thanks, -Stolee
On 4/12/2020 8:07 PM, Taylor Blau wrote: > On Sun, Apr 12, 2020 at 03:30:07PM -0700, Junio C Hamano wrote: >> Taylor Blau <me@ttaylorr.com> writes: >> >>> I certainly wouldn't complain about a comment here explaining these >>> three checks, but I suppose that the rationale is only a 'git blame' >>> away (and I guess that is faster now after this series ;-)). >>> >>>> + if (revs->prune_data.has_wildcard) >>>> + return; >>>> + if (revs->prune_data.nr > 1) >>>> + return; >>>> + if (revs->prune_data.magic || >>>> + (revs->prune_data.nr && >>>> + revs->prune_data.items[0].magic)) >> >> This says "any magic", but it is overly pessimistic to disable the >> optimization for "literal" magic. That magic is the one that lets >> well written scripts to say "I have in a '$variable' that the user >> gave me as a pathname (not pathspec), and it may have a wildcard >> letter in it, but please treat it as a literal string" by prefixing >> ":(literal)" before that user-supplied data, so it is punishing well >> disciplined folks. This is a good point. I'm unfamiliar with these advanced pathspec tricks. > I hadn't thought of that, but it makes sense to me. How about something > like this squashed into this patch? I moved the if-chain that Stolee > introduced out to its own function, at least since they seem > well-contained and related to one another. I figure that this simplifies > the implementation of 'prepare_to_use_bloom_filter' by giving the reader > less to think about up-front. > > diff --git a/revision.c b/revision.c > index 534c0bf996..15bf4ccff5 100644 > --- a/revision.c > +++ b/revision.c > @@ -654,6 +654,18 @@ static void trace2_bloom_filter_statistics_atexit(void) > jw_release(&jw); > } > > +static int has_bloom_key(struct pathspec *spec) > +{ > + if (spec->has_wildcard) > + return 0; > + if (spec->nr > 1) > + return 0; > + if ((spec->magic & ~PATHSPEC_LITERAL) || > + (spec->nr && spec->items[0].magic & ~PATHSPEC_LITERAL)) > + return 0; > + return 1; > +} > + Perhaps flip this on its head? +static int forbids_bloom_key(struct pathspec *spec) +{ + if (spec->has_wildcard) + return 1; + if (spec->nr > 1) + return 1; + if (spec->magic & ~PATHSPEC_LITERAL) + return 1; + if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL)) + return 1; + return 0; +} + > static void prepare_to_use_bloom_filter(struct rev_info *revs) > { > struct pathspec_item *pi; > @@ -665,13 +677,7 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) > if (!revs->commits) > return; > > - if (revs->prune_data.has_wildcard) > - return; > - if (revs->prune_data.nr > 1) > - return; > - if (revs->prune_data.magic || > - (revs->prune_data.nr && > - revs->prune_data.items[0].magic)) > + if (!has_bloom_key(&revs->prune_data)) > return; Then this would be "if (forbids_bloom_key(&revs->prune_data))" Generally, I like pulling this stuff out as a method to isolate and label its purpose. If we wanted to allow certain :(icase) things later, then we know what to modify in order to "allow" it. Thanks, -Stolee
Derrick Stolee <stolee@gmail.com> writes: >> But if users may use icase pathspec very often, it may be worth >> considering to build the bloom filter after downcasing the paths, >> perhaps? Given that many projects extract their source code to a >> case insensitive filesystem, I would imagine that downcasing paths >> would map two originally different paths into the same thing only >> rarely, if ever, so there may not be much downside to do so. > > This behavior could be extended later, and carefully. My initial > thought was that the case check would happen on every commit. If > the :(icase) check only happens at the walk tip(s), then we could > compute a single Bloom key at the start. Sorry, I am not sure what you mean. Do you mean that we notice that the user wants to match 'foo' case insensitively, and tell the logic that uses changed-path records in the graph file that commits that cannot possibly have touched any or the paths 'foo', 'foO', 'fOo', ... (all 8 case permutations) are not interesting? I guess that would work, but I was wondering if it is simpler without much downside if the changed-path records in the graph file are prepared on paths after they are normalized to a single case. That would lose information (e.g. you no longer can say "commits that touch the path 'foo' is interesting, but those that touch the path 'Foo' are not"), but makes the side that queries much simpler (i.e. you do not have to prepare all 8 case permutations---you only ask about 'foo'). And because the Bloom filter is used only for performance to cull commits that can never possibly match, allowing a false positive that would be discarded by actually running tree-diff anyway, the only potential downside happens when the project has too many paths that are different only in cases by increased collisions and by reducing our chances to skip running tree-diff (and never affects correctness). But this is not the "could be extended later" kind of behaviour, I am afraid. It is baked in the data stored in the graph file. It all depends on how often people want :(icase) pathspec matches in the history, I suspect. My point was that we need to declare that :(icase) won't matter in real life (hence we won't optimize our data to support that use case), before the way in which the data stored in the graph file is computed is cast in stone. Thanks.
On 4/14/2020 2:25 PM, Junio C Hamano wrote: > Derrick Stolee <stolee@gmail.com> writes: > >>> But if users may use icase pathspec very often, it may be worth >>> considering to build the bloom filter after downcasing the paths, >>> perhaps? Given that many projects extract their source code to a >>> case insensitive filesystem, I would imagine that downcasing paths >>> would map two originally different paths into the same thing only >>> rarely, if ever, so there may not be much downside to do so. >> >> This behavior could be extended later, and carefully. My initial >> thought was that the case check would happen on every commit. If >> the :(icase) check only happens at the walk tip(s), then we could >> compute a single Bloom key at the start. > > Sorry, I am not sure what you mean. That's my fault. There are a couple of things I misunderstood here. 1. Thinking about "git blame" we would need to "collapse" a pathspec to a specific file before starting history. But blame doesn't allow ":(icase)" anyway. 2. With that context of "git blame" in my head, I was thinking (incorrectly) that "git log" would collapse the pathspec based on what file(s) match the pattern at HEAD. The tests in t6131-pathspec-icase.sh clearly show that this is wrong. In fact, if we apply the following diff to this patch, then we can get failures with the changed-path filters: diff --git a/revision.c b/revision.c index f78c636e4d..a02be25feb 100644 --- a/revision.c +++ b/revision.c @@ -652,13 +652,14 @@ static void trace2_bloom_filter_statistics_atexit(void) static int forbid_bloom_filters(struct pathspec *spec) { + int allowed_flags = PATHSPEC_LITERAL | PATHSPEC_ICASE; if (spec->has_wildcard) return 1; if (spec->nr > 1) return 1; - if (spec->magic & ~PATHSPEC_LITERAL) + if (spec->magic & ~allowed_flags) return 1; - if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL)) + if (spec->nr && (spec->items[0].magic & ~allowed_flags)) return 1; return 0; > Do you mean that we notice that the user wants to match 'foo' case > insensitively, and tell the logic that uses changed-path records in > the graph file that commits that cannot possibly have touched any or > the paths 'foo', 'foO', 'fOo', ... (all 8 case permutations) are not > interesting? > > I guess that would work, but I was wondering if it is simpler > without much downside if the changed-path records in the graph file > are prepared on paths after they are normalized to a single case. > That would lose information (e.g. you no longer can say "commits > that touch the path 'foo' is interesting, but those that touch the > path 'Foo' are not"), but makes the side that queries much simpler > (i.e. you do not have to prepare all 8 case permutations---you only > ask about 'foo'). > > And because the Bloom filter is used only for performance to cull > commits that can never possibly match, allowing a false positive > that would be discarded by actually running tree-diff anyway, the > only potential downside happens when the project has too many paths > that are different only in cases by increased collisions and by > reducing our chances to skip running tree-diff (and never affects > correctness). > > But this is not the "could be extended later" kind of behaviour, I > am afraid. It is baked in the data stored in the graph file. Since the feature is not released, we still have time to update the format if we so desired. With the current format, we would need to disable the filters when using an :(icase) pathspec as the current patch does. I'm not against the idea. Logically, collapsing case before hashing the Bloom keys should not increase the probabilities of false positives except in the situations where we have case conflicts. There is a small cost in the pre-hashing step to change the case of the paths, but that should be much lower than the cost of the hash itself and the tree parsing to find the changed paths. > It all depends on how often people want :(icase) pathspec matches in > the history, I suspect. My point was that we need to declare that > :(icase) won't matter in real life (hence we won't optimize our data > to support that use case), before the way in which the data stored > in the graph file is computed is cast in stone. My earlier statement can be summarized as "we could make this happen" and you ask here "is it worth doing?" I will play around with how complicated the change would be while the community considers the "is it worth doing?" question. Thanks, -Stolee
On 4/15/2020 9:27 AM, Derrick Stolee wrote: > > I'm not against the idea. Logically, collapsing case before hashing > the Bloom keys should not increase the probabilities of false > positives except in the situations where we have case conflicts. > There is a small cost in the pre-hashing step to change the case of > the paths, but that should be much lower than the cost of the hash > itself and the tree parsing to find the changed paths. > >> It all depends on how often people want :(icase) pathspec matches in >> the history, I suspect. My point was that we need to declare that >> :(icase) won't matter in real life (hence we won't optimize our data >> to support that use case), before the way in which the data stored >> in the graph file is computed is cast in stone. > > My earlier statement can be summarized as "we could make this happen" > and you ask here "is it worth doing?" > > I will play around with how complicated the change would be while > the community considers the "is it worth doing?" question. I played a bit, and it wasn't terrible to cover everything by adjusting fill_bloom_key(). I also added a test that we will want anyways. There is more work to be done if this is the direction we want to go, including double-checking that this doesn't cause significant performance degradation. The point of me sending this patch is to make the proposed direction very clear how it would work. I'm still not sure how I feel about it. Thanks, -Stolee -->8-- From 89beb9598daabb19e3c896bbceeb0fc1b9ccc6ca Mon Sep 17 00:00:00 2001 From: Derrick Stolee <dstolee@microsoft.com> Date: Wed, 15 Apr 2020 18:04:25 +0000 Subject: [PATCH] bloom: compute all Bloom hashes from lowercase The changed-path Bloom filters currently hash path strings using the exact string for the path. This makes it difficult* to use the filters when restricting to case-insensitive pathspecs. * I say "difficult" because it is possible to generate all 2^n options for the case of a path and test them all, but this is a bad idea and should not be done. "Impossible" is an appropriate alternative. THIS IS A BREAKING CHANGE. Commit-graph files with changed-path Bloom filters computed by a previous commit will not be compatible with the filters computed in this commit, nor will we get correct results when testing across these incompatible versions. Normally, this would be a completely unacceptable change, but the filters have not been released and hence are still possible to update before release. TODO: If we decide to move in this direction, then the following steps should be done (and some of them should be done anyway): * We need to document the Bloom filter format to specify exactly how we compute the filter data. The details should be careful enough that someone can reproduce the exact file format without looking at the C code. * That document would include the tolower() transformation that is being done here. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- bloom.c | 16 ++++++++++++++-- revision.c | 5 +++-- t/t6131-pathspec-icase.sh | 18 ++++++++++++++++++ 3 files changed, 35 insertions(+), 4 deletions(-) diff --git a/bloom.c b/bloom.c index dd9bab9bbd..c919c60f12 100644 --- a/bloom.c +++ b/bloom.c @@ -130,8 +130,20 @@ void fill_bloom_key(const char *data, int i; const uint32_t seed0 = 0x293ae76f; const uint32_t seed1 = 0x7e646e2c; - const uint32_t hash0 = murmur3_seeded(seed0, data, len); - const uint32_t hash1 = murmur3_seeded(seed1, data, len); + uint32_t hash0, hash1; + static struct strbuf icase_data = STRBUF_INIT; + char *cur; + + strbuf_setlen(&icase_data, 0); + strbuf_addstr(&icase_data, data); + + for (cur = icase_data.buf; cur && *cur; cur++) { + if (isupper(*cur)) + *cur = tolower(*cur); + } + + hash0 = murmur3_seeded(seed0, icase_data.buf, len); + hash1 = murmur3_seeded(seed1, icase_data.buf, len); key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t)); for (i = 0; i < settings->num_hashes; i++) diff --git a/revision.c b/revision.c index f78c636e4d..a02be25feb 100644 --- a/revision.c +++ b/revision.c @@ -652,13 +652,14 @@ static void trace2_bloom_filter_statistics_atexit(void) static int forbid_bloom_filters(struct pathspec *spec) { + int allowed_flags = PATHSPEC_LITERAL | PATHSPEC_ICASE; if (spec->has_wildcard) return 1; if (spec->nr > 1) return 1; - if (spec->magic & ~PATHSPEC_LITERAL) + if (spec->magic & ~allowed_flags) return 1; - if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL)) + if (spec->nr && (spec->items[0].magic & ~allowed_flags)) return 1; return 0; diff --git a/t/t6131-pathspec-icase.sh b/t/t6131-pathspec-icase.sh index 39fc3f6769..ee0766bd91 100755 --- a/t/t6131-pathspec-icase.sh +++ b/t/t6131-pathspec-icase.sh @@ -106,4 +106,22 @@ test_expect_success '"git diff" can take magic :(icase) pathspec' ' test_cmp expect actual ' +test_expect_success '"git log" takes the magic :(icase) pathspec' ' + cat >expect <<-\EOF && + FOO/BAR + FOO/bAr + FOO/bar + fOo/BAR + fOo/bAr + fOo/bar + foo/BAR + foo/bAr + foo/bar + EOF + git log --pretty=%s HEAD -- ":(icase)foo/bar" >actual && + test_cmp expect actual && + git log --pretty=%s HEAD -- ":(icase)foo" >actual && + test_cmp expect actual +' + test_done
Derrick Stolee <stolee@gmail.com> writes: > diff --git a/bloom.c b/bloom.c > index dd9bab9bbd..c919c60f12 100644 > --- a/bloom.c > +++ b/bloom.c > @@ -130,8 +130,20 @@ void fill_bloom_key(const char *data, > int i; > const uint32_t seed0 = 0x293ae76f; > const uint32_t seed1 = 0x7e646e2c; > - const uint32_t hash0 = murmur3_seeded(seed0, data, len); > - const uint32_t hash1 = murmur3_seeded(seed1, data, len); > + uint32_t hash0, hash1; > + static struct strbuf icase_data = STRBUF_INIT; > + char *cur; > + > + strbuf_setlen(&icase_data, 0); > + strbuf_addstr(&icase_data, data); Perhaps strbuf_reset(&icase_data); strbuf_add(&icase_data, data, len); Or do we know bloom keys are always NUL-terminated string? I am not sure how reusable bloom.c::fill_bloom_key() is designed to be. If it is for the sole use of the changed-paths, then it is OK to assume that our data is NUL-terminated string and that keys wants to be always computed after downcasing (assuming that icase search is something we want to optimize for, which I find is a bit iffy). If not, obviously we would want these two things done on the caller's side (or perhaps a new helper function whose callers can make these assumption, fill_bloom_path(), may want to be inserted between fill_bloom_key() and its callers). > + for (cur = icase_data.buf; cur && *cur; cur++) { > + if (isupper(*cur)) > + *cur = tolower(*cur); > + } > + > + hash0 = murmur3_seeded(seed0, icase_data.buf, len); > + hash1 = murmur3_seeded(seed1, icase_data.buf, len); > > key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t)); > for (i = 0; i < settings->num_hashes; i++) In any case, the update to the above function seems fairly isolated. Nicely done. > diff --git a/revision.c b/revision.c > index f78c636e4d..a02be25feb 100644 > --- a/revision.c > +++ b/revision.c > @@ -652,13 +652,14 @@ static void trace2_bloom_filter_statistics_atexit(void) > > static int forbid_bloom_filters(struct pathspec *spec) > { > + int allowed_flags = PATHSPEC_LITERAL | PATHSPEC_ICASE; > if (spec->has_wildcard) > return 1; > if (spec->nr > 1) > return 1; > - if (spec->magic & ~PATHSPEC_LITERAL) > + if (spec->magic & ~allowed_flags) > return 1; > - if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL)) > + if (spec->nr && (spec->items[0].magic & ~allowed_flags)) > return 1; > > return 0; And thanks to the refactoring done to invent this helper function, the side that uses the Bloom data is quite straight-forward. As you are, I am on the fence. I do not think :(icase) pathspec is something we want to optimize for, but I still like this new hash function primarily because I suspect that it will increase the number of paths that you can cram into the filter without getting their hashes collided (hence getting false positive), under the assumption that real projects won't try to store too many pair of paths that are only different in their case, and if that is the case, it would help performance. So if we were to benchmark, we'd also pay attention to that side, in addition to the obvious downside of having to allocate and downcase. Thanks.
Junio C Hamano <gitster@pobox.com> writes: > As you are, I am on the fence. > > I do not think :(icase) pathspec is something we want to optimize > for, but I still like this new hash function primarily because I > suspect that it will increase the number of paths that you can cram > into the filter without getting their hashes collided (hence getting > false positive), under the assumption that real projects won't try > to store too many pair of paths that are only different in their > case... Sorry, but no, I do not think there is such upside. It may have effects on the actual hash values to downcase paths that are originally camelCased, but reducing the entropy of input paths that way shouldn't have effect on the overall distribution and rate of collision in any meaningful way (otherwise the chosen underlying hash function would be broken). So, sorry for the noise.
Derrick Stolee <stolee@gmail.com> writes: > THIS IS A BREAKING CHANGE. Commit-graph files with changed-path > Bloom filters computed by a previous commit will not be compatible > with the filters computed in this commit, nor will we get correct > results when testing across these incompatible versions. Normally, > this would be a completely unacceptable change, but the filters > have not been released and hence are still possible to update > before release. Sure, it hasn't even hit 'next' yet. But I think we are both sort-of leaning towards concluding that it does not help all that much. So I think it is OK. > TODO: If we decide to move in this direction, then the following > steps should be done (and some of them should be done anyway): Even if we decide not to do this "downcase before hashing" thing, we should document how exactly we compute, I think. And if we decide do change our mind later, it is not the end of the world. We should be able to use a different chunk type to store the filters computed over downcased paths. > * We need to document the Bloom filter format to specify exactly > how we compute the filter data. The details should be careful > enough that someone can reproduce the exact file format without > looking at the C code. > > * That document would include the tolower() transformation that is > being done here. As the tree-diff comparison done internally while running "git blame" does not take an end-user specified pathspec in any meaningful way, this does not matter in practice, but there is another possible complication we would want to consider when we extend the support to "git log" and friends---negative pathspecs (e.g. "git log ':(exclude)COPYING'"). A commit that cannot possibly have touched the COPYING file would be eligible for output without actually running tree-diff between it and its parent (as long as the trees of the two commits are different, we know it must be shown). Thanks.
On Wed, 15 Apr 2020 at 20:37, Derrick Stolee <stolee@gmail.com> wrote: [...] > -->8-- > From 89beb9598daabb19e3c896bbceeb0fc1b9ccc6ca Mon Sep 17 00:00:00 2001 > From: Derrick Stolee <dstolee@microsoft.com> > Date: Wed, 15 Apr 2020 18:04:25 +0000 > Subject: [PATCH] bloom: compute all Bloom hashes from lowercase > > The changed-path Bloom filters currently hash path strings using > the exact string for the path. This makes it difficult* to use the > filters when restricting to case-insensitive pathspecs. > > * I say "difficult" because it is possible to generate all 2^n > options for the case of a path and test them all, but this is > a bad idea and should not be done. "Impossible" is an appropriate > alternative. > > THIS IS A BREAKING CHANGE. Commit-graph files with changed-path > Bloom filters computed by a previous commit will not be compatible > with the filters computed in this commit, nor will we get correct > results when testing across these incompatible versions. Normally, > this would be a completely unacceptable change, but the filters > have not been released and hence are still possible to update > before release. > > TODO: If we decide to move in this direction, then the following > steps should be done (and some of them should be done anyway): > > * We need to document the Bloom filter format to specify exactly > how we compute the filter data. The details should be careful > enough that someone can reproduce the exact file format without > looking at the C code. > > * That document would include the tolower() transformation that is > being done here. Why not modify the BDAT chunk to include version of case folding transformation or other collation algorithm (other transformation).that is done prior to computing the Bloom filter key? Though that might be unnecessary flexibility... For example the value of 0x00 in such field of BDAT chunk header would mean no transformation, while the value of 0x01 would mean per-character tolower() or Unicode equivalent of it. Best,
On Thu, Apr 16, 2020 at 12:18:33AM +0200, Jakub Narębski wrote: > On Wed, 15 Apr 2020 at 20:37, Derrick Stolee <stolee@gmail.com> wrote: > [...] > > -->8-- > > From 89beb9598daabb19e3c896bbceeb0fc1b9ccc6ca Mon Sep 17 00:00:00 2001 > > From: Derrick Stolee <dstolee@microsoft.com> > > Date: Wed, 15 Apr 2020 18:04:25 +0000 > > Subject: [PATCH] bloom: compute all Bloom hashes from lowercase > > > > The changed-path Bloom filters currently hash path strings using > > the exact string for the path. This makes it difficult* to use the > > filters when restricting to case-insensitive pathspecs. > > > > * I say "difficult" because it is possible to generate all 2^n > > options for the case of a path and test them all, but this is > > a bad idea and should not be done. "Impossible" is an appropriate > > alternative. > > > > THIS IS A BREAKING CHANGE. Commit-graph files with changed-path > > Bloom filters computed by a previous commit will not be compatible > > with the filters computed in this commit, nor will we get correct > > results when testing across these incompatible versions. Normally, > > this would be a completely unacceptable change, but the filters > > have not been released and hence are still possible to update > > before release. > > > > TODO: If we decide to move in this direction, then the following > > steps should be done (and some of them should be done anyway): > > > > * We need to document the Bloom filter format to specify exactly > > how we compute the filter data. The details should be careful > > enough that someone can reproduce the exact file format without > > looking at the C code. > > > > * That document would include the tolower() transformation that is > > being done here. > > Why not modify the BDAT chunk to include version of > case folding transformation or other collation algorithm > (other transformation).that is done prior to computing > the Bloom filter key? Though that might be unnecessary > flexibility... If this ends up being something that we want to do, I agree with Stolee's reasoning that this should be a breaking change. If we were, say, several months into having Bloom filters in a release and decided at that point to make the change, then: sure, supporting both by writing a bit in the BDAT chunk makes sense. But, we're many months away from that state yet, and so I don't think the cost of rebuilding what few commit-graphs exist with bloom filters in them today to support both ordinary and lower-cased paths in the filter. Anyway, I'm still not sold on this idea in general (nor do I understand it that others are), so I'll respond in more detail in another part of the thread... > For example the value of 0x00 in such field of BDAT > chunk header would mean no transformation, while > the value of 0x01 would mean per-character tolower() > or Unicode equivalent of it. > > Best, > -- > Jakub Narębski Thanks, Taylor
On Wed, Apr 15, 2020 at 02:25:48PM -0700, Junio C Hamano wrote: > Derrick Stolee <stolee@gmail.com> writes: > > > THIS IS A BREAKING CHANGE. Commit-graph files with changed-path > > Bloom filters computed by a previous commit will not be compatible > > with the filters computed in this commit, nor will we get correct > > results when testing across these incompatible versions. Normally, > > this would be a completely unacceptable change, but the filters > > have not been released and hence are still possible to update > > before release. > > Sure, it hasn't even hit 'next' yet. > > But I think we are both sort-of leaning towards concluding that it > does not help all that much. So I think it is OK. Yeah, I think that the only thing that it is potentially helping is the :(icase) magic. In a repository that doesn't have colliding paths in case-insensitive filesystems (i.e., having both 'foo' and 'FOO' in the same tree), this doesn't seem to be obviously hurting anything. But, it is at least semi-harmful to repositories that do have such trees, since we now can't answer "probably yes" or "definitely not" for colliding paths unless both produce the same fingerprint. That seems like a clear downside. Now, how common is that against people who would benefit from changed-path Bloom filters in their commit-graphs? I don't know one way or the other. But, the upside seems to be minimal compared to the potential cost, so I think that it may be better to just leave this one alone. > > TODO: If we decide to move in this direction, then the following > > steps should be done (and some of them should be done anyway): > > Even if we decide not to do this "downcase before hashing" thing, we > should document how exactly we compute, I think. > > And if we decide do change our mind later, it is not the end of the > world. We should be able to use a different chunk type to store the > filters computed over downcased paths. > > > * We need to document the Bloom filter format to specify exactly > > how we compute the filter data. The details should be careful > > enough that someone can reproduce the exact file format without > > looking at the C code. > > > > * That document would include the tolower() transformation that is > > being done here. > > As the tree-diff comparison done internally while running "git > blame" does not take an end-user specified pathspec in any > meaningful way, this does not matter in practice, but there is > another possible complication we would want to consider when we > extend the support to "git log" and friends---negative pathspecs > (e.g. "git log ':(exclude)COPYING'"). A commit that cannot possibly > have touched the COPYING file would be eligible for output without > actually running tree-diff between it and its parent (as long as the > trees of the two commits are different, we know it must be shown). > > Thanks. Thanks, Taylor
On 4/15/2020 8:52 PM, Taylor Blau wrote: > On Thu, Apr 16, 2020 at 12:18:33AM +0200, Jakub Narębski wrote: >> On Wed, 15 Apr 2020 at 20:37, Derrick Stolee <stolee@gmail.com> wrote: >> [...] >>> -->8-- >>> From 89beb9598daabb19e3c896bbceeb0fc1b9ccc6ca Mon Sep 17 00:00:00 2001 >>> From: Derrick Stolee <dstolee@microsoft.com> >>> Date: Wed, 15 Apr 2020 18:04:25 +0000 >>> Subject: [PATCH] bloom: compute all Bloom hashes from lowercase >>> >>> The changed-path Bloom filters currently hash path strings using >>> the exact string for the path. This makes it difficult* to use the >>> filters when restricting to case-insensitive pathspecs. >>> >>> * I say "difficult" because it is possible to generate all 2^n >>> options for the case of a path and test them all, but this is >>> a bad idea and should not be done. "Impossible" is an appropriate >>> alternative. >>> >>> THIS IS A BREAKING CHANGE. Commit-graph files with changed-path >>> Bloom filters computed by a previous commit will not be compatible >>> with the filters computed in this commit, nor will we get correct >>> results when testing across these incompatible versions. Normally, >>> this would be a completely unacceptable change, but the filters >>> have not been released and hence are still possible to update >>> before release. >>> >>> TODO: If we decide to move in this direction, then the following >>> steps should be done (and some of them should be done anyway): >>> >>> * We need to document the Bloom filter format to specify exactly >>> how we compute the filter data. The details should be careful >>> enough that someone can reproduce the exact file format without >>> looking at the C code. >>> >>> * That document would include the tolower() transformation that is >>> being done here. >> >> Why not modify the BDAT chunk to include version of >> case folding transformation or other collation algorithm >> (other transformation).that is done prior to computing >> the Bloom filter key? Though that might be unnecessary >> flexibility... > > If this ends up being something that we want to do, I agree with > Stolee's reasoning that this should be a breaking change. If we were, > say, several months into having Bloom filters in a release and decided > at that point to make the change, then: sure, supporting both by writing > a bit in the BDAT chunk makes sense. > > But, we're many months away from that state yet, and so I don't think > the cost of rebuilding what few commit-graphs exist with bloom filters > in them today to support both ordinary and lower-cased paths in the > filter. > > Anyway, I'm still not sold on this idea in general (nor do I understand > it that others are), so I'll respond in more detail in another part of > the thread... I agree that this is not a good direction to go. I created the patch because I was curious how difficult it would be, and it is good to have a record of the possible direction. However, it complicates the file format and will have unpredictable effects on the entropy (or on the performance of history for case-colliding paths). It is good that we have the capability to extend the filter data in the future if we really need to. I'll make a TODO item for myself to try writing that detailed Bloom filter format documentation as a follow-up. In the meantime, I'll try to close this out by responding to the feedback we have so far. Thanks, -Stolee
On Thu, Apr 16, 2020 at 09:26:49AM -0400, Derrick Stolee wrote: > On 4/15/2020 8:52 PM, Taylor Blau wrote: > > On Thu, Apr 16, 2020 at 12:18:33AM +0200, Jakub Narębski wrote: > >> On Wed, 15 Apr 2020 at 20:37, Derrick Stolee <stolee@gmail.com> wrote: > >> [...] > >>> -->8-- > >>> From 89beb9598daabb19e3c896bbceeb0fc1b9ccc6ca Mon Sep 17 00:00:00 2001 > >>> From: Derrick Stolee <dstolee@microsoft.com> > >>> Date: Wed, 15 Apr 2020 18:04:25 +0000 > >>> Subject: [PATCH] bloom: compute all Bloom hashes from lowercase > >>> > >>> The changed-path Bloom filters currently hash path strings using > >>> the exact string for the path. This makes it difficult* to use the > >>> filters when restricting to case-insensitive pathspecs. > >>> > >>> * I say "difficult" because it is possible to generate all 2^n > >>> options for the case of a path and test them all, but this is > >>> a bad idea and should not be done. "Impossible" is an appropriate > >>> alternative. > >>> > >>> THIS IS A BREAKING CHANGE. Commit-graph files with changed-path > >>> Bloom filters computed by a previous commit will not be compatible > >>> with the filters computed in this commit, nor will we get correct > >>> results when testing across these incompatible versions. Normally, > >>> this would be a completely unacceptable change, but the filters > >>> have not been released and hence are still possible to update > >>> before release. > >>> > >>> TODO: If we decide to move in this direction, then the following > >>> steps should be done (and some of them should be done anyway): > >>> > >>> * We need to document the Bloom filter format to specify exactly > >>> how we compute the filter data. The details should be careful > >>> enough that someone can reproduce the exact file format without > >>> looking at the C code. > >>> > >>> * That document would include the tolower() transformation that is > >>> being done here. > >> > >> Why not modify the BDAT chunk to include version of > >> case folding transformation or other collation algorithm > >> (other transformation).that is done prior to computing > >> the Bloom filter key? Though that might be unnecessary > >> flexibility... > > > > If this ends up being something that we want to do, I agree with > > Stolee's reasoning that this should be a breaking change. If we were, > > say, several months into having Bloom filters in a release and decided > > at that point to make the change, then: sure, supporting both by writing > > a bit in the BDAT chunk makes sense. > > > > But, we're many months away from that state yet, and so I don't think > > the cost of rebuilding what few commit-graphs exist with bloom filters > > in them today to support both ordinary and lower-cased paths in the > > filter. > > > > Anyway, I'm still not sold on this idea in general (nor do I understand > > it that others are), so I'll respond in more detail in another part of > > the thread... > > I agree that this is not a good direction to go. I created the patch > because I was curious how difficult it would be, and it is good to have > a record of the possible direction. However, it complicates the file > format and will have unpredictable effects on the entropy (or on the > performance of history for case-colliding paths). > > It is good that we have the capability to extend the filter data in > the future if we really need to. > > I'll make a TODO item for myself to try writing that detailed Bloom > filter format documentation as a follow-up. In the meantime, I'll try > to close this out by responding to the feedback we have so far. Sounds good, and thanks for investigating. > Thanks, > -Stolee Thanks, Taylor
Taylor Blau <me@ttaylorr.com> writes: >> It is good that we have the capability to extend the filter data in >> the future if we really need to. >> >> I'll make a TODO item for myself to try writing that detailed Bloom >> filter format documentation as a follow-up. In the meantime, I'll try >> to close this out by responding to the feedback we have so far. > > Sounds good, and thanks for investigating. Yeah, thanks, all.
diff --git a/revision.c b/revision.c index 2b06ee739c8..e37b5b06108 100644 --- a/revision.c +++ b/revision.c @@ -661,6 +661,15 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) if (!revs->commits) return; + if (revs->prune_data.has_wildcard) + return; + if (revs->prune_data.nr > 1) + return; + if (revs->prune_data.magic || + (revs->prune_data.nr && + revs->prune_data.items[0].magic)) + return; + repo_parse_commit(revs->repo, revs->commits->item); if (!revs->repo->objects->commit_graph)