[v2,11/11] bloom: enforce a minimum size of 8 bytes

Message ID	1022c0ad21379dae825ca8e47b3da8b62c0b59dc.1592934430.git.gitgitgadget@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=IiYM=AE=vger.kernel.org=git-owner@kernel.org> Message-Id: <1022c0ad21379dae825ca8e47b3da8b62c0b59dc.1592934430.git.gitgitgadget@gmail.com> In-Reply-To: <pull.659.v2.git.1592934430.gitgitgadget@gmail.com> References: <pull.659.git.1592252093.gitgitgadget@gmail.com> <pull.659.v2.git.1592934430.gitgitgadget@gmail.com> From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> Date: Tue, 23 Jun 2020 17:47:10 +0000 Subject: [PATCH v2 11/11] bloom: enforce a minimum size of 8 bytes Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: me@ttaylorr.com, szeder.dev@gmail.com, l.s.r@web.de, Derrick Stolee <dstolee@microsoft.com>, Derrick Stolee <dstolee@microsoft.com> Sender: git-owner@vger.kernel.org Precedence: bulk
Series	More commit-graph/Bloom filter improvements \| expand [v2,00/11] More commit-graph/Bloom filter improvements [v2,01/11] commit-graph: place bloom_settings in context [v2,02/11] commit-graph: change test to die on parse, not load [v2,03/11] bloom: get_bloom_filter() cleanups [v2,04/11] commit-graph: persist existence of changed-paths [v2,05/11] commit-graph: unify the signatures of all write_graph_chunk_*() functions [v2,06/11] commit-graph: simplify chunk writes into loop [v2,07/11] commit-graph: check chunk sizes after writing [v2,08/11] revision.c: fix whitespace [v2,09/11] revision: empty pathspecs should not use Bloom filters [v2,10/11] commit-graph: check all leading directories in changed path Bloom filters [v2,11/11] bloom: enforce a minimum size of 8 bytes

Message ID

1022c0ad21379dae825ca8e47b3da8b62c0b59dc.1592934430.git.gitgitgadget@gmail.com (mailing list archive)

State

New, archived

Headers

Message-Id: 
 <1022c0ad21379dae825ca8e47b3da8b62c0b59dc.1592934430.git.gitgitgadget@gmail.com>
In-Reply-To: <pull.659.v2.git.1592934430.gitgitgadget@gmail.com>
References: <pull.659.git.1592252093.gitgitgadget@gmail.com>
        <pull.659.v2.git.1592934430.gitgitgadget@gmail.com>
From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
Date: Tue, 23 Jun 2020 17:47:10 +0000
Subject: [PATCH v2 11/11] bloom: enforce a minimum size of 8 bytes
Fcc: Sent
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
To: git@vger.kernel.org
Cc: me@ttaylorr.com, szeder.dev@gmail.com, l.s.r@web.de,
        Derrick Stolee <dstolee@microsoft.com>,
        Derrick Stolee <dstolee@microsoft.com>
Sender: git-owner@vger.kernel.org
Precedence: bulk

Series

More commit-graph/Bloom filter improvements | expand

Commit Message

Jean-Noël Avila via GitGitGadget June 23, 2020, 5:47 p.m. UTC

From: Derrick Stolee <dstolee@microsoft.com>

The original design of changed-path Bloom filters included an 8-byte
block size for filter lengths. This was changed mid-way through the
submission process, and now the length stored in the commit-graph has
one-byte granularity.

This can cause some issues for very small filters. The analysis for
false positive rates assume large filters, so rounding errors become
less important at that scale. When there are only a few paths changed,
a filter that has size only a few bytes could have very different
behavior. In fact, this is evidenced in the Git repository due to the
code organization and careful patch creation that leads to many commits
with very small filters. These small filters frequently have
false-positive rates in the 8-10% range or higher.

The previous change improved the false-positive rate using multiple
Bloom keys when the path has multiple directory components. However,
that does not help at all for files at root. It is typical to have
several commits that change only the README at root, and those commits
would be likely to have these artificially high false-positive rates.

Correct this issue by creating a minimum filters size of 8 bytes. This
requires the very small commits (with fewer than six changes, including
non-root directories) to have a larger filter. In principle, this
violates the bits_per_entry value of struct bloom_filter_settings.
However, it does not actually create a functional problem.

As for compatibility, this only affects new versions writing filters for
commits that do not yet have a filter. Old version will write the
smaller filters and this version will persist and properly read that
data. Now, the new files will be generated slightly larger.

               Bytes before   Bytes after  Difference
  --------------------------------------------------
  git             4,021,078    4,275,311   +6.32%
  linux          72,212,101   73,909,286   +2.35%
  tensorflow      7,596,359    7,691,646   +1.25%

This has a measurable improvement in the false-positive rate and the
end-to-end run time for these repos. The table below compares the average
false-positive rate and runtime of

  git rev-list HEAD -- "$path"

before and after this change for 5000+ randomly* selected paths from
each repository:

                    Average false           Average        Average
                    positive rate           runtime        runtime
                  before     after     before     after   difference
  ------------------------------------------------------------------
  git             0.786%     0.227%    0.0387s    0.0289s -25.5%
  linux           0.0296%    0.0174%   0.0766s    0.0706s  -7.8%
  tensorflow      0.6977%    0.0268%   0.0420s    0.0384s  -8.5%

*Path selection was done with the following pipeline:

        git ls-tree -r --name-only HEAD | sort -R | head -n 5000

These relatively-small increases in file size appear to be a fair price
to pay for these performance improvements.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bloom.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/bloom.c b/bloom.c
index 7291506564..e9dc15976c 100644
--- a/bloom.c
+++ b/bloom.c
@@ -257,6 +257,10 @@  struct bloom_filter *get_bloom_filter(struct repository *r,
 		}
 
 		filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
+
+		if (filter->len && filter->len < 8)
+			filter->len = 8;
+
 		filter->data = xcalloc(filter->len, sizeof(unsigned char));
 
 		hashmap_for_each_entry(&pathmap, &iter, e, entry) {

[v2,11/11] bloom: enforce a minimum size of 8 bytes

Commit Message

Patch