diff mbox series

[v3,2/4] read-cache: add index.skipHash config option

Message ID 00738c81a1212970910da6f29fe3ecef87c2ec3a.1671116820.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Optionally skip hashing index on write | expand

Commit Message

Derrick Stolee Dec. 15, 2022, 3:06 p.m. UTC
From: Derrick Stolee <derrickstolee@github.com>

The previous change allowed skipping the hashing portion of the
hashwrite API, using it instead as a buffered write API. Disabling the
hashwrite can be particularly helpful when the write operation is in a
critical path.

One such critical path is the writing of the index. This operation is so
critical that the sparse index was created specifically to reduce the
size of the index to make these writes (and reads) faster.

This trade-off between file stability at rest and write-time performance
is not easy to balance. The index is an interesting case for a couple
reasons:

1. Writes block users. Writing the index takes place in many user-
   blocking foreground operations. The speed improvement directly
   impacts their use. Other file formats are typically written in the
   background (commit-graph, multi-pack-index) or are super-critical to
   correctness (pack-files).

2. Index files are short lived. It is rare that a user leaves an index
   for a long time with many staged changes. Outside of staged changes,
   the index can be completely destroyed and rewritten with minimal
   impact to the user.

Following a similar approach to one used in the microsoft/git fork [1],
add a new config option (index.skipHash) that allows disabling this
hashing during the index write. The cost is that we can no longer
validate the contents for corruption-at-rest using the trailing hash.

[1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201

While older Git versions will not recognize the null hash as a special
case, the file format itself is still being met in terms of its
structure. Using this null hash will still allow Git operations to
function across older versions.

The one exception is 'git fsck' which checks the hash of the index file.
This used to be a check on every index read, but was split out to just
the index in a33fc72fe91 (read-cache: force_verify_index_checksum,
2017-04-14) and released first in Git 2.13.0. Document the versions that
relaxed these restrictions, with the optimistic expectation that this
change will be included in Git 2.40.0.

Here, we disable this check if the trailing hash is all zeroes. We add a
warning to the config option that this may cause undesirable behavior
with older Git versions.

As a quick comparison, I tested 'git update-index --force-write' with
and without index.skipHash=true on a copy of the Linux kernel
repository.

Benchmark 1: with hash
  Time (mean ± σ):      46.3 ms ±  13.8 ms    [User: 34.3 ms, System: 11.9 ms]
  Range (min … max):    34.3 ms …  79.1 ms    82 runs

Benchmark 2: without hash
  Time (mean ± σ):      26.0 ms ±   7.9 ms    [User: 11.8 ms, System: 14.2 ms]
  Range (min … max):    16.3 ms …  42.0 ms    69 runs

Summary
  'without hash' ran
    1.78 ± 0.76 times faster than 'with hash'

These performance benefits are substantial enough to allow users the
ability to opt-in to this feature, even with the potential confusion
with older 'git fsck' versions.

It is critical that this test is placed before the test_index_version
tests, since those tests obliterate the .git/config file and hence lose
the setting from GIT_TEST_DEFAULT_HASH, if set.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/config/index.txt | 11 +++++++++++
 read-cache.c                   | 12 +++++++++++-
 t/t1600-index.sh               |  6 ++++++
 3 files changed, 28 insertions(+), 1 deletion(-)

Comments

Ævar Arnfjörð Bjarmason Dec. 15, 2022, 4:12 p.m. UTC | #1
On Thu, Dec 15 2022, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@github.com>

> +	end = (unsigned char *)hdr + size;

Commentary: The "unsigned char *" cast has moved up here from the below,
that's needed (we're casting from the struct), and we should get rid of
it from below, good.

> +	start = end - the_hash_algo->rawsz;

Okey, so here we mark the start, which is the end minus the rawsz,
but...

> +	oidread(&oid, start);
> +	if (oideq(&oid, null_oid()))
> +		return 0;
> +
>  	the_hash_algo->init_fn(&c);
>  	the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz);
>  	the_hash_algo->final_fn(hash, &c);
> -	if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz))
> +	if (!hasheq(hash, end - the_hash_algo->rawsz))

...here we got rid of the cast, which is good, but let's not use "end -
the_hash_algo->rawsz" here, let's use "start", which you already
computed as "end - the_hash_algo->rawsz". This is just repeating it.

I wondered if I just missed it being modified in the interim before
carefully re-reading this, but we pass your tests with:

	-       if (!hasheq(hash, end - the_hash_algo->rawsz))
	+       assert((end - the_hash_algo->rawsz) == start);
	+       if (!hasheq(hash, start))

So, we can indeed juse the simpler "start" here, and it makes it easier
to read, as we're assured that it didn't move in the interim.

> +	git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash);

Aside from the question of whether we use the repo_*() variant here,
which I noted in my reply to the CL. The cast is suspicious.

So, in the 1/4 we added this as *unsigned*:
	
	+	 * If set to 1, skip_hash indicates that we should
	+	 * not actually compute the hash for this hashfile and
	+	 * instead only use it as a buffered write.
	+	 */
	+	unsigned int skip_hash;

But you need the cast here since the config API can and will set the
"dest" to -1. See the "*dest == -1" test in git_configset_get_value().

So, here we're relying on a "unsigned int" cast'd to "int" correctly
doing the right thing on a "-1" assignment.

I'm not sure if that's portable or leads to undefined behavior, but in
any case, won't such a -1 value be read back as ~0 from that "unsigned
int" variable on most modern platforms?

Just bypassing that entirely and making it "int" seems better here, or
having an intermediate variable.

I also wondered if this was all my fault, in your original version you
were doing:

	int skip_hash;
	[...]
	if (!git_config_get_maybe_bool("index.skiphash", &skip_hash))
		f->skip_hash = skip_hash;

And I suggested that this was redundant, and that you could just write
to "f->skip_hash" directly.

But I didn't notice it was "unsigned", and in any case your original
version had the same issue of assigning a -1 to the unsigned variable...
diff mbox series

Patch

diff --git a/Documentation/config/index.txt b/Documentation/config/index.txt
index 75f3a2d1054..23c7985eb40 100644
--- a/Documentation/config/index.txt
+++ b/Documentation/config/index.txt
@@ -30,3 +30,14 @@  index.version::
 	Specify the version with which new index files should be
 	initialized.  This does not affect existing repositories.
 	If `feature.manyFiles` is enabled, then the default is 4.
+
+index.skipHash::
+	When enabled, do not compute the trailing hash for the index file.
+	This accelerates Git commands that manipulate the index, such as
+	`git add`, `git commit`, or `git status`. Instead of storing the
+	checksum, write a trailing set of bytes with value zero, indicating
+	that the computation was skipped.
++
+If you enable `index.skipHash`, then Git clients older than 2.13.0 will
+refuse to parse the index and Git clients older than 2.40.0 will report an
+error during `git fsck`.
diff --git a/read-cache.c b/read-cache.c
index 46f5e497b14..3f7de8b2e20 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1817,6 +1817,8 @@  static int verify_hdr(const struct cache_header *hdr, unsigned long size)
 	git_hash_ctx c;
 	unsigned char hash[GIT_MAX_RAWSZ];
 	int hdr_version;
+	unsigned char *start, *end;
+	struct object_id oid;
 
 	if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
 		return error(_("bad signature 0x%08x"), hdr->hdr_signature);
@@ -1827,10 +1829,16 @@  static int verify_hdr(const struct cache_header *hdr, unsigned long size)
 	if (!verify_index_checksum)
 		return 0;
 
+	end = (unsigned char *)hdr + size;
+	start = end - the_hash_algo->rawsz;
+	oidread(&oid, start);
+	if (oideq(&oid, null_oid()))
+		return 0;
+
 	the_hash_algo->init_fn(&c);
 	the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz);
 	the_hash_algo->final_fn(hash, &c);
-	if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz))
+	if (!hasheq(hash, end - the_hash_algo->rawsz))
 		return error(_("bad index file sha1 signature"));
 	return 0;
 }
@@ -2918,6 +2926,8 @@  static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 
 	f = hashfd(tempfile->fd, tempfile->filename.buf);
 
+	git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash);
+
 	for (i = removed = extended = 0; i < entries; i++) {
 		if (cache[i]->ce_flags & CE_REMOVE)
 			removed++;
diff --git a/t/t1600-index.sh b/t/t1600-index.sh
index 010989f90e6..45feb0fc5d8 100755
--- a/t/t1600-index.sh
+++ b/t/t1600-index.sh
@@ -65,6 +65,12 @@  test_expect_success 'out of bounds index.version issues warning' '
 	)
 '
 
+test_expect_success 'index.skipHash config option' '
+	rm -f .git/index &&
+	git -c index.skipHash=true add a &&
+	git fsck
+'
+
 test_index_version () {
 	INDEX_VERSION_CONFIG=$1 &&
 	FEATURE_MANY_FILES=$2 &&