mbox series

[v3,0/4] Optionally skip hashing index on write

Message ID pull.1439.v3.git.1671116820.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series Optionally skip hashing index on write | expand

Message

Philippe Blain via GitGitGadget Dec. 15, 2022, 3:06 p.m. UTC
Writing the index is a critical action that takes place in multiple Git
commands. The recent performance improvements available with the sparse
index show how often the I/O costs around the index can affect different Git
commands, although reading the index takes place more often than a write.

The index is written through the hashfile API, both as a buffered write and
as a computation of its trailing hash. This trailing hash is an expectation
of the file format: several optional index extensions are provided using
indicators immediately preceding the trailing hash. 'git fsck' will complain
if the trailing hash does not match the contents of the file up to that
point.

However, computing the hash is expensive. This is even more expensive now
that we've moved to sha1dc instead of hardware-accelerated SHA1
computations.

This series provides a new config option, index.skipHash, that allows users
to disable computing the hash as Git writes the index. In this case, the
trailing hash is stored as the null hash (all bytes are zero).

The implementation is rather simple, but the patches organize different
aspects in a way that we could split things out:

 * Patch 1 adds a 'skip_hash' option to 'struct hashfile' that creates this
   as a general option to the hashwrite API.
 * The motivation in Patch 1 avoids the talk about the chunk-format API and
   instead focuses on the index as the first example that could make use of
   this.
 * Patch 2 creates the index.skipHash config option and updates 'git fsck'
   to ignore a null hash on the index. This is demonstrated by a very simple
   test.
 * Patch 3 creates a test helper to get the trailing hash of a file, and
   adds a concrete check on the trailing hash when index.skipHash=true.
 * Patch 4 (which could be deferred until later) adds index.skipHash=true as
   an implication of feature.manyFiles and as a setting in Scalar.

The end result is that index writes speed up significantly, enough that 'git
update-index --force-write' improves by 1.75x.

Similar patches appeared in my earlier refs RFC [1].

[1]
https://lore.kernel.org/git/pull.1408.git.1667846164.gitgitgadget@gmail.com/


Updates since v2
================

 * Patch 2 now has additional details about why to use the config option in
   its message. The discussion about why the index is special is included in
   the commit message.


Updates since v1
================

 * In Patch 1, use hashcler() instead of memset().
 * In Patches 2 and 4, be explicit about which Git versions will exhibit
   different behavior when reading an index with a null trailing hash.
 * In Patch 2, used a more compact way of loading from config.
 * In Patch 2 (and following) dropped the sub-shell in t1600-index.sh.
 * In Patch 3, dropped back-ticks when using a multi-line pipe.
 * In Patch 4, use "! cmp" instead of "! test_cmp"


Updates since RFC
=================

 * The config name is now index.skipHash to make it more clear that the
   default is to not skip the hash, and choosing this option requires a true
   value.
 * Use oidread() instead of an ad-hoc loop.
 * Patches 3 and 4 are new. 3 adds some extra test verification that wasn't
   in the RFC.
 * I think that patch 4 is helpful to see now, but I could understand if we
   wanted to defer that patch until after index.skipHash has had more time
   to bake.

Thanks, -Stolee

Derrick Stolee (4):
  hashfile: allow skipping the hash function
  read-cache: add index.skipHash config option
  test-lib-functions: add helper for trailing hash
  features: feature.manyFiles implies fast index writes

 Documentation/config/feature.txt |  5 +++++
 Documentation/config/index.txt   | 11 +++++++++++
 csum-file.c                      | 14 +++++++++++---
 csum-file.h                      |  7 +++++++
 read-cache.c                     | 15 ++++++++++++++-
 repo-settings.c                  |  2 ++
 repository.h                     |  1 +
 scalar.c                         |  1 +
 t/t1600-index.sh                 | 20 ++++++++++++++++++++
 t/test-lib-functions.sh          |  8 ++++++++
 10 files changed, 80 insertions(+), 4 deletions(-)


base-commit: 805265fcf7a737664a8321aaf4a0587b78435184
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1439%2Fderrickstolee%2Fhashfile-skip-hash-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1439/derrickstolee/hashfile-skip-hash-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/1439

Range-diff vs v2:

 1:  b582d681581 = 1:  b582d681581 hashfile: allow skipping the hash function
 2:  c8a2fd3a470 ! 2:  00738c81a12 read-cache: add index.skipHash config option
     @@ Commit message
          critical that the sparse index was created specifically to reduce the
          size of the index to make these writes (and reads) faster.
      
     +    This trade-off between file stability at rest and write-time performance
     +    is not easy to balance. The index is an interesting case for a couple
     +    reasons:
     +
     +    1. Writes block users. Writing the index takes place in many user-
     +       blocking foreground operations. The speed improvement directly
     +       impacts their use. Other file formats are typically written in the
     +       background (commit-graph, multi-pack-index) or are super-critical to
     +       correctness (pack-files).
     +
     +    2. Index files are short lived. It is rare that a user leaves an index
     +       for a long time with many staged changes. Outside of staged changes,
     +       the index can be completely destroyed and rewritten with minimal
     +       impact to the user.
     +
          Following a similar approach to one used in the microsoft/git fork [1],
          add a new config option (index.skipHash) that allows disabling this
          hashing during the index write. The cost is that we can no longer
     @@ Documentation/config/index.txt: index.version::
      +
      +index.skipHash::
      +	When enabled, do not compute the trailing hash for the index file.
     -+	Instead, write a trailing set of bytes with value zero, indicating
     ++	This accelerates Git commands that manipulate the index, such as
     ++	`git add`, `git commit`, or `git status`. Instead of storing the
     ++	checksum, write a trailing set of bytes with value zero, indicating
      +	that the computation was skipped.
      ++
      +If you enable `index.skipHash`, then Git clients older than 2.13.0 will
 3:  813e81a0582 = 3:  86370af1351 test-lib-functions: add helper for trailing hash
 4:  e640dff53dd = 4:  6490bd445eb features: feature.manyFiles implies fast index writes

Comments

Ævar Arnfjörð Bjarmason Dec. 15, 2022, 3:56 p.m. UTC | #1
On Thu, Dec 15 2022, Derrick Stolee via GitGitGadget wrote:

> Updates since v2
> ================
>
>  * Patch 2 now has additional details about why to use the config option in
>    its message. The discussion about why the index is special is included in
>    the commit message.

Re the comments I had on an earlier version[1] I tried this series with
these changes squashed in:
	
	1:  b582d681581 = 1:  53d289cf82c hashfile: allow skipping the hash function
	2:  00738c81a12 ! 2:  db487f3e76d read-cache: add index.skipHash config option
	    @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf
	      
	      	f = hashfd(tempfile->fd, tempfile->filename.buf);
	      
	    -+	git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash);
	    ++	repo_config_get_maybe_bool(istate->repo, "index.skiphash", (int *)&f->skip_hash);
	     +
	      	for (i = removed = extended = 0; i < entries; i++) {
	      		if (cache[i]->ce_flags & CE_REMOVE)
	3:  86370af1351 = 3:  35188bbcfb5 test-lib-functions: add helper for trailing hash
	4:  6490bd445eb ! 4:  4354328e8fc features: feature.manyFiles implies fast index writes
	    @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf
	      
	      	f = hashfd(tempfile->fd, tempfile->filename.buf);
	      
	    --	git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash);
	    +-	repo_config_get_maybe_bool(istate->repo, "index.skiphash", (int *)&f->skip_hash);
	     +	if (istate->repo) {
	    ++		int ours, theirs;
	    ++
	     +		prepare_repo_settings(istate->repo);
	    -+		f->skip_hash = istate->repo->settings.index_skip_hash;
	    ++
	    ++		ours = istate->repo->settings.index_skip_hash;
	    ++		theirs = the_repository->settings.index_skip_hash;
	    ++
	    ++		if (ours != theirs)
	    ++			BUG("ours %d theirs %d", ours, theirs);
	    ++		f->skip_hash = ours;
	    ++	} else {
	    ++		f->skip_hash = the_repository->settings.index_skip_hash;
	    ++		/*BUG("no repo???");*/
	     +	}
	      
	      	for (i = removed = extended = 0; i < entries; i++) {

With those all of your tests still pass. Which leads to these outstanding questions:

a) If we're doing to use "istate->repo" in 4/4 why not do so in 2/4,
   neither patch discusses *that* part of the change.

   On an unrelated topic there's this[2] fix-up of yours, 2/2 still
   seems like it needs the same treatment.

b) In 2/4 we'd get the config if "istate->repo" was NULL, if you
   uncomment that BUG() in the squashed 4/4 we'll get test failures all
   over the place. Aren't these places where we'd get the config before,
   but istate->repo isn't set up properly, so with 4/4 we won't get it?

   Maybe that's desired, but again, neither commit discusses this
   change.

Now, I *think* it's a good idea to defer to the already set-up
'istate->repo', but I also know (and have some local patches to fix...)
about the cases where we don't set it up (but should), so it can't be
fully relied on.

So even if we can't produce a behavior difference, just doing e.g.:

	struct repository *r = istate->repo ? istate->repo : the_repository;

And then using:

	prepare_repo_settings(r);
        f->skip_hash = r->->settings.index_skip_hash;

Seems sensible. I just don't get why 4/4 has that seeming fix-up of 2/4
after-the-fact. Isn't it better to carve that bit out of 4/4, just do
the config setup in repo-settings.c to begin with, and have 4/4 do what
its $subject says, i.e. to have "feature.manyFiles" imply this new
config?

The rest of this all looks sensible to me, sans some isolated things
I'll comment on on individual patches.

1. https://lore.kernel.org/git/221212.86v8mg4gr2.gmgdl@evledraar.gmail.com/
2. https://lore.kernel.org/git/857d1abec4cf124e011c7f84276ce105cb5b3a96.1670866407.git.gitgitgadget@gmail.com/
Derrick Stolee Dec. 16, 2022, 1:41 p.m. UTC | #2
On 12/15/22 10:56 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Thu, Dec 15 2022, Derrick Stolee via GitGitGadget wrote:
> 
>> Updates since v2
>> ================
>>
>>  * Patch 2 now has additional details about why to use the config option in
>>    its message. The discussion about why the index is special is included in
>>    the commit message.


> So even if we can't produce a behavior difference, just doing e.g.:
> 
> 	struct repository *r = istate->repo ? istate->repo : the_repository;
> 
> And then using:
> 
> 	prepare_repo_settings(r);
>         f->skip_hash = r->->settings.index_skip_hash;

This, and other comments are sensible and will be reflected in v4.
 
> Seems sensible. I just don't get why 4/4 has that seeming fix-up of 2/4
> after-the-fact. Isn't it better to carve that bit out of 4/4, just do
> the config setup in repo-settings.c to begin with, and have 4/4 do what
> its $subject says, i.e. to have "feature.manyFiles" imply this new
> config?

The point is to make patch 4 completely independent, and be able to
pull it out if necessary. There's no reason to load the config inside
repo-settings.c unless it's part of something like feature.manyFiles.

Thanks,
-Stolee