diff mbox series

[4/4] features: feature.manyFiles implies fast index writes

Message ID 77bf5d5ff27729a39ac00d52af3c09610d733b14.1670433958.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Optionally skip hashing index on write | expand

Commit Message

Derrick Stolee Dec. 7, 2022, 5:25 p.m. UTC
From: Derrick Stolee <derrickstolee@github.com>

The recent addition of the index.skipHash config option allows index
writes to speed up by skipping the hash computation for the trailing
checksum. This is particularly critical for repositories with many files
at HEAD, so add this config option to two cases where users in that
scenario may opt-in to such behavior:

 1. The feature.manyFiles config option enables some options that are
    helpful for repositories with many files at HEAD.

 2. 'scalar register' and 'scalar reconfigure' set config options that
    optimize for large repositories.

In both of these cases, set index.skipHash=true to gain this
speedup. Add tests that demonstrate the proper way that
index.skipHash=true can override feature.manyFiles=true.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/config/feature.txt |  3 +++
 read-cache.c                     |  7 ++++---
 repo-settings.c                  |  2 ++
 repository.h                     |  1 +
 scalar.c                         |  1 +
 t/t1600-index.sh                 | 13 ++++++++++++-
 6 files changed, 23 insertions(+), 4 deletions(-)

Comments

Ævar Arnfjörð Bjarmason Dec. 7, 2022, 10:30 p.m. UTC | #1
On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@github.com>
> [...]
> diff --git a/read-cache.c b/read-cache.c
> index fb4d6fb6387..1844953fba7 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -2923,12 +2923,13 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
>  	int ieot_entries = 1;
>  	struct index_entry_offset_table *ieot = NULL;
>  	int nr, nr_threads;
> -	int skip_hash;
>  
>  	f = hashfd(tempfile->fd, tempfile->filename.buf);
>  
> -	if (!git_config_get_maybe_bool("index.skiphash", &skip_hash))
> -		f->skip_hash = skip_hash;
> +	if (istate->repo) {
> +		prepare_repo_settings(istate->repo);
> +		f->skip_hash = istate->repo->settings.index_skip_hash;
> +	}

Urm, are we ever going to find ourselves in a situation where:

 * We have read the settings for the_repository
 * We have an index we're about to write out as our "main index", but
   the istate->repo *isn't* the_repository.
 * Even then, wouldn't the two copies of the repos have read the same
   repo settings?

But maybe there's a really obvious submodule / worktree / whatever edge
case I'm missing.

But if not, shouldn't we just always read/write this from
the_repository?

> +		rm -f .git/index &&
> +		git -c feature.manyFiles=true \
> +		    -c index.skipHash=false add a &&
> +		test_trailing_hash .git/index >hash &&
> +		! test_cmp expect hash

We had a parallel thread where we discussed "! test_cmp" being an
anti-pattern, i.e. you want them not to be the same, but you want it to
still show a diff, Maybe just "! cmp" ?

I.e. either the diff will be meaningless, or we really should be
asserting the actual value we want, not what it shouldn't be.

so in this case, shouldn't we assert that it's the 0000... value, or the
actual hash (depending on which way around we're testing this)?
Derrick Stolee Dec. 12, 2022, 2:18 p.m. UTC | #2
On 12/7/2022 5:30 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote:
> 
>> From: Derrick Stolee <derrickstolee@github.com>
>> [...]
>> diff --git a/read-cache.c b/read-cache.c
>> index fb4d6fb6387..1844953fba7 100644
>> --- a/read-cache.c
>> +++ b/read-cache.c
>> @@ -2923,12 +2923,13 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
>>  	int ieot_entries = 1;
>>  	struct index_entry_offset_table *ieot = NULL;
>>  	int nr, nr_threads;
>> -	int skip_hash;
>>  
>>  	f = hashfd(tempfile->fd, tempfile->filename.buf);
>>  
>> -	if (!git_config_get_maybe_bool("index.skiphash", &skip_hash))
>> -		f->skip_hash = skip_hash;
>> +	if (istate->repo) {
>> +		prepare_repo_settings(istate->repo);
>> +		f->skip_hash = istate->repo->settings.index_skip_hash;
>> +	}
> 
> Urm, are we ever going to find ourselves in a situation where:
> 
>  * We have read the settings for the_repository
>  * We have an index we're about to write out as our "main index", but
>    the istate->repo *isn't* the_repository.
>  * Even then, wouldn't the two copies of the repos have read the same
>    repo settings?
> 
> But maybe there's a really obvious submodule / worktree / whatever edge
> case I'm missing.
> 
> But if not, shouldn't we just always read/write this from
> the_repository?

I don't understand your concern. We call prepare_repo_settings(istate->repo)
just before using these settings, so we are using whatever repository-local
config we have available to us.

If you're thinking that we could be writing an index but istate->repo is
somehow the "wrong" repo, then that is a larger problem. This patch is
doing the best thing it can with the information it is given.

>> +		rm -f .git/index &&
>> +		git -c feature.manyFiles=true \
>> +		    -c index.skipHash=false add a &&
>> +		test_trailing_hash .git/index >hash &&
>> +		! test_cmp expect hash
> 
> We had a parallel thread where we discussed "! test_cmp" being an
> anti-pattern, i.e. you want them not to be the same, but you want it to
> still show a diff, Maybe just "! cmp" ?

I couldn't tell from this sentence whether test_cmp or cmp would show
the diff, but from testing I see that test_cmp shows the diff (for
debugging purposes, I'm sure) while cmp shows the position of the first
difference.

"! cmp" would work here, since we don't care about what the real hash is.
 
> I.e. either the diff will be meaningless, or we really should be
> asserting the actual value we want, not what it shouldn't be.
> 
> so in this case, shouldn't we assert that it's the 0000... value, or the
> actual hash (depending on which way around we're testing this)?

When it should be the null hash, we assert that it is that value.

When it isn't, we do not assert the exact hash because we do not want
other modifications to the index (or surrounding tests) to cause that
hash to change, causing toil for future contributors. "! cmp" suffices
for this case to show that the config inheritance is working correctly.

Thanks,
-Stolee
Ævar Arnfjörð Bjarmason Dec. 12, 2022, 6:27 p.m. UTC | #3
On Mon, Dec 12 2022, Derrick Stolee wrote:

> On 12/7/2022 5:30 PM, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote:
>> 
>>> From: Derrick Stolee <derrickstolee@github.com>
>>> [...]
>>> diff --git a/read-cache.c b/read-cache.c
>>> index fb4d6fb6387..1844953fba7 100644
>>> --- a/read-cache.c
>>> +++ b/read-cache.c
>>> @@ -2923,12 +2923,13 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
>>>  	int ieot_entries = 1;
>>>  	struct index_entry_offset_table *ieot = NULL;
>>>  	int nr, nr_threads;
>>> -	int skip_hash;
>>>  
>>>  	f = hashfd(tempfile->fd, tempfile->filename.buf);
>>>  
>>> -	if (!git_config_get_maybe_bool("index.skiphash", &skip_hash))
>>> -		f->skip_hash = skip_hash;
>>> +	if (istate->repo) {
>>> +		prepare_repo_settings(istate->repo);
>>> +		f->skip_hash = istate->repo->settings.index_skip_hash;
>>> +	}
>> 
>> Urm, are we ever going to find ourselves in a situation where:
>> 
>>  * We have read the settings for the_repository
>>  * We have an index we're about to write out as our "main index", but
>>    the istate->repo *isn't* the_repository.
>>  * Even then, wouldn't the two copies of the repos have read the same
>>    repo settings?
>> 
>> But maybe there's a really obvious submodule / worktree / whatever edge
>> case I'm missing.
>> 
>> But if not, shouldn't we just always read/write this from
>> the_repository?
>
> I don't understand your concern. We call prepare_repo_settings(istate->repo)
> just before using these settings, so we are using whatever repository-local
> config we have available to us.
>
> If you're thinking that we could be writing an index but istate->repo is
> somehow the "wrong" repo, then that is a larger problem. This patch is
> doing the best thing it can with the information it is given.

It's not a concern, just confusion :)

In the preceding step (and this is still the case in your v2) we used
git_config_get_maybe_bool(), if we meant to use istate->repo shouldn't
we have used repo_config_get_maybe_bool() to begin with?

And will we ever get !istate->repo? If not should we BUG() here?
Otherwise the 4/4 changes this to a state where we'll no longer read the
index.skipHash setting if that "repo" is NULL, but our previous
the_repository was non-NULL...
diff mbox series

Patch

diff --git a/Documentation/config/feature.txt b/Documentation/config/feature.txt
index 95975e50912..f0e1d4cb2be 100644
--- a/Documentation/config/feature.txt
+++ b/Documentation/config/feature.txt
@@ -23,6 +23,9 @@  feature.manyFiles::
 	working directory. With many files, commands such as `git status` and
 	`git checkout` may be slow and these new defaults improve performance:
 +
+* `index.skipHash=true` speeds up index writes by not computing a trailing
+  checksum.
++
 * `index.version=4` enables path-prefix compression in the index.
 +
 * `core.untrackedCache=true` enables the untracked cache. This setting assumes
diff --git a/read-cache.c b/read-cache.c
index fb4d6fb6387..1844953fba7 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -2923,12 +2923,13 @@  static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 	int ieot_entries = 1;
 	struct index_entry_offset_table *ieot = NULL;
 	int nr, nr_threads;
-	int skip_hash;
 
 	f = hashfd(tempfile->fd, tempfile->filename.buf);
 
-	if (!git_config_get_maybe_bool("index.skiphash", &skip_hash))
-		f->skip_hash = skip_hash;
+	if (istate->repo) {
+		prepare_repo_settings(istate->repo);
+		f->skip_hash = istate->repo->settings.index_skip_hash;
+	}
 
 	for (i = removed = extended = 0; i < entries; i++) {
 		if (cache[i]->ce_flags & CE_REMOVE)
diff --git a/repo-settings.c b/repo-settings.c
index 3021921c53d..3dbd3f0e2ec 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -47,6 +47,7 @@  void prepare_repo_settings(struct repository *r)
 	}
 	if (manyfiles) {
 		r->settings.index_version = 4;
+		r->settings.index_skip_hash = 1;
 		r->settings.core_untracked_cache = UNTRACKED_CACHE_WRITE;
 	}
 
@@ -61,6 +62,7 @@  void prepare_repo_settings(struct repository *r)
 	repo_cfg_bool(r, "pack.usesparse", &r->settings.pack_use_sparse, 1);
 	repo_cfg_bool(r, "core.multipackindex", &r->settings.core_multi_pack_index, 1);
 	repo_cfg_bool(r, "index.sparse", &r->settings.sparse_index, 0);
+	repo_cfg_bool(r, "index.skiphash", &r->settings.index_skip_hash, r->settings.index_skip_hash);
 
 	/*
 	 * The GIT_TEST_MULTI_PACK_INDEX variable is special in that
diff --git a/repository.h b/repository.h
index 6c461c5b9de..e8c67ffe165 100644
--- a/repository.h
+++ b/repository.h
@@ -42,6 +42,7 @@  struct repo_settings {
 	struct fsmonitor_settings *fsmonitor; /* lazily loaded */
 
 	int index_version;
+	int index_skip_hash;
 	enum untracked_cache_setting core_untracked_cache;
 
 	int pack_use_sparse;
diff --git a/scalar.c b/scalar.c
index 6c52243cdf1..b49bb8c24ec 100644
--- a/scalar.c
+++ b/scalar.c
@@ -143,6 +143,7 @@  static int set_recommended_config(int reconfigure)
 		{ "credential.validate", "false", 1 }, /* GCM4W-only */
 		{ "gc.auto", "0", 1 },
 		{ "gui.GCWarning", "false", 1 },
+		{ "index.skipHash", "false", 1 },
 		{ "index.threads", "true", 1 },
 		{ "index.version", "4", 1 },
 		{ "merge.stat", "false", 1 },
diff --git a/t/t1600-index.sh b/t/t1600-index.sh
index 55816756607..be0a0a8a008 100755
--- a/t/t1600-index.sh
+++ b/t/t1600-index.sh
@@ -72,7 +72,18 @@  test_expect_success 'index.skipHash config option' '
 		test_trailing_hash .git/index >hash &&
 		echo $(test_oid zero) >expect &&
 		test_cmp expect hash &&
-		git fsck
+		git fsck &&
+
+		rm -f .git/index &&
+		git -c feature.manyFiles=true add a &&
+		test_trailing_hash .git/index >hash &&
+		test_cmp expect hash &&
+
+		rm -f .git/index &&
+		git -c feature.manyFiles=true \
+		    -c index.skipHash=false add a &&
+		test_trailing_hash .git/index >hash &&
+		! test_cmp expect hash
 	)
 '