[0/4] Optionally skip hashing index on write

Message ID	pull.1439.git.1670433958.gitgitgadget@gmail.com (mailing list archive)
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <pull.1439.git.1670433958.gitgitgadget@gmail.com> From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> Date: Wed, 07 Dec 2022 17:25:54 +0000 Subject: [PATCH 0/4] Optionally skip hashing index on write MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fcc: Sent To: git@vger.kernel.org Cc: gitster@pobox.com, vdye@github.com, avarab@gmail.com, newren@gmail.com, Derrick Stolee <derrickstolee@github.com> Precedence: bulk
Series	Optionally skip hashing index on write \| expand [0/4] Optionally skip hashing index on write [1/4] hashfile: allow skipping the hash function [2/4] read-cache: add index.skipHash config option [3/4] test-lib-functions: add helper for trailing hash [4/4] features: feature.manyFiles implies fast index writes

Message ID

pull.1439.git.1670433958.gitgitgadget@gmail.com (mailing list archive)

Headers

Message-Id: <pull.1439.git.1670433958.gitgitgadget@gmail.com>
From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
Date: Wed, 07 Dec 2022 17:25:54 +0000
Subject: [PATCH 0/4] Optionally skip hashing index on write
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Fcc: Sent
To: git@vger.kernel.org
Cc: gitster@pobox.com, vdye@github.com, avarab@gmail.com,
        newren@gmail.com, Derrick Stolee <derrickstolee@github.com>
Precedence: bulk

Series

Optionally skip hashing index on write | expand

Message

Derrick Stolee via GitGitGadget Dec. 7, 2022, 5:25 p.m. UTC

Writing the index is a critical action that takes place in multiple Git
commands. The recent performance improvements available with the sparse
index show how often the I/O costs around the index can affect different Git
commands, although reading the index takes place more often than a write.

The index is written through the hashfile API, both as a buffered write and
as a computation of its trailing hash. This trailing hash is an expectation
of the file format: several optional index extensions are provided using
indicators immediately preceding the trailing hash. 'git fsck' will complain
if the trailing hash does not match the contents of the file up to that
point.

However, computing the hash is expensive. This is even more expensive now
that we've moved to sha1dc instead of hardware-accelerated SHA1
computations.

This series provides a new config option, index.skipHash, that allows users
to disable computing the hash as Git writes the index. In this case, the
trailing hash is stored as the null hash (all bytes are zero).

The implementation is rather simple, but the patches organize different
aspects in a way that we could split things out:

* Patch 1 adds a 'skip_hash' option to 'struct hashfile' that creates this
as a general option to the hashwrite API.
* The motivation in Patch 1 avoids the talk about the chunk-format API and
instead focuses on the index as the first example that could make use of
this.
* Patch 2 creates the index.skipHash config option and updates 'git fsck'
to ignore a null hash on the index. This is demonstrated by a very simple
test.
* Patch 3 creates a test helper to get the trailing hash of a file, and
adds a concrete check on the trailing hash when index.skipHash=true.
* Patch 4 (which could be deferred until later) adds index.skipHash=true as
an implication of feature.manyFiles and as a setting in Scalar.

The end result is that index writes speed up significantly, enough that 'git
update-index --force-write' improves by 1.75x.

Similar patches appeared in my earlier refs RFC [1].

[1]
https://lore.kernel.org/git/pull.1408.git.1667846164.gitgitgadget@gmail.com/

Updates since RFC
=================

* The config name is now index.skipHash to make it more clear that the
default is to not skip the hash, and choosing this option requires a true
value.
* Use oidread() instead of an ad-hoc loop.
* Patches 3 and 4 are new. 3 adds some extra test verification that wasn't
in the RFC.
* I think that patch 4 is helpful to see now, but I could understand if we
wanted to defer that patch until after index.skipHash has had more time
to bake.

Thanks, -Stolee

Derrick Stolee (4):
hashfile: allow skipping the hash function
read-cache: add index.skipHash config option
test-lib-functions: add helper for trailing hash
features: feature.manyFiles implies fast index writes

base-commit: 805265fcf7a737664a8321aaf4a0587b78435184
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1439%2Fderrickstolee%2Fhashfile-skip-hash-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1439/derrickstolee/hashfile-skip-hash-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1439

Comments

Junio C Hamano Dec. 7, 2022, 11:27 p.m. UTC | #1

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Writing the index is a critical action that takes place in multiple Git
> commands. The recent performance improvements available with the sparse
> index show how often the I/O costs around the index can affect different Git
> commands, although reading the index takes place more often than a write.

The sparse-index work is great in that it offers correctness while
taking advantage of the knowledge of which part of the tree is
quiescent and unused to boost performance.  I am not sure a change
to reduce file safety can be compared with it, in that one is pure
improvement, while the other is trade-off.

As long as we will keep the "create into a new file, write it fully
and fsync + rename to the final" pattern, we do not need the trailing
checksum to protect us from a truncated output due to index-writing
process dying in the middle, so I do not mind that trade-off, though.

Protecting files from bit flipping filesystem corruption is a
different matter.  Folks at hosting sites like GitHub would know how
often they detect object corruption (I presume they do not have to
deal with the index file on the server end that often, but loose and
pack object files have the trailing checksums the same way) thanks
to the trailing checksum, and what the consequences are if we lost
that safety (I am guessing it would be minimum, though).

Thanks.

Ævar Arnfjörð Bjarmason Dec. 7, 2022, 11:42 p.m. UTC | #2

On Thu, Dec 08 2022, Junio C Hamano wrote:

> Protecting files from bit flipping filesystem corruption is a
> different matter.  Folks at hosting sites like GitHub would know how
> often they detect object corruption (I presume they do not have to
> deal with the index file on the server end that often, but loose and
> pack object files have the trailing checksums the same way) thanks
> to the trailing checksum, and what the consequences are if we lost
> that safety (I am guessing it would be minimum, though).

I don't think this checksum does much for us in practice, but just on
this point in general: Extrapolating results at <hosting site> when it
comes to making general decisions about git's data safety isn't a good
idea.

I don't know about GitHub's hardware, but servers almost universally use
ECC ram, and tend to use things like error-correcting filesystem RAID
etc.

Data in that area is really interesting when it comes to running git in
that sort of setup, but it really shouldn't be extrapolated to git's
userbase in general.

A lot of those users will be using cheap memory and/or storage devices
without any error correction.

They're also likely to stress our reliability guarantees in other ways,
e.g. yanking their power cord (or equivalent), which a server typically
won't need to deal with.

Derrick Stolee Dec. 8, 2022, 4:38 p.m. UTC | #3

On 12/7/2022 6:27 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> Writing the index is a critical action that takes place in multiple Git
>> commands. The recent performance improvements available with the sparse
>> index show how often the I/O costs around the index can affect different Git
>> commands, although reading the index takes place more often than a write.
> 
> The sparse-index work is great in that it offers correctness while
> taking advantage of the knowledge of which part of the tree is
> quiescent and unused to boost performance.  I am not sure a change
> to reduce file safety can be compared with it, in that one is pure
> improvement, while the other is trade-off.

I agree that this is a trade-off, and we should both be careful about
whether or not we even make this a possibility for certain file
formats. The index is an interesting case for a couple reasons:

1. Writes block users. Writing the index takes place in many user-
   blocking foreground operations. The speed improvement directly
   impacts their use. Other file formats are typically written in
   the background (commit-graph, multi-pack-index) or are super-
   critical to correctness (pack-files).

2. Index files are short lived. It is rare that a user leaves an
   index for a long time with many staged changes. That's the condition
   that's required for losing an index file to cause a loss of work
   (or maybe I'm missing something). Outside of staged changes, the
   index can be completely destroyed and rewritten with minimal impact
   to the user.

> As long as we will keep the "create into a new file, write it fully
> and fsync + rename to the final" pattern, we do not need the trailing
> checksum to protect us from a truncated output due to index-writing
> process dying in the middle, so I do not mind that trade-off, though.
> 
> Protecting files from bit flipping filesystem corruption is a
> different matter.  Folks at hosting sites like GitHub would know how
> often they detect object corruption (I presume they do not have to
> deal with the index file on the server end that often, but loose and
> pack object files have the trailing checksums the same way) thanks
> to the trailing checksum, and what the consequences are if we lost
> that safety (I am guessing it would be minimum, though).

I agree that we need to be careful about which files get this
treatement.

But I also want to point out that I'm not using hosting servers as
evidence that this has worked in practice, but instead many developer
machines in large monorepos who have had this enabled (via the
microsoft/git fork) for years. We've not come across an instance where
this loss of a trailing hash has been an issue.

Thanks,
-Stolee

Jacob Keller Dec. 12, 2022, 10:22 p.m. UTC | #4

On Thu, Dec 8, 2022 at 8:50 AM Derrick Stolee <derrickstolee@github.com> wrote:
>
> On 12/7/2022 6:27 PM, Junio C Hamano wrote:
> > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> >
> >> Writing the index is a critical action that takes place in multiple Git
> >> commands. The recent performance improvements available with the sparse
> >> index show how often the I/O costs around the index can affect different Git
> >> commands, although reading the index takes place more often than a write.
> >
> > The sparse-index work is great in that it offers correctness while
> > taking advantage of the knowledge of which part of the tree is
> > quiescent and unused to boost performance.  I am not sure a change
> > to reduce file safety can be compared with it, in that one is pure
> > improvement, while the other is trade-off.
>
> I agree that this is a trade-off, and we should both be careful about
> whether or not we even make this a possibility for certain file
> formats. The index is an interesting case for a couple reasons:
>
> 1. Writes block users. Writing the index takes place in many user-
>    blocking foreground operations. The speed improvement directly
>    impacts their use. Other file formats are typically written in
>    the background (commit-graph, multi-pack-index) or are super-
>    critical to correctness (pack-files).
>
> 2. Index files are short lived. It is rare that a user leaves an
>    index for a long time with many staged changes. That's the condition
>    that's required for losing an index file to cause a loss of work
>    (or maybe I'm missing something). Outside of staged changes, the
>    index can be completely destroyed and rewritten with minimal impact
>    to the user.
>

Is this information in the commit messages somewhere? I didn't see it
in the cover letter. Nor did I see any other explanation in the cover
letter besides "this makes it faster". I would expect such trade off
or analysis of "what do we lose" to be in the cover letter, as it may
not be clear otherwise.

I do agree these reasons are good, but it can be confusing to later
reviewers when looking back at the code for an option like this and
wondering why it exists.

Thanks,
Jake