mbox series

[0/3] Convert index writes to use hashfile API

Message ID pull.916.git.1616785928.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series Convert index writes to use hashfile API | expand

Message

Derrick Stolee via GitGitGadget March 26, 2021, 7:12 p.m. UTC
As I prepare some ideas on index v5, one thing that strikes me as an
interesting direction to try is to use the chunk-format API. This would make
our extension model extremely simple (they become optional chunks, easily
identified by the table of contents).

But there is a huge hurdle to even starting that investigation: the index
uses its own hashing methods, separate from the hashfile API in csum-file.c!

The internals of the algorithms are mostly identical. The only possible
change is that the buffer sizes are different: 8KB for hashfile and 128KB in
read-cache.c. I was unable to find a performance difference in these two
implementations, despite testing on several repo sizes.

There is a subtle point about how the EOIE extension works in that it needs
a hash of just the previous extension data. This is solved by adding a new
"nested hashfile" mechanism that computes the hash at one level and then
passes the data below to another hashfile. (The good news is that this
extension will not need to exist at all if we use the chunk-format API to
manage extensions.)

Thanks, -Stolee

Derrick Stolee (3):
  csum-file: add nested_hashfile()
  read-cache: use hashfile instead of git_hash_ctx
  read-cache: delete unused hashing methods

 csum-file.c  |  22 +++++++
 csum-file.h  |   9 +++
 read-cache.c | 182 ++++++++++++++++-----------------------------------
 3 files changed, 89 insertions(+), 124 deletions(-)


base-commit: 142430338477d9d1bb25be66267225fb58498d92
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-916%2Fderrickstolee%2Findex-hashfile-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-916/derrickstolee/index-hashfile-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/916

Comments

Derrick Stolee March 26, 2021, 8:16 p.m. UTC | #1
On 3/26/2021 3:12 PM, Derrick Stolee via GitGitGadget wrote:
> As I prepare some ideas on index v5, one thing that strikes me as an
> interesting direction to try is to use the chunk-format API. This would make
> our extension model extremely simple (they become optional chunks, easily
> identified by the table of contents).
> 
> But there is a huge hurdle to even starting that investigation: the index
> uses its own hashing methods, separate from the hashfile API in csum-file.c!
> 
> The internals of the algorithms are mostly identical. The only possible
> change is that the buffer sizes are different: 8KB for hashfile and 128KB in
> read-cache.c. I was unable to find a performance difference in these two
> implementations, despite testing on several repo sizes.

Of course, shortly after I send this series (thinking I've checked all the
details carefully) I notice that I was using "git update-index --really-refresh"
for testing, but what I really wanted was "git update-index --force-write".

In this case, I _do_ see a performance degradation using the hashfile API.
I will investigate whether this is just a poor implementation of the nesting
hashfile, or something else more tricky. Changing the buffer size doesn't do
the trick.

Please ignore this series for now. Sorry for the noise.

-Stolee