[26/26] midx: switch to using the_hash_algo

Message ID	20190818200427.870753-27-sandals@crustytoothpaste.net (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> From: "brian m. carlson" <sandals@crustytoothpaste.net> To: <git@vger.kernel.org> Cc: Taylor Blau <me@ttaylorr.com>, Derrick Stolee <dstolee@microsoft.com> Subject: [PATCH 26/26] midx: switch to using the_hash_algo Date: Sun, 18 Aug 2019 20:04:27 +0000 Message-Id: <20190818200427.870753-27-sandals@crustytoothpaste.net> In-Reply-To: <20190818200427.870753-1-sandals@crustytoothpaste.net> References: <20190818200427.870753-1-sandals@crustytoothpaste.net> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: git-owner@vger.kernel.org Precedence: bulk
Series	object_id part 17 \| expand [00/26] object_id part 17 [01/26] builtin/replace: make hash size independent [02/26] patch-id: convert to use the_hash_algo [03/26] fetch-pack: use parse_oid_hex [04/26] builtin/receive-pack: switch to use the_hash_algo [05/26] builtin/blame: switch uses of GIT_SHA1_HEXSZ to the_hash_algo [06/26] builtin/rev-parse: switch to use the_hash_algo [07/26] blame: remove needless comparison with GIT_SHA1_HEXSZ [08/26] show-index: switch hard-coded constants to the_hash_algo [09/26] connected: switch GIT_SHA1_HEXSZ to the_hash_algo [10/26] bundle: switch to use the_hash_algo [11/26] combine-diff: replace GIT_SHA1_HEXSZ with the_hash_algo [12/26] config: use the_hash_algo in abbrev comparison [13/26] sha1-lookup: switch hard-coded constants to the_hash_algo [14/26] bisect: switch to using the_hash_algo [15/26] sequencer: convert to use the_hash_algo [16/26] pack-write: use hash_to_hex when writing checksums [17/26] builtin/repack: write object IDs of the proper length [18/26] builtin/worktree: switch null_sha1 to null_oid [19/26] cache: remove null_sha1 [20/26] wt-status: convert struct wt_status to object_id [21/26] packfile: replace sha1_to_hex [22/26] builtin/index-pack: replace sha1_to_hex [23/26] builtin/receive-pack: replace sha1_to_hex [24/26] rerere: replace sha1_to_hex [25/26] builtin/show-index: replace sha1_to_hex [26/26] midx: switch to using the_hash_algo

Message ID

20190818200427.870753-27-sandals@crustytoothpaste.net (mailing list archive)

State

New, archived

Headers

From: "brian m. carlson" <sandals@crustytoothpaste.net>
To: <git@vger.kernel.org>
Cc: Taylor Blau <me@ttaylorr.com>,
        Derrick Stolee <dstolee@microsoft.com>
Subject: [PATCH 26/26] midx: switch to using the_hash_algo
Date: Sun, 18 Aug 2019 20:04:27 +0000
Message-Id: <20190818200427.870753-27-sandals@crustytoothpaste.net>
In-Reply-To: <20190818200427.870753-1-sandals@crustytoothpaste.net>
References: <20190818200427.870753-1-sandals@crustytoothpaste.net>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: git-owner@vger.kernel.org
Precedence: bulk

Series

object_id part 17 | expand

Commit Message

brian m. carlson Aug. 18, 2019, 8:04 p.m. UTC

Instead of hard-coding the hash size, use the_hash_algo to look up the
hash size at runtime.  Remove the #define constant which was used to
hold the hash length, since writing the expression with the_hash_algo
provide enough documentary value on its own.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 midx.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

Comments

Derrick Stolee Aug. 22, 2019, 2:04 p.m. UTC | #1

On 8/18/2019 4:04 PM, brian m. carlson wrote:
> Instead of hard-coding the hash size, use the_hash_algo to look up the
> hash size at runtime.  Remove the #define constant which was used to
> hold the hash length, since writing the expression with the_hash_algo
> provide enough documentary value on its own.

Thanks for this change! It seems to be very similar to the one
included in the commit-graph, barring one small issue below
(that we can follow-up on later).

> diff --git a/midx.c b/midx.c
> index d649644420..f29afc0d2d 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -19,8 +19,7 @@
>  #define MIDX_BYTE_NUM_PACKS 8
>  #define MIDX_HASH_VERSION 1

This hash version "1" is the same as we used in the commit-graph. It's
a byte value from the file format, and we've already discussed how it
would have been better to use the 4-byte identifier, but that ship has
sailed. I'm just pointing this out to say that we are not done in this
file yet, but we can get to that when we want to test the midx with
multiple hash lengths.

>  #define MIDX_HEADER_SIZE 12
> -#define MIDX_HASH_LEN 20

The replacements of MIDX_HASH_LEN make sense. Thanks!

-Stolee

brian m. carlson Aug. 23, 2019, 2:17 a.m. UTC | #2

On 2019-08-22 at 14:04:16, Derrick Stolee wrote:
> On 8/18/2019 4:04 PM, brian m. carlson wrote:
> > diff --git a/midx.c b/midx.c
> > index d649644420..f29afc0d2d 100644
> > --- a/midx.c
> > +++ b/midx.c
> > @@ -19,8 +19,7 @@
> >  #define MIDX_BYTE_NUM_PACKS 8
> >  #define MIDX_HASH_VERSION 1
> 
> This hash version "1" is the same as we used in the commit-graph. It's
> a byte value from the file format, and we've already discussed how it
> would have been better to use the 4-byte identifier, but that ship has
> sailed. I'm just pointing this out to say that we are not done in this
> file yet, but we can get to that when we want to test the midx with
> multiple hash lengths.

My approach so far has been to assume everything in the .git directory
is in the same hash except for the translation functionality. Therefore,
it doesn't make sense to distinguish between hashes in the midx files,
because we'll never have files that differ in hash.  So essentially the
MIDX_HASH_VERSION being 1 is "whatever hash is being used in the .git
directory", not just SHA-1.

In addition, the current multi-pack index format isn't capable (from my
reading of the documentation, at least) of handling multiple hash
algorithms at once.  So we'd need a midx v2 format for folks who are
using SHA-256 with SHA-1 compatibility and we could then write separate
sets of object chunks with an appropriate format identifier, much like
the proposed pack index v3.

Derrick Stolee Aug. 23, 2019, 11:53 a.m. UTC | #3

On 8/22/2019 10:17 PM, brian m. carlson wrote:
> On 2019-08-22 at 14:04:16, Derrick Stolee wrote:
>> On 8/18/2019 4:04 PM, brian m. carlson wrote:
>>> diff --git a/midx.c b/midx.c
>>> index d649644420..f29afc0d2d 100644
>>> --- a/midx.c
>>> +++ b/midx.c
>>> @@ -19,8 +19,7 @@
>>>  #define MIDX_BYTE_NUM_PACKS 8
>>>  #define MIDX_HASH_VERSION 1
>>
>> This hash version "1" is the same as we used in the commit-graph. It's
>> a byte value from the file format, and we've already discussed how it
>> would have been better to use the 4-byte identifier, but that ship has
>> sailed. I'm just pointing this out to say that we are not done in this
>> file yet, but we can get to that when we want to test the midx with
>> multiple hash lengths.
> 
> My approach so far has been to assume everything in the .git directory
> is in the same hash except for the translation functionality. Therefore,
> it doesn't make sense to distinguish between hashes in the midx files,
> because we'll never have files that differ in hash.  So essentially the
> MIDX_HASH_VERSION being 1 is "whatever hash is being used in the .git
> directory", not just SHA-1.
> 
> In addition, the current multi-pack index format isn't capable (from my
> reading of the documentation, at least) of handling multiple hash
> algorithms at once.  So we'd need a midx v2 format for folks who are
> using SHA-256 with SHA-1 compatibility and we could then write separate
> sets of object chunks with an appropriate format identifier, much like
> the proposed pack index v3.

Absolutely, it is not. It would be a great place to store a transition
table, when that is needed.

If we _never_ allow both hashes in the .git folder, then maybe we won't
ever need this and can rely on config options. I imagine that will be
tricky, and updating this byte should only help. We are not ready for
that, anyway.

Thanks,
-Stolee

diff --git a/midx.c b/midx.c
index d649644420..f29afc0d2d 100644
--- a/midx.c
+++ b/midx.c
@@ -19,8 +19,7 @@ 
 #define MIDX_BYTE_NUM_PACKS 8
 #define MIDX_HASH_VERSION 1
 #define MIDX_HEADER_SIZE 12
-#define MIDX_HASH_LEN 20
-#define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
+#define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + the_hash_algo->rawsz)
 
 #define MIDX_MAX_CHUNKS 5
 #define MIDX_CHUNK_ALIGNMENT 4
@@ -93,7 +92,7 @@  struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 	hash_version = m->data[MIDX_BYTE_HASH_VERSION];
 	if (hash_version != MIDX_HASH_VERSION)
 		die(_("hash version %u does not match"), hash_version);
-	m->hash_len = MIDX_HASH_LEN;
+	m->hash_len = the_hash_algo->rawsz;
 
 	m->num_chunks = m->data[MIDX_BYTE_NUM_CHUNKS];
 
@@ -234,7 +233,7 @@  int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t
 int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result)
 {
 	return bsearch_hash(oid->hash, m->chunk_oid_fanout, m->chunk_oid_lookup,
-			    MIDX_HASH_LEN, result);
+			    the_hash_algo->rawsz, result);
 }
 
 struct object_id *nth_midxed_object_oid(struct object_id *oid,
@@ -928,7 +927,7 @@  static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	cur_chunk++;
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_OBJECTOFFSETS;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_HASH_LEN;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * the_hash_algo->rawsz;
 
 	cur_chunk++;
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_CHUNK_OFFSET_WIDTH;
@@ -976,7 +975,7 @@  static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				break;
 
 			case MIDX_CHUNKID_OIDLOOKUP:
-				written += write_midx_oid_lookup(f, MIDX_HASH_LEN, entries, nr_entries);
+				written += write_midx_oid_lookup(f, the_hash_algo->rawsz, entries, nr_entries);
 				break;
 
 			case MIDX_CHUNKID_OBJECTOFFSETS:

[26/26] midx: switch to using the_hash_algo

Commit Message

Comments

Patch