[v2,13/13] commit-graph: specify OID version for SHA-256
diff mbox series

Message ID 20181015021900.1030041-14-sandals@crustytoothpaste.net
State New
Headers show
Series
  • Base SHA-256 implementation
Related show

Commit Message

brian m. carlson Oct. 15, 2018, 2:19 a.m. UTC
Since the commit-graph code wants to serialize the hash algorithm into
the data store, specify a version number for each supported algorithm.
Note that we don't use the values of the constants themselves, as they
are internal and could change in the future.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 commit-graph.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Derrick Stolee Oct. 15, 2018, 3:11 p.m. UTC | #1
On 10/14/2018 10:19 PM, brian m. carlson wrote:
> Since the commit-graph code wants to serialize the hash algorithm into
> the data store, specify a version number for each supported algorithm.
> Note that we don't use the values of the constants themselves, as they
> are internal and could change in the future.
>
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
>   commit-graph.c | 9 ++++++++-
>   1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 7a28fbb03f..e587c21bb6 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -45,7 +45,14 @@ char *get_commit_graph_filename(const char *obj_dir)
>   
>   static uint8_t oid_version(void)
>   {
> -	return 1;
> +	switch (hash_algo_by_ptr(the_hash_algo)) {
> +		case GIT_HASH_SHA1:
> +			return 1;
> +		case GIT_HASH_SHA256:
> +			return 2;
> +		default:
> +			BUG("unknown hash algorithm");
> +	}
>   }
>   
>   static struct commit_graph *alloc_commit_graph(void)
Simple and easy!

Thanks,
-Stolee
Junio C Hamano Oct. 16, 2018, 2 a.m. UTC | #2
"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> Since the commit-graph code wants to serialize the hash algorithm into
> the data store, specify a version number for each supported algorithm.
> Note that we don't use the values of the constants themselves, as they
> are internal and could change in the future.
>
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
>  commit-graph.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 7a28fbb03f..e587c21bb6 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -45,7 +45,14 @@ char *get_commit_graph_filename(const char *obj_dir)
>  
>  static uint8_t oid_version(void)
>  {
> -	return 1;
> +	switch (hash_algo_by_ptr(the_hash_algo)) {
> +		case GIT_HASH_SHA1:
> +			return 1;
> +		case GIT_HASH_SHA256:
> +			return 2;
> +		default:
> +			BUG("unknown hash algorithm");
> +	}

Style: align switch/case.


Shouldn't this be just using GIT_HASH_* constants instead?  Are we
expecting that it would be benefitial to allow more than one "oid
version" even within a same GIT_HASH_*?

>  }
>  
>  static struct commit_graph *alloc_commit_graph(void)
Duy Nguyen Oct. 16, 2018, 3:35 p.m. UTC | #3
On Mon, Oct 15, 2018 at 4:23 AM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
>
> Since the commit-graph code wants to serialize the hash algorithm into
> the data store, specify a version number for each supported algorithm.
> Note that we don't use the values of the constants themselves, as they
> are internal and could change in the future.
>
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
>  commit-graph.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 7a28fbb03f..e587c21bb6 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -45,7 +45,14 @@ char *get_commit_graph_filename(const char *obj_dir)
>
>  static uint8_t oid_version(void)
>  {
> -       return 1;
> +       switch (hash_algo_by_ptr(the_hash_algo)) {
> +               case GIT_HASH_SHA1:
> +                       return 1;
> +               case GIT_HASH_SHA256:
> +                       return 2;

Should we just increase this field to uint32_t and store format_id
instead? That will keep oid version unique in all data formats.

> +               default:
> +                       BUG("unknown hash algorithm");
> +       }
>  }
>
>  static struct commit_graph *alloc_commit_graph(void)
Derrick Stolee Oct. 16, 2018, 4:01 p.m. UTC | #4
On 10/16/2018 11:35 AM, Duy Nguyen wrote:
> On Mon, Oct 15, 2018 at 4:23 AM brian m. carlson
> <sandals@crustytoothpaste.net> wrote:
>> Since the commit-graph code wants to serialize the hash algorithm into
>> the data store, specify a version number for each supported algorithm.
>> Note that we don't use the values of the constants themselves, as they
>> are internal and could change in the future.
>>
>> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
>> ---
>>   commit-graph.c | 9 ++++++++-
>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/commit-graph.c b/commit-graph.c
>> index 7a28fbb03f..e587c21bb6 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -45,7 +45,14 @@ char *get_commit_graph_filename(const char *obj_dir)
>>
>>   static uint8_t oid_version(void)
>>   {
>> -       return 1;
>> +       switch (hash_algo_by_ptr(the_hash_algo)) {
>> +               case GIT_HASH_SHA1:
>> +                       return 1;
>> +               case GIT_HASH_SHA256:
>> +                       return 2;
> Should we just increase this field to uint32_t and store format_id
> instead? That will keep oid version unique in all data formats.
Both the commit-graph and multi-pack-index store a single byte for the 
hash version, so that ship has sailed (without incrementing the full 
file version number in each format).

It may be good to make this method accessible to both formats. I'm not 
sure if Brian's branch is built on top of the multi-pack-index code. 
Probably best to see if ds/multi-pack-verify is in the history.

Thanks,
-Stolee
Duy Nguyen Oct. 16, 2018, 4:09 p.m. UTC | #5
On Tue, Oct 16, 2018 at 6:01 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 10/16/2018 11:35 AM, Duy Nguyen wrote:
> > On Mon, Oct 15, 2018 at 4:23 AM brian m. carlson
> > <sandals@crustytoothpaste.net> wrote:
> >> Since the commit-graph code wants to serialize the hash algorithm into
> >> the data store, specify a version number for each supported algorithm.
> >> Note that we don't use the values of the constants themselves, as they
> >> are internal and could change in the future.
> >>
> >> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> >> ---
> >>   commit-graph.c | 9 ++++++++-
> >>   1 file changed, 8 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/commit-graph.c b/commit-graph.c
> >> index 7a28fbb03f..e587c21bb6 100644
> >> --- a/commit-graph.c
> >> +++ b/commit-graph.c
> >> @@ -45,7 +45,14 @@ char *get_commit_graph_filename(const char *obj_dir)
> >>
> >>   static uint8_t oid_version(void)
> >>   {
> >> -       return 1;
> >> +       switch (hash_algo_by_ptr(the_hash_algo)) {
> >> +               case GIT_HASH_SHA1:
> >> +                       return 1;
> >> +               case GIT_HASH_SHA256:
> >> +                       return 2;
> > Should we just increase this field to uint32_t and store format_id
> > instead? That will keep oid version unique in all data formats.
> Both the commit-graph and multi-pack-index store a single byte for the
> hash version, so that ship has sailed (without incrementing the full
> file version number in each format).

And it's probably premature to add the oid version field when multiple
hash support has not been fully realized. Now we have different ways
of storing hash id and need separate mappings.

I would go for incrementing file version. Otherwise maybe we just
update format_id to be one byte instead, and the way of storing hash
version in commit-graph will be used everywhere.

> It may be good to make this method accessible to both formats. I'm not
> sure if Brian's branch is built on top of the multi-pack-index code.
> Probably best to see if ds/multi-pack-verify is in the history.
>
> Thanks,
> -Stolee
brian m. carlson Oct. 16, 2018, 10:39 p.m. UTC | #6
On Tue, Oct 16, 2018 at 11:00:19AM +0900, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> > Since the commit-graph code wants to serialize the hash algorithm into
> > the data store, specify a version number for each supported algorithm.
> > Note that we don't use the values of the constants themselves, as they
> > are internal and could change in the future.
> >
> > Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> > ---
> >  commit-graph.c | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 7a28fbb03f..e587c21bb6 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -45,7 +45,14 @@ char *get_commit_graph_filename(const char *obj_dir)
> >  
> >  static uint8_t oid_version(void)
> >  {
> > -	return 1;
> > +	switch (hash_algo_by_ptr(the_hash_algo)) {
> > +		case GIT_HASH_SHA1:
> > +			return 1;
> > +		case GIT_HASH_SHA256:
> > +			return 2;
> > +		default:
> > +			BUG("unknown hash algorithm");
> > +	}
> 
> Style: align switch/case.

Will fix.

> Shouldn't this be just using GIT_HASH_* constants instead?  Are we
> expecting that it would be benefitial to allow more than one "oid
> version" even within a same GIT_HASH_*?

I really would like to have us rely not rely explicitly on those values.
We don't serialize them anywhere else, and they're explicitly documented
as not being suitable for serialization.  If we were writing this fresh
today, we'd probably use the format_version field, which is designed for
this purpose, or simply omit the field altogether.
brian m. carlson Oct. 16, 2018, 10:44 p.m. UTC | #7
On Tue, Oct 16, 2018 at 06:09:41PM +0200, Duy Nguyen wrote:
> On Tue, Oct 16, 2018 at 6:01 PM Derrick Stolee <stolee@gmail.com> wrote:
> >
> > On 10/16/2018 11:35 AM, Duy Nguyen wrote:
> > > On Mon, Oct 15, 2018 at 4:23 AM brian m. carlson
> > > <sandals@crustytoothpaste.net> wrote:
> > >> Since the commit-graph code wants to serialize the hash algorithm into
> > >> the data store, specify a version number for each supported algorithm.
> > >> Note that we don't use the values of the constants themselves, as they
> > >> are internal and could change in the future.
> > >>
> > >> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> > >> ---
> > >>   commit-graph.c | 9 ++++++++-
> > >>   1 file changed, 8 insertions(+), 1 deletion(-)
> > >>
> > >> diff --git a/commit-graph.c b/commit-graph.c
> > >> index 7a28fbb03f..e587c21bb6 100644
> > >> --- a/commit-graph.c
> > >> +++ b/commit-graph.c
> > >> @@ -45,7 +45,14 @@ char *get_commit_graph_filename(const char *obj_dir)
> > >>
> > >>   static uint8_t oid_version(void)
> > >>   {
> > >> -       return 1;
> > >> +       switch (hash_algo_by_ptr(the_hash_algo)) {
> > >> +               case GIT_HASH_SHA1:
> > >> +                       return 1;
> > >> +               case GIT_HASH_SHA256:
> > >> +                       return 2;
> > > Should we just increase this field to uint32_t and store format_id
> > > instead? That will keep oid version unique in all data formats.
> > Both the commit-graph and multi-pack-index store a single byte for the
> > hash version, so that ship has sailed (without incrementing the full
> > file version number in each format).
> 
> And it's probably premature to add the oid version field when multiple
> hash support has not been fully realized. Now we have different ways
> of storing hash id and need separate mappings.

Honestly, anything in the .git directory that is not the v3 pack indexes
or the loose object file should be in exactly one hash algorithm.  We
could simply just leave this value at 1 all the time and ignore the
field, since we already know what algorithm it will use.

> I would go for incrementing file version. Otherwise maybe we just
> update format_id to be one byte instead, and the way of storing hash
> version in commit-graph will be used everywhere.

It needs to be four bytes for pack files so that it's four-byte aligned.
Otherwise accessing pack index fields will cause alignment issues if we
don't massively rewrite the pack handling code.
Derrick Stolee Oct. 17, 2018, 12:21 p.m. UTC | #8
On 10/14/2018 10:19 PM, brian m. carlson wrote:
> Since the commit-graph code wants to serialize the hash algorithm into
> the data store, specify a version number for each supported algorithm.
> Note that we don't use the values of the constants themselves, as they
> are internal and could change in the future.
>
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
>   commit-graph.c | 9 ++++++++-
>   1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 7a28fbb03f..e587c21bb6 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -45,7 +45,14 @@ char *get_commit_graph_filename(const char *obj_dir)
>   
>   static uint8_t oid_version(void)
>   {
> -	return 1;
> +	switch (hash_algo_by_ptr(the_hash_algo)) {
hash_algo_by_ptr is specified as an inline function in hash.h, so this 
leads to a linking error in jch and pu right now.

I think the fix is simply to add '#include "hash.h"' to the list of 
includes.

Thanks,
-Stolee
Duy Nguyen Oct. 17, 2018, 2:31 p.m. UTC | #9
On Wed, Oct 17, 2018 at 12:44 AM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
> > > >>   static uint8_t oid_version(void)
> > > >>   {
> > > >> -       return 1;
> > > >> +       switch (hash_algo_by_ptr(the_hash_algo)) {
> > > >> +               case GIT_HASH_SHA1:
> > > >> +                       return 1;
> > > >> +               case GIT_HASH_SHA256:
> > > >> +                       return 2;
> > > > Should we just increase this field to uint32_t and store format_id
> > > > instead? That will keep oid version unique in all data formats.
> > > Both the commit-graph and multi-pack-index store a single byte for the
> > > hash version, so that ship has sailed (without incrementing the full
> > > file version number in each format).
> >
> > And it's probably premature to add the oid version field when multiple
> > hash support has not been fully realized. Now we have different ways
> > of storing hash id and need separate mappings.
>
> Honestly, anything in the .git directory that is not the v3 pack indexes
> or the loose object file should be in exactly one hash algorithm.  We
> could simply just leave this value at 1 all the time and ignore the
> field, since we already know what algorithm it will use.

In this particular case, I agree, but not as a general principle. It's
nice to have independence for fsck-like tools. I don't know if we have
a tool that simply validates commit-graph file format (and not trying
to access any real object). But for such a tool, I guess we can just
pass the hash algorithm from command line. The user would have to
guess a bit.
brian m. carlson Oct. 17, 2018, 10:38 p.m. UTC | #10
On Wed, Oct 17, 2018 at 08:21:42AM -0400, Derrick Stolee wrote:
> On 10/14/2018 10:19 PM, brian m. carlson wrote:
> > Since the commit-graph code wants to serialize the hash algorithm into
> > the data store, specify a version number for each supported algorithm.
> > Note that we don't use the values of the constants themselves, as they
> > are internal and could change in the future.
> > 
> > Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> > ---
> >   commit-graph.c | 9 ++++++++-
> >   1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 7a28fbb03f..e587c21bb6 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -45,7 +45,14 @@ char *get_commit_graph_filename(const char *obj_dir)
> >   static uint8_t oid_version(void)
> >   {
> > -	return 1;
> > +	switch (hash_algo_by_ptr(the_hash_algo)) {
> hash_algo_by_ptr is specified as an inline function in hash.h, so this leads
> to a linking error in jch and pu right now.
> 
> I think the fix is simply to add '#include "hash.h"' to the list of
> includes.

hash.h is already included by cache.h, so it should be present.  We
probably need to make it static.  I'll make that change; can you see if
it fixes the problem for you?
brian m. carlson Oct. 18, 2018, 12:06 a.m. UTC | #11
On Wed, Oct 17, 2018 at 04:31:19PM +0200, Duy Nguyen wrote:
> On Wed, Oct 17, 2018 at 12:44 AM brian m. carlson
> <sandals@crustytoothpaste.net> wrote:
> > Honestly, anything in the .git directory that is not the v3 pack indexes
> > or the loose object file should be in exactly one hash algorithm.  We
> > could simply just leave this value at 1 all the time and ignore the
> > field, since we already know what algorithm it will use.
> 
> In this particular case, I agree, but not as a general principle. It's
> nice to have independence for fsck-like tools. I don't know if we have
> a tool that simply validates commit-graph file format (and not trying
> to access any real object). But for such a tool, I guess we can just
> pass the hash algorithm from command line. The user would have to
> guess a bit.

I'm going to drop this patch for now.  I'll send a follow-up series
later which bumps the format version for this and the multi-pack index
and serializes them with the four-byte value.  I probably should have
caught this earlier, but unfortunately I don't always have the time to
look at every series that hits the list.
Derrick Stolee Oct. 18, 2018, 1:03 p.m. UTC | #12
On 10/17/2018 8:06 PM, brian m. carlson wrote:
> On Wed, Oct 17, 2018 at 04:31:19PM +0200, Duy Nguyen wrote:
>> On Wed, Oct 17, 2018 at 12:44 AM brian m. carlson
>> <sandals@crustytoothpaste.net> wrote:
>>> Honestly, anything in the .git directory that is not the v3 pack indexes
>>> or the loose object file should be in exactly one hash algorithm.  We
>>> could simply just leave this value at 1 all the time and ignore the
>>> field, since we already know what algorithm it will use.
>> In this particular case, I agree, but not as a general principle. It's
>> nice to have independence for fsck-like tools. I don't know if we have
>> a tool that simply validates commit-graph file format (and not trying
>> to access any real object). But for such a tool, I guess we can just
>> pass the hash algorithm from command line. The user would have to
>> guess a bit.
> I'm going to drop this patch for now.  I'll send a follow-up series
> later which bumps the format version for this and the multi-pack index
> and serializes them with the four-byte value.  I probably should have
> caught this earlier, but unfortunately I don't always have the time to
> look at every series that hits the list.
We should coordinate before incrementing the version number. I was 
working on making the file formats incremental (think split-index) but 
couldn't come up with a way to do it without incrementing the file 
format. It would be best to combine these features into v2 of each format.

Thanks,
-Stolee
brian m. carlson Oct. 19, 2018, 10:21 p.m. UTC | #13
On Thu, Oct 18, 2018 at 09:03:22AM -0400, Derrick Stolee wrote:
> We should coordinate before incrementing the version number. I was working
> on making the file formats incremental (think split-index) but couldn't come
> up with a way to do it without incrementing the file format. It would be
> best to combine these features into v2 of each format.

For the moment, I'm happy to leave things as they are, and I'll
interpret version 1 of the hash version field as whatever hash is in use
elsewhere in .git.  If you're going to bump the version in the future,
feel free to CC me on the patches, and we'll make sure that we serialize
the format_version field in the file then.

Patch
diff mbox series

diff --git a/commit-graph.c b/commit-graph.c
index 7a28fbb03f..e587c21bb6 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -45,7 +45,14 @@  char *get_commit_graph_filename(const char *obj_dir)
 
 static uint8_t oid_version(void)
 {
-	return 1;
+	switch (hash_algo_by_ptr(the_hash_algo)) {
+		case GIT_HASH_SHA1:
+			return 1;
+		case GIT_HASH_SHA256:
+			return 2;
+		default:
+			BUG("unknown hash algorithm");
+	}
 }
 
 static struct commit_graph *alloc_commit_graph(void)