mbox series

[v2,00/30] initial support for multiple hash functions

Message ID 878r8l929e.fsf@gmail.froward.int.ebiederm.org (mailing list archive)
Headers show
Series initial support for multiple hash functions | expand

Message

Eric W. Biederman Oct. 2, 2023, 2:39 a.m. UTC
This addresses all of the known test failures from v1 of this set of
changes.  In particular I have reworked commit_tree_extended which
was flagged by smatch, -Werror=array-bounds, and the leak detector.

One functional bug was fixed in repo_for_each_abbrev where it was
mistakenly displaying too many ambiguous oids.

I am posting this so that people review and testing of this patchset
won't be distracted by the known and fixed issues.



A key part of the hash function transition plan is a way that a single
git repository can inter-operate with git repositories whose storage
hash function is SHA-1 and git repositories whose storage hash function
is SHA-256.

This interoperability can defined in terms of two repositories one whose
storage hash function is SHA-1 and another whose storage hash function
is SHA-256.  Those two repositories receive exactly the same objects,
but they store them in different but equivalent ways.

For a repository that has one storage hash function to inter-operate
with a repository that has a different storage hash function requires
the first repository to be able produce it's objects as if they were
stored in the second hash function.

This series of changes focuses on implementing the pieces that allow
a repository that uses one storage hash function to produce the objects
that would have been stored with a second storage hash function.

The final patch in this series is the addition of a test that creates
two repositories one that uses SHA-1 as it's storage hash function
and the other that uses SHA-256 as it's storage hash function.
Identical operations are performed on the two repositories, and their
compatibility objects are compared to verify they are the same.
AKA the SHA-1 repository on the fly generates the objects store in
the SHA-256 repository, and the SHA-256 repository on the fly generates
the objects that are stored in the SHA-1 repository.

There are two fundamental technologies for enabling this.
- The ability to convert a stored object into the object the
  other repository would have stored.
- The ability to remember a mapping between SHA-1 and SHA-256 oids
  of equivalent objects.

With such technologies it is very easy to implement user facing changes.
To avoid locking git into poor decisions by accident I have done my best
to minimize the user facing changes, while still building the internal
infrastructure that is needed for interoperability.

All of this work is inspired by earlier work on interoperability by
"brian m. carlson" and some of the key pieces of code are still his.

To get to the point where I can test if a SHA-1 and a SHA-256 repository
can on the fly generate each other, I have made some small user-facing
changes.

git rev-parse now supports --output-object-format as a way to query
the internal mapping tables between oids and report the equivalent
oid of the other format.

git cat-file when given a oid that does not match the repositories
storage format will now attempt to find the oids equivalent object that
is stored in the repository and if found dynamically generate the object
that would have been stored in a repository with a different storage
hash function and display the object.

An additional file loose-object-index will be stored in ".git/objects/".

An additional option "extensions.compatObjectFormat" is implemented,
that generates and stores mappings between the oids of objects stored in
the repository and oids of the equivalent objects that would be stored
in a repository show storage format was extensions.compatObjectFormat.

Eric W. Biederman (23):
      object-file-convert: stubs for converting from one object format to another
      oid-array: teach oid-array to handle multiple kinds of oids
      object-names: support input of oids in any supported hash
      repository: add a compatibility hash algorithm
      loose: compatibilty short name support
      object-file: update the loose object map when writing loose objects
      object-file: add a compat_oid_in parameter to write_object_file_flags
      commit: convert mergetag before computing the signature of a commit
      commit: export add_header_signature to support handling signatures on tags
      tag: sign both hashes
      object: factor out parse_mode out of fast-import and tree-walk into in object.h
      object-file-convert: don't leak when converting tag objects
      object-file-convert: convert commits that embed signed tags
      object-file: update object_info_extended to reencode objects
      rev-parse: add an --output-object-format parameter
      builtin/cat-file: let the oid determine the output algorithm
      tree-walk: init_tree_desc take an oid to get the hash algorithm
      object-file: handle compat objects in check_object_signature
      builtin/ls-tree: let the oid determine the output algorithm
      test-lib: compute the compatibility hash so tests may use it
      t1006: rename sha1 to oid
      t1006: test oid compatibility with cat-file
      t1016-compatObjectFormat: add tests to verify the conversion between objects

brian m. carlson (7):
      loose: add a mapping between SHA-1 and SHA-256 for loose objects
      commit: write commits for both hashes
      cache: add a function to read an OID of a specific algorithm
      object-file-convert: add a function to convert trees between algorithms
      object-file-convert: convert tag objects when writing
      object-file-convert: convert commit objects when writing
      repository: implement extensions.compatObjectFormat

 Documentation/config/extensions.txt |  12 ++
 Documentation/git-rev-parse.txt     |  12 ++
 Makefile                            |   3 +
 archive.c                           |   3 +-
 builtin/am.c                        |   6 +-
 builtin/cat-file.c                  |  12 +-
 builtin/checkout.c                  |   8 +-
 builtin/clone.c                     |   2 +-
 builtin/commit.c                    |   2 +-
 builtin/fast-import.c               |  18 +-
 builtin/grep.c                      |   8 +-
 builtin/ls-tree.c                   |   5 +-
 builtin/merge.c                     |   3 +-
 builtin/pack-objects.c              |   6 +-
 builtin/read-tree.c                 |   2 +-
 builtin/rev-parse.c                 |  25 ++-
 builtin/stash.c                     |   5 +-
 builtin/tag.c                       |  45 ++++-
 cache-tree.c                        |   4 +-
 commit.c                            | 221 ++++++++++++++++-----
 commit.h                            |   1 +
 delta-islands.c                     |   2 +-
 diff-lib.c                          |   2 +-
 fsck.c                              |   6 +-
 hash-ll.h                           |   1 +
 hash.h                              |   9 +-
 http-push.c                         |   2 +-
 list-objects.c                      |   2 +-
 loose.c                             | 259 ++++++++++++++++++++++++
 loose.h                             |  22 +++
 match-trees.c                       |   4 +-
 merge-ort.c                         |  11 +-
 merge-recursive.c                   |   2 +-
 merge.c                             |   3 +-
 object-file-convert.c               | 277 ++++++++++++++++++++++++++
 object-file-convert.h               |  24 +++
 object-file.c                       | 212 ++++++++++++++++++--
 object-name.c                       |  46 +++--
 object-name.h                       |   3 +-
 object-store-ll.h                   |   7 +-
 object.c                            |   2 +
 object.h                            |  18 ++
 oid-array.c                         |  12 +-
 pack-bitmap-write.c                 |   2 +-
 packfile.c                          |   3 +-
 reflog.c                            |   2 +-
 repository.c                        |  14 ++
 repository.h                        |   4 +
 revision.c                          |   4 +-
 setup.c                             |  22 +++
 setup.h                             |   1 +
 t/helper/test-delete-gpgsig.c       |  62 ++++++
 t/helper/test-tool.c                |   1 +
 t/helper/test-tool.h                |   1 +
 t/t1006-cat-file.sh                 | 379 +++++++++++++++++++++---------------
 t/t1016-compatObjectFormat.sh       | 281 ++++++++++++++++++++++++++
 t/t1016/gpg                         |   2 +
 t/test-lib-functions.sh             |  17 +-
 tree-walk.c                         |  58 +++---
 tree-walk.h                         |   7 +-
 tree.c                              |   2 +-
 walker.c                            |   2 +-
 62 files changed, 1844 insertions(+), 349 deletions(-)

Eric

Comments

Junio C Hamano Feb. 7, 2024, 10:18 p.m. UTC | #1
"Eric W. Biederman" <ebiederm@gmail.com> writes:

> This addresses all of the known test failures from v1 of this set of
> changes.  In particular I have reworked commit_tree_extended which
> was flagged by smatch, -Werror=array-bounds, and the leak detector.
>
> One functional bug was fixed in repo_for_each_abbrev where it was
> mistakenly displaying too many ambiguous oids.
>
> I am posting this so that people review and testing of this patchset
> won't be distracted by the known and fixed issues.

We haven't seen any reviews on this second round, and have had it
outside 'next' for too long.  I am tempted to say that we merge it
to 'next' and see if anybody screams at this point.

Thanks.
Linus Arver Feb. 8, 2024, 12:24 a.m. UTC | #2
Junio C Hamano <gitster@pobox.com> writes:

> "Eric W. Biederman" <ebiederm@gmail.com> writes:
>
>> This addresses all of the known test failures from v1 of this set of
>> changes.  In particular I have reworked commit_tree_extended which
>> was flagged by smatch, -Werror=array-bounds, and the leak detector.
>>
>> One functional bug was fixed in repo_for_each_abbrev where it was
>> mistakenly displaying too many ambiguous oids.
>>
>> I am posting this so that people review and testing of this patchset
>> won't be distracted by the known and fixed issues.
>
> We haven't seen any reviews on this second round, and have had it
> outside 'next' for too long.  I am tempted to say that we merge it
> to 'next' and see if anybody screams at this point.

FWIW out of all the "Needs review" topics this one seemed like the most
deserving of another pair of eyes, and I was planning to review some of
the patches here this week + the weekend. If my review takes too long
(taking longer than this weekend) I can give another update next week
saying "too hard for me, please don't wait for me" to unblock you from
merging to next.

Thanks.
Patrick Steinhardt Feb. 8, 2024, 6:11 a.m. UTC | #3
On Wed, Feb 07, 2024 at 04:24:13PM -0800, Linus Arver wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> 
> > "Eric W. Biederman" <ebiederm@gmail.com> writes:
> >
> >> This addresses all of the known test failures from v1 of this set of
> >> changes.  In particular I have reworked commit_tree_extended which
> >> was flagged by smatch, -Werror=array-bounds, and the leak detector.
> >>
> >> One functional bug was fixed in repo_for_each_abbrev where it was
> >> mistakenly displaying too many ambiguous oids.
> >>
> >> I am posting this so that people review and testing of this patchset
> >> won't be distracted by the known and fixed issues.
> >
> > We haven't seen any reviews on this second round, and have had it
> > outside 'next' for too long.  I am tempted to say that we merge it
> > to 'next' and see if anybody screams at this point.
> 
> FWIW out of all the "Needs review" topics this one seemed like the most
> deserving of another pair of eyes, and I was planning to review some of
> the patches here this week + the weekend. If my review takes too long
> (taking longer than this weekend) I can give another update next week
> saying "too hard for me, please don't wait for me" to unblock you from
> merging to next.

I completely lost track of this patch series. So same for me: I don't
want to hold it up, but would be happy to give it a pair of eyes next
week.

Patrick
Linus Arver Feb. 14, 2024, 7:36 a.m. UTC | #4
Linus Arver <linusa@google.com> writes:

> Junio C Hamano <gitster@pobox.com> writes:
>
>> "Eric W. Biederman" <ebiederm@gmail.com> writes:
>>
>>> This addresses all of the known test failures from v1 of this set of
>>> changes.  In particular I have reworked commit_tree_extended which
>>> was flagged by smatch, -Werror=array-bounds, and the leak detector.
>>>
>>> One functional bug was fixed in repo_for_each_abbrev where it was
>>> mistakenly displaying too many ambiguous oids.
>>>
>>> I am posting this so that people review and testing of this patchset
>>> won't be distracted by the known and fixed issues.
>>
>> We haven't seen any reviews on this second round, and have had it
>> outside 'next' for too long.  I am tempted to say that we merge it
>> to 'next' and see if anybody screams at this point.
>
> FWIW out of all the "Needs review" topics this one seemed like the most
> deserving of another pair of eyes, and I was planning to review some of
> the patches here this week + the weekend. If my review takes too long
> (taking longer than this weekend) I can give another update next week
> saying "too hard for me, please don't wait for me" to unblock you from
> merging to next.
>
> Thanks.

Unfortunately I don't think I can finish reviewing the rest of the
series (after all this time I've only been able to review just 4 out of
30 patches). I'm also stuck on trying to understand patch 5, as there is
a lot going on there.

FWIW a lot (perhaps all?) of my comments so far were around readability
and not material to the actual design or approach of anything AFAICS.
So, it's time for me to say "don't bother waiting for me" as I said
(predicted?) earlier.

Don't bother waiting for me. Thanks.
Patrick Steinhardt Feb. 15, 2024, 11:27 a.m. UTC | #5
On Sun, Oct 01, 2023 at 09:39:09PM -0500, Eric W. Biederman wrote:
> 
> This addresses all of the known test failures from v1 of this set of
> changes.  In particular I have reworked commit_tree_extended which
> was flagged by smatch, -Werror=array-bounds, and the leak detector.
> 
> One functional bug was fixed in repo_for_each_abbrev where it was
> mistakenly displaying too many ambiguous oids.
> 
> I am posting this so that people review and testing of this patchset
> won't be distracted by the known and fixed issues.

Thanks! I've reviewed this patch series up to patch 7.

I think the most important question mark I currently have is scalability
of the proposed object mapping format. The complexity to load the object
mappings is currently O(nlogn) and requires reading a file that is
easily hundreds of megabytes or even gigabytes in size.

I have a feeling that this needs to be addressed before such an object
mapping would be feasible for production use, or otherwise it would
incur too high a cost to be useful. I'm afraid that this will make the
whole series more complex to implement -- I'm sorry about that.

I've added comments and ideas regarding this issue on patch 6 and 7. It
could totally be that I'm missing the obvious though, or that my ideas
suck. Please don't hesitate to point that out if that is the case.

Patrick