mbox series

[v5,00/17] cruft packs

Message ID cover.1653088640.git.me@ttaylorr.com (mailing list archive)
Headers show
Series cruft packs | expand

Message

Taylor Blau May 20, 2022, 11:17 p.m. UTC
Here is another reroll of my series to implement "cruft packs". This is really
more like v4.1, since it has only cosmetic changes incorporated from the review
on v4 of this topic.

Since last time:

  - The new section in pack-format.txt (describing the ".mtimes" format) now
    says at the top "all 4-byte numbers are in network byte order", and avoids
    repeating "network [byte] order" throughout that section to reduce
    confusion.

  - A sub-shell in t5329 which incorrectly masked over the exit code of a "git"
    process was removed.

  - An overly-long line in builtin/repack.c::collect_pack_filenames() was
    eliminated, and matching braces are added.

...and that's pretty much it. In any case, a range-diff is included below.
Thanks again for all of the thoughtful feedback on this series.

Taylor Blau (17):
  Documentation/technical: add cruft-packs.txt
  pack-mtimes: support reading .mtimes files
  pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
  chunk-format.h: extract oid_version()
  pack-mtimes: support writing pack .mtimes files
  t/helper: add 'pack-mtimes' test-tool
  builtin/pack-objects.c: return from create_object_entry()
  builtin/pack-objects.c: --cruft without expiration
  reachable: add options to add_unseen_recent_objects_to_traversal
  reachable: report precise timestamps from objects in cruft packs
  builtin/pack-objects.c: --cruft with expiration
  builtin/repack.c: support generating a cruft pack
  builtin/repack.c: allow configuring cruft pack generation
  builtin/repack.c: use named flags for existing_packs
  builtin/repack.c: add cruft packs to MIDX during geometric repack
  builtin/gc.c: conditionally avoid pruning objects via loose
  sha1-file.c: don't freshen cruft packs

 Documentation/Makefile                  |   1 +
 Documentation/config/gc.txt             |  21 +-
 Documentation/config/repack.txt         |   9 +
 Documentation/git-gc.txt                |   5 +
 Documentation/git-pack-objects.txt      |  30 +
 Documentation/git-repack.txt            |  11 +
 Documentation/technical/cruft-packs.txt | 123 ++++
 Documentation/technical/pack-format.txt |  19 +
 Makefile                                |   2 +
 builtin/gc.c                            |  10 +-
 builtin/pack-objects.c                  | 304 +++++++++-
 builtin/repack.c                        | 185 +++++-
 bulk-checkin.c                          |   2 +-
 chunk-format.c                          |  12 +
 chunk-format.h                          |   3 +
 commit-graph.c                          |  18 +-
 midx.c                                  |  18 +-
 object-file.c                           |   4 +-
 object-store.h                          |   7 +-
 pack-mtimes.c                           | 126 ++++
 pack-mtimes.h                           |  15 +
 pack-objects.c                          |   6 +
 pack-objects.h                          |  25 +
 pack-write.c                            |  93 ++-
 pack.h                                  |   4 +
 packfile.c                              |  19 +-
 reachable.c                             |  58 +-
 reachable.h                             |   9 +-
 t/helper/test-pack-mtimes.c             |  56 ++
 t/helper/test-tool.c                    |   1 +
 t/helper/test-tool.h                    |   1 +
 t/t5329-pack-objects-cruft.sh           | 739 ++++++++++++++++++++++++
 32 files changed, 1834 insertions(+), 102 deletions(-)
 create mode 100644 Documentation/technical/cruft-packs.txt
 create mode 100644 pack-mtimes.c
 create mode 100644 pack-mtimes.h
 create mode 100644 t/helper/test-pack-mtimes.c
 create mode 100755 t/t5329-pack-objects-cruft.sh

Range-diff against v4:
 -:  ---------- >  1:  f494ef7377 Documentation/technical: add cruft-packs.txt
 1:  8f9fd21be9 !  2:  91a9d21b0b pack-mtimes: support reading .mtimes files
    @@ Documentation/technical/pack-format.txt: Pack file entry: <+
      
     +== pack-*.mtimes files have the format:
     +
    ++All 4-byte numbers are in network byte order.
    ++
     +  - A 4-byte magic number '0x4d544d45' ('MTME').
     +
     +  - A 4-byte version identifier (= 1).
     +
     +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
     +
    -+  - A table of 4-byte unsigned integers in network order. The ith
    -+    value is the modification time (mtime) of the ith object in the
    -+    corresponding pack by lexicographic (index) order. The mtimes
    -+    count standard epoch seconds.
    ++  - A table of 4-byte unsigned integers. The ith value is the
    ++    modification time (mtime) of the ith object in the corresponding
    ++    pack by lexicographic (index) order. The mtimes count standard
    ++    epoch seconds.
     +
     +  - A trailer, containing a checksum of the corresponding packfile,
     +    and a checksum of all of the above (each having length according
     +    to the specified hash function).
    -+
    -+All 4-byte numbers are in network order.
     +
      == multi-pack-index (MIDX) files have the following format:
      
 2:  cdb21236e1 =  3:  67c4e7209d pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
 3:  1d775f9850 =  4:  fc86506881 chunk-format.h: extract oid_version()
 4:  6172861bd9 =  5:  788d1f96f2 pack-mtimes: support writing pack .mtimes files
 5:  5f9a9a5b7b =  6:  2a6cfb00bf t/helper: add 'pack-mtimes' test-tool
 6:  b8a38fe2e4 =  7:  edb6fcd5ec builtin/pack-objects.c: return from create_object_entry()
 7:  94fe03cc65 =  8:  e3185741f2 builtin/pack-objects.c: --cruft without expiration
 8:  da7273f41f =  9:  1cf00d462c reachable: add options to add_unseen_recent_objects_to_traversal
 9:  58fecd1747 = 10:  d66be44d9a reachable: report precise timestamps from objects in cruft packs
10:  1740b8ef01 = 11:  1434e37623 builtin/pack-objects.c: --cruft with expiration
11:  5992a72cbf ! 12:  0d3555d595 builtin/repack.c: support generating a cruft pack
    @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'expired objects are pruned'
     +		git repack &&
     +
     +		tip="$(git rev-parse cruft)" &&
    -+		path="$objdir/$(test_oid_to_path "$(git rev-parse cruft)")" &&
    ++		path="$objdir/$(test_oid_to_path "$tip")" &&
     +		test-tool chmtime --get +1000 "$path" >expect &&
     +
     +		git checkout main &&
12:  1b241f8f91 = 13:  4b721d3ee9 builtin/repack.c: allow configuring cruft pack generation
13:  ffae78852c = 14:  f9e3ab56b1 builtin/repack.c: use named flags for existing_packs
14:  0743e373ba ! 15:  e9f46e7b5e builtin/repack.c: add cruft packs to MIDX during geometric repack
    @@ builtin/repack.c
      static int pack_everything;
      static int delta_base_offset = 1;
     @@ builtin/repack.c: static void collect_pack_filenames(struct string_list *fname_nonkept_list,
    + 		fname = xmemdupz(e->d_name, len);
    + 
      		if ((extra_keep->nr > 0 && i < extra_keep->nr) ||
    - 		    (file_exists(mkpath("%s/%s.keep", packdir, fname))))
    +-		    (file_exists(mkpath("%s/%s.keep", packdir, fname))))
    ++		    (file_exists(mkpath("%s/%s.keep", packdir, fname)))) {
      			string_list_append_nodup(fname_kept_list, fname);
     -		else
     -			string_list_append_nodup(fname_nonkept_list, fname);
    -+		else {
    -+			struct string_list_item *item = string_list_append_nodup(fname_nonkept_list, fname);
    ++		} else {
    ++			struct string_list_item *item;
    ++			item = string_list_append_nodup(fname_nonkept_list,
    ++							fname);
     +			if (file_exists(mkpath("%s/%s.mtimes", packdir, fname)))
     +				item->util = (void*)(uintptr_t)CRUFT_PACK;
     +		}
15:  9f7e0acac6 = 16:  43c14eec07 builtin/gc.c: conditionally avoid pruning objects via loose
16:  07fa9d4b47 = 17:  1e313b89e8 sha1-file.c: don't freshen cruft packs

Comments

Ævar Arnfjörð Bjarmason May 21, 2022, 11:17 a.m. UTC | #1
On Fri, May 20 2022, Taylor Blau wrote:

>   - The new section in pack-format.txt (describing the ".mtimes" format) now
>     says at the top "all 4-byte numbers are in network byte order", and avoids
>     repeating "network [byte] order" throughout that section to reduce
>     confusion.

Suggestion (outside this series) since that's fixed perhaps a small
stand-alone patch to fix the existing "network order" occurances in the
same file?

> ...and that's pretty much it. In any case, a range-diff is included below.
> Thanks again for all of the thoughtful feedback on this series.

It seems this didn't make it on-list, but for the last round (well, it
seems I replied to v1 by accident) A sent this (at
https://lore.kernel.org/git/220519.86ilq14u1a.gmgdl@evledraar.gmail.com/
if it eventually shows up):
	
	Return-Path: <avarab@gmail.com>
	Received: from gmgdl (dhcp-077-248-183-071.chello.nl. [77.248.183.71])
	        by smtp.gmail.com with ESMTPSA id en21-20020a17090728d500b006fa9820b4a2sm1979775ejc.165.2022.05.19.04.54.26
	        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
	        Thu, 19 May 2022 04:54:26 -0700 (PDT)
	Received: from avar by gmgdl with local (Exim 4.95)
		(envelope-from <avarab@gmail.com>)
		id 1nrejZ-0026V8-AE;
		Thu, 19 May 2022 13:54:25 +0200
	From: =?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason <avarab@gmail.com>
	To: Taylor Blau <me@ttaylorr.com>
	Cc: git@vger.kernel.org, gitster@pobox.com, larsxschneider@gmail.com,
	 peff@peff.net, tytso@mit.edu, brian m. carlson <bk2204@github.com>
	Subject: Re: [PATCH 05/17] pack-mtimes: support writing pack .mtimes files
	Date: Thu, 19 May 2022 13:48:42 +0200
	References: <cover.1638224692.git.me@ttaylorr.com>
	 <deece9eb70e9750bb8350946679b521e59139fe2.1638224692.git.me@ttaylorr.com>
	User-agent: Debian GNU/Linux bookworm/sid; Emacs 27.1; mu4e 1.7.12
	In-reply-to: <deece9eb70e9750bb8350946679b521e59139fe2.1638224692.git.me@ttaylorr.com>
	Message-ID: <220519.86ilq14u1a.gmgdl@evledraar.gmail.com>
	MIME-Version: 1.0
	Content-Type: text/plain
	X-TUID: ta0yc5HgmrrD
	
	
	On Mon, Nov 29 2021, Taylor Blau wrote:
	
	> +static void write_mtimes_header(struct hashfile *f)
	> +{
	> +	hashwrite_be32(f, MTIMES_SIGNATURE);
	> +	hashwrite_be32(f, MTIMES_VERSION);
	> +	hashwrite_be32(f, oid_version(the_hash_algo));
	> +}
	
	Given the history noted in
	https://lore.kernel.org/git/RFC-patch-2.2-051f0612ab9-20220519T113538Z-avarab@gmail.com/
	maybe we can say this ship has just sailed at this point.
	
	But since this is a new format I think it's worth considering not using
	the 1 or 2 you get from oid_version(), but the "format_id",
	i.e. GIT_SHA1_FORMAT_ID or GIT_SHA256_FORMAT_ID.
	
	You'll use the same space in the format for it, but we'll end up with
	something more obvious (as the integer encodes the sha1 or sha256 name).
	
	AFAICT this code is just copied from your earlier work on *.rev, which
	in turn seems copied from earlier work on midx & commit-graph, which
	seems to have used this way of referring to the hash version more as an
	accident than anything explicitly indended...
	
	Then again we could just say that both are equally valid at this point,
	especially given the use in adjacent formats.

I.e. do we think continuing to use 1 v.s. 2 in new formats over instead
of 0x73686131 and 0x73323536 is the right choice?

Other than that the only question I have (I think) on this series is if
Jonathan Nieder is happy with it. I looked back in my logs and there was
an extensive on-IRC discussion about it at the end of March, which ended
in you sending: https://lore.kernel.org/git/YkICkpttOujOKeT3@nand.local/

But it seems Jonathan didn't chime in since then, and he had some major
issues with the approach here. I think those should have been addressed
by that discussion, but it would be nice to get a confirmation.
	
> Taylor Blau (17):
>   Documentation/technical: add cruft-packs.txt
>   pack-mtimes: support reading .mtimes files
>   pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
>   chunk-format.h: extract oid_version()
>   pack-mtimes: support writing pack .mtimes files
>   t/helper: add 'pack-mtimes' test-tool
>   builtin/pack-objects.c: return from create_object_entry()
>   builtin/pack-objects.c: --cruft without expiration
>   reachable: add options to add_unseen_recent_objects_to_traversal
>   reachable: report precise timestamps from objects in cruft packs
>   builtin/pack-objects.c: --cruft with expiration
>   builtin/repack.c: support generating a cruft pack
>   builtin/repack.c: allow configuring cruft pack generation
>   builtin/repack.c: use named flags for existing_packs
>   builtin/repack.c: add cruft packs to MIDX during geometric repack
>   builtin/gc.c: conditionally avoid pruning objects via loose
>   sha1-file.c: don't freshen cruft packs
>
>  Documentation/Makefile                  |   1 +
>  Documentation/config/gc.txt             |  21 +-
>  Documentation/config/repack.txt         |   9 +
>  Documentation/git-gc.txt                |   5 +
>  Documentation/git-pack-objects.txt      |  30 +
>  Documentation/git-repack.txt            |  11 +
>  Documentation/technical/cruft-packs.txt | 123 ++++
>  Documentation/technical/pack-format.txt |  19 +
>  Makefile                                |   2 +
>  builtin/gc.c                            |  10 +-
>  builtin/pack-objects.c                  | 304 +++++++++-
>  builtin/repack.c                        | 185 +++++-
>  bulk-checkin.c                          |   2 +-
>  chunk-format.c                          |  12 +
>  chunk-format.h                          |   3 +
>  commit-graph.c                          |  18 +-
>  midx.c                                  |  18 +-
>  object-file.c                           |   4 +-
>  object-store.h                          |   7 +-
>  pack-mtimes.c                           | 126 ++++
>  pack-mtimes.h                           |  15 +
>  pack-objects.c                          |   6 +
>  pack-objects.h                          |  25 +
>  pack-write.c                            |  93 ++-
>  pack.h                                  |   4 +
>  packfile.c                              |  19 +-
>  reachable.c                             |  58 +-
>  reachable.h                             |   9 +-
>  t/helper/test-pack-mtimes.c             |  56 ++
>  t/helper/test-tool.c                    |   1 +
>  t/helper/test-tool.h                    |   1 +
>  t/t5329-pack-objects-cruft.sh           | 739 ++++++++++++++++++++++++
>  32 files changed, 1834 insertions(+), 102 deletions(-)
>  create mode 100644 Documentation/technical/cruft-packs.txt
>  create mode 100644 pack-mtimes.c
>  create mode 100644 pack-mtimes.h
>  create mode 100644 t/helper/test-pack-mtimes.c
>  create mode 100755 t/t5329-pack-objects-cruft.sh
>
> Range-diff against v4:
>  -:  ---------- >  1:  f494ef7377 Documentation/technical: add cruft-packs.txt
>  1:  8f9fd21be9 !  2:  91a9d21b0b pack-mtimes: support reading .mtimes files
>     @@ Documentation/technical/pack-format.txt: Pack file entry: <+
>       
>      +== pack-*.mtimes files have the format:
>      +
>     ++All 4-byte numbers are in network byte order.
>     ++
>      +  - A 4-byte magic number '0x4d544d45' ('MTME').
>      +
>      +  - A 4-byte version identifier (= 1).
>      +
>      +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
>      +
>     -+  - A table of 4-byte unsigned integers in network order. The ith
>     -+    value is the modification time (mtime) of the ith object in the
>     -+    corresponding pack by lexicographic (index) order. The mtimes
>     -+    count standard epoch seconds.
>     ++  - A table of 4-byte unsigned integers. The ith value is the
>     ++    modification time (mtime) of the ith object in the corresponding
>     ++    pack by lexicographic (index) order. The mtimes count standard
>     ++    epoch seconds.
>      +
>      +  - A trailer, containing a checksum of the corresponding packfile,
>      +    and a checksum of all of the above (each having length according
>      +    to the specified hash function).
>     -+
>     -+All 4-byte numbers are in network order.
>      +
>       == multi-pack-index (MIDX) files have the following format:
>       
>  2:  cdb21236e1 =  3:  67c4e7209d pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
>  3:  1d775f9850 =  4:  fc86506881 chunk-format.h: extract oid_version()
>  4:  6172861bd9 =  5:  788d1f96f2 pack-mtimes: support writing pack .mtimes files
>  5:  5f9a9a5b7b =  6:  2a6cfb00bf t/helper: add 'pack-mtimes' test-tool
>  6:  b8a38fe2e4 =  7:  edb6fcd5ec builtin/pack-objects.c: return from create_object_entry()
>  7:  94fe03cc65 =  8:  e3185741f2 builtin/pack-objects.c: --cruft without expiration
>  8:  da7273f41f =  9:  1cf00d462c reachable: add options to add_unseen_recent_objects_to_traversal
>  9:  58fecd1747 = 10:  d66be44d9a reachable: report precise timestamps from objects in cruft packs
> 10:  1740b8ef01 = 11:  1434e37623 builtin/pack-objects.c: --cruft with expiration
> 11:  5992a72cbf ! 12:  0d3555d595 builtin/repack.c: support generating a cruft pack
>     @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'expired objects are pruned'
>      +		git repack &&
>      +
>      +		tip="$(git rev-parse cruft)" &&
>     -+		path="$objdir/$(test_oid_to_path "$(git rev-parse cruft)")" &&
>     ++		path="$objdir/$(test_oid_to_path "$tip")" &&
>      +		test-tool chmtime --get +1000 "$path" >expect &&
>      +
>      +		git checkout main &&
> 12:  1b241f8f91 = 13:  4b721d3ee9 builtin/repack.c: allow configuring cruft pack generation
> 13:  ffae78852c = 14:  f9e3ab56b1 builtin/repack.c: use named flags for existing_packs
> 14:  0743e373ba ! 15:  e9f46e7b5e builtin/repack.c: add cruft packs to MIDX during geometric repack
>     @@ builtin/repack.c
>       static int pack_everything;
>       static int delta_base_offset = 1;
>      @@ builtin/repack.c: static void collect_pack_filenames(struct string_list *fname_nonkept_list,
>     + 		fname = xmemdupz(e->d_name, len);
>     + 
>       		if ((extra_keep->nr > 0 && i < extra_keep->nr) ||
>     - 		    (file_exists(mkpath("%s/%s.keep", packdir, fname))))
>     +-		    (file_exists(mkpath("%s/%s.keep", packdir, fname))))
>     ++		    (file_exists(mkpath("%s/%s.keep", packdir, fname)))) {
>       			string_list_append_nodup(fname_kept_list, fname);
>      -		else
>      -			string_list_append_nodup(fname_nonkept_list, fname);
>     -+		else {
>     -+			struct string_list_item *item = string_list_append_nodup(fname_nonkept_list, fname);
>     ++		} else {
>     ++			struct string_list_item *item;
>     ++			item = string_list_append_nodup(fname_nonkept_list,
>     ++							fname);
>      +			if (file_exists(mkpath("%s/%s.mtimes", packdir, fname)))
>      +				item->util = (void*)(uintptr_t)CRUFT_PACK;
>      +		}
> 15:  9f7e0acac6 = 16:  43c14eec07 builtin/gc.c: conditionally avoid pruning objects via loose
> 16:  07fa9d4b47 = 17:  1e313b89e8 sha1-file.c: don't freshen cruft packs
Jonathan Nieder May 24, 2022, 7:39 p.m. UTC | #2
Hi,

Ævar Arnfjörð Bjarmason wrote:
> On Fri, May 20 2022, Taylor Blau wrote:
> 	On Mon, Nov 29 2021, Taylor Blau wrote:

> 	> +static void write_mtimes_header(struct hashfile *f)
> 	> +{
> 	> +	hashwrite_be32(f, MTIMES_SIGNATURE);
> 	> +	hashwrite_be32(f, MTIMES_VERSION);
> 	> +	hashwrite_be32(f, oid_version(the_hash_algo));
> 	> +}
[...]
> 	But since this is a new format I think it's worth considering not using
> 	the 1 or 2 you get from oid_version(), but the "format_id",
> 	i.e. GIT_SHA1_FORMAT_ID or GIT_SHA256_FORMAT_ID.
>
> 	You'll use the same space in the format for it, but we'll end up with
> 	something more obvious (as the integer encodes the sha1 or sha256 name).

Agreed.

[...]
> Other than that the only question I have (I think) on this series is if
> Jonathan Nieder is happy with it. I looked back in my logs and there was
> an extensive on-IRC discussion about it at the end of March, which ended
> in you sending: https://lore.kernel.org/git/YkICkpttOujOKeT3@nand.local/
>
> But it seems Jonathan didn't chime in since then, and he had some major
> issues with the approach here. I think those should have been addressed
> by that discussion, but it would be nice to get a confirmation.

I would still prefer if this used a repository format extension, but
that preference is not strong enough that I'd say "this must not go in
without one".  What I think would help would be some information in
the user-facing documentation for commands that create and work with
cruft packs.  In other words, if our take on people sharing
repositories between implementations that understand and don't
understand cruft packs and get objects moving back and forth between
packed and loose objects is "you should have known you were doing
something strange", the least we can do is to warn them.

I don't see a config to enable PACK_CRUFT by default yet in this
series.  I'd like one, so that people can turn it on and get the good
new behavior. :)

Thanks,
Jonathan
Taylor Blau May 24, 2022, 9:50 p.m. UTC | #3
On Tue, May 24, 2022 at 12:39:11PM -0700, Jonathan Nieder wrote:
> Ævar Arnfjörð Bjarmason wrote:
> > On Fri, May 20 2022, Taylor Blau wrote:
> > 	On Mon, Nov 29 2021, Taylor Blau wrote:
>
> > 	> +static void write_mtimes_header(struct hashfile *f)
> > 	> +{
> > 	> +	hashwrite_be32(f, MTIMES_SIGNATURE);
> > 	> +	hashwrite_be32(f, MTIMES_VERSION);
> > 	> +	hashwrite_be32(f, oid_version(the_hash_algo));
> > 	> +}
> [...]
> > 	But since this is a new format I think it's worth considering not using
> > 	the 1 or 2 you get from oid_version(), but the "format_id",
> > 	i.e. GIT_SHA1_FORMAT_ID or GIT_SHA256_FORMAT_ID.
> >
> > 	You'll use the same space in the format for it, but we'll end up with
> > 	something more obvious (as the integer encodes the sha1 or sha256 name).
>
> Agreed.

I know we recommend using the format_id for on-disk formats, but I think
there is enough existing uses of "1" or "2" that either are acceptable
in practice.

E.g., grepping around for "hashwrite.*oid_version", there are three
existing formats that use "1" or "2" instead of the format_id. They are:

  - the commit-graph format
  - the midx format
  - the .rev format

Moreover, I can't seem to find any formats that _don't_ use that
convention. So I have a vague preference towards using the values "1"
and "2" as we currently do in these patches. (TBH, I don't find "sha1"
significantly more interpretable than just "1", so I would be just as
happy leaving it as-is).

> [...]
> > Other than that the only question I have (I think) on this series is if
> > Jonathan Nieder is happy with it. I looked back in my logs and there was
> > an extensive on-IRC discussion about it at the end of March, which ended
> > in you sending: https://lore.kernel.org/git/YkICkpttOujOKeT3@nand.local/
> >
> > But it seems Jonathan didn't chime in since then, and he had some major
> > issues with the approach here. I think those should have been addressed
> > by that discussion, but it would be nice to get a confirmation.
>
> I would still prefer if this used a repository format extension, but
> that preference is not strong enough that I'd say "this must not go in
> without one".  What I think would help would be some information in
> the user-facing documentation for commands that create and work with
> cruft packs.  In other words, if our take on people sharing
> repositories between implementations that understand and don't
> understand cruft packs and get objects moving back and forth between
> packed and loose objects is "you should have known you were doing
> something strange", the least we can do is to warn them.

I think that's a good suggestion. We already have some documentation in
Documentation/technical/cruft-packs.txt, but I think it could be helpful
to add user-facing documentation, too.

Would you be opposed to doing that outside of this series? ISTM that the
technical discussion has mostly settled, so I'd rather wordsmith the
user-facing documentation separately.

> I don't see a config to enable PACK_CRUFT by default yet in this
> series.  I'd like one, so that people can turn it on and get the good
> new behavior. :)

`git gc` has support for this (c.f., "gc.cruftPacks"). `git repack`
requires you to pass `--cruft`; IIRC I originally had a similar
configuration in `git repack` which would change the behavior of `-A` /
`-a` when set, but I found it too confusing and scrapped it.

Thanks,
Taylor
Ævar Arnfjörð Bjarmason May 24, 2022, 9:55 p.m. UTC | #4
On Tue, May 24 2022, Taylor Blau wrote:

> On Tue, May 24, 2022 at 12:39:11PM -0700, Jonathan Nieder wrote:
>> Ævar Arnfjörð Bjarmason wrote:
>> > On Fri, May 20 2022, Taylor Blau wrote:
>> > 	On Mon, Nov 29 2021, Taylor Blau wrote:
>>
>> > 	> +static void write_mtimes_header(struct hashfile *f)
>> > 	> +{
>> > 	> +	hashwrite_be32(f, MTIMES_SIGNATURE);
>> > 	> +	hashwrite_be32(f, MTIMES_VERSION);
>> > 	> +	hashwrite_be32(f, oid_version(the_hash_algo));
>> > 	> +}
>> [...]
>> > 	But since this is a new format I think it's worth considering not using
>> > 	the 1 or 2 you get from oid_version(), but the "format_id",
>> > 	i.e. GIT_SHA1_FORMAT_ID or GIT_SHA256_FORMAT_ID.
>> >
>> > 	You'll use the same space in the format for it, but we'll end up with
>> > 	something more obvious (as the integer encodes the sha1 or sha256 name).
>>
>> Agreed.
>
> I know we recommend using the format_id for on-disk formats, but I think
> there is enough existing uses of "1" or "2" that either are acceptable
> in practice.
>
> E.g., grepping around for "hashwrite.*oid_version", there are three
> existing formats that use "1" or "2" instead of the format_id. They are:
>
>   - the commit-graph format
>   - the midx format
>   - the .rev format
>
> Moreover, I can't seem to find any formats that _don't_ use that
> convention.

It's used in the reftable format.

> So I have a vague preference towards using the values "1"
> and "2" as we currently do in these patches.

I suspect that's less "vague" and more "c'mon, I'm using it in
production already" :)

Anyway, I'm fine with leaving it be as you have it currently. I'd first
encountered this magic with reftable I think, and only recently found
that we use 1 and 2 in these other more recent places.

> (TBH, I don't find "sha1"
> significantly more interpretable than just "1", so I would be just as
> happy leaving it as-is).

Hrm, I'd think having it sha1 or s256 in big-endian would be a bit more
self-explanatory. I.e. SHA-256 is 2, not 256, and our 3 (if that ever
arrives) is likely not to be SHA-3 (but probably some successor).
Taylor Blau May 24, 2022, 10:12 p.m. UTC | #5
On Tue, May 24, 2022 at 11:55:02PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
> On Tue, May 24 2022, Taylor Blau wrote:
>
> > On Tue, May 24, 2022 at 12:39:11PM -0700, Jonathan Nieder wrote:
> >> Ævar Arnfjörð Bjarmason wrote:
> >> > On Fri, May 20 2022, Taylor Blau wrote:
> >> > 	On Mon, Nov 29 2021, Taylor Blau wrote:
> >>
> >> > 	> +static void write_mtimes_header(struct hashfile *f)
> >> > 	> +{
> >> > 	> +	hashwrite_be32(f, MTIMES_SIGNATURE);
> >> > 	> +	hashwrite_be32(f, MTIMES_VERSION);
> >> > 	> +	hashwrite_be32(f, oid_version(the_hash_algo));
> >> > 	> +}
> >> [...]
> >> > 	But since this is a new format I think it's worth considering not using
> >> > 	the 1 or 2 you get from oid_version(), but the "format_id",
> >> > 	i.e. GIT_SHA1_FORMAT_ID or GIT_SHA256_FORMAT_ID.
> >> >
> >> > 	You'll use the same space in the format for it, but we'll end up with
> >> > 	something more obvious (as the integer encodes the sha1 or sha256 name).
> >>
> >> Agreed.
> >
> > I know we recommend using the format_id for on-disk formats, but I think
> > there is enough existing uses of "1" or "2" that either are acceptable
> > in practice.
> >
> > E.g., grepping around for "hashwrite.*oid_version", there are three
> > existing formats that use "1" or "2" instead of the format_id. They are:
> >
> >   - the commit-graph format
> >   - the midx format
> >   - the .rev format
> >
> > Moreover, I can't seem to find any formats that _don't_ use that
> > convention.
>
> It's used in the reftable format.

Ah, thanks for pointing it out. Still, I think there's enough uses of
"1" and "2" over format_id that I'm not convinced here.

> > So I have a vague preference towards using the values "1"
> > and "2" as we currently do in these patches.
>
> I suspect that's less "vague" and more "c'mon, I'm using it in
> production already" :)

No, this wasn't a veil over anything. Yes, GitHub is using this in
production already, but that isn't why I'm opposed here. I'm opposed for
the reasons I explained in the quoted bits (and would happily carry a
small amount of custom code in GitHub's fork to continue to recognize
the "1" or "2" values if this ever changed to use format_id).

> Anyway, I'm fine with leaving it be as you have it currently. I'd first
> encountered this magic with reftable I think, and only recently found
> that we use 1 and 2 in these other more recent places.

Sounds good. Unless others have a very strong opinion, let's leave it as
is.

Thanks,
Taylor
Jonathan Nieder May 25, 2022, 7:53 a.m. UTC | #6
Hi,

Taylor Blau wrote:
> On Tue, May 24, 2022 at 11:55:02PM +0200, Ævar Arnfjörð Bjarmason wrote:

>>> Moreover, I can't seem to find any formats that _don't_ use that
>>> convention.
>>
>> It's used in the reftable format.

It's also used in the formats described in
Documentation/technical/hash-function-transition.

[...]
> Sounds good. Unless others have a very strong opinion, let's leave it as
> is.

File formats are one of those things where a little time early can save
a lot of work later.  If there were a strong reason to use "1" and "2"
here then I'd be okay with living with it --- I'm a pragmatic person.
But in general, using the magic numbers instead of a sequential value is
really helpful both in making the file formats more self-explanatory and
in making it possible to experiment with multiple new hash_algos at the
same time.

The main argument I'm hearing for using "1" and "2" is "because some
other formats got that wrong".  That reason is the opposite of
compelling to me: it makes me suspect that as a project we should more
eagerly break the old bad habits and form new ones.  I guess this
qualifies as a very strong opinion.

Thanks,
Jonathan
Derrick Stolee May 25, 2022, 7:59 p.m. UTC | #7
On 5/25/2022 3:53 AM, Jonathan Nieder wrote:
> Taylor Blau wrote:
>> On Tue, May 24, 2022 at 11:55:02PM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
>>>> Moreover, I can't seem to find any formats that _don't_ use that
>>>> convention.
>>>
>>> It's used in the reftable format.

The use in reftable is the only one I can find and that implementation
is not idiomatic. Specifically, the way the four-byte header was
implemented is not easy to extract and share in other formats.

This series does the good work of extracting oid_version() as a
common method across these formats so it is easier to share.

> It's also used in the formats described in
> Documentation/technical/hash-function-transition.

It documents things that have not been implemented, such as the v3
pack-index format:

  Pack index (.idx) files use a new v3 format that supports multiple
  hash functions. They have the following format (all integers are in
  network byte order):
(...)
  * 4-byte number of object formats in this pack index: 2
  * For each object format:
    ** 4-byte format identifier (e.g., 'sha1' for SHA-1)
    ** 4-byte length in bytes of shortened object names. This is the
      shortest possible length needed to make names in the shortened
      object name table unambiguous.
    ** 4-byte integer, recording where tables relating to this format
      are stored in this index file, as an offset from the beginning.

This was added in your 752414ae431 (technical doc: add a design doc
for hash function transition, 2017-09-27), but has not been acted upon
yet.

> [...]
>> Sounds good. Unless others have a very strong opinion, let's leave it as
>> is.
> 
> File formats are one of those things where a little time early can save
> a lot of work later.  If there were a strong reason to use "1" and "2"
> here then I'd be okay with living with it --- I'm a pragmatic person.
> But in general, using the magic numbers instead of a sequential value is
> really helpful both in making the file formats more self-explanatory and
> in making it possible to experiment with multiple new hash_algos at the
> same time.
> 
> The main argument I'm hearing for using "1" and "2" is "because some
> other formats got that wrong".  That reason is the opposite of
> compelling to me: it makes me suspect that as a project we should more
> eagerly break the old bad habits and form new ones.  I guess this
> qualifies as a very strong opinion.

Either way, these are magic numbers. One happens to somewhat spell
out something when looking at the file in a hex editor with ASCII
previews, but that doesn't change the fact that it is most important
that the hash function is correctly indicated by the file format and
parsed by the Git executable (not a human).

I'd much rather have a consistent and proven way of specifying the
hash value (using the oid_version() helper) than to try and make a
new mechanism.

Thanks,
-Stolee
Taylor Blau May 25, 2022, 9:09 p.m. UTC | #8
On Wed, May 25, 2022 at 03:59:24PM -0400, Derrick Stolee wrote:
> I'd much rather have a consistent and proven way of specifying the
> hash value (using the oid_version() helper) than to try and make a
> new mechanism.

To be clear, I absolutely don't think any of us should have the attitude
of repeating past bad decisions for the sake of consistency.

As best I can tell, our (Jonathan and I's) disagreement is on whether
using "1" and "2" to identify which hash function is used by the .mtimes
file is OK or not. I happen to think that it is acceptable, so the
choice to continue to adopt this pattern was motivated by being
consistent with a pattern that is good and works.

Thanks,
Taylor
Ævar Arnfjörð Bjarmason May 26, 2022, 12:06 a.m. UTC | #9
On Wed, May 25 2022, Taylor Blau wrote:

> On Wed, May 25, 2022 at 03:59:24PM -0400, Derrick Stolee wrote:
>> I'd much rather have a consistent and proven way of specifying the
>> hash value (using the oid_version() helper) than to try and make a
>> new mechanism.
>
> To be clear, I absolutely don't think any of us should have the attitude
> of repeating past bad decisions for the sake of consistency.
>
> As best I can tell, our (Jonathan and I's) disagreement is on whether
> using "1" and "2" to identify which hash function is used by the .mtimes
> file is OK or not. I happen to think that it is acceptable, so the
> choice to continue to adopt this pattern was motivated by being
> consistent with a pattern that is good and works.

I don't have a strong opinion on whether we "bless" that or not, and say
that we should just use 1, 2 etc. going forward or not.

But I do think that us doing so initially wasn't intentional, and has
been in opposition to a strongly worded claim in a comment in hash.h
(which I modified in my earlier related RFC series).

So maybe not part of this series, but it seems prudent if you feel
strongly about using this for new formats over what hash.h is currently
recommending that we have some patch sooner than later to update it
accordingly.