mbox series

[v4,00/25] multi-pack reachability bitmaps

Message ID cover.1629821743.git.me@ttaylorr.com (mailing list archive)
Headers show
Series multi-pack reachability bitmaps | expand

Message

Taylor Blau Aug. 24, 2021, 4:15 p.m. UTC
Here is what I anticipate to be a final reroll of my series to implement
multi-pack reachability bitmaps, based on review feedback from Peff.

Most of the change since last time are cosmetic test clean-ups. The previous
reroll of this series incorporated feedback from a discussion[1] surrounding the
`multi-pack-index` builtin's `--object-dir` argument. This reroll fixes a bug
discussed here[2] where we should have been calling close_object_store() but
weren't; the remainder of that bug has already been dealt with.

Thanks everybody for dealing with multiple versions of this quite lengthy and
complicated series. Hopefully we are done in this round and can move on to
integrating this with `git repack`, which will complete the MIDX bitmaps topic.

[1]: https://lore.kernel.org/git/YQMFIljXl7sAAA%2FL@nand.local/
[2]: https://lore.kernel.org/git/YRWBZJDCVyUOhk2F@coredump.intra.peff.net/

Jeff King (2):
  t0410: disable GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP
  t5310: disable GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP

Taylor Blau (23):
  pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps
  pack-bitmap-write.c: gracefully fail to write non-closed bitmaps
  pack-bitmap-write.c: free existing bitmaps
  Documentation: describe MIDX-based bitmaps
  midx: clear auxiliary .rev after replacing the MIDX
  midx: reject empty `--preferred-pack`'s
  midx: infer preferred pack when not given one
  midx: close linked MIDXs, avoid leaking memory
  midx: avoid opening multiple MIDXs when writing
  pack-bitmap.c: introduce 'bitmap_num_objects()'
  pack-bitmap.c: introduce 'nth_bitmap_object_oid()'
  pack-bitmap.c: introduce 'bitmap_is_preferred_refname()'
  pack-bitmap.c: avoid redundant calls to try_partial_reuse
  pack-bitmap: read multi-pack bitmaps
  pack-bitmap: write multi-pack bitmaps
  t5310: move some tests to lib-bitmap.sh
  t/helper/test-read-midx.c: add --checksum mode
  t5326: test multi-pack bitmap behavior
  t5319: don't write MIDX bitmaps in t5319
  t7700: update to work with MIDX bitmap test knob
  midx: respect 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP'
  p5310: extract full and partial bitmap tests
  p5326: perf tests for MIDX bitmaps

 Documentation/git-multi-pack-index.txt       |  18 +-
 Documentation/technical/bitmap-format.txt    |  71 ++-
 Documentation/technical/multi-pack-index.txt |  10 +-
 builtin/multi-pack-index.c                   |   2 +
 builtin/pack-objects.c                       |   8 +-
 builtin/repack.c                             |  12 +-
 ci/run-build-and-tests.sh                    |   1 +
 midx.c                                       | 319 +++++++++++-
 midx.h                                       |   5 +
 pack-bitmap-write.c                          |  79 ++-
 pack-bitmap.c                                | 499 ++++++++++++++++---
 pack-bitmap.h                                |   9 +-
 packfile.c                                   |   2 +-
 t/README                                     |   4 +
 t/helper/test-read-midx.c                    |  16 +-
 t/lib-bitmap.sh                              | 240 +++++++++
 t/perf/lib-bitmap.sh                         |  69 +++
 t/perf/p5310-pack-bitmaps.sh                 |  65 +--
 t/perf/p5326-multi-pack-bitmaps.sh           |  43 ++
 t/t0410-partial-clone.sh                     |  12 +-
 t/t5310-pack-bitmaps.sh                      | 231 +--------
 t/t5319-multi-pack-index.sh                  |  20 +-
 t/t5326-multi-pack-bitmaps.sh                | 286 +++++++++++
 t/t7700-repack.sh                            |  18 +-
 24 files changed, 1603 insertions(+), 436 deletions(-)
 create mode 100644 t/perf/lib-bitmap.sh
 create mode 100755 t/perf/p5326-multi-pack-bitmaps.sh
 create mode 100755 t/t5326-multi-pack-bitmaps.sh

Range-diff against v3:
 1:  fa4cbed48e =  1:  92dc0bbc0d pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps
 2:  2b15c1fc5c =  2:  979276bc74 pack-bitmap-write.c: gracefully fail to write non-closed bitmaps
 3:  2ad513a230 =  3:  8f00493955 pack-bitmap-write.c: free existing bitmaps
 4:  8da5de7c24 =  4:  bc7db926d8 Documentation: describe MIDX-based bitmaps
 5:  49297f57ed =  5:  771741844b midx: clear auxiliary .rev after replacing the MIDX
 6:  c5513f2a75 =  6:  dab5dbf228 midx: reject empty `--preferred-pack`'s
 7:  53ef0a6d67 =  7:  31f4517de0 midx: infer preferred pack when not given one
 8:  114773d9cd =  8:  aa3bd96d9b midx: close linked MIDXs, avoid leaking memory
 9:  40cff5beb5 !  9:  c9fea31fa8 midx: avoid opening multiple MIDXs when writing
    @@ Commit message
         one and should invalidate the object store's memory of any MIDX that
         might have existed beforehand.
     
    +    Note that this now forbids passing object directories that don't belong
    +    to alternate repositories over `--object-dir`, since before we would
    +    have happily opened a MIDX in any directory, but now restrict ourselves
    +    to only those reachable by `r->objects->multi_pack_index` (and alternate
    +    MIDXs that we can see by walking the `next` pointer).
    +
    +    As far as I can tell, supporting arbitrary directories with
    +    `--object-dir` was a historical accident, since even the documentation
    +    says `<alt>` when referring to the value passed to this option.
    +
    +    A future patch could clean this up and provide a warning() when a
    +    non-alternate directory was given, since we'll still write a new MIDX
    +    there, we just won't reuse any MIDX that might happen to already exist
    +    in that directory.
    +
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
      ## midx.c ##
    @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack
     +			break;
     +		}
     +	}
    -+	if (!ctx.m)
    -+		ctx.m = get_local_multi_pack_index(the_repository);
      
      	if (ctx.m && !midx_checksum_valid(ctx.m)) {
      		warning(_("ignoring existing multi-pack-index; checksum mismatch"));
    +@@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack_index *
    + 	f = hashfd(get_lock_file_fd(&lk), get_lock_file_path(&lk));
    + 
    + 	if (ctx.m)
    +-		close_midx(ctx.m);
    ++		close_object_store(the_repository->objects);
    + 
    + 	if (ctx.nr - dropped_packs == 0) {
    + 		error(_("no pack files to index."));
     @@ midx.c: int write_midx_file(const char *object_dir,
      		    const char *preferred_pack_name,
      		    unsigned flags)
10:  ca7f726abf ! 10:  ee72fb7e38 pack-bitmap.c: introduce 'bitmap_num_objects()'
    @@ pack-bitmap.c: static void show_extended_objects(struct bitmap_index *bitmap_git
      
      		obj = eindex->objects[i];
     @@ pack-bitmap.c: static void filter_bitmap_exclude_type(struct bitmap_index *bitmap_git,
    - 	 * individually.
    + 	 * them individually.
      	 */
      	for (i = 0; i < eindex->count; i++) {
     -		uint32_t pos = i + bitmap_git->pack->num_objects;
11:  67e6897a34 = 11:  ede0bf1ce1 pack-bitmap.c: introduce 'nth_bitmap_object_oid()'
12:  743a1a138e = 12:  df6844def0 pack-bitmap.c: introduce 'bitmap_is_preferred_refname()'
13:  a3b641b3e6 = 13:  4e06f051a7 pack-bitmap.c: avoid redundant calls to try_partial_reuse
14:  141ff83275 = 14:  a0d73eb3d3 pack-bitmap: read multi-pack bitmaps
15:  54600b5814 ! 15:  9d83ad77ab pack-bitmap: write multi-pack bitmaps
    @@ midx.c: static int write_midx_internal(const char *object_dir,
      	f = hashfd(get_lock_file_fd(&lk), get_lock_file_path(&lk));
      
     -	if (ctx.m)
    --		close_midx(ctx.m);
    +-		close_object_store(the_repository->objects);
     -
      	if (ctx.nr - dropped_packs == 0) {
      		error(_("no pack files to index."));
    @@ midx.c: static int write_midx_internal(const char *object_dir,
     +		}
     +	}
     +
    -+	close_midx(ctx.m);
    ++	close_object_store(the_repository->objects);
      
      	commit_lock_file(&lk);
      
16:  168b7b0976 ! 16:  a92af89884 t5310: move some tests to lib-bitmap.sh
    @@ Commit message
     
      ## t/lib-bitmap.sh ##
     @@
    -+# Helpers for scripts testing bitamp functionality; see t5310 for
    ++# Helpers for scripts testing bitmap functionality; see t5310 for
     +# example usage.
     +
      # Compare a file containing rev-list bitmap traversal output to its non-bitmap
17:  60ec8b3466 ! 17:  d47aa4a919 t/helper/test-read-midx.c: add --checksum mode
    @@ t/lib-bitmap.sh: have_delta () {
      }
     +
     +midx_checksum () {
    -+	test-tool read-midx --checksum "${1:-.git/objects}"
    ++	test-tool read-midx --checksum "$1"
     +}
18:  3258ccfc1c ! 18:  9d9d9f28a6 t5326: test multi-pack bitmap behavior
    @@ t/t5326-multi-pack-bitmaps.sh (new)
     +	git repack -ad &&
     +	git multi-pack-index write --bitmap &&
     +	test_path_is_file $midx &&
    -+	test_path_is_file $midx-$(midx_checksum $objdir).bitmap
    ++	test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
    ++	test_path_is_file $midx-$(midx_checksum $objdir).rev
     +'
     +
     +basic_bitmap_tests
    @@ t/t5326-multi-pack-bitmaps.sh (new)
     +	for i in $(test_seq 1 16)
     +	do
     +		test_commit "$i" &&
    -+		git repack -d
    ++		git repack -d || return 1
     +	done &&
     +
     +	git checkout -b other2 HEAD~8 &&
     +	for i in $(test_seq 1 8)
     +	do
     +		test_commit "side-$i" &&
    -+		git repack -d
    ++		git repack -d || return 1
     +	done &&
     +	git checkout second
     +'
    @@ t/t5326-multi-pack-bitmaps.sh (new)
     +	test_line_count = 25 packs &&
     +
     +	test_path_is_file $midx &&
    -+	test_path_is_file $midx-$(midx_checksum $objdir).bitmap
    ++	test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
    ++	test_path_is_file $midx-$(midx_checksum $objdir).rev
     +'
     +
     +basic_bitmap_tests
    @@ t/t5326-multi-pack-bitmaps.sh (new)
     +	git multi-pack-index write --bitmap &&
     +
     +	test_commit respect--no-bitmap &&
    -+	GIT_TEST_MULTI_PACK_INDEX=0 git repack -d &&
    ++	git repack -d &&
     +
     +	test_path_is_file $midx &&
     +	test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
    ++	test_path_is_file $midx-$(midx_checksum $objdir).rev &&
     +
     +	git multi-pack-index write --no-bitmap &&
     +
     +	test_path_is_file $midx &&
    -+	test_path_is_missing $midx-$(midx_checksum $objdir).bitmap
    ++	test_path_is_missing $midx-$(midx_checksum $objdir).bitmap &&
    ++	test_path_is_missing $midx-$(midx_checksum $objdir).rev
     +'
     +
     +test_expect_success 'setup midx with base from later pack' '
    @@ t/t5326-multi-pack-bitmaps.sh (new)
     +			git config core.multiPackIndex true &&
     +			if test "MIDX" = "$from"
     +			then
    -+				GIT_TEST_MULTI_PACK_INDEX=0 git repack -Ad &&
    ++				git repack -Ad &&
     +				git multi-pack-index write --bitmap
     +			else
    -+				GIT_TEST_MULTI_PACK_INDEX=0 git repack -Adb
    ++				git repack -Adb
     +			fi
     +		)
     +	'
    @@ t/t5326-multi-pack-bitmaps.sh (new)
     +
     +			if test "MIDX" = "$to"
     +			then
    -+				GIT_TEST_MULTI_PACK_INDEX=0 git repack -d &&
    ++				git repack -d &&
     +				git multi-pack-index write --bitmap
     +			else
    -+				GIT_TEST_MULTI_PACK_INDEX=0 git repack -Adb
    ++				git repack -Adb
     +			fi
     +		)
     +	'
    @@ t/t5326-multi-pack-bitmaps.sh (new)
     +	test_commit loose &&
     +	git multi-pack-index write --bitmap 2>err &&
     +	test_path_is_file $midx &&
    -+	test_path_is_file $midx-$(midx_checksum $objdir).bitmap
    ++	test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
    ++	test_path_is_file $midx-$(midx_checksum $objdir).rev
     +'
     +
     +basic_bitmap_tests HEAD~
    @@ t/t5326-multi-pack-bitmaps.sh (new)
     +
     +		# Write a MIDX and bitmap; remove the MIDX but leave the bitmap.
     +		stale_bitmap=$midx-$(midx_checksum $objdir).bitmap &&
    ++		stale_rev=$midx-$(midx_checksum $objdir).rev &&
     +		rm $midx &&
     +
     +		# Then write a new MIDX.
    @@ t/t5326-multi-pack-bitmaps.sh (new)
     +
     +		test_path_is_file $midx &&
     +		test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
    -+		test_path_is_missing $stale_bitmap
    ++		test_path_is_file $midx-$(midx_checksum $objdir).rev &&
    ++		test_path_is_missing $stale_bitmap &&
    ++		test_path_is_missing $stale_rev
     +	)
     +'
     +
    @@ t/t5326-multi-pack-bitmaps.sh (new)
     +		git multi-pack-index write --bitmap &&
     +		test_path_is_file $midx &&
     +		test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
    ++		test_path_is_file $midx-$(midx_checksum $objdir).rev &&
     +
     +		test-tool bitmap list-commits | sort >bitmaps &&
     +		comm -13 bitmaps commits >before &&
19:  47c7e6bb9b = 19:  3e0da7e5ed t0410: disable GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP
20:  6a708858b1 = 20:  4e0d49a2dd t5310: disable GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP
21:  1eaa744b24 = 21:  47eba8ecf9 t5319: don't write MIDX bitmaps in t5319
22:  a4a899e31f = 22:  3d78afa2ad t7700: update to work with MIDX bitmap test knob
23:  50865e52a3 = 23:  c2f94e033d midx: respect 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP'
24:  0f1fd6e7d4 = 24:  6b03016c99 p5310: extract full and partial bitmap tests
25:  82e8133bf4 = 25:  d98faa4c2c p5326: perf tests for MIDX bitmaps

Comments

Jeff King Aug. 25, 2021, 12:28 a.m. UTC | #1
On Tue, Aug 24, 2021 at 12:15:47PM -0400, Taylor Blau wrote:

> Range-diff against v3:
> [...]
>  9:  40cff5beb5 !  9:  c9fea31fa8 midx: avoid opening multiple MIDXs when writing
>     @@ Commit message
>          one and should invalidate the object store's memory of any MIDX that
>          might have existed beforehand.
>      
>     +    Note that this now forbids passing object directories that don't belong
>     +    to alternate repositories over `--object-dir`, since before we would
>     +    have happily opened a MIDX in any directory, but now restrict ourselves
>     +    to only those reachable by `r->objects->multi_pack_index` (and alternate
>     +    MIDXs that we can see by walking the `next` pointer).
>     +
>     +    As far as I can tell, supporting arbitrary directories with
>     +    `--object-dir` was a historical accident, since even the documentation
>     +    says `<alt>` when referring to the value passed to this option.
>     +
>     +    A future patch could clean this up and provide a warning() when a
>     +    non-alternate directory was given, since we'll still write a new MIDX
>     +    there, we just won't reuse any MIDX that might happen to already exist
>     +    in that directory.
>     +

So this is definitely fixed as we discussed. But since that discussion,
we've had the thread over in:

  https://lore.kernel.org/git/20210820195558.44275-1-johannes@sipsolutions.net/

and its siblings:

  https://lore.kernel.org/git/20210823094049.44136-1-johannes@sipsolutions.net/

  https://lore.kernel.org/git/20210823171011.80588-1-johannes@sipsolutions.net/

It's not clear to me that we have a resolution on whether calling "cd ..
&& git multi-pack-index write --object-dir repo.git" is supposed to
work.

It has traditionally worked (at least for trivial cases, AFAICT), but I
find the behavior surprising and unlike most of the rest of Git, and I'm
not at all certain that there aren't subtle bugs lurking (basically
anything that wants to do object lookup, like oh say, a bitmap
generator).

But if we do want to support it, then we have to find a different
solution here, don't we?  I think the least-painful version of that is
probably recording _whether_ we found ctx.m in the_repository's
object_store, and switching behavior based on that (e.g., calling
close_midx() versus close_object_store() depending).

-Peff
Taylor Blau Aug. 25, 2021, 2:10 a.m. UTC | #2
On Tue, Aug 24, 2021 at 08:28:36PM -0400, Jeff King wrote:
> On Tue, Aug 24, 2021 at 12:15:47PM -0400, Taylor Blau wrote:
>
> > Range-diff against v3:
> > [...]
> >  9:  40cff5beb5 !  9:  c9fea31fa8 midx: avoid opening multiple MIDXs when writing
> >     @@ Commit message
> >          one and should invalidate the object store's memory of any MIDX that
> >          might have existed beforehand.
> >
> >     +    Note that this now forbids passing object directories that don't belong
> >     +    to alternate repositories over `--object-dir`, since before we would
> >     +    have happily opened a MIDX in any directory, but now restrict ourselves
> >     +    to only those reachable by `r->objects->multi_pack_index` (and alternate
> >     +    MIDXs that we can see by walking the `next` pointer).
> >     +
> >     +    As far as I can tell, supporting arbitrary directories with
> >     +    `--object-dir` was a historical accident, since even the documentation
> >     +    says `<alt>` when referring to the value passed to this option.
> >     +
> >     +    A future patch could clean this up and provide a warning() when a
> >     +    non-alternate directory was given, since we'll still write a new MIDX
> >     +    there, we just won't reuse any MIDX that might happen to already exist
> >     +    in that directory.
> >     +
>
> So this is definitely fixed as we discussed. But since that discussion,
> we've had the thread over in:
>
>   https://lore.kernel.org/git/20210820195558.44275-1-johannes@sipsolutions.net/
>
> and its siblings:
>
>   https://lore.kernel.org/git/20210823094049.44136-1-johannes@sipsolutions.net/
>
>   https://lore.kernel.org/git/20210823171011.80588-1-johannes@sipsolutions.net/
>
> It's not clear to me that we have a resolution on whether calling "cd ..
> && git multi-pack-index write --object-dir repo.git" is supposed to
> work.

My recommendation would be to do the following things, all in a reroll
of this series:

  - Fix the bug by which we would delete a .rev or .bitmap file out of a
    different object store than we were working in (when the caller
    passes `--object-dir`).

  - Disallow running `git multi-pack-index` outside of a Git repository.

  - Restrict `--object-dir` to only work with alternates of the
    repository in the current working directory.

To me, that seems like both the least-surprising behavior, and what
would lend itself to the easiest implementation. I would probably argue
that the existing behavior (where `--object-dir` would work against
arbitrary repositories) is a bug, and shouldn't continue to be
supported.

So my plan would be to do that, which would generate something like the
following range-diff. If nobody has any objections, I'd like to send
what I currently have in ttaylorr/git on GitHub in the
tb/multi-pack-bitmaps branch as a reroll of this series, and then merge
that early in the cycle to give it a chance to be tested before we cut
2.34.

Thanks,
Taylor
Taylor Blau Aug. 25, 2021, 2:13 a.m. UTC | #3
On Tue, Aug 24, 2021 at 10:10:12PM -0400, Taylor Blau wrote:
> My recommendation would be to do the following things, all in a reroll
> of this series:

For what it's worth, the substantive changes (which I have not figured
out how to include in a range-diff since they are entirely new patches)
are these:

  - Replacing Johannes' patch with:
    https://github.com/ttaylorr/git/commit/2b1afbd516a75bb43a8aae6ff1cac6a83ed7f589,

  - and then adding another patch immediately after it:
    https://github.com/git/git/commit/0a2d4d8dbf3c50eb3e2b659d1dcdf432d3b4d223

...and otherwise keeping the remainder of the series unchanged.

Thanks,
Taylor
Jeff King Aug. 25, 2021, 7:36 a.m. UTC | #4
On Tue, Aug 24, 2021 at 10:10:12PM -0400, Taylor Blau wrote:

> > It's not clear to me that we have a resolution on whether calling "cd ..
> > && git multi-pack-index write --object-dir repo.git" is supposed to
> > work.
> 
> My recommendation would be to do the following things, all in a reroll
> of this series:
> 
>   - Fix the bug by which we would delete a .rev or .bitmap file out of a
>     different object store than we were working in (when the caller
>     passes `--object-dir`).
> 
>   - Disallow running `git multi-pack-index` outside of a Git repository.
> 
>   - Restrict `--object-dir` to only work with alternates of the
>     repository in the current working directory.
> 
> To me, that seems like both the least-surprising behavior, and what
> would lend itself to the easiest implementation. I would probably argue
> that the existing behavior (where `--object-dir` would work against
> arbitrary repositories) is a bug, and shouldn't continue to be
> supported.

All of those seem reasonable to me, and are what I would suggest if we
were starting from scratch. My only hesitation is whether people are
using the weird behavior of --object-dir in the wild (e.g., are bup
folks relying on it).

Johannes, is this something you're using _now_, and it works, or
something you hoped to use in the future?

In a sense, "hope to use" does not make you any less disappointed. ;)
But what I'm wondering is whether using --object-dir from outside a repo
entirely is actually something that even works. I.e., would we be
disabling a behavior that was not intended, but does happen to work? Or
are we closing off a possibly buggy and half-working part of the system?

-Peff
Johannes Berg Aug. 25, 2021, 7:48 a.m. UTC | #5
On Wed, 2021-08-25 at 03:36 -0400, Jeff King wrote:
> On Tue, Aug 24, 2021 at 10:10:12PM -0400, Taylor Blau wrote:
> 
> > > It's not clear to me that we have a resolution on whether calling "cd ..
> > > && git multi-pack-index write --object-dir repo.git" is supposed to
> > > work.
> > 
> > My recommendation would be to do the following things, all in a reroll
> > of this series:
> > 
> >   - Fix the bug by which we would delete a .rev or .bitmap file out of a
> >     different object store than we were working in (when the caller
> >     passes `--object-dir`).

That was what my patch did, afaict.

> >   - Disallow running `git multi-pack-index` outside of a Git repository.
> > 
> >   - Restrict `--object-dir` to only work with alternates of the
> >     repository in the current working directory.
> > 
> > To me, that seems like both the least-surprising behavior, and what
> > would lend itself to the easiest implementation. I would probably argue
> > that the existing behavior (where `--object-dir` would work against
> > arbitrary repositories) is a bug, and shouldn't continue to be
> > supported.
> 
> All of those seem reasonable to me, and are what I would suggest if we
> were starting from scratch. My only hesitation is whether people are
> using the weird behavior of --object-dir in the wild (e.g., are bup
> folks relying on it).
> 
> Johannes, is this something you're using _now_, and it works, or
> something you hoped to use in the future?

I was "hoping" to use

	git multi-pack-index --object-dir=... write

but never

	$ git multi-pack-index write --object-dir=...

which almost seems like it really is more like

	$ git -C ... multi-pack-index write

anyway, because you specify a repo? At least per the above example, I
never tried.


As I started playing with that again (I had done before, and it worked)
I noticed the segfault, hence my previous patch.


However, what I was thinking of doing is more outlined in this thread:
https://lore.kernel.org/git/20210820195558.44275-1-johannes@sipsolutions.net/


And essentially, as I described later in
https://lore.kernel.org/git/dbb24573efc3dd945acd8acdfd9fe627ad7cbcd2.camel@sipsolutions.net/

I have two only vaguely overlapping use cases.

One of them doesn't need "--object-dir", and the other requires that
[RFC PATCH] to be applied as well, which would basically let me use only
the small subset of git that is "git multi-pack-index" as machinery to
*just* do indexing, *without* really ever having a real "repository"
that git could otherwise operate on and worry about the actual objects
etc.

I might resend that with the code style issues fixed, but the objects
seemed more fundamental.

> But what I'm wondering is whether using --object-dir from outside a repo
> entirely is actually something that even works. I.e., would we be
> disabling a behavior that was not intended, but does happen to work? Or
> are we closing off a possibly buggy and half-working part of the system?

Well, it does work now, modulo the segfault, but that is actually a very
recent addition, I'd tried this before :)

johannes
Taylor Blau Aug. 26, 2021, 6:49 p.m. UTC | #6
On Wed, Aug 25, 2021 at 03:36:15AM -0400, Jeff King wrote:
> On Tue, Aug 24, 2021 at 10:10:12PM -0400, Taylor Blau wrote:
>
> > > It's not clear to me that we have a resolution on whether calling "cd ..
> > > && git multi-pack-index write --object-dir repo.git" is supposed to
> > > work.
> >
> > My recommendation would be to do the following things, all in a reroll
> > of this series:
> >
> >   - Fix the bug by which we would delete a .rev or .bitmap file out of a
> >     different object store than we were working in (when the caller
> >     passes `--object-dir`).
> >
> >   - Disallow running `git multi-pack-index` outside of a Git repository.
> >
> >   - Restrict `--object-dir` to only work with alternates of the
> >     repository in the current working directory.
> >
> > To me, that seems like both the least-surprising behavior, and what
> > would lend itself to the easiest implementation. I would probably argue
> > that the existing behavior (where `--object-dir` would work against
> > arbitrary repositories) is a bug, and shouldn't continue to be
> > supported.
>
> All of those seem reasonable to me, and are what I would suggest if we
> were starting from scratch. My only hesitation is whether people are
> using the weird behavior of --object-dir in the wild (e.g., are bup
> folks relying on it).
>
> Johannes, is this something you're using _now_, and it works, or
> something you hoped to use in the future?

I did some research[1] on what parts of `--object-dir` have worked (and not
worked) in the past, and came to the conclusion that although this
behavior is surprising, we do bear the responsibility of continuing to
maintain it.

And in that sense, I agree with your "only call close_object_store() if
the MIDX we are using came from the object store, or otherwise call
close_midx() if it didn't", so that's what I did in the
tb/multi-pack-bitmaps branch of my fork[2].

I think that this is the most reasonable path forward, since it resolves
Johannes' concerns while also not breaking any existing functionality in
the meantime as we add new features on top. It has the added benefit of
closing some holes that were open in the past, so I think that it's
worth doing.

Before I drop 27 patches onto the inboxes of list subscribers, would you
mind taking a look at [1] (and the rest of the patches in [2]) to make
sure that you're OK with the approach too?

Thanks,
Taylor

[1]: https://github.com/ttaylorr/git/commit/a24290489c2b30f3caed7e33fe8f85226a12778f
[2]: https://github.com/ttaylorr/git/compare/tb/multi-pack-bitmaps
Taylor Blau Aug. 26, 2021, 9:22 p.m. UTC | #7
On Thu, Aug 26, 2021 at 02:49:10PM -0400, Taylor Blau wrote:
> On Wed, Aug 25, 2021 at 03:36:15AM -0400, Jeff King wrote:
> > On Tue, Aug 24, 2021 at 10:10:12PM -0400, Taylor Blau wrote:
> >
> > > > It's not clear to me that we have a resolution on whether calling "cd ..
> > > > && git multi-pack-index write --object-dir repo.git" is supposed to
> > > > work.
> > >
> > > My recommendation would be to do the following things, all in a reroll
> > > of this series:
> > >
> > >   - Fix the bug by which we would delete a .rev or .bitmap file out of a
> > >     different object store than we were working in (when the caller
> > >     passes `--object-dir`).
> > >
> > >   - Disallow running `git multi-pack-index` outside of a Git repository.
> > >
> > >   - Restrict `--object-dir` to only work with alternates of the
> > >     repository in the current working directory.
> > >
> > > To me, that seems like both the least-surprising behavior, and what
> > > would lend itself to the easiest implementation. I would probably argue
> > > that the existing behavior (where `--object-dir` would work against
> > > arbitrary repositories) is a bug, and shouldn't continue to be
> > > supported.
> >
> > All of those seem reasonable to me, and are what I would suggest if we
> > were starting from scratch. My only hesitation is whether people are
> > using the weird behavior of --object-dir in the wild (e.g., are bup
> > folks relying on it).
> >
> > Johannes, is this something you're using _now_, and it works, or
> > something you hoped to use in the future?
>
> I did some research[1] on what parts of `--object-dir` have worked (and not
> worked) in the past, and came to the conclusion that although this
> behavior is surprising, we do bear the responsibility of continuing to
> maintain it.

Hmm. Upon thinking on in more, here is some evidence to the contrary.
The new test, specifically this snippet:

    git init repo &&
    test_when_finished "rm -fr repo" &&
    (
      cd repo &&
      test_commit base &&
      git repack -d
    ) &&

    nongit git multi-pack-index --object-dir=$(pwd)/repo/.git/objects write

will fail with GIT_TEST_DEFAULT_HASH=sha256, since the MIDX internals
settle on the hash size via `the_hash_algo` which doesn't respect the
hash algorithm used by the target repository.

And that seems like it never could have worked. Try this at your shell
to observe the failure:

    git init --object-format=sha256 repo &&
    git -C repo commit --allow-empty -m initial &&
    git -C repo repack -d &&

    git multi-pack-index write --object-dir=$(pwd)/repo/.git/objects

and get:

    error: wrong index v2 file size in
    /home/ttaylorr/repo/.git/objects/pack/pack-9f08dc78ae6f37407a5acad69e3fdf5a1887eb7da5c043a1ddedc56ea7160814.idx
    warning: failed to open pack-index
    '/home/ttaylorr/repo/.git/objects/pack/pack-9f08dc78ae6f37407a5acad69e3fdf5a1887eb7da5c043a1ddedc56ea7160814.idx'

since we're trying to open a sha256 index with the_hash_algo in
sha1-mode.

The question is do we consider this to be a bug in the existing behavior
that we should patch, or an indication that the feature shouldn't exist
in the first place?

I think that I tend to agree more with the latter, so I'm inclined to
drop support for it (where "it" is running the midx command outside of a
repository) in this series (i.e., by making the midx builtin have the
RUN_SETUP flag instead of RUN_SETUP_GENTLY).

Thoughts?

Thanks,
Taylor
Jeff King Aug. 27, 2021, 9:30 p.m. UTC | #8
On Thu, Aug 26, 2021 at 05:22:15PM -0400, Taylor Blau wrote:

> > I did some research[1] on what parts of `--object-dir` have worked (and not
> > worked) in the past, and came to the conclusion that although this
> > behavior is surprising, we do bear the responsibility of continuing to
> > maintain it.
> 
> Hmm. Upon thinking on in more, here is some evidence to the contrary.
> The new test, specifically this snippet:
> 
>     git init repo &&
>     test_when_finished "rm -fr repo" &&
>     (
>       cd repo &&
>       test_commit base &&
>       git repack -d
>     ) &&
> 
>     nongit git multi-pack-index --object-dir=$(pwd)/repo/.git/objects write
> 
> will fail with GIT_TEST_DEFAULT_HASH=sha256, since the MIDX internals
> settle on the hash size via `the_hash_algo` which doesn't respect the
> hash algorithm used by the target repository.

Yeah, I think this is a good example of the class of things that might
fail: anything that requires the repo config to behave correctly.

I do think the hash format is somewhat unusual here. Most of the changes
to the on-disk files are reflected in the files themselves (e.g., pack
index v2 is chosen by config at _write_ time, but readers can interpret
the file stand-alone).

There may be other config that could influence the writing of the midx,
and we'd skip it in this kind of non-repo setup. An example here is
repack.usedeltabaseoffset, which midx_repack() tries to respect.
Ignoring that doesn't produce a nonsense result, but it doesn't follow
what would happen if run from inside the repo.

The other class of problems I'd expect is where part of the midx
operation needs to look at other parts of the repo. Bitmap generation is
an obvious one there, since we'd want to look at refs to find the
reachable tips. Now obviously that's a new feature we're trying to
introduce here, so it can't be an existing breakage. But it does make me
wonder what other problems might be lurking.

So I dunno. Even if it mostly works now, I'm not sure it's something
that I'm all that happy about supporting going forward. It seems like a
recipe for subtle bugs where the midx code calls into other library code
that assumes that it can look at the repository struct.

-Peff
Junio C Hamano Aug. 29, 2021, 10:42 p.m. UTC | #9
Jeff King <peff@peff.net> writes:

>> The new test, specifically this snippet:
>> 
>>     git init repo &&
>>     test_when_finished "rm -fr repo" &&
>>     (
>>       cd repo &&
>>       test_commit base &&
>>       git repack -d
>>     ) &&
>> 
>>     nongit git multi-pack-index --object-dir=$(pwd)/repo/.git/objects write
>> 
>> will fail with GIT_TEST_DEFAULT_HASH=sha256, since the MIDX internals
>> settle on the hash size via `the_hash_algo` which doesn't respect the
>> hash algorithm used by the target repository.
>
> Yeah, I think this is a good example of the class of things that might
> fail: anything that requires the repo config to behave correctly.
>
> I do think the hash format is somewhat unusual here. Most of the changes
> to the on-disk files are reflected in the files themselves (e.g., pack
> index v2 is chosen by config at _write_ time, but readers can interpret
> the file stand-alone).
>
> There may be other config that could influence the writing of the midx,
> and we'd skip it in this kind of non-repo setup. An example here is
> repack.usedeltabaseoffset, which midx_repack() tries to respect.
> Ignoring that doesn't produce a nonsense result, but it doesn't follow
> what would happen if run from inside the repo.
>
> The other class of problems I'd expect is where part of the midx
> operation needs to look at other parts of the repo. Bitmap generation is
> an obvious one there, since we'd want to look at refs to find the
> reachable tips. Now obviously that's a new feature we're trying to
> introduce here, so it can't be an existing breakage. But it does make me
> wonder what other problems might be lurking.
>
> So I dunno. Even if it mostly works now, I'm not sure it's something
> that I'm all that happy about supporting going forward. It seems like a
> recipe for subtle bugs where the midx code calls into other library code
> that assumes that it can look at the repository struct.

I tend to agree that we should first disallow things that are not
what we know we definitely need, and the non-repo setup is something
we would want to punt on to make sure we have a solid support for
the mainstream usecase.

Thanks.