diff mbox series

[08/24] pack-objects: implement `--ignore-disjoint` mode

Message ID 0368f7ab37669163b50b82185725935bde5bc946.1701198172.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series pack-objects: multi-pack verbatim reuse | expand

Commit Message

Taylor Blau Nov. 28, 2023, 7:08 p.m. UTC
Before multi-pack reachability bitmaps learn how to perform pack reuse
over the set of disjoint packs, we will need a way to generate packs
that are known to be disjoint with respect to the currently marked set
of disjoint packs.

In other words, we want a way to make a pack which does not have any
objects contained in the union of the set of packs which are currently
marked as disjoint.

There are a various ways that we could go about this, for example:

  - passing `--unpacked`, which would exclude all packed objects (and
    thus would not contain any objects from the disjoint pack)

  - passing `--stdin-packs` with the set of packs currently marked as
    disjoint as "excluded", indicating that `pack-objects` should
    discard any objects present in any of the excluded packs (thus
    producing a disjoint pack)

  - marking each of the disjoint packs as kept in-core with the
    `--keep-pack` flag, and then passing `--honor-pack-keep` to
    similarly ignore any object(s) from kept packs (thus also producing
    a pack which is disjoint with respect to the current set)

`git repack` is the main entry-point to generating a new pack, by
invoking `pack-objects` and then adding the new pack to the set of
disjoint packs if generating a new MIDX. However, `repack` has a number
of ways to invoke `pack-objects` (e.g., all-into-one repacks, geometric
repacks, incremental repacks, etc.), all of which would require careful
reasoning in order to prove that the resulting set of packs is disjoint.

The most appealing option of the above would be to pass the set of
disjoint packs as kept (via `--keep-pack`) and then ignore their
contents (with `--honor-pack-keep`), doing so for all kinds of
`pack-objects` invocations. But there may be more disjoint packs than we
can easily fit into the command-line arguments.

Instead, teach `pack-objects` a special `--ignore-disjoint` which is the
moral equivalent of marking the set of disjoint packs as kept, and
ignoring their contents, even if it would have otherwise been packed. In
fact, this similarity extends down to the implementation, where each
disjoint pack is first loaded, then has its `pack_keep_in_core` bit set.

With this in place, we can use the kept-pack cache from 20b031fede
(packfile: add kept-pack cache for find_kept_pack_entry(), 2021-02-22),
which looks up objects first in a cache containing just the set of kept
(in this case, disjoint) packs. Assuming that the set of disjoint packs
is a relatively small portion of the entire repository (which should be
a safe assumption to make), each object lookup will be very inexpensive.

The only place we want to avoid using `--ignore-disjoint` is in
conjunction with `--cruft`, since doing so may cause us to omit an
object which would have been included in a new cruft pack in order to
freshen it. In other words, failing to do so might cause that object to
be pruned from the repository earlier than expected.

Otherwise, `--ignore-disjoint` is compatible with most other modes of
`pack-objects`. These various combinations are tested below. As a
result, `repack` will be able to unconditionally (except for the cruft
pack) pass `--ignore-disjoint` when trying to add a new pack to the
disjoint set, and the result will be usable, without having to carefully
consider and reason about each individual case.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-pack-objects.txt |   8 ++
 builtin/pack-objects.c             |  31 +++++-
 t/lib-disjoint.sh                  |  11 ++
 t/t5331-pack-objects-stdin.sh      | 156 +++++++++++++++++++++++++++++
 4 files changed, 203 insertions(+), 3 deletions(-)

Comments

Patrick Steinhardt Nov. 30, 2023, 10:18 a.m. UTC | #1
On Tue, Nov 28, 2023 at 02:08:16PM -0500, Taylor Blau wrote:
> Before multi-pack reachability bitmaps learn how to perform pack reuse
> over the set of disjoint packs, we will need a way to generate packs
> that are known to be disjoint with respect to the currently marked set
> of disjoint packs.
> 
> In other words, we want a way to make a pack which does not have any
> objects contained in the union of the set of packs which are currently
> marked as disjoint.
> 
> There are a various ways that we could go about this, for example:
> 
>   - passing `--unpacked`, which would exclude all packed objects (and
>     thus would not contain any objects from the disjoint pack)
> 
>   - passing `--stdin-packs` with the set of packs currently marked as
>     disjoint as "excluded", indicating that `pack-objects` should
>     discard any objects present in any of the excluded packs (thus
>     producing a disjoint pack)
> 
>   - marking each of the disjoint packs as kept in-core with the
>     `--keep-pack` flag, and then passing `--honor-pack-keep` to
>     similarly ignore any object(s) from kept packs (thus also producing
>     a pack which is disjoint with respect to the current set)
> 
> `git repack` is the main entry-point to generating a new pack, by
> invoking `pack-objects` and then adding the new pack to the set of
> disjoint packs if generating a new MIDX. However, `repack` has a number
> of ways to invoke `pack-objects` (e.g., all-into-one repacks, geometric
> repacks, incremental repacks, etc.), all of which would require careful
> reasoning in order to prove that the resulting set of packs is disjoint.
> 
> The most appealing option of the above would be to pass the set of
> disjoint packs as kept (via `--keep-pack`) and then ignore their
> contents (with `--honor-pack-keep`), doing so for all kinds of
> `pack-objects` invocations. But there may be more disjoint packs than we
> can easily fit into the command-line arguments.
> 
> Instead, teach `pack-objects` a special `--ignore-disjoint` which is the
> moral equivalent of marking the set of disjoint packs as kept, and
> ignoring their contents, even if it would have otherwise been packed. In
> fact, this similarity extends down to the implementation, where each
> disjoint pack is first loaded, then has its `pack_keep_in_core` bit set.
> 
> With this in place, we can use the kept-pack cache from 20b031fede
> (packfile: add kept-pack cache for find_kept_pack_entry(), 2021-02-22),
> which looks up objects first in a cache containing just the set of kept
> (in this case, disjoint) packs. Assuming that the set of disjoint packs
> is a relatively small portion of the entire repository (which should be
> a safe assumption to make), each object lookup will be very inexpensive.

This cought me by surprise a bit. I'd have expected that in the end,
most of the packfiles in a repository would be disjoint. Using for
example geometric repacks, my expectation was that all of the packs that
get written via geometric repacking would eventually become disjoint
whereas new packs added to the repository would initially not be.

Patrick

> The only place we want to avoid using `--ignore-disjoint` is in
> conjunction with `--cruft`, since doing so may cause us to omit an
> object which would have been included in a new cruft pack in order to
> freshen it. In other words, failing to do so might cause that object to
> be pruned from the repository earlier than expected.
> 
> Otherwise, `--ignore-disjoint` is compatible with most other modes of
> `pack-objects`. These various combinations are tested below. As a
> result, `repack` will be able to unconditionally (except for the cruft
> pack) pass `--ignore-disjoint` when trying to add a new pack to the
> disjoint set, and the result will be usable, without having to carefully
> consider and reason about each individual case.
> 
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  Documentation/git-pack-objects.txt |   8 ++
>  builtin/pack-objects.c             |  31 +++++-
>  t/lib-disjoint.sh                  |  11 ++
>  t/t5331-pack-objects-stdin.sh      | 156 +++++++++++++++++++++++++++++
>  4 files changed, 203 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
> index e32404c6aa..592c4ce742 100644
> --- a/Documentation/git-pack-objects.txt
> +++ b/Documentation/git-pack-objects.txt
> @@ -96,6 +96,14 @@ base-name::
>  Incompatible with `--revs`, or options that imply `--revs` (such as
>  `--all`), with the exception of `--unpacked`, which is compatible.
>  
> +--ignore-disjoint::
> +	This flag causes an object that appears in any pack marked as
> +	"disjoint" by the multi-pack index to be ignored, even if it
> +	would have otherwise been packed. When used with
> +	`--stdin-packs`, objects from disjoint packs may be included if
> +	and only if a disjoint pack is explicitly given as an input pack
> +	to `--stdin-packs`. Incompatible with `--cruft`.
> +
>  --cruft::
>  	Packs unreachable objects into a separate "cruft" pack, denoted
>  	by the existence of a `.mtimes` file. Typically used by `git
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index bfa60359d4..107154db34 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -207,6 +207,7 @@ static int have_non_local_packs;
>  static int incremental;
>  static int ignore_packed_keep_on_disk;
>  static int ignore_packed_keep_in_core;
> +static int ignore_midx_disjoint_packs;
>  static int allow_ofs_delta;
>  static struct pack_idx_option pack_idx_opts;
>  static const char *base_name;
> @@ -1403,7 +1404,8 @@ static int want_found_object(const struct object_id *oid, int exclude,
>  	/*
>  	 * Then handle .keep first, as we have a fast(er) path there.
>  	 */
> -	if (ignore_packed_keep_on_disk || ignore_packed_keep_in_core) {
> +	if (ignore_packed_keep_on_disk || ignore_packed_keep_in_core ||
> +	    ignore_midx_disjoint_packs) {
>  		/*
>  		 * Set the flags for the kept-pack cache to be the ones we want
>  		 * to ignore.
> @@ -1415,7 +1417,7 @@ static int want_found_object(const struct object_id *oid, int exclude,
>  		unsigned flags = 0;
>  		if (ignore_packed_keep_on_disk)
>  			flags |= ON_DISK_KEEP_PACKS;
> -		if (ignore_packed_keep_in_core)
> +		if (ignore_packed_keep_in_core || ignore_midx_disjoint_packs)
>  			flags |= IN_CORE_KEEP_PACKS;
>  
>  		if (ignore_packed_keep_on_disk && p->pack_keep)
> @@ -3389,6 +3391,7 @@ static void read_packs_list_from_stdin(void)
>  			die(_("could not find pack '%s'"), item->string);
>  		if (!is_pack_valid(p))
>  			die(_("packfile %s cannot be accessed"), p->pack_name);
> +		p->pack_keep_in_core = 0;
>  	}
>  
>  	/*
> @@ -4266,6 +4269,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
>  			 N_("create packs suitable for shallow fetches")),
>  		OPT_BOOL(0, "honor-pack-keep", &ignore_packed_keep_on_disk,
>  			 N_("ignore packs that have companion .keep file")),
> +		OPT_BOOL(0, "ignore-disjoint", &ignore_midx_disjoint_packs,
> +			 N_("ignore packs that are marked disjoint in the MIDX")),
>  		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
>  				N_("ignore this pack")),
>  		OPT_INTEGER(0, "compression", &pack_compression_level,
> @@ -4412,7 +4417,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
>  		if (use_internal_rev_list)
>  			die(_("cannot use internal rev list with --cruft"));
>  		if (stdin_packs)
> -			die(_("cannot use --stdin-packs with --cruft"));
> +			die(_("cannot use %s with %s"), "--stdin-packs", "--cruft");
> +		if (ignore_midx_disjoint_packs)
> +			die(_("cannot use %s with %s"), "--ignore-disjoint", "--cruft");
>  	}
>  
>  	/*
> @@ -4452,6 +4459,24 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
>  		if (!p) /* no keep-able packs found */
>  			ignore_packed_keep_on_disk = 0;
>  	}
> +	if (ignore_midx_disjoint_packs) {
> +		struct multi_pack_index *m = get_multi_pack_index(the_repository);
> +		struct bitmapped_pack pack;
> +		unsigned any_disjoint = 0;
> +		uint32_t i;
> +
> +		for (i = 0; m && m->chunk_disjoint_packs && i < m->num_packs; i++) {
> +			if (nth_bitmapped_pack(the_repository, m, &pack, i) < 0)
> +				die(_("could not load bitmapped pack %i"), i);
> +			if (pack.disjoint) {
> +				pack.p->pack_keep_in_core = 1;
> +				any_disjoint = 1;
> +			}
> +		}
> +
> +		if (!any_disjoint) /* no disjoint packs to ignore */
> +			ignore_midx_disjoint_packs = 0;
> +	}
>  	if (local) {
>  		/*
>  		 * unlike ignore_packed_keep_on_disk above, we do not
> diff --git a/t/lib-disjoint.sh b/t/lib-disjoint.sh
> index c6c6e74aba..c802ca6940 100644
> --- a/t/lib-disjoint.sh
> +++ b/t/lib-disjoint.sh
> @@ -36,3 +36,14 @@ test_must_be_disjoint () {
>  test_must_not_be_disjoint () {
>  	test_disjoint_1 "$1" "no"
>  }
> +
> +# packed_contents </path/to/pack-$XYZ.idx [...]>
> +#
> +# Prints the set of objects packed in the given pack indexes.
> +packed_contents () {
> +	for idx in "$@"
> +	do
> +		git show-index <$idx || return 1
> +	done >tmp &&
> +	cut -d" " -f2 <tmp | sort -u
> +}
> diff --git a/t/t5331-pack-objects-stdin.sh b/t/t5331-pack-objects-stdin.sh
> index 2dcf1eecee..e522aa3f7d 100755
> --- a/t/t5331-pack-objects-stdin.sh
> +++ b/t/t5331-pack-objects-stdin.sh
> @@ -6,6 +6,7 @@ export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
>  
>  TEST_PASSES_SANITIZE_LEAK=true
>  . ./test-lib.sh
> +. "$TEST_DIRECTORY"/lib-disjoint.sh
>  
>  packed_objects () {
>  	git show-index <"$1" >tmp-object-list &&
> @@ -237,4 +238,159 @@ test_expect_success 'pack-objects --stdin with packfiles from main and alternate
>  	test_cmp expected-objects actual-objects
>  '
>  
> +objdir=.git/objects
> +packdir=$objdir/pack
> +
> +test_expect_success 'loose objects also in disjoint packs are ignored' '
> +	test_when_finished "rm -fr repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +
> +		# create a pack containing the objects in each commit below, but
> +		# do not delete their loose copies
> +		test_commit base &&
> +		base_pack="$(echo base | git pack-objects --revs $packdir/pack)" &&
> +
> +		test_commit other &&
> +		other_pack="$(echo base..other | git pack-objects --revs $packdir/pack)" &&
> +
> +		cat >in <<-EOF &&
> +		pack-$base_pack.idx
> +		+pack-$other_pack.idx
> +		EOF
> +		git multi-pack-index write --stdin-packs --bitmap <in &&
> +
> +		test_commit more &&
> +		out="$(git pack-objects --all --ignore-disjoint $packdir/pack)" &&
> +
> +		# gather all objects in "all", and objects from the disjoint
> +		# pack in "disjoint"
> +		git cat-file --batch-all-objects --batch-check="%(objectname)" >all &&
> +		packed_contents "$packdir/pack-$other_pack.idx" >disjoint &&
> +
> +		# make sure that the set of objects we just generated matches
> +		# "all \ disjoint"
> +		packed_contents "$packdir/pack-$out.idx" >got &&
> +		comm -23 all disjoint >want &&
> +		test_cmp want got
> +	)
> +'
> +
> +test_expect_success 'objects in disjoint packs are ignored (--unpacked)' '
> +	test_when_finished "rm -fr repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +
> +		for c in A B
> +		do
> +			test_commit "$c" || return 1
> +		done &&
> +
> +		A="$(echo "A" | git pack-objects --revs $packdir/pack)" &&
> +		B="$(echo "A..B" | git pack-objects --revs $packdir/pack)" &&
> +
> +		cat >in <<-EOF &&
> +		pack-$A.idx
> +		+pack-$B.idx
> +		EOF
> +		git multi-pack-index write --stdin-packs --bitmap <in &&
> +
> +		test_must_not_be_disjoint "pack-$A.pack" &&
> +		test_must_be_disjoint "pack-$B.pack" &&
> +
> +		test_commit C &&
> +
> +		got="$(git pack-objects --all --unpacked --ignore-disjoint $packdir/pack)" &&
> +		packed_contents "$packdir/pack-$got.idx" >actual &&
> +
> +		git rev-list --objects --no-object-names B..C >expect.raw &&
> +		sort <expect.raw >expect &&
> +
> +		test_cmp expect actual
> +	)
> +'
> +
> +test_expect_success 'objects in disjoint packs are ignored (--stdin-packs)' '
> +	# Create objects in three separate packs:
> +	#
> +	#   - pack A (midx, non disjoint)
> +	#   - pack B (midx, disjoint)
> +	#   - pack C (non-midx)
> +	#
> +	# Then create a new pack with `--stdin-packs` and `--ignore-disjoint`
> +	# including packs A, B, and C. The resulting pack should contain
> +	# only the objects from packs A, and C, excluding those from
> +	# pack B as it is marked as disjoint.
> +	test_when_finished "rm -fr repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +
> +		for c in A B C
> +		do
> +			test_commit "$c" || return 1
> +		done &&
> +
> +		A="$(echo "A" | git pack-objects --revs $packdir/pack)" &&
> +		B="$(echo "A..B" | git pack-objects --revs $packdir/pack)" &&
> +		C="$(echo "B..C" | git pack-objects --revs $packdir/pack)" &&
> +
> +		cat >in <<-EOF &&
> +		pack-$A.idx
> +		+pack-$B.idx
> +		EOF
> +		git multi-pack-index write --stdin-packs --bitmap <in &&
> +
> +		test_must_not_be_disjoint "pack-$A.pack" &&
> +		test_must_be_disjoint "pack-$B.pack" &&
> +
> +		# Generate a pack with `--stdin-packs` using packs "A" and "C",
> +		# but excluding objects from "B". The objects from pack "B" are
> +		# expected to be omitted from the generated pack for two
> +		# reasons:
> +		#
> +		#   - because it was specified as a negated tip via
> +		#     `--stdin-packs`
> +		#   - because it is a disjoint pack.
> +		cat >in <<-EOF &&
> +		pack-$A.pack
> +		^pack-$B.pack
> +		pack-$C.pack
> +		EOF
> +		got="$(git pack-objects --stdin-packs --ignore-disjoint $packdir/pack <in)" &&
> +
> +		packed_contents "$packdir/pack-$got.idx" >actual &&
> +		packed_contents "$packdir/pack-$A.idx" \
> +				"$packdir/pack-$C.idx" >expect &&
> +		test_cmp expect actual &&
> +
> +		# Generate another pack with `--stdin-packs`, this time
> +		# using packs "B" and "C". The objects from pack "B" are
> +		# expected to be in the final pack, despite it being a
> +		# disjoint pack, because "B" was mentioned explicitly
> +		# via `stdin-packs`.
> +		cat >in <<-EOF &&
> +		pack-$B.pack
> +		pack-$C.pack
> +		EOF
> +		got="$(git pack-objects --stdin-packs --ignore-disjoint $packdir/pack <in)" &&
> +
> +		packed_contents "$packdir/pack-$got.idx" >actual &&
> +		packed_contents "$packdir/pack-$B.idx" \
> +				"$packdir/pack-$C.idx" >expect &&
> +		test_cmp expect actual
> +	)
> +'
> +
> +test_expect_success '--cruft is incompatible with --ignore-disjoint' '
> +	test_must_fail git pack-objects --cruft --ignore-disjoint --stdout \
> +		</dev/null >/dev/null 2>actual &&
> +	cat >expect <<-\EOF &&
> +	fatal: cannot use --ignore-disjoint with --cruft
> +	EOF
> +	test_cmp expect actual
> +'
> +
>  test_done
> -- 
> 2.43.0.24.g980b318f98
>
Taylor Blau Nov. 30, 2023, 7:32 p.m. UTC | #2
On Thu, Nov 30, 2023 at 11:18:57AM +0100, Patrick Steinhardt wrote:
> > Instead, teach `pack-objects` a special `--ignore-disjoint` which is the
> > moral equivalent of marking the set of disjoint packs as kept, and
> > ignoring their contents, even if it would have otherwise been packed. In
> > fact, this similarity extends down to the implementation, where each
> > disjoint pack is first loaded, then has its `pack_keep_in_core` bit set.
> >
> > With this in place, we can use the kept-pack cache from 20b031fede
> > (packfile: add kept-pack cache for find_kept_pack_entry(), 2021-02-22),
> > which looks up objects first in a cache containing just the set of kept
> > (in this case, disjoint) packs. Assuming that the set of disjoint packs
> > is a relatively small portion of the entire repository (which should be
> > a safe assumption to make), each object lookup will be very inexpensive.
>
> This cought me by surprise a bit. I'd have expected that in the end,
> most of the packfiles in a repository would be disjoint. Using for
> example geometric repacks, my expectation was that all of the packs that
> get written via geometric repacking would eventually become disjoint
> whereas new packs added to the repository would initially not be.

Which part are you referring to here? If you're referring to the part
where I say that the set of disjoint packs is relatively small in
proposition to the rest of the packs, I think I know where the confusion
is.

I'm not saying that the set of disjoint packs is small in comparison to
the rest of the repository by object count, but rather by count of packs
overall. You're right that packs from pushes will not be guaranteed to
be disjoint upon entering the repository, but will become disjoint when
geometrically repacked (assuming that the caller uses --ignore-disjoint
when repacking).

Thanks,
Taylor
Patrick Steinhardt Dec. 1, 2023, 8:17 a.m. UTC | #3
On Thu, Nov 30, 2023 at 02:32:24PM -0500, Taylor Blau wrote:
> On Thu, Nov 30, 2023 at 11:18:57AM +0100, Patrick Steinhardt wrote:
> > > Instead, teach `pack-objects` a special `--ignore-disjoint` which is the
> > > moral equivalent of marking the set of disjoint packs as kept, and
> > > ignoring their contents, even if it would have otherwise been packed. In
> > > fact, this similarity extends down to the implementation, where each
> > > disjoint pack is first loaded, then has its `pack_keep_in_core` bit set.
> > >
> > > With this in place, we can use the kept-pack cache from 20b031fede
> > > (packfile: add kept-pack cache for find_kept_pack_entry(), 2021-02-22),
> > > which looks up objects first in a cache containing just the set of kept
> > > (in this case, disjoint) packs. Assuming that the set of disjoint packs
> > > is a relatively small portion of the entire repository (which should be
> > > a safe assumption to make), each object lookup will be very inexpensive.
> >
> > This cought me by surprise a bit. I'd have expected that in the end,
> > most of the packfiles in a repository would be disjoint. Using for
> > example geometric repacks, my expectation was that all of the packs that
> > get written via geometric repacking would eventually become disjoint
> > whereas new packs added to the repository would initially not be.
> 
> Which part are you referring to here? If you're referring to the part
> where I say that the set of disjoint packs is relatively small in
> proposition to the rest of the packs, I think I know where the confusion
> is.

Yeah, that's what I was referring to.

> I'm not saying that the set of disjoint packs is small in comparison to
> the rest of the repository by object count, but rather by count of packs
> overall. You're right that packs from pushes will not be guaranteed to
> be disjoint upon entering the repository, but will become disjoint when
> geometrically repacked (assuming that the caller uses --ignore-disjoint
> when repacking).

I was actually thinking about it in the number of packfiles, not number
of objects. I'm mostly coming from the angle of geometric repacking
here, where it is totally expected that you have a comparatively large
number of packfiles when your repository is big. With a geometric factor
of 2, you'll have up to `log2($numobjects)` many packfiles in your repo
while keeping the geometric sequence intact.

In something like linux.git with almost 10M objects that boils down to
23 packfiles, and I'd assume that all of these would be disjoint in the
best case. So if you gain new packfiles by pushing into the repository
then I'd think that it's quite likely that the number of non-disjoint
packfiles is smaller than the number of disjoint ones.

I do realize though that in absolute numbers, this isn't all that many.
I was also thinking ahead though to a future where we have something
like geometric repacking with maximum packfile sizes working well
together so that we'll be able to merge packfiles together until they
reach a certain maximum size, and afterwards they are just left alone.
This would help to avoid those "surprise" repack cases where everything
is again packed into a single packfile for the biggest repositories out
there. But it would of course also lead to an increase in packfiles in
those huge repositories.

Anyway, I feel like I'm rambling. In the end it's probably going to be
fine, I was simply surprised by your assumption that the number of
disjoint packfiles should usually be much smaller than the number of
non-disjoint ones.

Patrick
Taylor Blau Dec. 1, 2023, 7:58 p.m. UTC | #4
On Fri, Dec 01, 2023 at 09:17:38AM +0100, Patrick Steinhardt wrote:
> In something like linux.git with almost 10M objects that boils down to
> 23 packfiles, and I'd assume that all of these would be disjoint in the
> best case. So if you gain new packfiles by pushing into the repository
> then I'd think that it's quite likely that the number of non-disjoint
> packfiles is smaller than the number of disjoint ones.

Right, although if you have 10M objects over 23 packs with a geometric
repacking factor of two, the last pack should have just around a single
object in it. In other words, as soon as you receive a push, your
geometric progression will collapse into a single pack.

So having a repository with 10M objects split across 23 packs is a
relatively short-lived state. And in general we should only be in that
state every time a repository doubles (again, assuming a factor of two).

In that sense, I'd expect relatively few packs to be disjoint, and for
each of those packs to have a relatively large number of objects,
accounting for most of the non-recent parts of the repository.

Thanks,
Taylor
diff mbox series

Patch

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index e32404c6aa..592c4ce742 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -96,6 +96,14 @@  base-name::
 Incompatible with `--revs`, or options that imply `--revs` (such as
 `--all`), with the exception of `--unpacked`, which is compatible.
 
+--ignore-disjoint::
+	This flag causes an object that appears in any pack marked as
+	"disjoint" by the multi-pack index to be ignored, even if it
+	would have otherwise been packed. When used with
+	`--stdin-packs`, objects from disjoint packs may be included if
+	and only if a disjoint pack is explicitly given as an input pack
+	to `--stdin-packs`. Incompatible with `--cruft`.
+
 --cruft::
 	Packs unreachable objects into a separate "cruft" pack, denoted
 	by the existence of a `.mtimes` file. Typically used by `git
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index bfa60359d4..107154db34 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -207,6 +207,7 @@  static int have_non_local_packs;
 static int incremental;
 static int ignore_packed_keep_on_disk;
 static int ignore_packed_keep_in_core;
+static int ignore_midx_disjoint_packs;
 static int allow_ofs_delta;
 static struct pack_idx_option pack_idx_opts;
 static const char *base_name;
@@ -1403,7 +1404,8 @@  static int want_found_object(const struct object_id *oid, int exclude,
 	/*
 	 * Then handle .keep first, as we have a fast(er) path there.
 	 */
-	if (ignore_packed_keep_on_disk || ignore_packed_keep_in_core) {
+	if (ignore_packed_keep_on_disk || ignore_packed_keep_in_core ||
+	    ignore_midx_disjoint_packs) {
 		/*
 		 * Set the flags for the kept-pack cache to be the ones we want
 		 * to ignore.
@@ -1415,7 +1417,7 @@  static int want_found_object(const struct object_id *oid, int exclude,
 		unsigned flags = 0;
 		if (ignore_packed_keep_on_disk)
 			flags |= ON_DISK_KEEP_PACKS;
-		if (ignore_packed_keep_in_core)
+		if (ignore_packed_keep_in_core || ignore_midx_disjoint_packs)
 			flags |= IN_CORE_KEEP_PACKS;
 
 		if (ignore_packed_keep_on_disk && p->pack_keep)
@@ -3389,6 +3391,7 @@  static void read_packs_list_from_stdin(void)
 			die(_("could not find pack '%s'"), item->string);
 		if (!is_pack_valid(p))
 			die(_("packfile %s cannot be accessed"), p->pack_name);
+		p->pack_keep_in_core = 0;
 	}
 
 	/*
@@ -4266,6 +4269,8 @@  int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			 N_("create packs suitable for shallow fetches")),
 		OPT_BOOL(0, "honor-pack-keep", &ignore_packed_keep_on_disk,
 			 N_("ignore packs that have companion .keep file")),
+		OPT_BOOL(0, "ignore-disjoint", &ignore_midx_disjoint_packs,
+			 N_("ignore packs that are marked disjoint in the MIDX")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
 				N_("ignore this pack")),
 		OPT_INTEGER(0, "compression", &pack_compression_level,
@@ -4412,7 +4417,9 @@  int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		if (use_internal_rev_list)
 			die(_("cannot use internal rev list with --cruft"));
 		if (stdin_packs)
-			die(_("cannot use --stdin-packs with --cruft"));
+			die(_("cannot use %s with %s"), "--stdin-packs", "--cruft");
+		if (ignore_midx_disjoint_packs)
+			die(_("cannot use %s with %s"), "--ignore-disjoint", "--cruft");
 	}
 
 	/*
@@ -4452,6 +4459,24 @@  int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		if (!p) /* no keep-able packs found */
 			ignore_packed_keep_on_disk = 0;
 	}
+	if (ignore_midx_disjoint_packs) {
+		struct multi_pack_index *m = get_multi_pack_index(the_repository);
+		struct bitmapped_pack pack;
+		unsigned any_disjoint = 0;
+		uint32_t i;
+
+		for (i = 0; m && m->chunk_disjoint_packs && i < m->num_packs; i++) {
+			if (nth_bitmapped_pack(the_repository, m, &pack, i) < 0)
+				die(_("could not load bitmapped pack %i"), i);
+			if (pack.disjoint) {
+				pack.p->pack_keep_in_core = 1;
+				any_disjoint = 1;
+			}
+		}
+
+		if (!any_disjoint) /* no disjoint packs to ignore */
+			ignore_midx_disjoint_packs = 0;
+	}
 	if (local) {
 		/*
 		 * unlike ignore_packed_keep_on_disk above, we do not
diff --git a/t/lib-disjoint.sh b/t/lib-disjoint.sh
index c6c6e74aba..c802ca6940 100644
--- a/t/lib-disjoint.sh
+++ b/t/lib-disjoint.sh
@@ -36,3 +36,14 @@  test_must_be_disjoint () {
 test_must_not_be_disjoint () {
 	test_disjoint_1 "$1" "no"
 }
+
+# packed_contents </path/to/pack-$XYZ.idx [...]>
+#
+# Prints the set of objects packed in the given pack indexes.
+packed_contents () {
+	for idx in "$@"
+	do
+		git show-index <$idx || return 1
+	done >tmp &&
+	cut -d" " -f2 <tmp | sort -u
+}
diff --git a/t/t5331-pack-objects-stdin.sh b/t/t5331-pack-objects-stdin.sh
index 2dcf1eecee..e522aa3f7d 100755
--- a/t/t5331-pack-objects-stdin.sh
+++ b/t/t5331-pack-objects-stdin.sh
@@ -6,6 +6,7 @@  export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
 
 TEST_PASSES_SANITIZE_LEAK=true
 . ./test-lib.sh
+. "$TEST_DIRECTORY"/lib-disjoint.sh
 
 packed_objects () {
 	git show-index <"$1" >tmp-object-list &&
@@ -237,4 +238,159 @@  test_expect_success 'pack-objects --stdin with packfiles from main and alternate
 	test_cmp expected-objects actual-objects
 '
 
+objdir=.git/objects
+packdir=$objdir/pack
+
+test_expect_success 'loose objects also in disjoint packs are ignored' '
+	test_when_finished "rm -fr repo" &&
+	git init repo &&
+	(
+		cd repo &&
+
+		# create a pack containing the objects in each commit below, but
+		# do not delete their loose copies
+		test_commit base &&
+		base_pack="$(echo base | git pack-objects --revs $packdir/pack)" &&
+
+		test_commit other &&
+		other_pack="$(echo base..other | git pack-objects --revs $packdir/pack)" &&
+
+		cat >in <<-EOF &&
+		pack-$base_pack.idx
+		+pack-$other_pack.idx
+		EOF
+		git multi-pack-index write --stdin-packs --bitmap <in &&
+
+		test_commit more &&
+		out="$(git pack-objects --all --ignore-disjoint $packdir/pack)" &&
+
+		# gather all objects in "all", and objects from the disjoint
+		# pack in "disjoint"
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >all &&
+		packed_contents "$packdir/pack-$other_pack.idx" >disjoint &&
+
+		# make sure that the set of objects we just generated matches
+		# "all \ disjoint"
+		packed_contents "$packdir/pack-$out.idx" >got &&
+		comm -23 all disjoint >want &&
+		test_cmp want got
+	)
+'
+
+test_expect_success 'objects in disjoint packs are ignored (--unpacked)' '
+	test_when_finished "rm -fr repo" &&
+	git init repo &&
+	(
+		cd repo &&
+
+		for c in A B
+		do
+			test_commit "$c" || return 1
+		done &&
+
+		A="$(echo "A" | git pack-objects --revs $packdir/pack)" &&
+		B="$(echo "A..B" | git pack-objects --revs $packdir/pack)" &&
+
+		cat >in <<-EOF &&
+		pack-$A.idx
+		+pack-$B.idx
+		EOF
+		git multi-pack-index write --stdin-packs --bitmap <in &&
+
+		test_must_not_be_disjoint "pack-$A.pack" &&
+		test_must_be_disjoint "pack-$B.pack" &&
+
+		test_commit C &&
+
+		got="$(git pack-objects --all --unpacked --ignore-disjoint $packdir/pack)" &&
+		packed_contents "$packdir/pack-$got.idx" >actual &&
+
+		git rev-list --objects --no-object-names B..C >expect.raw &&
+		sort <expect.raw >expect &&
+
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'objects in disjoint packs are ignored (--stdin-packs)' '
+	# Create objects in three separate packs:
+	#
+	#   - pack A (midx, non disjoint)
+	#   - pack B (midx, disjoint)
+	#   - pack C (non-midx)
+	#
+	# Then create a new pack with `--stdin-packs` and `--ignore-disjoint`
+	# including packs A, B, and C. The resulting pack should contain
+	# only the objects from packs A, and C, excluding those from
+	# pack B as it is marked as disjoint.
+	test_when_finished "rm -fr repo" &&
+	git init repo &&
+	(
+		cd repo &&
+
+		for c in A B C
+		do
+			test_commit "$c" || return 1
+		done &&
+
+		A="$(echo "A" | git pack-objects --revs $packdir/pack)" &&
+		B="$(echo "A..B" | git pack-objects --revs $packdir/pack)" &&
+		C="$(echo "B..C" | git pack-objects --revs $packdir/pack)" &&
+
+		cat >in <<-EOF &&
+		pack-$A.idx
+		+pack-$B.idx
+		EOF
+		git multi-pack-index write --stdin-packs --bitmap <in &&
+
+		test_must_not_be_disjoint "pack-$A.pack" &&
+		test_must_be_disjoint "pack-$B.pack" &&
+
+		# Generate a pack with `--stdin-packs` using packs "A" and "C",
+		# but excluding objects from "B". The objects from pack "B" are
+		# expected to be omitted from the generated pack for two
+		# reasons:
+		#
+		#   - because it was specified as a negated tip via
+		#     `--stdin-packs`
+		#   - because it is a disjoint pack.
+		cat >in <<-EOF &&
+		pack-$A.pack
+		^pack-$B.pack
+		pack-$C.pack
+		EOF
+		got="$(git pack-objects --stdin-packs --ignore-disjoint $packdir/pack <in)" &&
+
+		packed_contents "$packdir/pack-$got.idx" >actual &&
+		packed_contents "$packdir/pack-$A.idx" \
+				"$packdir/pack-$C.idx" >expect &&
+		test_cmp expect actual &&
+
+		# Generate another pack with `--stdin-packs`, this time
+		# using packs "B" and "C". The objects from pack "B" are
+		# expected to be in the final pack, despite it being a
+		# disjoint pack, because "B" was mentioned explicitly
+		# via `stdin-packs`.
+		cat >in <<-EOF &&
+		pack-$B.pack
+		pack-$C.pack
+		EOF
+		got="$(git pack-objects --stdin-packs --ignore-disjoint $packdir/pack <in)" &&
+
+		packed_contents "$packdir/pack-$got.idx" >actual &&
+		packed_contents "$packdir/pack-$B.idx" \
+				"$packdir/pack-$C.idx" >expect &&
+		test_cmp expect actual
+	)
+'
+
+test_expect_success '--cruft is incompatible with --ignore-disjoint' '
+	test_must_fail git pack-objects --cruft --ignore-disjoint --stdout \
+		</dev/null >/dev/null 2>actual &&
+	cat >expect <<-\EOF &&
+	fatal: cannot use --ignore-disjoint with --cruft
+	EOF
+	test_cmp expect actual
+'
+
 test_done