diff mbox series

repack: respect --keep-pack with geometric repack

Message ID pull.1235.git.1653064572170.gitgitgadget@gmail.com (mailing list archive)
State New
Headers show
Series repack: respect --keep-pack with geometric repack | expand

Commit Message

Victoria Dye May 20, 2022, 4:36 p.m. UTC
From: Victoria Dye <vdye@github.com>

Update 'repack' to ignore packs named on the command line with the
'--keep-pack' option. Specifically, modify 'init_pack_geometry()' to treat
command line-kept packs the same way it treats packs with an on-disk '.keep'
file (that is, skip the pack and do not include it in the 'geometry'
structure).

Without this handling, a '--keep-pack' pack would be included in the
'geometry' structure. If the pack is *before* the geometry split line (with
at least one other pack and/or loose objects present), 'repack' assumes the
pack's contents are "rolled up" into another pack via 'pack-objects'.
However, because the internally-invoked 'pack-objects' properly excludes
'--keep-pack' objects, any new pack it creates will not contain the kept
objects. Finally, 'repack' deletes the '--keep-pack' as "redundant" (since
it assumes 'pack-objects' created a new pack with its contents), resulting
in possible object loss and repository corruption.

Add a test ensuring that '--keep-pack' packs are now appropriately handled.

Co-authored-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Victoria Dye <vdye@github.com>
---
    repack: respect --keep-pack with geometric repack

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1235%2Fvdye%2Fbugfix%2Frepack-kept-pack-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1235/vdye/bugfix/repack-kept-pack-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1235

 builtin/repack.c            | 40 +++++++++++++++++++++++++++----------
 t/t7703-repack-geometric.sh | 34 +++++++++++++++++++++++++++++++
 2 files changed, 63 insertions(+), 11 deletions(-)


base-commit: 277cf0bc36094f6dc4297d8c9cef79df045b735d

Comments

Taylor Blau May 20, 2022, 5 p.m. UTC | #1
Hi Victoria,

On Fri, May 20, 2022 at 04:36:12PM +0000, Victoria Dye via GitGitGadget wrote:
> From: Victoria Dye <vdye@github.com>
>
> Update 'repack' to ignore packs named on the command line with the
> '--keep-pack' option. Specifically, modify 'init_pack_geometry()' to treat
> command line-kept packs the same way it treats packs with an on-disk '.keep'
> file (that is, skip the pack and do not include it in the 'geometry'
> structure).
>
> Without this handling, a '--keep-pack' pack would be included in the
> 'geometry' structure. If the pack is *before* the geometry split line (with
> at least one other pack and/or loose objects present), 'repack' assumes the
> pack's contents are "rolled up" into another pack via 'pack-objects'.
> However, because the internally-invoked 'pack-objects' properly excludes
> '--keep-pack' objects, any new pack it creates will not contain the kept
> objects. Finally, 'repack' deletes the '--keep-pack' as "redundant" (since
> it assumes 'pack-objects' created a new pack with its contents), resulting
> in possible object loss and repository corruption.

Nicely found and explained. Having discussed this fix with you already
off-list, this approach (to treat kept packs as excluded from the list
of packs in the `geometry` structure regardless of whether they are kept
on disk or in-core) makes sense to me.

I left a couple of small notes on the patch below, but since I have some
patches that deal with a separate issue in the `git repack --geometric`
code coming, do you want to combine forces (and I can send a
lightly-reworked version of this patch as a part of my series)?

> @@ -332,17 +332,34 @@ static int geometry_cmp(const void *va, const void *vb)
>  	return 0;
>  }
>
> -static void init_pack_geometry(struct pack_geometry **geometry_p)
> +static void init_pack_geometry(struct pack_geometry **geometry_p,
> +			       struct string_list *existing_kept_packs)
>  {
>  	struct packed_git *p;
>  	struct pack_geometry *geometry;
> +	struct strbuf buf = STRBUF_INIT;
>
>  	*geometry_p = xcalloc(1, sizeof(struct pack_geometry));
>  	geometry = *geometry_p;
>
> +	string_list_sort(existing_kept_packs);

Would it be worth sorting this as early as in collect_pack_filenames()?
For our purposes in this patch, this works as-is, but it may be
defensive to try and minimize the time that list has unsorted contents.

>  	for (p = get_all_packs(the_repository); p; p = p->next) {
> -		if (!pack_kept_objects && p->pack_keep)
> -			continue;
> +		if (!pack_kept_objects) {
> +			if (p->pack_keep)
> +				continue;

(You mentioned this to me off-list, but I'll repeat it here since it
wasn't obvious to me on first read): this check for `p->pack_keep` isn't
strictly necessary, since any packs that have their `pack_keep` bit set
will appear in the `existing_kept_packs` list.

But it does give us a fast path to avoid having to check that list, so
it's worth checking that bit to avoid a slightly more expensive check
where possible.

> +			/*
> +			 * The pack may be kept via the --keep-pack option;
> +			 * check 'existing_kept_packs' to determine whether to
> +			 * ignore it.
> +			 */
> +			strbuf_reset(&buf);
> +			strbuf_addstr(&buf, pack_basename(p));
> +			strbuf_strip_suffix(&buf, ".pack");
> +
> +			if (string_list_has_string(existing_kept_packs, buf.buf))
> +				continue;

It's too bad that we have to do this check at all, and can't rely on the
`pack_keep_in_core` in the same way as we check `p->pack_keep`. But
lifting that restriction is a more invasive change, so I'm happy to
rely on the contents of existing_kept_packs here in the meantime.

> +		}
>
>  		ALLOC_GROW(geometry->pack,
>  			   geometry->pack_nr + 1,
> @@ -353,6 +370,7 @@ static void init_pack_geometry(struct pack_geometry **geometry_p)
>  	}
>
>  	QSORT(geometry->pack, geometry->pack_nr, geometry_cmp);
> +	strbuf_release(&buf);
>  }
>
>  static void split_pack_geometry(struct pack_geometry *geometry, int factor)
> @@ -714,17 +732,20 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  		strbuf_release(&path);
>  	}
>
> +	packdir = mkpathdup("%s/pack", get_object_directory());
> +	packtmp_name = xstrfmt(".tmp-%d-pack", (int)getpid());
> +	packtmp = mkpathdup("%s/%s", packdir, packtmp_name);
> +
> +	collect_pack_filenames(&existing_nonkept_packs, &existing_kept_packs,
> +			       &keep_pack_list);
> +

Makes sense; we have to initialize existing_kept_packs before arranging
the list of packs for the `--geometric` split. And presumably
`collect_pack_filenames()` relies on `packdir`, `packtmp_name`, and
`packtmp` being setup ahead of time, too.

>  	if (geometric_factor) {
>  		if (pack_everything)
>  			die(_("options '%s' and '%s' cannot be used together"), "--geometric", "-A/-a");
> -		init_pack_geometry(&geometry);
> +		init_pack_geometry(&geometry, &existing_kept_packs);
>  		split_pack_geometry(geometry, geometric_factor);
>  	}
>
> -	packdir = mkpathdup("%s/pack", get_object_directory());
> -	packtmp_name = xstrfmt(".tmp-%d-pack", (int)getpid());
> -	packtmp = mkpathdup("%s/%s", packdir, packtmp_name);
> -
>  	sigchain_push_common(remove_pack_on_signal);
>
>  	prepare_pack_objects(&cmd, &po_args);
> @@ -764,9 +785,6 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  	if (use_delta_islands)
>  		strvec_push(&cmd.args, "--delta-islands");
>
> -	collect_pack_filenames(&existing_nonkept_packs, &existing_kept_packs,
> -			       &keep_pack_list);
> -
>  	if (pack_everything & ALL_INTO_ONE) {
>  		repack_promisor_objects(&po_args, &names);
>
> diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh
> index bdbbcbf1eca..f5ac23413d5 100755
> --- a/t/t7703-repack-geometric.sh
> +++ b/t/t7703-repack-geometric.sh
> @@ -180,6 +180,40 @@ test_expect_success '--geometric ignores kept packs' '
>  	)
>  '
>
> +test_expect_success '--geometric ignores --keep-pack packs' '
> +	git init geometric &&
> +	test_when_finished "rm -fr geometric" &&
> +	(
> +		cd geometric &&
> +
> +		# Create two equal-sized packs
> +		test_commit kept && # 3 objects
> +		test_commit pack && # 3 objects
> +
> +		KEPT=$(git pack-objects --revs $objdir/pack/pack <<-EOF
> +		refs/tags/kept
> +		EOF
> +		) &&
> +		PACK=$(git pack-objects --revs $objdir/pack/pack <<-EOF
> +		refs/tags/pack
> +		^refs/tags/kept
> +		EOF
> +		) &&

Nit; we don't care about the name of $PACK, so it would probably be fine
to avoid storing the `PACK` variable. We could write these packs with
just `git repack -d` after each `test_commit` (which would avoid us
having to call `prune-packed`).

Does it matter which one is kept? I don't think so, since AFAICT the
critical bit is that we mark one of the packs being rolled up as a
`--keep-pack`.

> +		# Prune loose objects that are now packed into PACK and KEEP
> +		git prune-packed &&
> +
> +		git repack --geometric 2 -dm --keep-pack=pack-$KEPT.pack >out &&
> +
> +		# Packs should not have changed (only one non-kept pack, no
> +		# loose objects), but midx should now exist.
> +		test_i18ngrep "Nothing new to pack" out &&

Nit; test_i18ngrep here should just be "grep".

> +		test_path_is_file $midx &&
> +		test_path_is_file $objdir/pack/pack-$KEPT.pack &&
> +		git fsck
> +	)


Thanks,
Taylor
Victoria Dye May 20, 2022, 5:27 p.m. UTC | #2
Taylor Blau wrote:
> Hi Victoria,
> 
> On Fri, May 20, 2022 at 04:36:12PM +0000, Victoria Dye via GitGitGadget wrote:
>> From: Victoria Dye <vdye@github.com>
>>
>> Update 'repack' to ignore packs named on the command line with the
>> '--keep-pack' option. Specifically, modify 'init_pack_geometry()' to treat
>> command line-kept packs the same way it treats packs with an on-disk '.keep'
>> file (that is, skip the pack and do not include it in the 'geometry'
>> structure).
>>
>> Without this handling, a '--keep-pack' pack would be included in the
>> 'geometry' structure. If the pack is *before* the geometry split line (with
>> at least one other pack and/or loose objects present), 'repack' assumes the
>> pack's contents are "rolled up" into another pack via 'pack-objects'.
>> However, because the internally-invoked 'pack-objects' properly excludes
>> '--keep-pack' objects, any new pack it creates will not contain the kept
>> objects. Finally, 'repack' deletes the '--keep-pack' as "redundant" (since
>> it assumes 'pack-objects' created a new pack with its contents), resulting
>> in possible object loss and repository corruption.
> 
> Nicely found and explained. Having discussed this fix with you already
> off-list, this approach (to treat kept packs as excluded from the list
> of packs in the `geometry` structure regardless of whether they are kept
> on disk or in-core) makes sense to me.
> 
> I left a couple of small notes on the patch below, but since I have some
> patches that deal with a separate issue in the `git repack --geometric`
> code coming, do you want to combine forces (and I can send a
> lightly-reworked version of this patch as a part of my series)?
> 

Works for me! I'm happy with all the suggested changes you noted below
(moving the 'string_list_sort' and cleaning up the test), so feel free to
include them in your series.

Thanks!

>> @@ -332,17 +332,34 @@ static int geometry_cmp(const void *va, const void *vb)
>>  	return 0;
>>  }
>>
>> -static void init_pack_geometry(struct pack_geometry **geometry_p)
>> +static void init_pack_geometry(struct pack_geometry **geometry_p,
>> +			       struct string_list *existing_kept_packs)
>>  {
>>  	struct packed_git *p;
>>  	struct pack_geometry *geometry;
>> +	struct strbuf buf = STRBUF_INIT;
>>
>>  	*geometry_p = xcalloc(1, sizeof(struct pack_geometry));
>>  	geometry = *geometry_p;
>>
>> +	string_list_sort(existing_kept_packs);
> 
> Would it be worth sorting this as early as in collect_pack_filenames()?
> For our purposes in this patch, this works as-is, but it may be
> defensive to try and minimize the time that list has unsorted contents.
> 

I went back and forth on this, eventually settling on this to keep the
'string_list_sort' as close as possible to where the sorted list is needed.
I'm still pretty indifferent, though, so moving it to the end of
'collect_pack_filenames()' is fine with me.

>>  	for (p = get_all_packs(the_repository); p; p = p->next) {
>> -		if (!pack_kept_objects && p->pack_keep)
>> -			continue;
>> +		if (!pack_kept_objects) {
>> +			if (p->pack_keep)
>> +				continue;
> 
> (You mentioned this to me off-list, but I'll repeat it here since it
> wasn't obvious to me on first read): this check for `p->pack_keep` isn't
> strictly necessary, since any packs that have their `pack_keep` bit set
> will appear in the `existing_kept_packs` list.
> 
> But it does give us a fast path to avoid having to check that list, so
> it's worth checking that bit to avoid a slightly more expensive check
> where possible.
> 
>> +			/*
>> +			 * The pack may be kept via the --keep-pack option;
>> +			 * check 'existing_kept_packs' to determine whether to
>> +			 * ignore it.
>> +			 */
>> +			strbuf_reset(&buf);
>> +			strbuf_addstr(&buf, pack_basename(p));
>> +			strbuf_strip_suffix(&buf, ".pack");
>> +
>> +			if (string_list_has_string(existing_kept_packs, buf.buf))
>> +				continue;
> 
> It's too bad that we have to do this check at all, and can't rely on the
> `pack_keep_in_core` in the same way as we check `p->pack_keep`. But
> lifting that restriction is a more invasive change, so I'm happy to
> rely on the contents of existing_kept_packs here in the meantime.
> 
>> +		}
>>
>>  		ALLOC_GROW(geometry->pack,
>>  			   geometry->pack_nr + 1,
>> @@ -353,6 +370,7 @@ static void init_pack_geometry(struct pack_geometry **geometry_p)
>>  	}
>>
>>  	QSORT(geometry->pack, geometry->pack_nr, geometry_cmp);
>> +	strbuf_release(&buf);
>>  }
>>
>>  static void split_pack_geometry(struct pack_geometry *geometry, int factor)
>> @@ -714,17 +732,20 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>>  		strbuf_release(&path);
>>  	}
>>
>> +	packdir = mkpathdup("%s/pack", get_object_directory());
>> +	packtmp_name = xstrfmt(".tmp-%d-pack", (int)getpid());
>> +	packtmp = mkpathdup("%s/%s", packdir, packtmp_name);
>> +
>> +	collect_pack_filenames(&existing_nonkept_packs, &existing_kept_packs,
>> +			       &keep_pack_list);
>> +
> 
> Makes sense; we have to initialize existing_kept_packs before arranging
> the list of packs for the `--geometric` split. And presumably
> `collect_pack_filenames()` relies on `packdir`, `packtmp_name`, and
> `packtmp` being setup ahead of time, too.
> 
>>  	if (geometric_factor) {
>>  		if (pack_everything)
>>  			die(_("options '%s' and '%s' cannot be used together"), "--geometric", "-A/-a");
>> -		init_pack_geometry(&geometry);
>> +		init_pack_geometry(&geometry, &existing_kept_packs);
>>  		split_pack_geometry(geometry, geometric_factor);
>>  	}
>>
>> -	packdir = mkpathdup("%s/pack", get_object_directory());
>> -	packtmp_name = xstrfmt(".tmp-%d-pack", (int)getpid());
>> -	packtmp = mkpathdup("%s/%s", packdir, packtmp_name);
>> -
>>  	sigchain_push_common(remove_pack_on_signal);
>>
>>  	prepare_pack_objects(&cmd, &po_args);
>> @@ -764,9 +785,6 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>>  	if (use_delta_islands)
>>  		strvec_push(&cmd.args, "--delta-islands");
>>
>> -	collect_pack_filenames(&existing_nonkept_packs, &existing_kept_packs,
>> -			       &keep_pack_list);
>> -
>>  	if (pack_everything & ALL_INTO_ONE) {
>>  		repack_promisor_objects(&po_args, &names);
>>
>> diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh
>> index bdbbcbf1eca..f5ac23413d5 100755
>> --- a/t/t7703-repack-geometric.sh
>> +++ b/t/t7703-repack-geometric.sh
>> @@ -180,6 +180,40 @@ test_expect_success '--geometric ignores kept packs' '
>>  	)
>>  '
>>
>> +test_expect_success '--geometric ignores --keep-pack packs' '
>> +	git init geometric &&
>> +	test_when_finished "rm -fr geometric" &&
>> +	(
>> +		cd geometric &&
>> +
>> +		# Create two equal-sized packs
>> +		test_commit kept && # 3 objects
>> +		test_commit pack && # 3 objects
>> +
>> +		KEPT=$(git pack-objects --revs $objdir/pack/pack <<-EOF
>> +		refs/tags/kept
>> +		EOF
>> +		) &&
>> +		PACK=$(git pack-objects --revs $objdir/pack/pack <<-EOF
>> +		refs/tags/pack
>> +		^refs/tags/kept
>> +		EOF
>> +		) &&
> 
> Nit; we don't care about the name of $PACK, so it would probably be fine
> to avoid storing the `PACK` variable. We could write these packs with
> just `git repack -d` after each `test_commit` (which would avoid us
> having to call `prune-packed`).
> 

Makes sense.

> Does it matter which one is kept? I don't think so, since AFAICT the
> critical bit is that we mark one of the packs being rolled up as a
> `--keep-pack`.
> 

Correct, the two packs in this test are just two same-sized (or, more
generally, non-geometrically progressing) packs with non-overlapping
content.

>> +		# Prune loose objects that are now packed into PACK and KEEP
>> +		git prune-packed &&
>> +
>> +		git repack --geometric 2 -dm --keep-pack=pack-$KEPT.pack >out &&
>> +
>> +		# Packs should not have changed (only one non-kept pack, no
>> +		# loose objects), but midx should now exist.
>> +		test_i18ngrep "Nothing new to pack" out &&
> 
> Nit; test_i18ngrep here should just be "grep".
> 

Thanks for pointing this out - I've been a bit unsure of the difference for
a while, but this pushed me to figure out the difference and I found the
note in 'test-lib-functions.sh' clarifying that 'test_i18ngrep' is
deprecated.

>> +		test_path_is_file $midx &&
>> +		test_path_is_file $objdir/pack/pack-$KEPT.pack &&
>> +		git fsck
>> +	)
> 
> 
> Thanks,
> Taylor
Taylor Blau May 20, 2022, 7:05 p.m. UTC | #3
On Fri, May 20, 2022 at 10:27:21AM -0700, Victoria Dye wrote:
> > I left a couple of small notes on the patch below, but since I have some
> > patches that deal with a separate issue in the `git repack --geometric`
> > code coming, do you want to combine forces (and I can send a
> > lightly-reworked version of this patch as a part of my series)?
>
> Works for me! I'm happy with all the suggested changes you noted below
> (moving the 'string_list_sort' and cleaning up the test), so feel free to
> include them in your series.
>
> Thanks!

No problem; I (re-)sent this patch as 1/3 in my series which should be
available via the archive at:

    https://lore.kernel.org/git/cover.1653073280.git.me@ttaylorr.com/T/#t

It looks like we're on the same page with respect to my suggestions, but
feel free to take a look at the reworked version of this patch (and the
new ones on top, too) and let me know what you think.

> >> @@ -332,17 +332,34 @@ static int geometry_cmp(const void *va, const void *vb)
> >>  	return 0;
> >>  }
> >>
> >> -static void init_pack_geometry(struct pack_geometry **geometry_p)
> >> +static void init_pack_geometry(struct pack_geometry **geometry_p,
> >> +			       struct string_list *existing_kept_packs)
> >>  {
> >>  	struct packed_git *p;
> >>  	struct pack_geometry *geometry;
> >> +	struct strbuf buf = STRBUF_INIT;
> >>
> >>  	*geometry_p = xcalloc(1, sizeof(struct pack_geometry));
> >>  	geometry = *geometry_p;
> >>
> >> +	string_list_sort(existing_kept_packs);
> >
> > Would it be worth sorting this as early as in collect_pack_filenames()?
> > For our purposes in this patch, this works as-is, but it may be
> > defensive to try and minimize the time that list has unsorted contents.
>
> I went back and forth on this, eventually settling on this to keep the
> 'string_list_sort' as close as possible to where the sorted list is needed.
> I'm still pretty indifferent, though, so moving it to the end of
> 'collect_pack_filenames()' is fine with me.

I'm fine with it either way. I opted to sorting the list in
collect_pack_filenames() since I think it's slightly more fool-proof,
but I also don't have strong feelings about its placement.

Thanks,
Taylor
Junio C Hamano May 20, 2022, 8:41 p.m. UTC | #4
Taylor Blau <me@ttaylorr.com> writes:

>>  	*geometry_p = xcalloc(1, sizeof(struct pack_geometry));
>>  	geometry = *geometry_p;
>>
>> +	string_list_sort(existing_kept_packs);
>
> Would it be worth sorting this as early as in collect_pack_filenames()?
> For our purposes in this patch, this works as-is, but it may be
> defensive to try and minimize the time that list has unsorted contents.

It does not matter too much but after reading the latest one before
coming back to this original thread, I actually thought that sorting
it too early was questionable.  Nobody other than this code cares
about the list being sorted, and if it turns out that the look-up in
the loop need to be optimized, it is plausible that we do not need
this list sorted when we populate a hashmap out of its contents, for
example.

>>  	for (p = get_all_packs(the_repository); p; p = p->next) {
>> -		if (!pack_kept_objects && p->pack_keep)
>> -			continue;
>> +		if (!pack_kept_objects) {
>> +			if (p->pack_keep)
>> +				continue;
>
> (You mentioned this to me off-list, but I'll repeat it here since it
> wasn't obvious to me on first read): this check for `p->pack_keep` isn't
> strictly necessary, since any packs that have their `pack_keep` bit set
> will appear in the `existing_kept_packs` list.
>
> But it does give us a fast path to avoid having to check that list, so
> it's worth checking that bit to avoid a slightly more expensive check
> where possible.

That invites a related but different question.  Doesn't p->pack_keep
and string_list_has_string(existing_kept_packs, name_of_pack(p))
mirror each other?  Can a pack appear on the existing_kept_packs
list without having .pack_keep bit set and under what condition?

It is answered by the comment below, I presume?

>> +			/*
>> +			 * The pack may be kept via the --keep-pack option;
>> +			 * check 'existing_kept_packs' to determine whether to
>> +			 * ignore it.
>> +			 */

OK.  So there are two classes of packs we want to exclude from the
geometry repacking.  Those that already have .pack_keep bit set, and
those that are _are_ newly making into kept packs that do not yet
have .pack_keep bit set.  Makes sense.
Junio C Hamano May 20, 2022, 10:12 p.m. UTC | #5
Junio C Hamano <gitster@pobox.com> writes:

>>> +			/*
>>> +			 * The pack may be kept via the --keep-pack option;
>>> +			 * check 'existing_kept_packs' to determine whether to
>>> +			 * ignore it.
>>> +			 */
>
> OK.  So there are two classes of packs we want to exclude from the
> geometry repacking.  Those that already have .pack_keep bit set, and
> those that are _are_ newly making into kept packs that do not yet
> have .pack_keep bit set.  Makes sense.

And with another topic in-flight combined, we have the third class
that we would want to exclude here, i.e. the ones that are "cruft".

(Noticed while rebuilding 'seen').
Taylor Blau May 20, 2022, 10:20 p.m. UTC | #6
On Fri, May 20, 2022 at 03:12:32PM -0700, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>
> >>> +			/*
> >>> +			 * The pack may be kept via the --keep-pack option;
> >>> +			 * check 'existing_kept_packs' to determine whether to
> >>> +			 * ignore it.
> >>> +			 */
> >
> > OK.  So there are two classes of packs we want to exclude from the
> > geometry repacking.  Those that already have .pack_keep bit set, and
> > those that are _are_ newly making into kept packs that do not yet
> > have .pack_keep bit set.  Makes sense.
>
> And with another topic in-flight combined, we have the third class
> that we would want to exclude here, i.e. the ones that are "cruft".

Indeed. tb/cruft-packs handles this by:

    if (p->is_cruft)
      continue;

since we would never want to roll up the objects in a cruft pack during
geometric repacking.

There is likely a small conflict you had to resolve, but the resolution
on 'seen' looks as-expected to me.

Thanks,
Taylor
Taylor Blau May 20, 2022, 11:26 p.m. UTC | #7
On Fri, May 20, 2022 at 01:41:49PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> >>  	*geometry_p = xcalloc(1, sizeof(struct pack_geometry));
> >>  	geometry = *geometry_p;
> >>
> >> +	string_list_sort(existing_kept_packs);
> >
> > Would it be worth sorting this as early as in collect_pack_filenames()?
> > For our purposes in this patch, this works as-is, but it may be
> > defensive to try and minimize the time that list has unsorted contents.
>
> It does not matter too much but after reading the latest one before
> coming back to this original thread, I actually thought that sorting
> it too early was questionable.  Nobody other than this code cares
> about the list being sorted, and if it turns out that the look-up in
> the loop need to be optimized, it is plausible that we do not need
> this list sorted when we populate a hashmap out of its contents, for
> example.

My thinking was simply that sorting it as often as possible (by doing so
immediately after appending all of our newly written packs into it) was
the most defensive approach that we could take. At least in the sense of
another caller which _does_ depend on `names` sorted, and makes the same
mistake of forgetting to call `string_list_sort()` itself.

I suspect that this is mostly academic anyway, and that either would be
fine, so if others feel strongly I'm happy to send an updated version
(though TBH I would be just as happy with what we already have in the
series I posted here).

> >>  	for (p = get_all_packs(the_repository); p; p = p->next) {
> >> -		if (!pack_kept_objects && p->pack_keep)
> >> -			continue;
> >> +		if (!pack_kept_objects) {
> >> +			if (p->pack_keep)
> >> +				continue;
> >
> > (You mentioned this to me off-list, but I'll repeat it here since it
> > wasn't obvious to me on first read): this check for `p->pack_keep` isn't
> > strictly necessary, since any packs that have their `pack_keep` bit set
> > will appear in the `existing_kept_packs` list.
> >
> > But it does give us a fast path to avoid having to check that list, so
> > it's worth checking that bit to avoid a slightly more expensive check
> > where possible.
>
> That invites a related but different question.  Doesn't p->pack_keep
> and string_list_has_string(existing_kept_packs, name_of_pack(p))
> mirror each other?  Can a pack appear on the existing_kept_packs
> list without having .pack_keep bit set and under what condition?

Unfortunately not. `repack` is much more keen to manipulate the results
of calling readdir() over $GIT_DIR/objects/pack than it is to, say, call
`get_all_packs()` and mark `p->pack_keep_in_core`.

In a future where `repack` does not interact with the packdir at such a
low level, I don't think we would have a use for a function like
`collect_pack_filenames()` at all. But that is a much larger change than
this one ;).

> It is answered by the comment below, I presume?
>
> >> +			/*
> >> +			 * The pack may be kept via the --keep-pack option;
> >> +			 * check 'existing_kept_packs' to determine whether to
> >> +			 * ignore it.
> >> +			 */
>
> OK.  So there are two classes of packs we want to exclude from the
> geometry repacking.  Those that already have .pack_keep bit set, and
> those that are _are_ newly making into kept packs that do not yet
> have .pack_keep bit set.  Makes sense.

Yep (and also "yep" to your note below about adding cruft packs into the
mix).

Thanks,
Taylor
diff mbox series

Patch

diff --git a/builtin/repack.c b/builtin/repack.c
index d1a563d5b65..0b636676056 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -332,17 +332,34 @@  static int geometry_cmp(const void *va, const void *vb)
 	return 0;
 }
 
-static void init_pack_geometry(struct pack_geometry **geometry_p)
+static void init_pack_geometry(struct pack_geometry **geometry_p,
+			       struct string_list *existing_kept_packs)
 {
 	struct packed_git *p;
 	struct pack_geometry *geometry;
+	struct strbuf buf = STRBUF_INIT;
 
 	*geometry_p = xcalloc(1, sizeof(struct pack_geometry));
 	geometry = *geometry_p;
 
+	string_list_sort(existing_kept_packs);
 	for (p = get_all_packs(the_repository); p; p = p->next) {
-		if (!pack_kept_objects && p->pack_keep)
-			continue;
+		if (!pack_kept_objects) {
+			if (p->pack_keep)
+				continue;
+
+			/*
+			 * The pack may be kept via the --keep-pack option;
+			 * check 'existing_kept_packs' to determine whether to
+			 * ignore it.
+			 */
+			strbuf_reset(&buf);
+			strbuf_addstr(&buf, pack_basename(p));
+			strbuf_strip_suffix(&buf, ".pack");
+
+			if (string_list_has_string(existing_kept_packs, buf.buf))
+				continue;
+		}
 
 		ALLOC_GROW(geometry->pack,
 			   geometry->pack_nr + 1,
@@ -353,6 +370,7 @@  static void init_pack_geometry(struct pack_geometry **geometry_p)
 	}
 
 	QSORT(geometry->pack, geometry->pack_nr, geometry_cmp);
+	strbuf_release(&buf);
 }
 
 static void split_pack_geometry(struct pack_geometry *geometry, int factor)
@@ -714,17 +732,20 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 		strbuf_release(&path);
 	}
 
+	packdir = mkpathdup("%s/pack", get_object_directory());
+	packtmp_name = xstrfmt(".tmp-%d-pack", (int)getpid());
+	packtmp = mkpathdup("%s/%s", packdir, packtmp_name);
+
+	collect_pack_filenames(&existing_nonkept_packs, &existing_kept_packs,
+			       &keep_pack_list);
+
 	if (geometric_factor) {
 		if (pack_everything)
 			die(_("options '%s' and '%s' cannot be used together"), "--geometric", "-A/-a");
-		init_pack_geometry(&geometry);
+		init_pack_geometry(&geometry, &existing_kept_packs);
 		split_pack_geometry(geometry, geometric_factor);
 	}
 
-	packdir = mkpathdup("%s/pack", get_object_directory());
-	packtmp_name = xstrfmt(".tmp-%d-pack", (int)getpid());
-	packtmp = mkpathdup("%s/%s", packdir, packtmp_name);
-
 	sigchain_push_common(remove_pack_on_signal);
 
 	prepare_pack_objects(&cmd, &po_args);
@@ -764,9 +785,6 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (use_delta_islands)
 		strvec_push(&cmd.args, "--delta-islands");
 
-	collect_pack_filenames(&existing_nonkept_packs, &existing_kept_packs,
-			       &keep_pack_list);
-
 	if (pack_everything & ALL_INTO_ONE) {
 		repack_promisor_objects(&po_args, &names);
 
diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh
index bdbbcbf1eca..f5ac23413d5 100755
--- a/t/t7703-repack-geometric.sh
+++ b/t/t7703-repack-geometric.sh
@@ -180,6 +180,40 @@  test_expect_success '--geometric ignores kept packs' '
 	)
 '
 
+test_expect_success '--geometric ignores --keep-pack packs' '
+	git init geometric &&
+	test_when_finished "rm -fr geometric" &&
+	(
+		cd geometric &&
+
+		# Create two equal-sized packs
+		test_commit kept && # 3 objects
+		test_commit pack && # 3 objects
+
+		KEPT=$(git pack-objects --revs $objdir/pack/pack <<-EOF
+		refs/tags/kept
+		EOF
+		) &&
+		PACK=$(git pack-objects --revs $objdir/pack/pack <<-EOF
+		refs/tags/pack
+		^refs/tags/kept
+		EOF
+		) &&
+
+		# Prune loose objects that are now packed into PACK and KEEP
+		git prune-packed &&
+
+		git repack --geometric 2 -dm --keep-pack=pack-$KEPT.pack >out &&
+
+		# Packs should not have changed (only one non-kept pack, no
+		# loose objects), but midx should now exist.
+		test_i18ngrep "Nothing new to pack" out &&
+		test_path_is_file $midx &&
+		test_path_is_file $objdir/pack/pack-$KEPT.pack &&
+		git fsck
+	)
+'
+
 test_expect_success '--geometric chooses largest MIDX preferred pack' '
 	git init geometric &&
 	test_when_finished "rm -fr geometric" &&