diff mbox series

[v2,11/18] maintenance: auto-size incremental-repack batch

Message ID 478c7f1d0b858755c2c4b98605405214910b6f4c.1595527000.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Maintenance builtin, allowing 'gc --auto' customization | expand

Commit Message

Johannes Schindelin via GitGitGadget July 23, 2020, 5:56 p.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

When repacking during the 'incremental-repack' task, we use the
--batch-size option in 'git multi-pack-index repack'. The initial setting
used --batch-size=0 to repack everything into a single pack-file. This is
not sustaintable for a large repository. The amount of work required is
also likely to use too many system resources for a background job.

Update the 'incremental-repack' task by dynamically computing a
--batch-size option based on the current pack-file structure.

The dynamic default size is computed with this idea in mind for a client
repository that was cloned from a very large remote: there is likely one
"big" pack-file that was created at clone time. Thus, do not try
repacking it as it is likely packed efficiently by the server.

Instead, we select the second-largest pack-file, and create a batch size
that is one larger than that pack-file. If there are three or more
pack-files, then this guarantees that at least two will be combined into
a new pack-file.

Of course, this means that the second-largest pack-file size is likely
to grow over time and may eventually surpass the initially-cloned
pack-file. Recall that the pack-file batch is selected in a greedy
manner: the packs are considered from oldest to newest and are selected
if they have size smaller than the batch size until the total selected
size is larger than the batch size. Thus, that oldest "clone" pack will
be first to repack after the new data creates a pack larger than that.

We also want to place some limits on how large these pack-files become,
in order to bound the amount of time spent repacking. A maximum
batch-size of two gigabytes means that large repositories will never be
packed into a single pack-file using this job, but also that repack is
rather expensive. This is a trade-off that is valuable to have if the
maintenance is being run automatically or in the background. Users who
truly want to optimize for space and performance (and are willing to pay
the upfront cost of a full repack) can use the 'gc' task to do so.

Reported-by: Son Luong Ngoc <sluongng@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/gc.c           | 48 +++++++++++++++++++++++++++++++++++++++++-
 t/t7900-maintenance.sh |  5 +++--
 2 files changed, 50 insertions(+), 3 deletions(-)

Comments

Junio C Hamano July 23, 2020, 10:15 p.m. UTC | #1
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> When repacking during the 'incremental-repack' task, we use the
> --batch-size option in 'git multi-pack-index repack'. The initial setting
> used --batch-size=0 to repack everything into a single pack-file. This is
> not sustaintable for a large repository. The amount of work required is

sustainable.

> also likely to use too many system resources for a background job.
>
> Update the 'incremental-repack' task by dynamically computing a
> --batch-size option based on the current pack-file structure.

OK.

> The dynamic default size is computed with this idea in mind for a client
> repository that was cloned from a very large remote: there is likely one
> "big" pack-file that was created at clone time. Thus, do not try
> repacking it as it is likely packed efficiently by the server.
>
> Instead, we select the second-largest pack-file, and create a batch size
> that is one larger than that pack-file. If there are three or more
> pack-files, then this guarantees that at least two will be combined into
> a new pack-file.
>
> Of course, this means that the second-largest pack-file size is likely
> to grow over time and may eventually surpass the initially-cloned
> pack-file. Recall that the pack-file batch is selected in a greedy
> manner: the packs are considered from oldest to newest and are selected
> if they have size smaller than the batch size until the total selected
> size is larger than the batch size. Thus, that oldest "clone" pack will
> be first to repack after the new data creates a pack larger than that.
>
> We also want to place some limits on how large these pack-files become,
> in order to bound the amount of time spent repacking. A maximum
> batch-size of two gigabytes means that large repositories will never be
> packed into a single pack-file using this job, but also that repack is
> rather expensive. This is a trade-off that is valuable to have if the
> maintenance is being run automatically or in the background. Users who
> truly want to optimize for space and performance (and are willing to pay
> the upfront cost of a full repack) can use the 'gc' task to do so.

It might be too late to ask this now, but how does the quality of
the resulting combined pack ensured, wrt locality and deltification?

> +#define TWO_GIGABYTES (2147483647)
> +#define UNSET_BATCH_SIZE ((unsigned long)-1)
> +
> +static off_t get_auto_pack_size(void)
> +{
> +	/*
> +	 * The "auto" value is special: we optimize for
> +	 * one large pack-file (i.e. from a clone) and
> +	 * expect the rest to be small and they can be
> +	 * repacked quickly.
> +	 *
> +	 * The strategy we select here is to select a
> +	 * size that is one more than the second largest
> +	 * pack-file. This ensures that we will repack
> +	 * at least two packs if there are three or more
> +	 * packs.
> +	 */
> +	off_t max_size = 0;
> +	off_t second_largest_size = 0;
> +	off_t result_size;
> +	struct packed_git *p;
> +	struct repository *r = the_repository;
> +
> +	reprepare_packed_git(r);
> +	for (p = get_all_packs(r); p; p = p->next) {
> +		if (p->pack_size > max_size) {
> +			second_largest_size = max_size;
> +			max_size = p->pack_size;
> +		} else if (p->pack_size > second_largest_size)
> +			second_largest_size = p->pack_size;
> +	}
> +
> +	result_size = second_largest_size + 1;

We won't worry about this addition wrapping around; I guess we
cannot do anything intelligent when it happens.

> +	/* But limit ourselves to a batch size of 2g */
> +	if (result_size > TWO_GIGABYTES)
> +		result_size = TWO_GIGABYTES;

Well, when it happens, we'd cap to 2G, which must be a reasonable
fallback value, so it would be OK.

> +	return result_size;
> +}
> +
>  static int multi_pack_index_repack(void)
>  {
>  	int result;
>  	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	struct strbuf batch_arg = STRBUF_INIT;
> +
>  	argv_array_pushl(&cmd, "multi-pack-index", "repack", NULL);
>  
>  	if (opts.quiet)
>  		argv_array_push(&cmd, "--no-progress");
>  
> -	argv_array_push(&cmd, "--batch-size=0");
> +	strbuf_addf(&batch_arg, "--batch-size=%"PRIuMAX,
> +		    (uintmax_t)get_auto_pack_size());
> +	argv_array_push(&cmd, batch_arg.buf);
>  
>  	close_object_store(the_repository->objects);
>  	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +	strbuf_release(&batch_arg);

I think I saw a suggestion to use xstrfmt() with free()  instead of
the sequence of strbuf_init(), strbuf_addf(), and strbuf_release()
in a similar but different context.  Perhaps we should follow suit
here, too?


Thanks.  That's it for today from me.
Eric Sunshine July 23, 2020, 11:09 p.m. UTC | #2
On Thu, Jul 23, 2020 at 6:15 PM Junio C Hamano <gitster@pobox.com> wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> > +     struct strbuf batch_arg = STRBUF_INIT;
> > +
> > -     argv_array_push(&cmd, "--batch-size=0");
> > +     strbuf_addf(&batch_arg, "--batch-size=%"PRIuMAX,
> > +                 (uintmax_t)get_auto_pack_size());
> > +     argv_array_push(&cmd, batch_arg.buf);
> >
> > +     strbuf_release(&batch_arg);
>
> I think I saw a suggestion to use xstrfmt() with free()  instead of
> the sequence of strbuf_init(), strbuf_addf(), and strbuf_release()
> in a similar but different context.  Perhaps we should follow suit
> here, too?

Perhaps I'm missing something obvious, but wouldn't argv_array_pushf()
be even simpler?

    argv_array_pushf(&cmd, "--batch-size=%"PRIuMAX,
        (uintmax_t)get_auto_pack_size());
Junio C Hamano July 23, 2020, 11:24 p.m. UTC | #3
Eric Sunshine <sunshine@sunshineco.com> writes:

> On Thu, Jul 23, 2020 at 6:15 PM Junio C Hamano <gitster@pobox.com> wrote:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> > +     struct strbuf batch_arg = STRBUF_INIT;
>> > +
>> > -     argv_array_push(&cmd, "--batch-size=0");
>> > +     strbuf_addf(&batch_arg, "--batch-size=%"PRIuMAX,
>> > +                 (uintmax_t)get_auto_pack_size());
>> > +     argv_array_push(&cmd, batch_arg.buf);
>> >
>> > +     strbuf_release(&batch_arg);
>>
>> I think I saw a suggestion to use xstrfmt() with free()  instead of
>> the sequence of strbuf_init(), strbuf_addf(), and strbuf_release()
>> in a similar but different context.  Perhaps we should follow suit
>> here, too?
>
> Perhaps I'm missing something obvious, but wouldn't argv_array_pushf()
> be even simpler?
>
>     argv_array_pushf(&cmd, "--batch-size=%"PRIuMAX,
>         (uintmax_t)get_auto_pack_size());

No, it was me who was missing an obvious and better alternative.
Derrick Stolee July 24, 2020, 4:09 p.m. UTC | #4
On 7/23/2020 7:24 PM, Junio C Hamano wrote:
> Eric Sunshine <sunshine@sunshineco.com> writes:
> 
>> On Thu, Jul 23, 2020 at 6:15 PM Junio C Hamano <gitster@pobox.com> wrote:
>>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>>> +     struct strbuf batch_arg = STRBUF_INIT;
>>>> +
>>>> -     argv_array_push(&cmd, "--batch-size=0");
>>>> +     strbuf_addf(&batch_arg, "--batch-size=%"PRIuMAX,
>>>> +                 (uintmax_t)get_auto_pack_size());
>>>> +     argv_array_push(&cmd, batch_arg.buf);
>>>>
>>>> +     strbuf_release(&batch_arg);
>>>
>>> I think I saw a suggestion to use xstrfmt() with free()  instead of
>>> the sequence of strbuf_init(), strbuf_addf(), and strbuf_release()
>>> in a similar but different context.  Perhaps we should follow suit
>>> here, too?
>>
>> Perhaps I'm missing something obvious, but wouldn't argv_array_pushf()
>> be even simpler?
>>
>>     argv_array_pushf(&cmd, "--batch-size=%"PRIuMAX,
>>         (uintmax_t)get_auto_pack_size());
> 
> No, it was me who was missing an obvious and better alternative.

Today I learned about arv_array_push. Thanks!

-Stolee
Derrick Stolee July 24, 2020, 7:51 p.m. UTC | #5
On 7/23/2020 6:15 PM, Junio C Hamano wrote:
> It might be too late to ask this now, but how does the quality of
> the resulting combined pack ensured, wrt locality and deltification?

There are two questions here, really.

The first is: given the set of objects to pack, are we packing
them as efficiently as possible?

Since e11d86de139 (midx: teach "git multi-pack-index repack" honor
"git repack" configurations, 2020-05-10), the 'repack' subcommand
honors the configured recommendations for deltas. This includes:

 (requires updating the arguments to pack-objects)
 * repack.useDeltaBaseOffset
 * repack.useDeltaIslands

 (automatically respected by pack-objects)
 * repack.packKeptsObjects
 * pack.threads
 * pack.depth
 * pack.window
 * pack.windowMemory
 * pack.deltaCacheSize
 * pack.deltaCacheLimit

All of these config settings allow the user to specify how hard
to try for delta compression. If they know something about their
data or their tolerance for extra CPU time during pack-objects,
then they can get better deltas by changing these values.

The second question is "how well do the deltas compress when
only packing incrementally versus packing the entire repo?"

One important way to consider these things is how the pack-
files are created. If we expect most pack-files coming from
'git fetch' calls, then there are some interesting patterns
that arise.

I started measuring by creating a local clone of the Linux
kernel repo starting at v5.0 and then fetching an increment
of ten commits from the first-parent history of later tags.
Each fetch created a pack-file of ~300 MB relative to the
base pack-file of ~1.6 GB. Collecting ten of these in a row
leads to almost 2 GB of "fetched" packs.

However, keep in mind that we didn't fetch 2 GB of data
"across the wire" but instead expanded the thin pack into
a full pack by copying the base objects. After running the
incremental-repack step, that ~2 GB of data re-compresses
back down to one pack-file of size ~300 MB.

_Why_ did 10 pack-files all around 300 MB get repacked at
once? It's because there were duplicate objects across those
pack-files! Recall that the multi-pack-index repack computes
batch sizes by computing an "estimated pack size" by counting
how many objects in that pack-file are referenced by the
multi-pack-index, then computing

  expected size = actual size * num objects
                              / num referenced objects

In this case, the "base" objects that are copied between the
fetches already exist in these smaller pack-files. Thus, when
the batch-size is ~300 MB it still repacks all 10 "small"
packs into a new pack that is still ~300 MB.

Now, this is still a little wasteful. That second pack has
a significant "extra space" cost. However, it came at a bonus
of writing much less data.

Perhaps the Linux kernel repository is just too small to care
about this version of maintenance? In such a case, I can work
to introduce a 'full-repack' task that is more aggressive with
repacking all pack-files. This could use the multi-pack-index
repack with --batch-size=0 to still benefit from the
repack/expire trick for safe concurrency.

Other ideas are to try repacking in other ways, such as by
object type, to maximize easy wins. For example, perhaps we
repack all of the commits and trees every time, but leave the
blobs to be repacked when we are ready to spend time on
removing deltas?

I think the incremental-repack has value, and perhaps it is
isolated to super-huge repositories. That can be controlled
by limiting its use to those when an expert user configures
Git to use it.

I remain open to recommendations from others with more
experience with delta compression to recommend alternatives.

tl,dr: the incremental-repack isn't the most space-efficient
thing we can do, and that's by design.

Thanks,
-Stolee
Junio C Hamano July 24, 2020, 8:17 p.m. UTC | #6
Derrick Stolee <stolee@gmail.com> writes:

> tl,dr: the incremental-repack isn't the most space-efficient
> thing we can do, and that's by design.

I am not very surprised by the fact that many packfiles that were
obtained by thin pack transfer have many duplicate objects (due to
having to include the delta bases), and it is natural to expect that
deduplication would save many bytes.  It's not all that interesting.

I am more interested in making sure that we can assure that in the
combined single pack, (1) objects are ordered for good locality of
access and (2) objects are getting good delta compression.
Emily Shaffer July 29, 2020, 10:23 p.m. UTC | #7
On Thu, Jul 23, 2020 at 05:56:33PM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/builtin/gc.c b/builtin/gc.c
> index eb4b01c104..889d97afe7 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -1021,19 +1021,65 @@ static int multi_pack_index_expire(void)
>  	return result;
>  }
>  
> +#define TWO_GIGABYTES (2147483647)

[jonathan tan] This would be easier to understand if it was expressed
with bitshift, e.g. 1 << 31

> +#define UNSET_BATCH_SIZE ((unsigned long)-1)
[jonathan tan] This looks like it's never used... and vulnerable to
cross-platform size changes because it's referring to an implicitly
sized int, and could behave differently if it was put into a larger
size, e.g. you wouldn't get 0xFFFF.. if you assigned this into a long
long.

> +	for (p = get_all_packs(r); p; p = p->next) {
> +		if (p->pack_size > max_size) {
> +			second_largest_size = max_size;
> +			max_size = p->pack_size;
> +		} else if (p->pack_size > second_largest_size)
> +			second_largest_size = p->pack_size;
> +	}
> +
> +	result_size = second_largest_size + 1;
[jonathan tan] What happens when there's only one packfile, and when
there are two packfiles? Can we write tests to illustrate the behavior?
The edge case here (result_size=1) is hard to understand by reading the
code.

> +
> +	/* But limit ourselves to a batch size of 2g */
[emily shaffer] nit: can you please capitalize G, GB, whatever :)
Derrick Stolee July 30, 2020, 4:57 p.m. UTC | #8
On 7/29/2020 6:23 PM, Emily Shaffer wrote:
> On Thu, Jul 23, 2020 at 05:56:33PM +0000, Derrick Stolee via GitGitGadget wrote:
>> diff --git a/builtin/gc.c b/builtin/gc.c
>> index eb4b01c104..889d97afe7 100644
>> --- a/builtin/gc.c
>> +++ b/builtin/gc.c
>> @@ -1021,19 +1021,65 @@ static int multi_pack_index_expire(void)
>>  	return result;
>>  }
>>  
>> +#define TWO_GIGABYTES (2147483647)
> 
> [jonathan tan] This would be easier to understand if it was expressed
> with bitshift, e.g. 1 << 31

This is actually ((1 << 31) - 1) because "unsigned long" is 32-bits
on Windows. But it's better to not use magic numbers and instead use
operations like this.

>> +#define UNSET_BATCH_SIZE ((unsigned long)-1)
> [jonathan tan] This looks like it's never used... and vulnerable to
> cross-platform size changes because it's referring to an implicitly
> sized int, and could behave differently if it was put into a larger
> size, e.g. you wouldn't get 0xFFFF.. if you assigned this into a long
> long.

Thanks for catching this cruft.

>> +	for (p = get_all_packs(r); p; p = p->next) {
>> +		if (p->pack_size > max_size) {
>> +			second_largest_size = max_size;
>> +			max_size = p->pack_size;
>> +		} else if (p->pack_size > second_largest_size)
>> +			second_largest_size = p->pack_size;
>> +	}
>> +
>> +	result_size = second_largest_size + 1;
> [jonathan tan] What happens when there's only one packfile, and when
> there are two packfiles? Can we write tests to illustrate the behavior?
> The edge case here (result_size=1) is hard to understand by reading the
> code.
> 
>> +
>> +	/* But limit ourselves to a batch size of 2g */
> [emily shaffer] nit: can you please capitalize G, GB, whatever :)

I suppose I could (get_unit_factor() performs case-insensitive matches
on the size suffixes), except that would be inconsistent with the
following error message in parse-options.c:

	return error(_("%s expects a non-negative integer value"
		       " with an optional k/m/g suffix"),

or these options in Documentation/git-pack-objects.txt:

--window-memory=<n>::
	This option provides an additional limit on top of `--window`;
	the window size will dynamically scale down so as to not take
	up more than '<n>' bytes in memory.  This is useful in
	repositories with a mix of large and small objects to not run
	out of memory with a large window, but still be able to take
	advantage of the large window for the smaller objects.  The
	size can be suffixed with "k", "m", or "g".
	`--window-memory=0` makes memory usage unlimited.  The default
	is taken from the `pack.windowMemory` configuration variable.

--max-pack-size=<n>::
	In unusual scenarios, you may not be able to create files
	larger than a certain size on your filesystem, and this option
	can be used to tell the command to split the output packfile
	into multiple independent packfiles, each not larger than the
	given size. The size can be suffixed with
	"k", "m", or "g". The minimum size allowed is limited to 1 MiB.
	This option
	prevents the creation of a bitmap index.
	The default is unlimited, unless the config variable
	`pack.packSizeLimit` is set.

Perhaps that's not really a good reason, but upper vs lower case
seems to be arbitrary. Any tie breaker seems valid.

Thanks,
-Stolee
Derrick Stolee July 30, 2020, 7:02 p.m. UTC | #9
On 7/30/2020 12:57 PM, Derrick Stolee wrote:
> On 7/29/2020 6:23 PM, Emily Shaffer wrote:
>> On Thu, Jul 23, 2020 at 05:56:33PM +0000, Derrick Stolee via GitGitGadget wrote:
>>> diff --git a/builtin/gc.c b/builtin/gc.c
>>> index eb4b01c104..889d97afe7 100644
>>> --- a/builtin/gc.c
>>> +++ b/builtin/gc.c
>>> @@ -1021,19 +1021,65 @@ static int multi_pack_index_expire(void)
>>>  	return result;
>>>  }
>>>  
>>> +#define TWO_GIGABYTES (2147483647)
>>
>> [jonathan tan] This would be easier to understand if it was expressed
>> with bitshift, e.g. 1 << 31
> 
> This is actually ((1 << 31) - 1) because "unsigned long" is 32-bits
> on Windows. But it's better to not use magic numbers and instead use
> operations like this.

Nevermind. This breaks the build on 32-bit machines (even adding a few
"L" characters). I'll replace my magic decimal number with a magic
hexadecimal number.

Thanks,
-Stolee
Chris Torek July 30, 2020, 7:24 p.m. UTC | #10
On Thu, Jul 30, 2020 at 12:04 PM Derrick Stolee <stolee@gmail.com> wrote:
> > This is actually ((1 << 31) - 1) because "unsigned long" is 32-bits
> > on Windows. But it's better to not use magic numbers and instead use
> > operations like this.
>
> Nevermind. This breaks the build on 32-bit machines (even adding a few
> "L" characters). I'll replace my magic decimal number with a magic
> hexadecimal number.

You would need something along these lines:

    #define WHATEVER ((long)((1UL << 31) - 1))

i.e., make the constant 1 be "unsigned long" so that the shift is
well defined, then subtract 1, then convert back to signed long.

I say "something along" because there are several more ways to write
it.  I'm not really a big fan of most of them, and 0x7FFFFFFF (or in
lowercase) is my own preference here too.

Chris
Đoàn Trần Công Danh Aug. 5, 2020, 12:37 p.m. UTC | #11
On 2020-07-30 15:02:32-0400, Derrick Stolee <stolee@gmail.com> wrote:
> On 7/30/2020 12:57 PM, Derrick Stolee wrote:
> > On 7/29/2020 6:23 PM, Emily Shaffer wrote:
> >> On Thu, Jul 23, 2020 at 05:56:33PM +0000, Derrick Stolee via GitGitGadget wrote:
> >>> diff --git a/builtin/gc.c b/builtin/gc.c
> >>> index eb4b01c104..889d97afe7 100644
> >>> --- a/builtin/gc.c
> >>> +++ b/builtin/gc.c
> >>> @@ -1021,19 +1021,65 @@ static int multi_pack_index_expire(void)
> >>>  	return result;
> >>>  }
> >>>  
> >>> +#define TWO_GIGABYTES (2147483647)
> >>
> >> [jonathan tan] This would be easier to understand if it was expressed
> >> with bitshift, e.g. 1 << 31
> > 
> > This is actually ((1 << 31) - 1) because "unsigned long" is 32-bits
> > on Windows. But it's better to not use magic numbers and instead use
> > operations like this.
> 
> Nevermind. This breaks the build on 32-bit machines (even adding a few
> "L" characters). I'll replace my magic decimal number with a magic
> hexadecimal number.

Would it be better if we use this C99 feature instead:

	#define TWO_GIGABYTES INT32_MAX

But it seems like it's available everywhere, except (maybe) Windows.
(judging from UINT32_MAX's usage)
Derrick Stolee Aug. 6, 2020, 1:54 p.m. UTC | #12
On 8/5/2020 8:37 AM, Đoàn Trần Công Danh wrote:
> On 2020-07-30 15:02:32-0400, Derrick Stolee <stolee@gmail.com> wrote:
>> On 7/30/2020 12:57 PM, Derrick Stolee wrote:
>>> On 7/29/2020 6:23 PM, Emily Shaffer wrote:
>>>> On Thu, Jul 23, 2020 at 05:56:33PM +0000, Derrick Stolee via GitGitGadget wrote:
>>>>> diff --git a/builtin/gc.c b/builtin/gc.c
>>>>> index eb4b01c104..889d97afe7 100644
>>>>> --- a/builtin/gc.c
>>>>> +++ b/builtin/gc.c
>>>>> @@ -1021,19 +1021,65 @@ static int multi_pack_index_expire(void)
>>>>>  	return result;
>>>>>  }
>>>>>  
>>>>> +#define TWO_GIGABYTES (2147483647)
>>>>
>>>> [jonathan tan] This would be easier to understand if it was expressed
>>>> with bitshift, e.g. 1 << 31
>>>
>>> This is actually ((1 << 31) - 1) because "unsigned long" is 32-bits
>>> on Windows. But it's better to not use magic numbers and instead use
>>> operations like this.
>>
>> Nevermind. This breaks the build on 32-bit machines (even adding a few
>> "L" characters). I'll replace my magic decimal number with a magic
>> hexadecimal number.
> 
> Would it be better if we use this C99 feature instead:
> 
> 	#define TWO_GIGABYTES INT32_MAX
> 
> But it seems like it's available everywhere, except (maybe) Windows.
> (judging from UINT32_MAX's usage)

Thanks for the recommendation. This appears to work in all of the
CI builds, including Windows. I'll use that constant in my next
version. Perhaps we will discover a platform that doesn't have it
by including it.

-Stolee
diff mbox series

Patch

diff --git a/builtin/gc.c b/builtin/gc.c
index eb4b01c104..889d97afe7 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -1021,19 +1021,65 @@  static int multi_pack_index_expire(void)
 	return result;
 }
 
+#define TWO_GIGABYTES (2147483647)
+#define UNSET_BATCH_SIZE ((unsigned long)-1)
+
+static off_t get_auto_pack_size(void)
+{
+	/*
+	 * The "auto" value is special: we optimize for
+	 * one large pack-file (i.e. from a clone) and
+	 * expect the rest to be small and they can be
+	 * repacked quickly.
+	 *
+	 * The strategy we select here is to select a
+	 * size that is one more than the second largest
+	 * pack-file. This ensures that we will repack
+	 * at least two packs if there are three or more
+	 * packs.
+	 */
+	off_t max_size = 0;
+	off_t second_largest_size = 0;
+	off_t result_size;
+	struct packed_git *p;
+	struct repository *r = the_repository;
+
+	reprepare_packed_git(r);
+	for (p = get_all_packs(r); p; p = p->next) {
+		if (p->pack_size > max_size) {
+			second_largest_size = max_size;
+			max_size = p->pack_size;
+		} else if (p->pack_size > second_largest_size)
+			second_largest_size = p->pack_size;
+	}
+
+	result_size = second_largest_size + 1;
+
+	/* But limit ourselves to a batch size of 2g */
+	if (result_size > TWO_GIGABYTES)
+		result_size = TWO_GIGABYTES;
+
+	return result_size;
+}
+
 static int multi_pack_index_repack(void)
 {
 	int result;
 	struct argv_array cmd = ARGV_ARRAY_INIT;
+	struct strbuf batch_arg = STRBUF_INIT;
+
 	argv_array_pushl(&cmd, "multi-pack-index", "repack", NULL);
 
 	if (opts.quiet)
 		argv_array_push(&cmd, "--no-progress");
 
-	argv_array_push(&cmd, "--batch-size=0");
+	strbuf_addf(&batch_arg, "--batch-size=%"PRIuMAX,
+		    (uintmax_t)get_auto_pack_size());
+	argv_array_push(&cmd, batch_arg.buf);
 
 	close_object_store(the_repository->objects);
 	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+	strbuf_release(&batch_arg);
 
 	if (result && multi_pack_index_verify()) {
 		warning(_("multi-pack-index verify failed after repack"));
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index 3ec813979a..ab5c961eb9 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -134,10 +134,11 @@  test_expect_success 'incremental-repack task' '
 	test_line_count = 4 packs-between &&
 
 	# the job deletes the two old packs, and does not write
-	# a new one because only one pack remains.
+	# a new one because the batch size is not high enough to
+	# pack the largest pack-file.
 	git maintenance run --task=incremental-repack &&
 	ls .git/objects/pack/*.pack >packs-after &&
-	test_line_count = 1 packs-after
+	test_line_count = 2 packs-after
 '
 
 test_done