Message ID | 478c7f1d0b858755c2c4b98605405214910b6f4c.1595527000.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Maintenance builtin, allowing 'gc --auto' customization | expand |
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Derrick Stolee <dstolee@microsoft.com> > > When repacking during the 'incremental-repack' task, we use the > --batch-size option in 'git multi-pack-index repack'. The initial setting > used --batch-size=0 to repack everything into a single pack-file. This is > not sustaintable for a large repository. The amount of work required is sustainable. > also likely to use too many system resources for a background job. > > Update the 'incremental-repack' task by dynamically computing a > --batch-size option based on the current pack-file structure. OK. > The dynamic default size is computed with this idea in mind for a client > repository that was cloned from a very large remote: there is likely one > "big" pack-file that was created at clone time. Thus, do not try > repacking it as it is likely packed efficiently by the server. > > Instead, we select the second-largest pack-file, and create a batch size > that is one larger than that pack-file. If there are three or more > pack-files, then this guarantees that at least two will be combined into > a new pack-file. > > Of course, this means that the second-largest pack-file size is likely > to grow over time and may eventually surpass the initially-cloned > pack-file. Recall that the pack-file batch is selected in a greedy > manner: the packs are considered from oldest to newest and are selected > if they have size smaller than the batch size until the total selected > size is larger than the batch size. Thus, that oldest "clone" pack will > be first to repack after the new data creates a pack larger than that. > > We also want to place some limits on how large these pack-files become, > in order to bound the amount of time spent repacking. A maximum > batch-size of two gigabytes means that large repositories will never be > packed into a single pack-file using this job, but also that repack is > rather expensive. This is a trade-off that is valuable to have if the > maintenance is being run automatically or in the background. Users who > truly want to optimize for space and performance (and are willing to pay > the upfront cost of a full repack) can use the 'gc' task to do so. It might be too late to ask this now, but how does the quality of the resulting combined pack ensured, wrt locality and deltification? > +#define TWO_GIGABYTES (2147483647) > +#define UNSET_BATCH_SIZE ((unsigned long)-1) > + > +static off_t get_auto_pack_size(void) > +{ > + /* > + * The "auto" value is special: we optimize for > + * one large pack-file (i.e. from a clone) and > + * expect the rest to be small and they can be > + * repacked quickly. > + * > + * The strategy we select here is to select a > + * size that is one more than the second largest > + * pack-file. This ensures that we will repack > + * at least two packs if there are three or more > + * packs. > + */ > + off_t max_size = 0; > + off_t second_largest_size = 0; > + off_t result_size; > + struct packed_git *p; > + struct repository *r = the_repository; > + > + reprepare_packed_git(r); > + for (p = get_all_packs(r); p; p = p->next) { > + if (p->pack_size > max_size) { > + second_largest_size = max_size; > + max_size = p->pack_size; > + } else if (p->pack_size > second_largest_size) > + second_largest_size = p->pack_size; > + } > + > + result_size = second_largest_size + 1; We won't worry about this addition wrapping around; I guess we cannot do anything intelligent when it happens. > + /* But limit ourselves to a batch size of 2g */ > + if (result_size > TWO_GIGABYTES) > + result_size = TWO_GIGABYTES; Well, when it happens, we'd cap to 2G, which must be a reasonable fallback value, so it would be OK. > + return result_size; > +} > + > static int multi_pack_index_repack(void) > { > int result; > struct argv_array cmd = ARGV_ARRAY_INIT; > + struct strbuf batch_arg = STRBUF_INIT; > + > argv_array_pushl(&cmd, "multi-pack-index", "repack", NULL); > > if (opts.quiet) > argv_array_push(&cmd, "--no-progress"); > > - argv_array_push(&cmd, "--batch-size=0"); > + strbuf_addf(&batch_arg, "--batch-size=%"PRIuMAX, > + (uintmax_t)get_auto_pack_size()); > + argv_array_push(&cmd, batch_arg.buf); > > close_object_store(the_repository->objects); > result = run_command_v_opt(cmd.argv, RUN_GIT_CMD); > + strbuf_release(&batch_arg); I think I saw a suggestion to use xstrfmt() with free() instead of the sequence of strbuf_init(), strbuf_addf(), and strbuf_release() in a similar but different context. Perhaps we should follow suit here, too? Thanks. That's it for today from me.
On Thu, Jul 23, 2020 at 6:15 PM Junio C Hamano <gitster@pobox.com> wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > > + struct strbuf batch_arg = STRBUF_INIT; > > + > > - argv_array_push(&cmd, "--batch-size=0"); > > + strbuf_addf(&batch_arg, "--batch-size=%"PRIuMAX, > > + (uintmax_t)get_auto_pack_size()); > > + argv_array_push(&cmd, batch_arg.buf); > > > > + strbuf_release(&batch_arg); > > I think I saw a suggestion to use xstrfmt() with free() instead of > the sequence of strbuf_init(), strbuf_addf(), and strbuf_release() > in a similar but different context. Perhaps we should follow suit > here, too? Perhaps I'm missing something obvious, but wouldn't argv_array_pushf() be even simpler? argv_array_pushf(&cmd, "--batch-size=%"PRIuMAX, (uintmax_t)get_auto_pack_size());
Eric Sunshine <sunshine@sunshineco.com> writes: > On Thu, Jul 23, 2020 at 6:15 PM Junio C Hamano <gitster@pobox.com> wrote: >> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: >> > + struct strbuf batch_arg = STRBUF_INIT; >> > + >> > - argv_array_push(&cmd, "--batch-size=0"); >> > + strbuf_addf(&batch_arg, "--batch-size=%"PRIuMAX, >> > + (uintmax_t)get_auto_pack_size()); >> > + argv_array_push(&cmd, batch_arg.buf); >> > >> > + strbuf_release(&batch_arg); >> >> I think I saw a suggestion to use xstrfmt() with free() instead of >> the sequence of strbuf_init(), strbuf_addf(), and strbuf_release() >> in a similar but different context. Perhaps we should follow suit >> here, too? > > Perhaps I'm missing something obvious, but wouldn't argv_array_pushf() > be even simpler? > > argv_array_pushf(&cmd, "--batch-size=%"PRIuMAX, > (uintmax_t)get_auto_pack_size()); No, it was me who was missing an obvious and better alternative.
On 7/23/2020 7:24 PM, Junio C Hamano wrote: > Eric Sunshine <sunshine@sunshineco.com> writes: > >> On Thu, Jul 23, 2020 at 6:15 PM Junio C Hamano <gitster@pobox.com> wrote: >>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: >>>> + struct strbuf batch_arg = STRBUF_INIT; >>>> + >>>> - argv_array_push(&cmd, "--batch-size=0"); >>>> + strbuf_addf(&batch_arg, "--batch-size=%"PRIuMAX, >>>> + (uintmax_t)get_auto_pack_size()); >>>> + argv_array_push(&cmd, batch_arg.buf); >>>> >>>> + strbuf_release(&batch_arg); >>> >>> I think I saw a suggestion to use xstrfmt() with free() instead of >>> the sequence of strbuf_init(), strbuf_addf(), and strbuf_release() >>> in a similar but different context. Perhaps we should follow suit >>> here, too? >> >> Perhaps I'm missing something obvious, but wouldn't argv_array_pushf() >> be even simpler? >> >> argv_array_pushf(&cmd, "--batch-size=%"PRIuMAX, >> (uintmax_t)get_auto_pack_size()); > > No, it was me who was missing an obvious and better alternative. Today I learned about arv_array_push. Thanks! -Stolee
On 7/23/2020 6:15 PM, Junio C Hamano wrote: > It might be too late to ask this now, but how does the quality of > the resulting combined pack ensured, wrt locality and deltification? There are two questions here, really. The first is: given the set of objects to pack, are we packing them as efficiently as possible? Since e11d86de139 (midx: teach "git multi-pack-index repack" honor "git repack" configurations, 2020-05-10), the 'repack' subcommand honors the configured recommendations for deltas. This includes: (requires updating the arguments to pack-objects) * repack.useDeltaBaseOffset * repack.useDeltaIslands (automatically respected by pack-objects) * repack.packKeptsObjects * pack.threads * pack.depth * pack.window * pack.windowMemory * pack.deltaCacheSize * pack.deltaCacheLimit All of these config settings allow the user to specify how hard to try for delta compression. If they know something about their data or their tolerance for extra CPU time during pack-objects, then they can get better deltas by changing these values. The second question is "how well do the deltas compress when only packing incrementally versus packing the entire repo?" One important way to consider these things is how the pack- files are created. If we expect most pack-files coming from 'git fetch' calls, then there are some interesting patterns that arise. I started measuring by creating a local clone of the Linux kernel repo starting at v5.0 and then fetching an increment of ten commits from the first-parent history of later tags. Each fetch created a pack-file of ~300 MB relative to the base pack-file of ~1.6 GB. Collecting ten of these in a row leads to almost 2 GB of "fetched" packs. However, keep in mind that we didn't fetch 2 GB of data "across the wire" but instead expanded the thin pack into a full pack by copying the base objects. After running the incremental-repack step, that ~2 GB of data re-compresses back down to one pack-file of size ~300 MB. _Why_ did 10 pack-files all around 300 MB get repacked at once? It's because there were duplicate objects across those pack-files! Recall that the multi-pack-index repack computes batch sizes by computing an "estimated pack size" by counting how many objects in that pack-file are referenced by the multi-pack-index, then computing expected size = actual size * num objects / num referenced objects In this case, the "base" objects that are copied between the fetches already exist in these smaller pack-files. Thus, when the batch-size is ~300 MB it still repacks all 10 "small" packs into a new pack that is still ~300 MB. Now, this is still a little wasteful. That second pack has a significant "extra space" cost. However, it came at a bonus of writing much less data. Perhaps the Linux kernel repository is just too small to care about this version of maintenance? In such a case, I can work to introduce a 'full-repack' task that is more aggressive with repacking all pack-files. This could use the multi-pack-index repack with --batch-size=0 to still benefit from the repack/expire trick for safe concurrency. Other ideas are to try repacking in other ways, such as by object type, to maximize easy wins. For example, perhaps we repack all of the commits and trees every time, but leave the blobs to be repacked when we are ready to spend time on removing deltas? I think the incremental-repack has value, and perhaps it is isolated to super-huge repositories. That can be controlled by limiting its use to those when an expert user configures Git to use it. I remain open to recommendations from others with more experience with delta compression to recommend alternatives. tl,dr: the incremental-repack isn't the most space-efficient thing we can do, and that's by design. Thanks, -Stolee
Derrick Stolee <stolee@gmail.com> writes: > tl,dr: the incremental-repack isn't the most space-efficient > thing we can do, and that's by design. I am not very surprised by the fact that many packfiles that were obtained by thin pack transfer have many duplicate objects (due to having to include the delta bases), and it is natural to expect that deduplication would save many bytes. It's not all that interesting. I am more interested in making sure that we can assure that in the combined single pack, (1) objects are ordered for good locality of access and (2) objects are getting good delta compression.
On Thu, Jul 23, 2020 at 05:56:33PM +0000, Derrick Stolee via GitGitGadget wrote: > diff --git a/builtin/gc.c b/builtin/gc.c > index eb4b01c104..889d97afe7 100644 > --- a/builtin/gc.c > +++ b/builtin/gc.c > @@ -1021,19 +1021,65 @@ static int multi_pack_index_expire(void) > return result; > } > > +#define TWO_GIGABYTES (2147483647) [jonathan tan] This would be easier to understand if it was expressed with bitshift, e.g. 1 << 31 > +#define UNSET_BATCH_SIZE ((unsigned long)-1) [jonathan tan] This looks like it's never used... and vulnerable to cross-platform size changes because it's referring to an implicitly sized int, and could behave differently if it was put into a larger size, e.g. you wouldn't get 0xFFFF.. if you assigned this into a long long. > + for (p = get_all_packs(r); p; p = p->next) { > + if (p->pack_size > max_size) { > + second_largest_size = max_size; > + max_size = p->pack_size; > + } else if (p->pack_size > second_largest_size) > + second_largest_size = p->pack_size; > + } > + > + result_size = second_largest_size + 1; [jonathan tan] What happens when there's only one packfile, and when there are two packfiles? Can we write tests to illustrate the behavior? The edge case here (result_size=1) is hard to understand by reading the code. > + > + /* But limit ourselves to a batch size of 2g */ [emily shaffer] nit: can you please capitalize G, GB, whatever :)
On 7/29/2020 6:23 PM, Emily Shaffer wrote: > On Thu, Jul 23, 2020 at 05:56:33PM +0000, Derrick Stolee via GitGitGadget wrote: >> diff --git a/builtin/gc.c b/builtin/gc.c >> index eb4b01c104..889d97afe7 100644 >> --- a/builtin/gc.c >> +++ b/builtin/gc.c >> @@ -1021,19 +1021,65 @@ static int multi_pack_index_expire(void) >> return result; >> } >> >> +#define TWO_GIGABYTES (2147483647) > > [jonathan tan] This would be easier to understand if it was expressed > with bitshift, e.g. 1 << 31 This is actually ((1 << 31) - 1) because "unsigned long" is 32-bits on Windows. But it's better to not use magic numbers and instead use operations like this. >> +#define UNSET_BATCH_SIZE ((unsigned long)-1) > [jonathan tan] This looks like it's never used... and vulnerable to > cross-platform size changes because it's referring to an implicitly > sized int, and could behave differently if it was put into a larger > size, e.g. you wouldn't get 0xFFFF.. if you assigned this into a long > long. Thanks for catching this cruft. >> + for (p = get_all_packs(r); p; p = p->next) { >> + if (p->pack_size > max_size) { >> + second_largest_size = max_size; >> + max_size = p->pack_size; >> + } else if (p->pack_size > second_largest_size) >> + second_largest_size = p->pack_size; >> + } >> + >> + result_size = second_largest_size + 1; > [jonathan tan] What happens when there's only one packfile, and when > there are two packfiles? Can we write tests to illustrate the behavior? > The edge case here (result_size=1) is hard to understand by reading the > code. > >> + >> + /* But limit ourselves to a batch size of 2g */ > [emily shaffer] nit: can you please capitalize G, GB, whatever :) I suppose I could (get_unit_factor() performs case-insensitive matches on the size suffixes), except that would be inconsistent with the following error message in parse-options.c: return error(_("%s expects a non-negative integer value" " with an optional k/m/g suffix"), or these options in Documentation/git-pack-objects.txt: --window-memory=<n>:: This option provides an additional limit on top of `--window`; the window size will dynamically scale down so as to not take up more than '<n>' bytes in memory. This is useful in repositories with a mix of large and small objects to not run out of memory with a large window, but still be able to take advantage of the large window for the smaller objects. The size can be suffixed with "k", "m", or "g". `--window-memory=0` makes memory usage unlimited. The default is taken from the `pack.windowMemory` configuration variable. --max-pack-size=<n>:: In unusual scenarios, you may not be able to create files larger than a certain size on your filesystem, and this option can be used to tell the command to split the output packfile into multiple independent packfiles, each not larger than the given size. The size can be suffixed with "k", "m", or "g". The minimum size allowed is limited to 1 MiB. This option prevents the creation of a bitmap index. The default is unlimited, unless the config variable `pack.packSizeLimit` is set. Perhaps that's not really a good reason, but upper vs lower case seems to be arbitrary. Any tie breaker seems valid. Thanks, -Stolee
On 7/30/2020 12:57 PM, Derrick Stolee wrote: > On 7/29/2020 6:23 PM, Emily Shaffer wrote: >> On Thu, Jul 23, 2020 at 05:56:33PM +0000, Derrick Stolee via GitGitGadget wrote: >>> diff --git a/builtin/gc.c b/builtin/gc.c >>> index eb4b01c104..889d97afe7 100644 >>> --- a/builtin/gc.c >>> +++ b/builtin/gc.c >>> @@ -1021,19 +1021,65 @@ static int multi_pack_index_expire(void) >>> return result; >>> } >>> >>> +#define TWO_GIGABYTES (2147483647) >> >> [jonathan tan] This would be easier to understand if it was expressed >> with bitshift, e.g. 1 << 31 > > This is actually ((1 << 31) - 1) because "unsigned long" is 32-bits > on Windows. But it's better to not use magic numbers and instead use > operations like this. Nevermind. This breaks the build on 32-bit machines (even adding a few "L" characters). I'll replace my magic decimal number with a magic hexadecimal number. Thanks, -Stolee
On Thu, Jul 30, 2020 at 12:04 PM Derrick Stolee <stolee@gmail.com> wrote: > > This is actually ((1 << 31) - 1) because "unsigned long" is 32-bits > > on Windows. But it's better to not use magic numbers and instead use > > operations like this. > > Nevermind. This breaks the build on 32-bit machines (even adding a few > "L" characters). I'll replace my magic decimal number with a magic > hexadecimal number. You would need something along these lines: #define WHATEVER ((long)((1UL << 31) - 1)) i.e., make the constant 1 be "unsigned long" so that the shift is well defined, then subtract 1, then convert back to signed long. I say "something along" because there are several more ways to write it. I'm not really a big fan of most of them, and 0x7FFFFFFF (or in lowercase) is my own preference here too. Chris
On 2020-07-30 15:02:32-0400, Derrick Stolee <stolee@gmail.com> wrote: > On 7/30/2020 12:57 PM, Derrick Stolee wrote: > > On 7/29/2020 6:23 PM, Emily Shaffer wrote: > >> On Thu, Jul 23, 2020 at 05:56:33PM +0000, Derrick Stolee via GitGitGadget wrote: > >>> diff --git a/builtin/gc.c b/builtin/gc.c > >>> index eb4b01c104..889d97afe7 100644 > >>> --- a/builtin/gc.c > >>> +++ b/builtin/gc.c > >>> @@ -1021,19 +1021,65 @@ static int multi_pack_index_expire(void) > >>> return result; > >>> } > >>> > >>> +#define TWO_GIGABYTES (2147483647) > >> > >> [jonathan tan] This would be easier to understand if it was expressed > >> with bitshift, e.g. 1 << 31 > > > > This is actually ((1 << 31) - 1) because "unsigned long" is 32-bits > > on Windows. But it's better to not use magic numbers and instead use > > operations like this. > > Nevermind. This breaks the build on 32-bit machines (even adding a few > "L" characters). I'll replace my magic decimal number with a magic > hexadecimal number. Would it be better if we use this C99 feature instead: #define TWO_GIGABYTES INT32_MAX But it seems like it's available everywhere, except (maybe) Windows. (judging from UINT32_MAX's usage)
On 8/5/2020 8:37 AM, Đoàn Trần Công Danh wrote: > On 2020-07-30 15:02:32-0400, Derrick Stolee <stolee@gmail.com> wrote: >> On 7/30/2020 12:57 PM, Derrick Stolee wrote: >>> On 7/29/2020 6:23 PM, Emily Shaffer wrote: >>>> On Thu, Jul 23, 2020 at 05:56:33PM +0000, Derrick Stolee via GitGitGadget wrote: >>>>> diff --git a/builtin/gc.c b/builtin/gc.c >>>>> index eb4b01c104..889d97afe7 100644 >>>>> --- a/builtin/gc.c >>>>> +++ b/builtin/gc.c >>>>> @@ -1021,19 +1021,65 @@ static int multi_pack_index_expire(void) >>>>> return result; >>>>> } >>>>> >>>>> +#define TWO_GIGABYTES (2147483647) >>>> >>>> [jonathan tan] This would be easier to understand if it was expressed >>>> with bitshift, e.g. 1 << 31 >>> >>> This is actually ((1 << 31) - 1) because "unsigned long" is 32-bits >>> on Windows. But it's better to not use magic numbers and instead use >>> operations like this. >> >> Nevermind. This breaks the build on 32-bit machines (even adding a few >> "L" characters). I'll replace my magic decimal number with a magic >> hexadecimal number. > > Would it be better if we use this C99 feature instead: > > #define TWO_GIGABYTES INT32_MAX > > But it seems like it's available everywhere, except (maybe) Windows. > (judging from UINT32_MAX's usage) Thanks for the recommendation. This appears to work in all of the CI builds, including Windows. I'll use that constant in my next version. Perhaps we will discover a platform that doesn't have it by including it. -Stolee
diff --git a/builtin/gc.c b/builtin/gc.c index eb4b01c104..889d97afe7 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -1021,19 +1021,65 @@ static int multi_pack_index_expire(void) return result; } +#define TWO_GIGABYTES (2147483647) +#define UNSET_BATCH_SIZE ((unsigned long)-1) + +static off_t get_auto_pack_size(void) +{ + /* + * The "auto" value is special: we optimize for + * one large pack-file (i.e. from a clone) and + * expect the rest to be small and they can be + * repacked quickly. + * + * The strategy we select here is to select a + * size that is one more than the second largest + * pack-file. This ensures that we will repack + * at least two packs if there are three or more + * packs. + */ + off_t max_size = 0; + off_t second_largest_size = 0; + off_t result_size; + struct packed_git *p; + struct repository *r = the_repository; + + reprepare_packed_git(r); + for (p = get_all_packs(r); p; p = p->next) { + if (p->pack_size > max_size) { + second_largest_size = max_size; + max_size = p->pack_size; + } else if (p->pack_size > second_largest_size) + second_largest_size = p->pack_size; + } + + result_size = second_largest_size + 1; + + /* But limit ourselves to a batch size of 2g */ + if (result_size > TWO_GIGABYTES) + result_size = TWO_GIGABYTES; + + return result_size; +} + static int multi_pack_index_repack(void) { int result; struct argv_array cmd = ARGV_ARRAY_INIT; + struct strbuf batch_arg = STRBUF_INIT; + argv_array_pushl(&cmd, "multi-pack-index", "repack", NULL); if (opts.quiet) argv_array_push(&cmd, "--no-progress"); - argv_array_push(&cmd, "--batch-size=0"); + strbuf_addf(&batch_arg, "--batch-size=%"PRIuMAX, + (uintmax_t)get_auto_pack_size()); + argv_array_push(&cmd, batch_arg.buf); close_object_store(the_repository->objects); result = run_command_v_opt(cmd.argv, RUN_GIT_CMD); + strbuf_release(&batch_arg); if (result && multi_pack_index_verify()) { warning(_("multi-pack-index verify failed after repack")); diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh index 3ec813979a..ab5c961eb9 100755 --- a/t/t7900-maintenance.sh +++ b/t/t7900-maintenance.sh @@ -134,10 +134,11 @@ test_expect_success 'incremental-repack task' ' test_line_count = 4 packs-between && # the job deletes the two old packs, and does not write - # a new one because only one pack remains. + # a new one because the batch size is not high enough to + # pack the largest pack-file. git maintenance run --task=incremental-repack && ls .git/objects/pack/*.pack >packs-after && - test_line_count = 1 packs-after + test_line_count = 2 packs-after ' test_done