Message ID | b6328c210625e1ba98e2065208a2a478c2c64f94.1595527000.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Maintenance builtin, allowing 'gc --auto' customization | expand |
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > 1. 'git multi-pack-index write' creates a multi-pack-index file if > one did not exist, and otherwise will update the multi-pack-index > with any new pack-files that appeared since the last write. This > is particularly relevant with the background fetch job. > > When the multi-pack-index sees two copies of the same object, it > stores the offset data into the newer pack-file. This means that > some old pack-files could become "unreferenced" which I will use > to mean "a pack-file that is in the pack-file list of the > multi-pack-index but none of the objects in the multi-pack-index > reference a location inside that pack-file." An obvious alternative is to favor the copy in the older pack, right? Is the expectation that over time, most of the objects that are relevant would reappear in newer packs, so that eventually by favoring the copies in the newer packs, we can retire and remove the old pack, keeping only the newer ones? But would that assumption hold? The old packs hold objects that are necessary for the older parts of the history, so unless you are cauterizing away the old history, these objects in the older packs are likely to stay with us longer than those used by the newer parts of the history, some of which may not even have been pushed out yet and can be rebased away? > 2. 'git multi-pack-index expire' deletes any unreferenced pack-files > and updaes the multi-pack-index to drop those pack-files from the > list. This is safe to do as concurrent Git processes will see the > multi-pack-index and not open those packs when looking for object > contents. (Similar to the 'loose-objects' job, there are some Git > commands that open pack-files regardless of the multi-pack-index, > but they are rarely used. Further, a user that self-selects to > use background operations would likely refrain from using those > commands.) OK. > 3. 'git multi-pack-index repack --bacth-size=<size>' collects a set > of pack-files that are listed in the multi-pack-index and creates > a new pack-file containing the objects whose offsets are listed > by the multi-pack-index to be in those objects. The set of pack- > files is selected greedily by sorting the pack-files by modified > time and adding a pack-file to the set if its "expected size" is > smaller than the batch size until the total expected size of the > selected pack-files is at least the batch size. The "expected > size" is calculated by taking the size of the pack-file divided > by the number of objects in the pack-file and multiplied by the > number of objects from the multi-pack-index with offset in that > pack-file. The expected size approximats how much data from that approximates. > pack-file will contribute to the resulting pack-file size. The > intention is that the resulting pack-file will be close in size > to the provided batch size. > +static int maintenance_task_incremental_repack(void) > +{ > + if (multi_pack_index_write()) { > + error(_("failed to write multi-pack-index")); > + return 1; > + } > + > + if (multi_pack_index_verify()) { > + warning(_("multi-pack-index verify failed after initial write")); > + return rewrite_multi_pack_index(); > + } > + > + if (multi_pack_index_expire()) { > + error(_("multi-pack-index expire failed")); > + return 1; > + } > + > + if (multi_pack_index_verify()) { > + warning(_("multi-pack-index verify failed after expire")); > + return rewrite_multi_pack_index(); > + } > + if (multi_pack_index_repack()) { > + error(_("multi-pack-index repack failed")); > + return 1; > + } Hmph, I wonder if these warning should come from each helper functions that are static to this function anyway. It also makes it easier to reason about this function by eliminating the need for having a different pattern only for the verify helper. Instead, verify could call rewrite internally when it notices a breakage. I.e. if (multi_pack_index_write()) return 1; if (multi_pack_index_verify("after initial write")) return 1; if (multi_pack_index_exire()) return 1; ... Also, it feels odd, compared to our internal API convention, that positive non-zero is used as an error here. > + return 0; > +} > + > typedef int maintenance_task_fn(void); > > struct maintenance_task { > @@ -1037,6 +1152,10 @@ static void initialize_tasks(void) > tasks[num_tasks]->fn = maintenance_task_loose_objects; > num_tasks++; > > + tasks[num_tasks]->name = "incremental-repack"; > + tasks[num_tasks]->fn = maintenance_task_incremental_repack; > + num_tasks++; > + > tasks[num_tasks]->name = "gc"; > tasks[num_tasks]->fn = maintenance_task_gc; > tasks[num_tasks]->enabled = 1; Exactly the same comment as 08/18 about natural/inherent ordering applies here as well. Thanks.
On 7/23/2020 6:00 PM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> 1. 'git multi-pack-index write' creates a multi-pack-index file if >> one did not exist, and otherwise will update the multi-pack-index >> with any new pack-files that appeared since the last write. This >> is particularly relevant with the background fetch job. >> >> When the multi-pack-index sees two copies of the same object, it >> stores the offset data into the newer pack-file. This means that >> some old pack-files could become "unreferenced" which I will use >> to mean "a pack-file that is in the pack-file list of the >> multi-pack-index but none of the objects in the multi-pack-index >> reference a location inside that pack-file." > > An obvious alternative is to favor the copy in the older pack, > right? Is the expectation that over time, most of the objects that > are relevant would reappear in newer packs, so that eventually by > favoring the copies in the newer packs, we can retire and remove the > old pack, keeping only the newer ones? > > But would that assumption hold? The old packs hold objects that are > necessary for the older parts of the history, so unless you are > cauterizing away the old history, these objects in the older packs > are likely to stay with us longer than those used by the newer parts > of the history, some of which may not even have been pushed out yet > and can be rebased away? If we created a new pack-file containing an already-packed object, then shouldn't we assume that the new pack-file does a _better_ job of compressing that object? Or at least, doesn't make it worse? For example, if we use 'git multi-pack-index repack --batch-size=0', then this creates a new pack-file containing every previously-packed object. This new pack-file should have better delta compression than the previous setup across multiple pack-files. We want this new pack-file to be used, not the old ones. This "pick the newest pack" strategy is also what allows us to safely use the 'expire' option to drop old pack-files. If we always keep the old copies, then when we try to switch the new pack-files, we cannot delete the old packs safely because a concurrent Git process could be trying to reference it. One race is as follows: * Process A opens the multi-pack-index referring to old pack P. It doesn't open the pack-file as it hasn't tried to parse objects yet. * Process B is 'git multi-pack-index expire'. It sees that pack P can be dropped because all objects appear in newer pack-files. It deletes P. * Process A tries to read from P, and this fails. A must reprepare its representation of the pack-files. This is the less disruptive race since A can recover with a small cost to its performance. The worse race (on Windows) is this: * Process A loads the multi-pack-index and tries to parse an object by loading "old" pack P. * Process B tries to delete P. However, on Windows the handle to P by A prevents the deletion. At this point, there could be two resolutions. The first is to have the 'expire' fail because we can't delete A. This means we might never delete A in a busy repository. The second is that the 'expire' command continues and drops A from the list in the multi-pack-index. However, now all Git processes add A to the packed_git list because it isn't referenced by the multi-pack-index. In summary: the decision to pick the newer copies is a fundamental part of how the write->expire->repack loop was designed. >> 2. 'git multi-pack-index expire' deletes any unreferenced pack-files >> and updaes the multi-pack-index to drop those pack-files from the >> list. This is safe to do as concurrent Git processes will see the >> multi-pack-index and not open those packs when looking for object >> contents. (Similar to the 'loose-objects' job, there are some Git >> commands that open pack-files regardless of the multi-pack-index, >> but they are rarely used. Further, a user that self-selects to >> use background operations would likely refrain from using those >> commands.) > > OK. > >> 3. 'git multi-pack-index repack --bacth-size=<size>' collects a set >> of pack-files that are listed in the multi-pack-index and creates >> a new pack-file containing the objects whose offsets are listed >> by the multi-pack-index to be in those objects. The set of pack- >> files is selected greedily by sorting the pack-files by modified >> time and adding a pack-file to the set if its "expected size" is >> smaller than the batch size until the total expected size of the >> selected pack-files is at least the batch size. The "expected >> size" is calculated by taking the size of the pack-file divided >> by the number of objects in the pack-file and multiplied by the >> number of objects from the multi-pack-index with offset in that >> pack-file. The expected size approximats how much data from that > > approximates. > >> pack-file will contribute to the resulting pack-file size. The >> intention is that the resulting pack-file will be close in size >> to the provided batch size. > >> +static int maintenance_task_incremental_repack(void) >> +{ >> + if (multi_pack_index_write()) { >> + error(_("failed to write multi-pack-index")); >> + return 1; >> + } >> + >> + if (multi_pack_index_verify()) { >> + warning(_("multi-pack-index verify failed after initial write")); >> + return rewrite_multi_pack_index(); >> + } >> + >> + if (multi_pack_index_expire()) { >> + error(_("multi-pack-index expire failed")); >> + return 1; >> + } >> + >> + if (multi_pack_index_verify()) { >> + warning(_("multi-pack-index verify failed after expire")); >> + return rewrite_multi_pack_index(); >> + } >> + if (multi_pack_index_repack()) { >> + error(_("multi-pack-index repack failed")); >> + return 1; >> + } > > Hmph, I wonder if these warning should come from each helper > functions that are static to this function anyway. > > It also makes it easier to reason about this function by eliminating > the need for having a different pattern only for the verify helper. > Instead, verify could call rewrite internally when it notices a > breakage. I.e. > > if (multi_pack_index_write()) > return 1; > if (multi_pack_index_verify("after initial write")) > return 1; > if (multi_pack_index_exire()) > return 1; > ... This is a cleaner model. I'll work on that. > Also, it feels odd, compared to our internal API convention, that > positive non-zero is used as an error here. ... > Exactly the same comment as 08/18 about natural/inherent ordering > applies here as well. I'll leave these to be resolved in the earlier messages. Thanks, -Stolee
On Thu, Jul 23, 2020 at 05:56:32PM +0000, Derrick Stolee via GitGitGadget wrote: > 1. 'git multi-pack-index write' creates a multi-pack-index file if > one did not exist, and otherwise will update the multi-pack-index > with any new pack-files that appeared since the last write. This > is particularly relevant with the background fetch job. [emily shaffer] Will this use midx even if the user has disabled it in their config? > diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh > index 94bb493733..3ec813979a 100755 > --- a/t/t7900-maintenance.sh > +++ b/t/t7900-maintenance.sh > @@ -103,4 +103,41 @@ test_expect_success 'loose-objects task' ' > test_cmp packs-between packs-after > ' [emily shaffer] Can we include a test to prove that this task is or is not run if core.multipackindex is set or unset? That behavior is hard to deduce from the code... we might want to be cautious. > > +test_expect_success 'incremental-repack task' ' > + packDir=.git/objects/pack && > + for i in $(test_seq 1 5) > + do > + test_commit $i || return 1 > + done && > + > + # Create three disjoint pack-files with size BIG, small, small. > + echo HEAD~2 | git pack-objects --revs $packDir/test-1 && > + test_tick && > + git pack-objects --revs $packDir/test-2 <<-\EOF && > + HEAD~1 > + ^HEAD~2 > + EOF > + test_tick && > + git pack-objects --revs $packDir/test-3 <<-\EOF && > + HEAD > + ^HEAD~1 > + EOF > + rm -f $packDir/pack-* && > + rm -f $packDir/loose-* && > + ls $packDir/*.pack >packs-before && > + test_line_count = 3 packs-before && > + > + # the job repacks the two into a new pack, but does not > + # delete the old ones. > + git maintenance run --task=incremental-repack && > + ls $packDir/*.pack >packs-between && > + test_line_count = 4 packs-between && > + > + # the job deletes the two old packs, and does not write > + # a new one because only one pack remains. > + git maintenance run --task=incremental-repack && > + ls .git/objects/pack/*.pack >packs-after && > + test_line_count = 1 packs-after > +' > + > test_done > -- > gitgitgadget >
diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt index 557915a653..bda8df4aaa 100644 --- a/Documentation/git-maintenance.txt +++ b/Documentation/git-maintenance.txt @@ -84,6 +84,21 @@ loose-objects:: thousand objects to prevent the job from taking too long on a repository with many loose objects. +incremental-repack:: + The `incremental-repack` job repacks the object directory + using the `multi-pack-index` feature. In order to prevent race + conditions with concurrent Git commands, it follows a two-step + process. First, it deletes any pack-files included in the + `multi-pack-index` where none of the objects in the + `multi-pack-index` reference those pack-files; this only happens + if all objects in the pack-file are also stored in a newer + pack-file. Second, it selects a group of pack-files whose "expected + size" is below the batch size until the group has total expected + size at least the batch size; see the `--batch-size` option for + the `repack` subcommand in linkgit:git-multi-pack-index[1]. The + default batch-size is zero, which is a special case that attempts + to repack all pack-files into a single pack-file. + OPTIONS ------- --auto:: diff --git a/builtin/gc.c b/builtin/gc.c index fa65304580..eb4b01c104 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -29,6 +29,7 @@ #include "tree.h" #include "promisor-remote.h" #include "remote.h" +#include "midx.h" #define FAILED_RUN "failed to run %s" @@ -701,7 +702,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix) return 0; } -#define MAX_NUM_TASKS 4 +#define MAX_NUM_TASKS 5 static const char * const builtin_maintenance_usage[] = { N_("git maintenance run [<options>]"), @@ -958,6 +959,120 @@ static int maintenance_task_loose_objects(void) return prune_packed() || pack_loose(); } +static int multi_pack_index_write(void) +{ + int result; + struct argv_array cmd = ARGV_ARRAY_INIT; + argv_array_pushl(&cmd, "multi-pack-index", "write", NULL); + + if (opts.quiet) + argv_array_push(&cmd, "--no-progress"); + + result = run_command_v_opt(cmd.argv, RUN_GIT_CMD); + argv_array_clear(&cmd); + + return result; +} + +static int rewrite_multi_pack_index(void) +{ + struct repository *r = the_repository; + char *midx_name = get_midx_filename(r->objects->odb->path); + + unlink(midx_name); + free(midx_name); + + if (multi_pack_index_write()) { + error(_("failed to rewrite multi-pack-index")); + return 1; + } + + return 0; +} + +static int multi_pack_index_verify(void) +{ + int result; + struct argv_array cmd = ARGV_ARRAY_INIT; + argv_array_pushl(&cmd, "multi-pack-index", "verify", NULL); + + if (opts.quiet) + argv_array_push(&cmd, "--no-progress"); + + result = run_command_v_opt(cmd.argv, RUN_GIT_CMD); + argv_array_clear(&cmd); + + return result; +} + +static int multi_pack_index_expire(void) +{ + int result; + struct argv_array cmd = ARGV_ARRAY_INIT; + argv_array_pushl(&cmd, "multi-pack-index", "expire", NULL); + + if (opts.quiet) + argv_array_push(&cmd, "--no-progress"); + + close_object_store(the_repository->objects); + result = run_command_v_opt(cmd.argv, RUN_GIT_CMD); + argv_array_clear(&cmd); + + return result; +} + +static int multi_pack_index_repack(void) +{ + int result; + struct argv_array cmd = ARGV_ARRAY_INIT; + argv_array_pushl(&cmd, "multi-pack-index", "repack", NULL); + + if (opts.quiet) + argv_array_push(&cmd, "--no-progress"); + + argv_array_push(&cmd, "--batch-size=0"); + + close_object_store(the_repository->objects); + result = run_command_v_opt(cmd.argv, RUN_GIT_CMD); + + if (result && multi_pack_index_verify()) { + warning(_("multi-pack-index verify failed after repack")); + result = rewrite_multi_pack_index(); + } + + return result; +} + +static int maintenance_task_incremental_repack(void) +{ + if (multi_pack_index_write()) { + error(_("failed to write multi-pack-index")); + return 1; + } + + if (multi_pack_index_verify()) { + warning(_("multi-pack-index verify failed after initial write")); + return rewrite_multi_pack_index(); + } + + if (multi_pack_index_expire()) { + error(_("multi-pack-index expire failed")); + return 1; + } + + if (multi_pack_index_verify()) { + warning(_("multi-pack-index verify failed after expire")); + return rewrite_multi_pack_index(); + } + + if (multi_pack_index_repack()) { + error(_("multi-pack-index repack failed")); + return 1; + } + + return 0; +} + typedef int maintenance_task_fn(void); struct maintenance_task { @@ -1037,6 +1152,10 @@ static void initialize_tasks(void) tasks[num_tasks]->fn = maintenance_task_loose_objects; num_tasks++; + tasks[num_tasks]->name = "incremental-repack"; + tasks[num_tasks]->fn = maintenance_task_incremental_repack; + num_tasks++; + tasks[num_tasks]->name = "gc"; tasks[num_tasks]->fn = maintenance_task_gc; tasks[num_tasks]->enabled = 1; diff --git a/midx.c b/midx.c index 6d1584ca51..57a8a00082 100644 --- a/midx.c +++ b/midx.c @@ -36,7 +36,7 @@ #define PACK_EXPIRED UINT_MAX -static char *get_midx_filename(const char *object_dir) +char *get_midx_filename(const char *object_dir) { return xstrfmt("%s/pack/multi-pack-index", object_dir); } diff --git a/midx.h b/midx.h index b18cf53bc4..baeecc70c9 100644 --- a/midx.h +++ b/midx.h @@ -37,6 +37,7 @@ struct multi_pack_index { #define MIDX_PROGRESS (1 << 0) +char *get_midx_filename(const char *object_dir); struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local); int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id); int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result); diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh index 94bb493733..3ec813979a 100755 --- a/t/t7900-maintenance.sh +++ b/t/t7900-maintenance.sh @@ -103,4 +103,41 @@ test_expect_success 'loose-objects task' ' test_cmp packs-between packs-after ' +test_expect_success 'incremental-repack task' ' + packDir=.git/objects/pack && + for i in $(test_seq 1 5) + do + test_commit $i || return 1 + done && + + # Create three disjoint pack-files with size BIG, small, small. + echo HEAD~2 | git pack-objects --revs $packDir/test-1 && + test_tick && + git pack-objects --revs $packDir/test-2 <<-\EOF && + HEAD~1 + ^HEAD~2 + EOF + test_tick && + git pack-objects --revs $packDir/test-3 <<-\EOF && + HEAD + ^HEAD~1 + EOF + rm -f $packDir/pack-* && + rm -f $packDir/loose-* && + ls $packDir/*.pack >packs-before && + test_line_count = 3 packs-before && + + # the job repacks the two into a new pack, but does not + # delete the old ones. + git maintenance run --task=incremental-repack && + ls $packDir/*.pack >packs-between && + test_line_count = 4 packs-between && + + # the job deletes the two old packs, and does not write + # a new one because only one pack remains. + git maintenance run --task=incremental-repack && + ls .git/objects/pack/*.pack >packs-after && + test_line_count = 1 packs-after +' + test_done