diff mbox series

[v2,09/18] maintenance: add loose-objects task

Message ID 83648f48655ba68126110018d81c1d2e2bcc7a6f.1595527000.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Maintenance builtin, allowing 'gc --auto' customization | expand

Commit Message

Linus Arver via GitGitGadget July 23, 2020, 5:56 p.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

One goal of background maintenance jobs is to allow a user to
disable auto-gc (gc.auto=0) but keep their repository in a clean
state. Without any cleanup, loose objects will clutter the object
database and slow operations. In addition, the loose objects will
take up extra space because they are not stored with deltas against
similar objects.

Create a 'loose-objects' task for the 'git maintenance run' command.
This helps clean up loose objects without disrupting concurrent Git
commands using the following sequence of events:

1. Run 'git prune-packed' to delete any loose objects that exist
   in a pack-file. Concurrent commands will prefer the packed
   version of the object to the loose version. (Of course, there
   are exceptions for commands that specifically care about the
   location of an object. These are rare for a user to run on
   purpose, and we hope a user that has selected background
   maintenance will not be trying to do foreground maintenance.)

2. Run 'git pack-objects' on a batch of loose objects. These
   objects are grouped by scanning the loose object directories in
   lexicographic order until listing all loose objects -or-
   reaching 50,000 objects. This is more than enough if the loose
   objects are created only by a user doing normal development.
   We noticed users with _millions_ of loose objects because VFS
   for Git downloads blobs on-demand when a file read operation
   requires populating a virtual file. This has potential of
   happening in partial clones if someone runs 'git grep' or
   otherwise evades the batch-download feature for requesting
   promisor objects.

This step is based on a similar step in Scalar [1] and VFS for Git.
[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/LooseObjectsStep.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-maintenance.txt |  11 ++++
 builtin/gc.c                      | 106 +++++++++++++++++++++++++++++-
 t/t7900-maintenance.sh            |  35 ++++++++++
 3 files changed, 151 insertions(+), 1 deletion(-)

Comments

Junio C Hamano July 23, 2020, 8:59 p.m. UTC | #1
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Create a 'loose-objects' task for the 'git maintenance run' command.
> This helps clean up loose objects without disrupting concurrent Git
> commands using the following sequence of events:
>
> 1. Run 'git prune-packed' to delete any loose objects that exist
>    in a pack-file. Concurrent commands will prefer the packed
>    version of the object to the loose version. (Of course, there
>    are exceptions for commands that specifically care about the
>    location of an object. These are rare for a user to run on
>    purpose, and we hope a user that has selected background
>    maintenance will not be trying to do foreground maintenance.)

OK.  That would make sense.

> 2. Run 'git pack-objects' on a batch of loose objects. These
>    objects are grouped by scanning the loose object directories in
>    lexicographic order until listing all loose objects -or-
>    reaching 50,000 objects. This is more than enough if the loose
>    objects are created only by a user doing normal development.

I haven't seen this in action, but my gut feeling is that this would
result in horrible locality and deltification in the resulting
packfile.  The order you feed the objects to pack-objects and the
path hint you attach to each object matters quite a lot.

I do agree that it would be useful to have a task to deal with only
loose objects without touching existing packfiles.  I just am not
sure if 2. is a worthwhile thing to do.  A poorly constructed pack
will also contaminate later packfiles made without "-f" option to
"git repack".
Derrick Stolee July 24, 2020, 2:50 p.m. UTC | #2
On 7/23/2020 4:59 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> Create a 'loose-objects' task for the 'git maintenance run' command.
>> This helps clean up loose objects without disrupting concurrent Git
>> commands using the following sequence of events:
>>
>> 1. Run 'git prune-packed' to delete any loose objects that exist
>>    in a pack-file. Concurrent commands will prefer the packed
>>    version of the object to the loose version. (Of course, there
>>    are exceptions for commands that specifically care about the
>>    location of an object. These are rare for a user to run on
>>    purpose, and we hope a user that has selected background
>>    maintenance will not be trying to do foreground maintenance.)
> 
> OK.  That would make sense.
> 
>> 2. Run 'git pack-objects' on a batch of loose objects. These
>>    objects are grouped by scanning the loose object directories in
>>    lexicographic order until listing all loose objects -or-
>>    reaching 50,000 objects. This is more than enough if the loose
>>    objects are created only by a user doing normal development.
> 
> I haven't seen this in action, but my gut feeling is that this would
> result in horrible locality and deltification in the resulting
> packfile.  The order you feed the objects to pack-objects and the
> path hint you attach to each object matters quite a lot.
> 
> I do agree that it would be useful to have a task to deal with only
> loose objects without touching existing packfiles.  I just am not
> sure if 2. is a worthwhile thing to do.  A poorly constructed pack
> will also contaminate later packfiles made without "-f" option to
> "git repack".

There are several factors going on here:

 * In a partial clone, it is likely that we get loose objects only
   due to a command like "git log -p" that downloads blobs
   one-by-one. In such a case, this step coming in later and picking
   up those blobs _will_ find good deltas because they are present
   in the same batch.

 * (I know this case isn't important to core Git, but please indulge
   me) In a VFS for Git repo, the loose objects correspond to blobs
   that were faulted in by a virtual filesystem read. In this case,
   the blobs are usually from a single commit in history, so good
   deltas between the blobs don't actually exist!

 * My experience indicates that the packs created by the
   loose-objects task are rather small (when created daily). This
   means that they get selected by the incremental-repack task to
   repack into a new pack-file where deltas are recomputed with modest
   success. As mentioned in that task, we saw a significant compression
   factor using that step for users of the Windows OS repo, mostly due
   to recomputing tree deltas.

 * Some amount of "extra" space is expected with this incremental
   repacking scheme. The most space-efficient thing to do is a full
   repack along with a tree walk that detects the paths used for each
   blob, allowing better hints for delta compression. However, that
   operation is very _time_ consuming. The trade-off here is something
   I should make more explicit. In my experience, disk space is cheap
   but CPU time is expensive. Most repositories could probably do a
   daily repack without being a disruption to the user. These steps
   enable maintenance for repositories where a full repack is too
   disruptive.

I hope this adds some context. I would love if someone who knows more
about delta compression could challenge my assumptions. Sharing that
expertise can help create better maintenance strategies. Junio's
initial concern here is a good first step in that direction.

Thanks,
-Stolee
Junio C Hamano July 24, 2020, 7:57 p.m. UTC | #3
Derrick Stolee <stolee@gmail.com> writes:

>  * In a partial clone, it is likely that we get loose objects only
>    due to a command like "git log -p" that downloads blobs
>    one-by-one. In such a case, this step coming in later and picking
>    up those blobs _will_ find good deltas because they are present
>    in the same batch.
>
>  * (I know this case isn't important to core Git, but please indulge
>    me) In a VFS for Git repo, the loose objects correspond to blobs
>    that were faulted in by a virtual filesystem read. In this case,
>    the blobs are usually from a single commit in history, so good
>    deltas between the blobs don't actually exist!

Let me stop here by saying that I am now starting to worry about
overfitting the repacking strategy to lazy clone repositories.  I am
perfectly fine with the plan to start with just one strategy overfit
for partially cloned repositories, as long as we make sure that we
can be extended it to suit other access/object acquisition patterns.
Emily Shaffer July 29, 2020, 10:21 p.m. UTC | #4
On Thu, Jul 23, 2020 at 05:56:31PM +0000, Derrick Stolee via GitGitGadget wrote:
> +loose-objects::
> +	The `loose-objects` job cleans up loose objects and places them into
> +	pack-files. In order to prevent race conditions with concurrent Git
> +	commands, it follows a two-step process. First, it deletes any loose
> +	objects that already exist in a pack-file; concurrent Git processes
> +	will examine the pack-file for the object data instead of the loose
> +	object. Second, it creates a new pack-file (starting with "loose-")

[jonathan tan + jonathan nieder] If you are going to document this,
probably it should also be tested, so the documentation does not become
stale. Or, just don't document it.

> +static int pack_loose(void)
> +{
> +	struct repository *r = the_repository;
> +	int result = 0;
> +	struct write_loose_object_data data;
> +	struct strbuf prefix = STRBUF_INIT;
> +	struct child_process *pack_proc;
> +
> +	/*
> +	 * Do not start pack-objects process
> +	 * if there are no loose objects.
> +	 */
> +	if (!for_each_loose_file_in_objdir(r->objects->odb->path,
> +					   loose_object_exists,
> +					   NULL, NULL, NULL))

[emily] To me, this is unintuitive - but upon inspection, it's exiting
the foreach early if any loose object is found, so this is cheaper than
actually counting. Maybe a comment would help to understand? Or we could
name the function differently, like "bail_if_loose()" or something?

> +test_expect_success 'loose-objects task' '
> +	# Repack everything so we know the state of the object dir
> +	git repack -adk &&
> +
> +	# Hack to stop maintenance from running during "git commit"
> +	echo in use >.git/objects/maintenance.lock &&
> +	test_commit create-loose-object &&

[jonathan nieder] Does it make sense to use a different git command
which is guaranteed to make a loose object? Is 'git commit' futureproof,
if we decide commits should directly create packs in the future? 'git
unpack-objects' is guaranteed to make a loose object, although it is
clumsy because it needs a packfile to begin with...

[jonathan tan] But, using 'git commit' is easier to understand in this
context. Maybe commenting to say that we assume 'git commit' makes 1 or
more loose objects will be enough to futureproof - then we have a signal
to whoever made a change to make this fail, and that person knows how to
fix this test.
Derrick Stolee July 30, 2020, 3:38 p.m. UTC | #5
On 7/29/2020 6:21 PM, Emily Shaffer wrote:
> On Thu, Jul 23, 2020 at 05:56:31PM +0000, Derrick Stolee via GitGitGadget wrote:
>> +loose-objects::
>> +	The `loose-objects` job cleans up loose objects and places them into
>> +	pack-files. In order to prevent race conditions with concurrent Git
>> +	commands, it follows a two-step process. First, it deletes any loose
>> +	objects that already exist in a pack-file; concurrent Git processes
>> +	will examine the pack-file for the object data instead of the loose
>> +	object. Second, it creates a new pack-file (starting with "loose-")
> 
> [jonathan tan + jonathan nieder] If you are going to document this,
> probably it should also be tested, so the documentation does not become
> stale. Or, just don't document it.

Adding a condition to the test is a good idea.

>> +static int pack_loose(void)
>> +{
>> +	struct repository *r = the_repository;
>> +	int result = 0;
>> +	struct write_loose_object_data data;
>> +	struct strbuf prefix = STRBUF_INIT;
>> +	struct child_process *pack_proc;
>> +
>> +	/*
>> +	 * Do not start pack-objects process
>> +	 * if there are no loose objects.
>> +	 */
>> +	if (!for_each_loose_file_in_objdir(r->objects->odb->path,
>> +					   loose_object_exists,
>> +					   NULL, NULL, NULL))
> 
> [emily] To me, this is unintuitive - but upon inspection, it's exiting
> the foreach early if any loose object is found, so this is cheaper than
> actually counting. Maybe a comment would help to understand? Or we could
> name the function differently, like "bail_if_loose()" or something?

Sure. "bail_on_loose()" makes more sense to me.

>> +test_expect_success 'loose-objects task' '
>> +	# Repack everything so we know the state of the object dir
>> +	git repack -adk &&
>> +
>> +	# Hack to stop maintenance from running during "git commit"
>> +	echo in use >.git/objects/maintenance.lock &&
>> +	test_commit create-loose-object &&
> 
> [jonathan nieder] Does it make sense to use a different git command
> which is guaranteed to make a loose object? Is 'git commit' futureproof,
> if we decide commits should directly create packs in the future? 'git
> unpack-objects' is guaranteed to make a loose object, although it is
> clumsy because it needs a packfile to begin with...

"unpack-objects" also has the problem that it won't write the loose
object if it exists as a packed object, so it requires moving the
pack-file out of the object directory first. (That's also required
for testing that the loose objects get packed by the maintenance
task, but I'd rather not need to remove the multi-pack-index, too.)

> [jonathan tan] But, using 'git commit' is easier to understand in this
> context. Maybe commenting to say that we assume 'git commit' makes 1 or
> more loose objects will be enough to futureproof - then we have a signal
> to whoever made a change to make this fail, and that person knows how to
> fix this test.

I'll add the comment for now. Thanks.

-Stolee
diff mbox series

Patch

diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index 0927643247..557915a653 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -73,6 +73,17 @@  gc::
 	It can also be disruptive in some situations, as it deletes stale
 	data.
 
+loose-objects::
+	The `loose-objects` job cleans up loose objects and places them into
+	pack-files. In order to prevent race conditions with concurrent Git
+	commands, it follows a two-step process. First, it deletes any loose
+	objects that already exist in a pack-file; concurrent Git processes
+	will examine the pack-file for the object data instead of the loose
+	object. Second, it creates a new pack-file (starting with "loose-")
+	containing a batch of loose objects. The batch size is limited to 50
+	thousand objects to prevent the job from taking too long on a
+	repository with many loose objects.
+
 OPTIONS
 -------
 --auto::
diff --git a/builtin/gc.c b/builtin/gc.c
index 969c127877..fa65304580 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -701,7 +701,7 @@  int cmd_gc(int argc, const char **argv, const char *prefix)
 	return 0;
 }
 
-#define MAX_NUM_TASKS 3
+#define MAX_NUM_TASKS 4
 
 static const char * const builtin_maintenance_usage[] = {
 	N_("git maintenance run [<options>]"),
@@ -858,6 +858,106 @@  static int maintenance_task_gc(void)
 	return result;
 }
 
+static int prune_packed(void)
+{
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	argv_array_pushl(&cmd, "prune-packed", NULL);
+
+	if (opts.quiet)
+		argv_array_push(&cmd, "--quiet");
+
+	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+}
+
+struct write_loose_object_data {
+	FILE *in;
+	int count;
+	int batch_size;
+};
+
+static int loose_object_exists(const struct object_id *oid,
+			       const char *path,
+			       void *data)
+{
+	return 1;
+}
+
+static int write_loose_object_to_stdin(const struct object_id *oid,
+				       const char *path,
+				       void *data)
+{
+	struct write_loose_object_data *d = (struct write_loose_object_data *)data;
+
+	fprintf(d->in, "%s\n", oid_to_hex(oid));
+
+	return ++(d->count) > d->batch_size;
+}
+
+static int pack_loose(void)
+{
+	struct repository *r = the_repository;
+	int result = 0;
+	struct write_loose_object_data data;
+	struct strbuf prefix = STRBUF_INIT;
+	struct child_process *pack_proc;
+
+	/*
+	 * Do not start pack-objects process
+	 * if there are no loose objects.
+	 */
+	if (!for_each_loose_file_in_objdir(r->objects->odb->path,
+					   loose_object_exists,
+					   NULL, NULL, NULL))
+		return 0;
+
+	pack_proc = xmalloc(sizeof(*pack_proc));
+
+	child_process_init(pack_proc);
+
+	strbuf_addstr(&prefix, r->objects->odb->path);
+	strbuf_addstr(&prefix, "/pack/loose");
+
+	argv_array_pushl(&pack_proc->args, "git", "pack-objects", NULL);
+	if (opts.quiet)
+		argv_array_push(&pack_proc->args, "--quiet");
+	argv_array_push(&pack_proc->args, prefix.buf);
+
+	pack_proc->in = -1;
+
+	if (start_command(pack_proc)) {
+		error(_("failed to start 'git pack-objects' process"));
+		result = 1;
+		goto cleanup;
+	}
+
+	data.in = xfdopen(pack_proc->in, "w");
+	data.count = 0;
+	data.batch_size = 50000;
+
+	for_each_loose_file_in_objdir(r->objects->odb->path,
+				      write_loose_object_to_stdin,
+				      NULL,
+				      NULL,
+				      &data);
+
+	fclose(data.in);
+
+	if (finish_command(pack_proc)) {
+		error(_("failed to finish 'git pack-objects' process"));
+		result = 1;
+	}
+
+cleanup:
+	strbuf_release(&prefix);
+	free(pack_proc);
+	return result;
+}
+
+static int maintenance_task_loose_objects(void)
+{
+	return prune_packed() || pack_loose();
+}
+
 typedef int maintenance_task_fn(void);
 
 struct maintenance_task {
@@ -933,6 +1033,10 @@  static void initialize_tasks(void)
 	tasks[num_tasks]->fn = maintenance_task_prefetch;
 	num_tasks++;
 
+	tasks[num_tasks]->name = "loose-objects";
+	tasks[num_tasks]->fn = maintenance_task_loose_objects;
+	num_tasks++;
+
 	tasks[num_tasks]->name = "gc";
 	tasks[num_tasks]->fn = maintenance_task_gc;
 	tasks[num_tasks]->enabled = 1;
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index 8b04a04c79..94bb493733 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -68,4 +68,39 @@  test_expect_success 'prefetch multiple remotes' '
 	git log prefetch/remote2/two
 '
 
+test_expect_success 'loose-objects task' '
+	# Repack everything so we know the state of the object dir
+	git repack -adk &&
+
+	# Hack to stop maintenance from running during "git commit"
+	echo in use >.git/objects/maintenance.lock &&
+	test_commit create-loose-object &&
+	rm .git/objects/maintenance.lock &&
+
+	ls .git/objects >obj-dir-before &&
+	test_file_not_empty obj-dir-before &&
+	ls .git/objects/pack/*.pack >packs-before &&
+	test_line_count = 1 packs-before &&
+
+	# The first run creates a pack-file
+	# but does not delete loose objects.
+	git maintenance run --task=loose-objects &&
+	ls .git/objects >obj-dir-between &&
+	test_cmp obj-dir-before obj-dir-between &&
+	ls .git/objects/pack/*.pack >packs-between &&
+	test_line_count = 2 packs-between &&
+
+	# The second run deletes loose objects
+	# but does not create a pack-file.
+	git maintenance run --task=loose-objects &&
+	ls .git/objects >obj-dir-after &&
+	cat >expect <<-\EOF &&
+	info
+	pack
+	EOF
+	test_cmp expect obj-dir-after &&
+	ls .git/objects/pack/*.pack >packs-after &&
+	test_cmp packs-between packs-after
+'
+
 test_done