diff mbox series

[3/9] maintenance: add loose-objects task

Message ID 621375a3c99014c48568660bde062b7330d5a662.1596731425.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Maintenance II: prefetch, loose-objects, incremental-repack tasks | expand

Commit Message

Jean-Noël Avila via GitGitGadget Aug. 6, 2020, 4:30 p.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

One goal of background maintenance jobs is to allow a user to
disable auto-gc (gc.auto=0) but keep their repository in a clean
state. Without any cleanup, loose objects will clutter the object
database and slow operations. In addition, the loose objects will
take up extra space because they are not stored with deltas against
similar objects.

Create a 'loose-objects' task for the 'git maintenance run' command.
This helps clean up loose objects without disrupting concurrent Git
commands using the following sequence of events:

1. Run 'git prune-packed' to delete any loose objects that exist
   in a pack-file. Concurrent commands will prefer the packed
   version of the object to the loose version. (Of course, there
   are exceptions for commands that specifically care about the
   location of an object. These are rare for a user to run on
   purpose, and we hope a user that has selected background
   maintenance will not be trying to do foreground maintenance.)

2. Run 'git pack-objects' on a batch of loose objects. These
   objects are grouped by scanning the loose object directories in
   lexicographic order until listing all loose objects -or-
   reaching 50,000 objects. This is more than enough if the loose
   objects are created only by a user doing normal development.
   We noticed users with _millions_ of loose objects because VFS
   for Git downloads blobs on-demand when a file read operation
   requires populating a virtual file. This has potential of
   happening in partial clones if someone runs 'git grep' or
   otherwise evades the batch-download feature for requesting
   promisor objects.

This step is based on a similar step in Scalar [1] and VFS for Git.
[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/LooseObjectsStep.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-maintenance.txt | 11 ++++
 builtin/gc.c                      | 97 +++++++++++++++++++++++++++++++
 t/t7900-maintenance.sh            | 39 +++++++++++++
 3 files changed, 147 insertions(+)

Comments

Emily Shaffer Aug. 12, 2020, 11:10 p.m. UTC | #1
On Thu, Aug 06, 2020 at 04:30:18PM +0000, Derrick Stolee via GitGitGadget wrote:

> diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
> index bb0d5eded4..898aff4726 100644
> --- a/Documentation/git-maintenance.txt
> +++ b/Documentation/git-maintenance.txt
> @@ -80,6 +80,17 @@ gc::
>  	It can also be disruptive in some situations, as it deletes stale
>  	data.
>  
> +loose-objects::
> +	The `loose-objects` job cleans up loose objects and places them into
> +	pack-files. In order to prevent race conditions with concurrent Git
> +	commands, it follows a two-step process. First, it deletes any loose
> +	objects that already exist in a pack-file; concurrent Git processes
> +	will examine the pack-file for the object data instead of the loose
> +	object. Second, it creates a new pack-file (starting with "loose-")
> +	containing a batch of loose objects. The batch size is limited to 50
> +	thousand objects to prevent the job from taking too long on a
> +	repository with many loose objects.

[emily] What's not said is what happens to loose-* packfile. Does this
get repacked and disappear, is it treated differently, etc?
[jonathantan] This is treated the same as any other pack. Is there a
reason to call it loose-*?
[jrnieder] It seems like unreachable objects will get stuck going in and
out of this pack, right? unreachable loose obj -> loose-*.pack ->
repack, unreachable becomes loose -> repeat

> +static int prune_packed(struct maintenance_opts *opts)
> +{
> +	struct child_process child = CHILD_PROCESS_INIT;
> +
> +	child.git_cmd = 1;
> +	strvec_push(&child.args, "prune-packed");
> +
> +	if (opts->quiet)
> +		strvec_push(&child.args, "--quiet");
> +
> +	return !!run_command(&child);
[emily] Why not report the error code here to the caller? Is there a
path to notify user of errors that require user intervention?
[jrnieder] Some errors might be expected and some might not, so should
we handle them differently?
[emily] Probably makes more sense to revisit this in the far future when
we teach 'git maintenance' to send us emails and flashing lights when
our jobs fail, instead of worrying about it now :)
[jrnieder] Imagine if we got exit 127 because the usage we are using is
wrong - in that case, for example, we would want to BUG()

> +static int write_loose_object_to_stdin(const struct object_id *oid,
> +				       const char *path,
> +				       void *data)
> +{
> +	struct write_loose_object_data *d = (struct write_loose_object_data *)data;
[jrnieder] Since we are enlightened C developers, not C++ developers,
you can skip the explicit cast.

> +++ b/t/t7900-maintenance.sh
> @@ -83,4 +83,43 @@ test_expect_success 'prefetch multiple remotes' '
>  	git log prefetch/remote2/two
>  '
>  
> +test_expect_success 'loose-objects task' '
[jrnieder] Unrelated to this change, this test makes me sad that we
don't have better low-level test helpers to enable this kind of testing.
Makes me wish for something better :)
> +	# Repack everything so we know the state of the object dir
> +	git repack -adk &&
> +
> +	# Hack to stop maintenance from running during "git commit"
> +	echo in use >.git/objects/maintenance.lock &&
> +
> +	# Assuming that "git commit" creates at least one loose object
> +	test_commit create-loose-object &&
> +	rm .git/objects/maintenance.lock &&
> +
> +	ls .git/objects >obj-dir-before &&
> +	test_file_not_empty obj-dir-before &&
> +	ls .git/objects/pack/*.pack >packs-before &&
> +	test_line_count = 1 packs-before &&
> +
> +	# The first run creates a pack-file
> +	# but does not delete loose objects.
> +	git maintenance run --task=loose-objects &&
> +	ls .git/objects >obj-dir-between &&
> +	test_cmp obj-dir-before obj-dir-between &&
> +	ls .git/objects/pack/*.pack >packs-between &&
> +	test_line_count = 2 packs-between &&
> +	ls .git/objects/pack/loose-*.pack >loose-packs &&
> +	test_line_count = 1 loose-packs &&
> +
> +	# The second run deletes loose objects
> +	# but does not create a pack-file.
> +	git maintenance run --task=loose-objects &&
> +	ls .git/objects >obj-dir-after &&
> +	cat >expect <<-\EOF &&
> +	info
> +	pack
> +	EOF
> +	test_cmp expect obj-dir-after &&
> +	ls .git/objects/pack/*.pack >packs-after &&
> +	test_cmp packs-between packs-after
> +'
> +
Derrick Stolee Aug. 14, 2020, 1:46 a.m. UTC | #2
On 8/12/2020 7:10 PM, Emily Shaffer wrote:
> On Thu, Aug 06, 2020 at 04:30:18PM +0000, Derrick Stolee via GitGitGadget wrote:
> 
>> diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
>> index bb0d5eded4..898aff4726 100644
>> --- a/Documentation/git-maintenance.txt
>> +++ b/Documentation/git-maintenance.txt
>> @@ -80,6 +80,17 @@ gc::
>>  	It can also be disruptive in some situations, as it deletes stale
>>  	data.
>>  
>> +loose-objects::
>> +	The `loose-objects` job cleans up loose objects and places them into
>> +	pack-files. In order to prevent race conditions with concurrent Git
>> +	commands, it follows a two-step process. First, it deletes any loose
>> +	objects that already exist in a pack-file; concurrent Git processes
>> +	will examine the pack-file for the object data instead of the loose
>> +	object. Second, it creates a new pack-file (starting with "loose-")
>> +	containing a batch of loose objects. The batch size is limited to 50
>> +	thousand objects to prevent the job from taking too long on a
>> +	repository with many loose objects.
> 
> [emily] What's not said is what happens to loose-* packfile. Does this
> get repacked and disappear, is it treated differently, etc?
> [jonathantan] This is treated the same as any other pack. Is there a
> reason to call it loose-*?

Auditing, mostly. When working with repositories that have
multiple pack-files as a rule, it is helpful to know what "kind"
of pack-file they are. I almost added a "repacked-" prefix to
the packs produced by 'git multi-pack-index repack'. It helps
to distinguish "reprocessed" data from "I got this from a 'clone'
or 'fetch'."

Of course, as these pack-files are repacked later, the prefixes
will get dropped. The name does give us some help when performing
diagnostics on a misbehaving repository. A pack directory filled
with many "loose-*.pack" files indicates that the 'loose-objects'
step was run many times without either 'gc' or 'incremental-repack'
coalescing those packs with delta compression.

> [jrnieder] It seems like unreachable objects will get stuck going in and
> out of this pack, right? unreachable loose obj -> loose-*.pack ->
> repack, unreachable becomes loose -> repeat

This is a good point. This task would probably interact poorly with
the 'gc' task in that regard. They are not intended to both be
enabled at the same time, but it might be good to point that out
directly in the documentation.

How does this sound?

 The `gc` task writes unreachable objects as loose objects to
 be cleaned up by a later step only if they are not re-added to
 a pack-file; for this reason it is not advisable to enable both
 the `loose-objects` and `gc` tasks at the same time.

>> +static int prune_packed(struct maintenance_opts *opts)
>> +{
>> +	struct child_process child = CHILD_PROCESS_INIT;
>> +
>> +	child.git_cmd = 1;
>> +	strvec_push(&child.args, "prune-packed");
>> +
>> +	if (opts->quiet)
>> +		strvec_push(&child.args, "--quiet");
>> +
>> +	return !!run_command(&child);

> [emily] Why not report the error code here to the caller? Is there a
> path to notify user of errors that require user intervention?
> [jrnieder] Some errors might be expected and some might not, so should
> we handle them differently?
> [emily] Probably makes more sense to revisit this in the far future when
> we teach 'git maintenance' to send us emails and flashing lights when
> our jobs fail, instead of worrying about it now :)
> [jrnieder] Imagine if we got exit 127 because the usage we are using is
> wrong - in that case, for example, we would want to BUG()

I was previously returning the exact error code, but received
feedback that these internal methods should have simpler return
values [1].

[1] https://lore.kernel.org/git/xmqq4kpyq8wh.fsf@gitster.c.googlers.com/

>> +static int write_loose_object_to_stdin(const struct object_id *oid,
>> +				       const char *path,
>> +				       void *data)
>> +{
>> +	struct write_loose_object_data *d = (struct write_loose_object_data *)data;

> [jrnieder] Since we are enlightened C developers, not C++ developers,
> you can skip the explicit cast.

Hm. I definitely _could_ but don't really want to. Perhaps I'm just
being pedantic, but I was just talking to someone today off-list that
they were disappointed that they couldn't compile Git in a C++ compiler
mainly because of these kinds of casting issues.

I was surprised the word "cast" does not appear anywhere in the
CodingGuidelines, because if this is a convention we want to follow,
then I have definitely been violating it for a *while*.
 
>> +++ b/t/t7900-maintenance.sh
>> @@ -83,4 +83,43 @@ test_expect_success 'prefetch multiple remotes' '
>>  	git log prefetch/remote2/two
>>  '
>>  
>> +test_expect_success 'loose-objects task' '

> [jrnieder] Unrelated to this change, this test makes me sad that we
> don't have better low-level test helpers to enable this kind of testing.> Makes me wish for something better :)

Do you mean you wish we had helpers for getting a repo
into a certain state of its object database? Or somehow
mocking the object database in-memory to provide more
unit-level testing? I don't understand what "something
better" looks like.

>> +	# Repack everything so we know the state of the object dir
>> +	git repack -adk &&
>> +
>> +	# Hack to stop maintenance from running during "git commit"
>> +	echo in use >.git/objects/maintenance.lock &&
>> +
>> +	# Assuming that "git commit" creates at least one loose object
>> +	test_commit create-loose-object &&
>> +	rm .git/objects/maintenance.lock &&
>> +
>> +	ls .git/objects >obj-dir-before &&
>> +	test_file_not_empty obj-dir-before &&
>> +	ls .git/objects/pack/*.pack >packs-before &&
>> +	test_line_count = 1 packs-before &&
>> +
>> +	# The first run creates a pack-file
>> +	# but does not delete loose objects.
>> +	git maintenance run --task=loose-objects &&
>> +	ls .git/objects >obj-dir-between &&
>> +	test_cmp obj-dir-before obj-dir-between &&
>> +	ls .git/objects/pack/*.pack >packs-between &&
>> +	test_line_count = 2 packs-between &&
>> +	ls .git/objects/pack/loose-*.pack >loose-packs &&
>> +	test_line_count = 1 loose-packs &&
>> +
>> +	# The second run deletes loose objects
>> +	# but does not create a pack-file.
>> +	git maintenance run --task=loose-objects &&
>> +	ls .git/objects >obj-dir-after &&
>> +	cat >expect <<-\EOF &&
>> +	info
>> +	pack
>> +	EOF
>> +	test_cmp expect obj-dir-after &&
>> +	ls .git/objects/pack/*.pack >packs-after &&
>> +	test_cmp packs-between packs-after
>> +'
>> +
>
diff mbox series

Patch

diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index bb0d5eded4..898aff4726 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -80,6 +80,17 @@  gc::
 	It can also be disruptive in some situations, as it deletes stale
 	data.
 
+loose-objects::
+	The `loose-objects` job cleans up loose objects and places them into
+	pack-files. In order to prevent race conditions with concurrent Git
+	commands, it follows a two-step process. First, it deletes any loose
+	objects that already exist in a pack-file; concurrent Git processes
+	will examine the pack-file for the object data instead of the loose
+	object. Second, it creates a new pack-file (starting with "loose-")
+	containing a batch of loose objects. The batch size is limited to 50
+	thousand objects to prevent the job from taking too long on a
+	repository with many loose objects.
+
 OPTIONS
 -------
 --auto::
diff --git a/builtin/gc.c b/builtin/gc.c
index 942b7ea535..60261d2647 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -908,6 +908,98 @@  static int maintenance_task_gc(struct maintenance_opts *opts)
 	return run_command(&child);
 }
 
+static int prune_packed(struct maintenance_opts *opts)
+{
+	struct child_process child = CHILD_PROCESS_INIT;
+
+	child.git_cmd = 1;
+	strvec_push(&child.args, "prune-packed");
+
+	if (opts->quiet)
+		strvec_push(&child.args, "--quiet");
+
+	return !!run_command(&child);
+}
+
+struct write_loose_object_data {
+	FILE *in;
+	int count;
+	int batch_size;
+};
+
+static int bail_on_loose(const struct object_id *oid,
+			 const char *path,
+			 void *data)
+{
+	return 1;
+}
+
+static int write_loose_object_to_stdin(const struct object_id *oid,
+				       const char *path,
+				       void *data)
+{
+	struct write_loose_object_data *d = (struct write_loose_object_data *)data;
+
+	fprintf(d->in, "%s\n", oid_to_hex(oid));
+
+	return ++(d->count) > d->batch_size;
+}
+
+static int pack_loose(struct maintenance_opts *opts)
+{
+	struct repository *r = the_repository;
+	int result = 0;
+	struct write_loose_object_data data;
+	struct child_process pack_proc = CHILD_PROCESS_INIT;
+
+	/*
+	 * Do not start pack-objects process
+	 * if there are no loose objects.
+	 */
+	if (!for_each_loose_file_in_objdir(r->objects->odb->path,
+					   bail_on_loose,
+					   NULL, NULL, NULL))
+		return 0;
+
+	pack_proc.git_cmd = 1;
+
+	strvec_push(&pack_proc.args, "pack-objects");
+	if (opts->quiet)
+		strvec_push(&pack_proc.args, "--quiet");
+	strvec_pushf(&pack_proc.args, "%s/pack/loose", r->objects->odb->path);
+
+	pack_proc.in = -1;
+
+	if (start_command(&pack_proc)) {
+		error(_("failed to start 'git pack-objects' process"));
+		return 1;
+	}
+
+	data.in = xfdopen(pack_proc.in, "w");
+	data.count = 0;
+	data.batch_size = 50000;
+
+	for_each_loose_file_in_objdir(r->objects->odb->path,
+				      write_loose_object_to_stdin,
+				      NULL,
+				      NULL,
+				      &data);
+
+	fclose(data.in);
+
+	if (finish_command(&pack_proc)) {
+		error(_("failed to finish 'git pack-objects' process"));
+		result = 1;
+	}
+
+	return result;
+}
+
+static int maintenance_task_loose_objects(struct maintenance_opts *opts)
+{
+	return prune_packed(opts) || pack_loose(opts);
+}
+
 typedef int maintenance_task_fn(struct maintenance_opts *opts);
 
 /*
@@ -928,6 +1020,7 @@  struct maintenance_task {
 
 enum maintenance_task_label {
 	TASK_PREFETCH,
+	TASK_LOOSE_OBJECTS,
 	TASK_GC,
 	TASK_COMMIT_GRAPH,
 
@@ -940,6 +1033,10 @@  static struct maintenance_task tasks[] = {
 		"prefetch",
 		maintenance_task_prefetch,
 	},
+	[TASK_LOOSE_OBJECTS] = {
+		"loose-objects",
+		maintenance_task_loose_objects,
+	},
 	[TASK_GC] = {
 		"gc",
 		maintenance_task_gc,
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index 3dd99ef660..8d54f93a10 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -83,4 +83,43 @@  test_expect_success 'prefetch multiple remotes' '
 	git log prefetch/remote2/two
 '
 
+test_expect_success 'loose-objects task' '
+	# Repack everything so we know the state of the object dir
+	git repack -adk &&
+
+	# Hack to stop maintenance from running during "git commit"
+	echo in use >.git/objects/maintenance.lock &&
+
+	# Assuming that "git commit" creates at least one loose object
+	test_commit create-loose-object &&
+	rm .git/objects/maintenance.lock &&
+
+	ls .git/objects >obj-dir-before &&
+	test_file_not_empty obj-dir-before &&
+	ls .git/objects/pack/*.pack >packs-before &&
+	test_line_count = 1 packs-before &&
+
+	# The first run creates a pack-file
+	# but does not delete loose objects.
+	git maintenance run --task=loose-objects &&
+	ls .git/objects >obj-dir-between &&
+	test_cmp obj-dir-before obj-dir-between &&
+	ls .git/objects/pack/*.pack >packs-between &&
+	test_line_count = 2 packs-between &&
+	ls .git/objects/pack/loose-*.pack >loose-packs &&
+	test_line_count = 1 loose-packs &&
+
+	# The second run deletes loose objects
+	# but does not create a pack-file.
+	git maintenance run --task=loose-objects &&
+	ls .git/objects >obj-dir-after &&
+	cat >expect <<-\EOF &&
+	info
+	pack
+	EOF
+	test_cmp expect obj-dir-after &&
+	ls .git/objects/pack/*.pack >packs-after &&
+	test_cmp packs-between packs-after
+'
+
 test_done