[4/5] multi-pack-index: prepare 'repack' verb
diff mbox series

Message ID 72b213959171af3bfe4d849b925920ddbfb3d4b7.1544465177.git.gitgitgadget@gmail.com
State New
Headers show
Series
  • Create 'expire' and 'repack' verbs for git-multi-pack-index
Related show

Commit Message

Elijah Newren via GitGitGadget Dec. 10, 2018, 6:06 p.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

In an environment where the multi-pack-index is useful, it is due
to many pack-files and an inability to repack the object store
into a single pack-file. However, it is likely that many of these
pack-files are rather small, and could be repacked into a slightly
larger pack-file without too much effort. It may also be important
to ensure the object store is highly available and the repack
operation does not interrupt concurrent git commands.

Introduce a 'repack' verb to 'git multi-pack-index' that takes a
'--batch-size' option. The verb will inspect the multi-pack-index
for referenced pack-files whose size is smaller than the batch
size, until collecting a list of pack-files whose sizes sum to
larger than the batch size. Then, a new pack-file will be created
containing the objects from those pack-files that are referenced
by the multi-pack-index. The resulting pack is likely to actually
be smaller than the batch size due to compression and the fact
that there may be objects in the pack-files that have duplicate
copies in other pack-files.

The current change introduces the command-line arguments, and we
add a test that ensures we parse these options properly. Since
we specify a small batch size, we will guarantee that future
implementations do not change the list of pack-files.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-multi-pack-index.txt | 12 ++++++++++++
 builtin/multi-pack-index.c             | 10 +++++++++-
 midx.c                                 |  5 +++++
 midx.h                                 |  1 +
 t/t5319-multi-pack-index.sh            | 11 +++++++++++
 5 files changed, 38 insertions(+), 1 deletion(-)

Comments

Stefan Beller Dec. 11, 2018, 1:54 a.m. UTC | #1
On Mon, Dec 10, 2018 at 10:06 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> In an environment where the multi-pack-index is useful, it is due
> to many pack-files and an inability to repack the object store
> into a single pack-file. However, it is likely that many of these
> pack-files are rather small, and could be repacked into a slightly
> larger pack-file without too much effort. It may also be important
> to ensure the object store is highly available and the repack
> operation does not interrupt concurrent git commands.
>
> Introduce a 'repack' verb to 'git multi-pack-index' that takes a
> '--batch-size' option. The verb will inspect the multi-pack-index
> for referenced pack-files whose size is smaller than the batch
> size, until collecting a list of pack-files whose sizes sum to
> larger than the batch size. Then, a new pack-file will be created
> containing the objects from those pack-files that are referenced
> by the multi-pack-index. The resulting pack is likely to actually
> be smaller than the batch size due to compression and the fact
> that there may be objects in the pack-files that have duplicate
> copies in other pack-files.

This highlights an optimization problem: How do we pick the
batches optimally?
Ideally we'd pick packs that have an overlap of many
same objects to dedup them completely, next best would
be to have objects that are very similar, such that they delta
very well.
(Assuming that the sum of the resulting pack sizes is a metric
we'd want to optimize for eventually)

For now it seems we just take a random cut of "small" packs.

>
> The current change introduces the command-line arguments, and we
> add a test that ensures we parse these options properly. Since
> we specify a small batch size, we will guarantee that future
> implementations do not change the list of pack-files.

This is another clever trick that makes the test correct
despite no implementation yet. :-)

> +repack::
> +       When given as the verb, collect a batch of pack-files whose
> +       size are all at most the size given by --batch-size,

okay.

>  but
> +       whose sizes sum to larger than --batch-size.

or less if there are not enough packs.

Now it would be interesting if we can specify an upper bound
(e.g. my laptop has 8GB of ram, can I use this incremental
repacking optimally by telling git to make batches of at most
7.5G), the answer seems to follow...

>   The batch is
> +       selected by greedily adding small pack-files starting with
> +       the oldest pack-files that fit the size. Create a new pack-
> +       file containing the objects the multi-pack-index indexes
> +       into thos pack-files, and rewrite the multi-pack-index to

those

> +       contain that pack-file. A later run of 'git multi-pack-index
> +       expire' will delete the pack-files that were part of this
> +       batch.

... but the optimization seems to be rather about getting rid
of the oldest packs first instead of getting as close to the batch
size. (e.g. another way to look at this is to "find the permutation
of all packs that (each are smaller than batch size), but in sum
are the smallest threshold above the batch size).

I guess that the strategy of picking the oldest is just easiest
to implement and should be sufficient for now, but memory
bounds might be interesting to keep in mind, just as the
optimal packing from above.
Derrick Stolee Dec. 11, 2018, 12:45 p.m. UTC | #2
On 12/10/2018 8:54 PM, Stefan Beller wrote:
> On Mon, Dec 10, 2018 at 10:06 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> In an environment where the multi-pack-index is useful, it is due
>> to many pack-files and an inability to repack the object store
>> into a single pack-file. However, it is likely that many of these
>> pack-files are rather small, and could be repacked into a slightly
>> larger pack-file without too much effort. It may also be important
>> to ensure the object store is highly available and the repack
>> operation does not interrupt concurrent git commands.
>>
>> Introduce a 'repack' verb to 'git multi-pack-index' that takes a
>> '--batch-size' option. The verb will inspect the multi-pack-index
>> for referenced pack-files whose size is smaller than the batch
>> size, until collecting a list of pack-files whose sizes sum to
>> larger than the batch size. Then, a new pack-file will be created
>> containing the objects from those pack-files that are referenced
>> by the multi-pack-index. The resulting pack is likely to actually
>> be smaller than the batch size due to compression and the fact
>> that there may be objects in the pack-files that have duplicate
>> copies in other pack-files.
> This highlights an optimization problem: How do we pick the
> batches optimally?
> Ideally we'd pick packs that have an overlap of many
> same objects to dedup them completely, next best would
> be to have objects that are very similar, such that they delta
> very well.
> (Assuming that the sum of the resulting pack sizes is a metric
> we'd want to optimize for eventually)
>
> For now it seems we just take a random cut of "small" packs.
>
>> The current change introduces the command-line arguments, and we
>> add a test that ensures we parse these options properly. Since
>> we specify a small batch size, we will guarantee that future
>> implementations do not change the list of pack-files.
> This is another clever trick that makes the test correct
> despite no implementation yet. :-)
>
>> +repack::
>> +       When given as the verb, collect a batch of pack-files whose
>> +       size are all at most the size given by --batch-size,
> okay.
>
>>   but
>> +       whose sizes sum to larger than --batch-size.
> or less if there are not enough packs.

If there are not enough packs to reach the batch size, then we do nothing.

> Now it would be interesting if we can specify an upper bound
> (e.g. my laptop has 8GB of ram, can I use this incremental
> repacking optimally by telling git to make batches of at most
> 7.5G), the answer seems to follow...

Well, this gets us to the "unsigned long" problem and how it pervades 
the OPT_MAGNITUDE() code and things that use that kind of parameter. 
This means that Windows users can specify a maximum of (1 << 32) - 1. 
This size is intended to be large enough to make a reasonable change in 
the pack organization without rewriting the entire repo (e.g. 1GB). If 
there is another word than "repack" that means "collect objects from a 
subset of pack-files and place them into a new pack-file, doing a small 
amount of redeltification in the process" then I am open to suggestions.

I tried (briefly) to fix this dependence, but it got too large for me to 
handle at the same time as this change. I'll consider revisiting it 
again later.

>>    The batch is
>> +       selected by greedily adding small pack-files starting with
>> +       the oldest pack-files that fit the size. Create a new pack-
>> +       file containing the objects the multi-pack-index indexes
>> +       into thos pack-files, and rewrite the multi-pack-index to
> those
>
>> +       contain that pack-file. A later run of 'git multi-pack-index
>> +       expire' will delete the pack-files that were part of this
>> +       batch.
> ... but the optimization seems to be rather about getting rid
> of the oldest packs first instead of getting as close to the batch
> size. (e.g. another way to look at this is to "find the permutation
> of all packs that (each are smaller than batch size), but in sum
> are the smallest threshold above the batch size).

You are describing the subset-sum problem, with an additional 
optimization component. While there are dynamic programming approaches 
that are usually effective (if the sum is small), this problem is 
NP-complete, and hence could lead to complications.

> I guess that the strategy of picking the oldest is just easiest
> to implement and should be sufficient for now, but memory
> bounds might be interesting to keep in mind, just as the
> optimal packing from above.

It is easy to implement, and fast. Further, we do have a heuristic that 
the pack modified time correlates with the time the objects were 
introduced to the repository and hence may compress well when placed 
together.

Thanks,

-Stolee

Patch
diff mbox series

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index 822d83c845..e43e7da71e 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -39,6 +39,18 @@  expire::
 	Rewrite the MIDX file afterward to remove all references to
 	these pack-files.
 
+repack::
+	When given as the verb, collect a batch of pack-files whose
+	size are all at most the size given by --batch-size, but
+	whose sizes sum to larger than --batch-size. The batch is
+	selected by greedily adding small pack-files starting with
+	the oldest pack-files that fit the size. Create a new pack-
+	file containing the objects the multi-pack-index indexes
+	into thos pack-files, and rewrite the multi-pack-index to
+	contain that pack-file. A later run of 'git multi-pack-index
+	expire' will delete the pack-files that were part of this
+	batch.
+
 
 EXAMPLES
 --------
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 145de3a46c..d87a2235e3 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -5,12 +5,13 @@ 
 #include "midx.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
-	N_("git multi-pack-index [--object-dir=<dir>] (write|verify|expire)"),
+	N_("git multi-pack-index [--object-dir=<dir>] (write|verify|expire|repack --batch-size=<size>)"),
 	NULL
 };
 
 static struct opts_multi_pack_index {
 	const char *object_dir;
+	unsigned long batch_size;
 } opts;
 
 int cmd_multi_pack_index(int argc, const char **argv,
@@ -19,6 +20,8 @@  int cmd_multi_pack_index(int argc, const char **argv,
 	static struct option builtin_multi_pack_index_options[] = {
 		OPT_FILENAME(0, "object-dir", &opts.object_dir,
 		  N_("object directory containing set of packfile and pack-index pairs")),
+		OPT_MAGNITUDE(0, "batch-size", &opts.batch_size,
+		  N_("during repack, collect pack-files of smaller size into a batch that is larger than this size")),
 		OPT_END(),
 	};
 
@@ -40,6 +43,11 @@  int cmd_multi_pack_index(int argc, const char **argv,
 		return 1;
 	}
 
+	if (!strcmp(argv[0], "repack"))
+		return midx_repack(opts.object_dir, (size_t)opts.batch_size);
+	if (opts.batch_size)
+		die(_("--batch-size option is only for 'repack' verb"));
+
 	if (!strcmp(argv[0], "write"))
 		return write_midx_file(opts.object_dir);
 	if (!strcmp(argv[0], "verify"))
diff --git a/midx.c b/midx.c
index 50e4cd7270..4caf148464 100644
--- a/midx.c
+++ b/midx.c
@@ -1115,3 +1115,8 @@  int expire_midx_packs(const char *object_dir)
 	string_list_clear(&packs_to_drop, 0);
 	return result;
 }
+
+int midx_repack(const char *object_dir, size_t batch_size)
+{
+	return 0;
+}
diff --git a/midx.h b/midx.h
index e3a2b740b5..394a21ee96 100644
--- a/midx.h
+++ b/midx.h
@@ -50,6 +50,7 @@  int write_midx_file(const char *object_dir);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(const char *object_dir);
 int expire_midx_packs(const char *object_dir);
+int midx_repack(const char *object_dir, size_t batch_size);
 
 void close_midx(struct multi_pack_index *m);
 
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 210279a3cf..c23e930a5d 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -410,4 +410,15 @@  test_expect_success 'expire removes unreferenced packs' '
 	)
 '
 
+test_expect_success 'repack does not create any packs' '
+	(
+		cd dup &&
+		ls .git/objects/pack >expect &&
+		MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) &&
+		git multi-pack-index repack --batch-size=$MINSIZE &&
+		ls .git/objects/pack >actual &&
+		test_cmp expect actual
+	)
+'
+
 test_done