diff mbox series

[v3,7/9] multi-pack-index: prepare 'repack' subcommand

Message ID b39f90ad09265614efd466fee5089354a193ae8c.1547047269.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Create 'expire' and 'repack' verbs for git-multi-pack-index | expand

Commit Message

ZheNing Hu via GitGitGadget Jan. 9, 2019, 3:21 p.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

In an environment where the multi-pack-index is useful, it is due
to many pack-files and an inability to repack the object store
into a single pack-file. However, it is likely that many of these
pack-files are rather small, and could be repacked into a slightly
larger pack-file without too much effort. It may also be important
to ensure the object store is highly available and the repack
operation does not interrupt concurrent git commands.

Introduce a 'repack' subcommand to 'git multi-pack-index' that
takes a '--batch-size' option. The verb will inspect the
multi-pack-index for referenced pack-files whose size is smaller
than the batch size, until collecting a list of pack-files whose
sizes sum to larger than the batch size. Then, a new pack-file
will be created containing the objects from those pack-files that
are referenced by the multi-pack-index. The resulting pack is
likely to actually be smaller than the batch size due to
compression and the fact that there may be objects in the pack-
files that have duplicate copies in other pack-files.

The current change introduces the command-line arguments, and we
add a test that ensures we parse these options properly. Since
we specify a small batch size, we will guarantee that future
implementations do not change the list of pack-files.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-multi-pack-index.txt | 11 +++++++++++
 builtin/multi-pack-index.c             | 10 +++++++++-
 midx.c                                 |  5 +++++
 midx.h                                 |  1 +
 t/t5319-multi-pack-index.sh            | 11 +++++++++++
 5 files changed, 37 insertions(+), 1 deletion(-)

Comments

SZEDER Gábor Jan. 9, 2019, 3:56 p.m. UTC | #1
On Wed, Jan 09, 2019 at 07:21:17AM -0800, Derrick Stolee via GitGitGadget wrote:
> Introduce a 'repack' subcommand to 'git multi-pack-index' that
> takes a '--batch-size' option. The verb will inspect the

s/verb/subcommand/

> diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
> index 145de3a46c..d87a2235e3 100644
> --- a/builtin/multi-pack-index.c
> +++ b/builtin/multi-pack-index.c

> @@ -40,6 +43,11 @@ int cmd_multi_pack_index(int argc, const char **argv,
>  		return 1;
>  	}
>  
> +	if (!strcmp(argv[0], "repack"))
> +		return midx_repack(opts.object_dir, (size_t)opts.batch_size);
> +	if (opts.batch_size)
> +		die(_("--batch-size option is only for 'repack' verb"));

Likewise.
Jonathan Tan Jan. 23, 2019, 10:38 p.m. UTC | #2
> diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
> index 6186c4c936..cc63531cc0 100644
> --- a/Documentation/git-multi-pack-index.txt
> +++ b/Documentation/git-multi-pack-index.txt
> @@ -36,6 +36,17 @@ expire::
>  	have no objects referenced by the MIDX. Rewrite the MIDX file
>  	afterward to remove all references to these pack-files.
>  
> +repack::
> +	Collect a batch of pack-files whose size are all at most the
> +	size given by --batch-size, but whose sizes sum to larger
> +	than --batch-size. The batch is selected by greedily adding
> +	small pack-files starting with the oldest pack-files that fit
> +	the size. Create a new pack-file containing the objects the
> +	multi-pack-index indexes into those pack-files, and rewrite
> +	the multi-pack-index to contain that pack-file. A later run
> +	of 'git multi-pack-index expire' will delete the pack-files
> +	that were part of this batch.

I see in the subsequent patch that you stop once the batch size is
matched or exceeded - I see that you mention "whose sizes sum to larger
than --batch-size", but this leads me to think that if the total so
happens to not exceed the batch size, don't do anything, but otherwise
repack *all* the small packs together.

I would write this as:

  Create a new packfile containing the objects in the N least-sized
  packfiles referenced by the multi-pack-index, where N is the smallest
  number such that the total size of the packfiles equals or exceeds the
  given batch size. Rewrite the multi-pack-index to reference the new
  packfile instead of the N packfiles. A later run of ...
Derrick Stolee Jan. 24, 2019, 7:36 p.m. UTC | #3
On 1/23/2019 5:38 PM, Jonathan Tan wrote:
>> diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
>> index 6186c4c936..cc63531cc0 100644
>> --- a/Documentation/git-multi-pack-index.txt
>> +++ b/Documentation/git-multi-pack-index.txt
>> @@ -36,6 +36,17 @@ expire::
>>   	have no objects referenced by the MIDX. Rewrite the MIDX file
>>   	afterward to remove all references to these pack-files.
>>   
>> +repack::
>> +	Collect a batch of pack-files whose size are all at most the
>> +	size given by --batch-size, but whose sizes sum to larger
>> +	than --batch-size. The batch is selected by greedily adding
>> +	small pack-files starting with the oldest pack-files that fit
>> +	the size. Create a new pack-file containing the objects the
>> +	multi-pack-index indexes into those pack-files, and rewrite
>> +	the multi-pack-index to contain that pack-file. A later run
>> +	of 'git multi-pack-index expire' will delete the pack-files
>> +	that were part of this batch.
> I see in the subsequent patch that you stop once the batch size is
> matched or exceeded - I see that you mention "whose sizes sum to larger
> than --batch-size", but this leads me to think that if the total so
> happens to not exceed the batch size, don't do anything, but otherwise
> repack *all* the small packs together.
>
> I would write this as:
>
>    Create a new packfile containing the objects in the N least-sized
>    packfiles referenced by the multi-pack-index, where N is the smallest
>    number such that the total size of the packfiles equals or exceeds the
>    given batch size. Rewrite the multi-pack-index to reference the new
>    packfile instead of the N packfiles. A later run of ...

Thanks for the suggestion.

It is slightly wrong, in that we don't sort by size. Instead we sort by 
modified time. That makes is a little complicated, but I'll give it 
another shot using your framing:

         Create a new pack-file containing objects in small pack-files
         referenced by the multi-pack-index. Select the pack-files by
         examining packs from oldest-to-newest, adding a pack if its
         size is below the batch size. Stop adding packs when the sum
         of sizes of the added packs is above the batch size. If the
         total size does not reach the batch size, then do nothing.
         Rewrite the multi-pack-index to reference the new pack-file.
         A later run of 'git multi-pack-index expire' will delete the
         pack-files that were part of this batch.

-Stolee
Jonathan Tan Jan. 24, 2019, 9:38 p.m. UTC | #4
> Thanks for the suggestion.
> 
> It is slightly wrong, in that we don't sort by size. Instead we sort by 
> modified time. That makes is a little complicated, but I'll give it 
> another shot using your framing:
> 
>          Create a new pack-file containing objects in small pack-files
>          referenced by the multi-pack-index. Select the pack-files by
>          examining packs from oldest-to-newest, adding a pack if its
>          size is below the batch size. Stop adding packs when the sum
>          of sizes of the added packs is above the batch size. If the
>          total size does not reach the batch size, then do nothing.
>          Rewrite the multi-pack-index to reference the new pack-file.
>          A later run of 'git multi-pack-index expire' will delete the
>          pack-files that were part of this batch.

Thanks, this looks good.
diff mbox series

Patch

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index 6186c4c936..cc63531cc0 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -36,6 +36,17 @@  expire::
 	have no objects referenced by the MIDX. Rewrite the MIDX file
 	afterward to remove all references to these pack-files.
 
+repack::
+	Collect a batch of pack-files whose size are all at most the
+	size given by --batch-size, but whose sizes sum to larger
+	than --batch-size. The batch is selected by greedily adding
+	small pack-files starting with the oldest pack-files that fit
+	the size. Create a new pack-file containing the objects the
+	multi-pack-index indexes into those pack-files, and rewrite
+	the multi-pack-index to contain that pack-file. A later run
+	of 'git multi-pack-index expire' will delete the pack-files
+	that were part of this batch.
+
 
 EXAMPLES
 --------
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 145de3a46c..d87a2235e3 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -5,12 +5,13 @@ 
 #include "midx.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
-	N_("git multi-pack-index [--object-dir=<dir>] (write|verify|expire)"),
+	N_("git multi-pack-index [--object-dir=<dir>] (write|verify|expire|repack --batch-size=<size>)"),
 	NULL
 };
 
 static struct opts_multi_pack_index {
 	const char *object_dir;
+	unsigned long batch_size;
 } opts;
 
 int cmd_multi_pack_index(int argc, const char **argv,
@@ -19,6 +20,8 @@  int cmd_multi_pack_index(int argc, const char **argv,
 	static struct option builtin_multi_pack_index_options[] = {
 		OPT_FILENAME(0, "object-dir", &opts.object_dir,
 		  N_("object directory containing set of packfile and pack-index pairs")),
+		OPT_MAGNITUDE(0, "batch-size", &opts.batch_size,
+		  N_("during repack, collect pack-files of smaller size into a batch that is larger than this size")),
 		OPT_END(),
 	};
 
@@ -40,6 +43,11 @@  int cmd_multi_pack_index(int argc, const char **argv,
 		return 1;
 	}
 
+	if (!strcmp(argv[0], "repack"))
+		return midx_repack(opts.object_dir, (size_t)opts.batch_size);
+	if (opts.batch_size)
+		die(_("--batch-size option is only for 'repack' verb"));
+
 	if (!strcmp(argv[0], "write"))
 		return write_midx_file(opts.object_dir);
 	if (!strcmp(argv[0], "verify"))
diff --git a/midx.c b/midx.c
index 6ccbec3f19..30ff4430ab 100644
--- a/midx.c
+++ b/midx.c
@@ -1106,3 +1106,8 @@  int expire_midx_packs(const char *object_dir)
 	string_list_clear(&packs_to_drop, 0);
 	return result;
 }
+
+int midx_repack(const char *object_dir, size_t batch_size)
+{
+	return 0;
+}
diff --git a/midx.h b/midx.h
index e3a2b740b5..394a21ee96 100644
--- a/midx.h
+++ b/midx.h
@@ -50,6 +50,7 @@  int write_midx_file(const char *object_dir);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(const char *object_dir);
 int expire_midx_packs(const char *object_dir);
+int midx_repack(const char *object_dir, size_t batch_size);
 
 void close_midx(struct multi_pack_index *m);
 
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index f55a60a89c..d64767600a 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -413,4 +413,15 @@  test_expect_success 'expire removes unreferenced packs' '
 	)
 '
 
+test_expect_success 'repack with minimum size does not alter existing packs' '
+	(
+		cd dup &&
+		ls .git/objects/pack >expect &&
+		MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) &&
+		git multi-pack-index repack --batch-size=$MINSIZE &&
+		ls .git/objects/pack >actual &&
+		test_cmp expect actual
+	)
+'
+
 test_done