diff mbox series

[5/5] midx: implement midx_repack()

Message ID 41ef671ec8361a9635dc78c078d2d84e9d985236.1544465177.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Create 'expire' and 'repack' verbs for git-multi-pack-index | expand

Commit Message

Linus Arver via GitGitGadget Dec. 10, 2018, 6:06 p.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

To repack using a multi-pack-index, first sort all pack-files by
their modified time. Second, walk those pack-files from oldest
to newest, adding the packs to a list if they are smaller than the
given pack-size. Finally, collect the objects from the multi-pack-
index that are in those packs and send them to 'git pack-objects'.

While first designing a 'git multi-pack-index repack' operation, I
started by collecting the batches based on the size of the objects
instead of the size of the pack-files. This allows repacking a
large pack-file that has very few referencd objects. However, this
came at a significant cost of parsing pack-files instead of simply
reading the multi-pack-index and getting the file information for
the pack-files. This object-size idea could be a direction for
future expansion in this area.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 109 +++++++++++++++++++++++++++++++++++-
 t/t5319-multi-pack-index.sh |  25 +++++++++
 2 files changed, 133 insertions(+), 1 deletion(-)

Comments

Stefan Beller Dec. 11, 2018, 2:32 a.m. UTC | #1
On Mon, Dec 10, 2018 at 10:06 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> To repack using a multi-pack-index, first sort all pack-files by
> their modified time. Second, walk those pack-files from oldest
> to newest, adding the packs to a list if they are smaller than the
> given pack-size. Finally, collect the objects from the multi-pack-
> index that are in those packs and send them to 'git pack-objects'.

Makes sense.

With this operation we only coalesce some packfiles into a new
pack file. So to perform the "complete" repack this command
has to be run repeatedly until there is at most one packfile
left that is smaller than batch size.

Imagine the following scenario:

  There are 5 packfiles A, B, C, D, E,
  created last Monday thru Friday (A is oldest, E youngest).
  The sizes are [A=4, B=6, C=5, D=5, E=4]

  You'd issue a repack with batch size=10, such that
  A and B would be repacked into F, which is
  created today, size is less or equal than 10.

  You issue another repack tomorrow, which then would
  coalesce C and D to G, which is
  dated tomorrow, size is less or equal to 10 as well.

  You issue a third repack, which then takes E
  (as it is the oldest) and would probably find F as the
  next oldest (assuming it is less than 10), to repack
  into H.

  H is then compromised of A, B and E, and G is C+D.

In a way these repacks, always picking up the oldest,
sound like you "roll forward" objects into new packs.
As the new packs are newest (we have no packs from
the future), we'd cycle through different packs to look at
for packing on each repacking.

It is however more likely that content is more similar
on a temporal basis. (e.g. I am boldly claiming that
[ABC, DE] would take less space than [ABE, CD]
as produced above).

(The obvious solution to this hypothetical would be
to backdate the resulting pack to the youngest pack
that is input to the new pack, but I dislike fudging with
the time a file is created/touched, so let's not go there)

Would the object count make sense as input instead of
the pack date?


> While first designing a 'git multi-pack-index repack' operation, I
> started by collecting the batches based on the size of the objects
> instead of the size of the pack-files. This allows repacking a
> large pack-file that has very few referencd objects. However, this

referenced

> came at a significant cost of parsing pack-files instead of simply
> reading the multi-pack-index and getting the file information for
> the pack-files. This object-size idea could be a direction for
> future expansion in this area.

Ah, that also explains why the above idea is toast.

Would it make sense to extend or annotate the midx file
to give hints at which packs are easy to combine?

I guess such an "annotation worker" could run in a separate
thread / pool with the lowest priority as this seems like a
decent fallback for the lack of any better information how
to pick the packfiles.

Thanks,
Stefan
Derrick Stolee Dec. 11, 2018, 1 p.m. UTC | #2
On 12/10/2018 9:32 PM, Stefan Beller wrote:
> On Mon, Dec 10, 2018 at 10:06 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> To repack using a multi-pack-index, first sort all pack-files by
>> their modified time. Second, walk those pack-files from oldest
>> to newest, adding the packs to a list if they are smaller than the
>> given pack-size. Finally, collect the objects from the multi-pack-
>> index that are in those packs and send them to 'git pack-objects'.
> Makes sense.
>
> With this operation we only coalesce some packfiles into a new
> pack file. So to perform the "complete" repack this command
> has to be run repeatedly until there is at most one packfile
> left that is smaller than batch size.

Well, the batch size essentially means "If a pack-file is larger than 
<size>, then leave it be. I'm happy with packs that large." This assumes 
that the reason the pack is that large is because it was already 
combined with other packs or contains a lot of objects.

>
> Imagine the following scenario:
>
>    There are 5 packfiles A, B, C, D, E,
>    created last Monday thru Friday (A is oldest, E youngest).
>    The sizes are [A=4, B=6, C=5, D=5, E=4]
>
>    You'd issue a repack with batch size=10, such that
>    A and B would be repacked into F, which is
>    created today, size is less or equal than 10.
>
>    You issue another repack tomorrow, which then would
>    coalesce C and D to G, which is
>    dated tomorrow, size is less or equal to 10 as well.
>
>    You issue a third repack, which then takes E
>    (as it is the oldest) and would probably find F as the
>    next oldest (assuming it is less than 10), to repack
>    into H.
>
>    H is then compromised of A, B and E, and G is C+D.
>
> In a way these repacks, always picking up the oldest,
> sound like you "roll forward" objects into new packs.
> As the new packs are newest (we have no packs from
> the future), we'd cycle through different packs to look at
> for packing on each repacking.
>
> It is however more likely that content is more similar
> on a temporal basis. (e.g. I am boldly claiming that
> [ABC, DE] would take less space than [ABE, CD]
> as produced above).
>
> (The obvious solution to this hypothetical would be
> to backdate the resulting pack to the youngest pack
> that is input to the new pack, but I dislike fudging with
> the time a file is created/touched, so let's not go there)

This raises a good point about what happens when we "roll over" into the 
"repacked" packs.

I'm not claiming that this is an optimal way to save space, but is a way 
to incrementally collect small packs into slightly larger packs, all 
while not interrupting concurrent Git commands. Reducing pack count 
improves data locality, which is my goal here. In our environment, we do 
see reduced space as a benefit, even if it is not optimal.

>
> Would the object count make sense as input instead of
> the pack date?
>
>
>> While first designing a 'git multi-pack-index repack' operation, I
>> started by collecting the batches based on the size of the objects
>> instead of the size of the pack-files. This allows repacking a
>> large pack-file that has very few referencd objects. However, this
> referenced
>
>> came at a significant cost of parsing pack-files instead of simply
>> reading the multi-pack-index and getting the file information for
>> the pack-files. This object-size idea could be a direction for
>> future expansion in this area.
> Ah, that also explains why the above idea is toast.
>
> Would it make sense to extend or annotate the midx file
> to give hints at which packs are easy to combine?
>
> I guess such an "annotation worker" could run in a separate
> thread / pool with the lowest priority as this seems like a
> decent fallback for the lack of any better information how
> to pick the packfiles.

One idea I had earlier (and is in 
Documentation/technical/multi-pack-index.txt) is to have the midx track 
metadata about pack-files. We could avoid this "rollover" problem by 
tracking which packs were repacked using this mechanism. This could 
create a "pack generation" value, and we could collect a batch of packs 
that have the same generation. This does seem a bit overcomplicated for 
the potential benefit, and could waste better use of that metadata 
concept. For instance, we could use the metadata to track the 
information given by ".keep" and ".promisor" files.

Thanks,

-Stolee
Junio C Hamano Dec. 12, 2018, 7:40 a.m. UTC | #3
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +		SECOND_SMALLEST_SIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 2 | tail -n 1) &&

awk is capable of remembering $5 from each line of input, sorting
them and picking the second smallest element from it, isn't it?

> +		BATCH_SIZE=$(($SECOND_SMALLEST_SIZE + 1)) &&

... or incrementing the number by one, before reporting, for that
matter.
Junio C Hamano Dec. 13, 2018, 4:23 a.m. UTC | #4
Junio C Hamano <gitster@pobox.com> writes:

> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> +		SECOND_SMALLEST_SIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 2 | tail -n 1) &&
>
> awk is capable of remembering $5 from each line of input, sorting
> them and picking the second smallest element from it, isn't it?
>
>> +		BATCH_SIZE=$(($SECOND_SMALLEST_SIZE + 1)) &&
>
> ... or incrementing the number by one, before reporting, for that
> matter.


Oops, please disregard.  My awk is rusty.  Unless we are willing to
rely on gawk, which we are not, it is not practical to sort inside
awk.
diff mbox series

Patch

diff --git a/midx.c b/midx.c
index 4caf148464..3718e78132 100644
--- a/midx.c
+++ b/midx.c
@@ -8,6 +8,7 @@ 
 #include "sha1-lookup.h"
 #include "midx.h"
 #include "progress.h"
+#include "run-command.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -1116,7 +1117,113 @@  int expire_midx_packs(const char *object_dir)
 	return result;
 }
 
-int midx_repack(const char *object_dir, size_t batch_size)
+struct time_and_id {
+	timestamp_t mtime;
+	uint32_t pack_int_id;
+};
+
+static int compare_by_mtime(const void *a_, const void *b_)
 {
+	const struct time_and_id *a, *b;
+
+	a = (const struct time_and_id *)a_;
+	b = (const struct time_and_id *)b_;
+
+	if (a->mtime < b->mtime)
+		return -1;
+	if (a->mtime > b->mtime)
+		return 1;
 	return 0;
 }
+
+int midx_repack(const char *object_dir, size_t batch_size)
+{
+	int result = 0;
+	uint32_t i, packs_to_repack;
+	size_t total_size;
+	struct time_and_id *pack_ti;
+	unsigned char *include_pack;
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf base_name = STRBUF_INIT;
+	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
+
+	if (!m)
+		return 0;
+
+	include_pack = xcalloc(m->num_packs, sizeof(unsigned char));
+	pack_ti = xcalloc(m->num_packs, sizeof(struct time_and_id));
+
+	for (i = 0; i < m->num_packs; i++) {
+		pack_ti[i].pack_int_id = i;
+
+		if (prepare_midx_pack(m, i))
+			continue;
+
+		pack_ti[i].mtime = m->packs[i]->mtime;
+	}
+	QSORT(pack_ti, m->num_packs, compare_by_mtime);
+
+	total_size = 0;
+	packs_to_repack = 0;
+	for (i = 0; total_size < batch_size && i < m->num_packs; i++) {
+		int pack_int_id = pack_ti[i].pack_int_id;
+		struct packed_git *p = m->packs[pack_int_id];
+
+		if (!p)
+			continue;
+		if (p->pack_size >= batch_size)
+			continue;
+
+		packs_to_repack++;
+		total_size += p->pack_size;
+		include_pack[pack_int_id] = 1;
+	}
+
+	if (total_size < batch_size || packs_to_repack < 2)
+		goto cleanup;
+
+	argv_array_push(&cmd.args, "pack-objects");
+
+	strbuf_addstr(&base_name, object_dir);
+	strbuf_addstr(&base_name, "/pack/pack");
+	argv_array_push(&cmd.args, base_name.buf);
+	strbuf_release(&base_name);
+
+	cmd.git_cmd = 1;
+	cmd.in = cmd.out = -1;
+
+	if (start_command(&cmd)) {
+		error(_("could not start pack-objects"));
+		result = 1;
+		goto cleanup;
+	}
+
+	for (i = 0; i < m->num_objects; i++) {
+		struct object_id oid;
+		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
+
+		if (!include_pack[pack_int_id])
+			continue;
+
+		nth_midxed_object_oid(&oid, m, i);
+		xwrite(cmd.in, oid_to_hex(&oid), the_hash_algo->hexsz);
+		xwrite(cmd.in, "\n", 1);
+	}
+	close(cmd.in);
+
+	if (finish_command(&cmd)) {
+		error(_("could not finish pack-objects"));
+		result = 1;
+		goto cleanup;
+	}
+
+	result = write_midx_internal(object_dir, m, NULL);
+	m = NULL;
+
+cleanup:
+	if (m)
+		close_midx(m);
+	free(include_pack);
+	free(pack_ti);
+	return result;
+}
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index c23e930a5d..3cc9c918d5 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -421,4 +421,29 @@  test_expect_success 'repack does not create any packs' '
 	)
 '
 
+test_expect_success 'repack creates a new pack' '
+	(
+		cd dup &&
+		SECOND_SMALLEST_SIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 2 | tail -n 1) &&
+		BATCH_SIZE=$(($SECOND_SMALLEST_SIZE + 1)) &&
+		git multi-pack-index repack --batch-size=$BATCH_SIZE &&
+		ls .git/objects/pack/*idx >idx-list &&
+		test_line_count = 5 idx-list &&
+		test-tool read-midx .git/objects | grep idx >midx-list &&
+		test_line_count = 5 midx-list
+	)
+'
+
+test_expect_success 'expire removes repacked packs' '
+	(
+		cd dup &&
+		ls -S .git/objects/pack/*pack | head -n 3 >expect &&
+		git multi-pack-index expire &&
+		ls -S .git/objects/pack/*pack >actual &&
+		test_cmp expect actual &&
+		test-tool read-midx .git/objects | grep idx >midx-list &&
+		test_line_count = 3 midx-list
+	)
+'
+
 test_done