diff mbox series

[v2,12/17] repack: add --path-walk option

Message ID 834c9ea270932e3d25e2018cca4e74831345f592.1729431810.git.gitgitgadget@gmail.com (mailing list archive)
State New
Headers show
Series pack-objects: add --path-walk option for better deltas | expand

Commit Message

Derrick Stolee Oct. 20, 2024, 1:43 p.m. UTC
From: Derrick Stolee <stolee@gmail.com>

Since 'git pack-objects' supports a --path-walk option, allow passing it
through in 'git repack'. This presents interesting testing opportunities for
comparing the different repacking strategies against each other.

In my copy of the Git repository, the new tests in p5313 show these
results:

Test                                      this tree
-------------------------------------------------------------
5313.10: repack                           27.88(150.23+2.70)
5313.11: repack size                               228.2M
5313.12: repack with --path-walk          134.59(148.77+0.81)
5313.13: repack size with --path-walk              209.7M

Note that the 'git pack-objects --path-walk' feature is not integrated
with threads. Look forward to a future change that will introduce
threading to improve the time performance of this feature with
equivalent space performance.

For the microsoft/fluentui repo [1] had some interesting aspects for the
previous tests in p5313, so here are the repack results:

Test                                      this tree
-------------------------------------------------------------
5313.10: repack                           91.76(680.94+2.48)
5313.11: repack size                               439.1M
5313.12: repack with --path-walk          110.35(130.46+0.74)
5313.13: repack size with --path-walk              155.3M

[1] https://github.com/microsoft/fluentui

Here, we see the significant improvement of a full repack using this
strategy. The name-hash collisions in this repo cause the space
problems. Those collisions also cause the repack command to spend a lot
of cycles trying to find delta bases among files that are not actually
very similar, so the lack of threading with the --path-walk feature is
less pronounced in the process time.

For the Linux kernel repository, we have these stats:

Test                                      this tree
---------------------------------------------------------------
5313.10: repack                           553.61(1929.41+30.31)
5313.11: repack size                                 2.5G
5313.12: repack with --path-walk          1777.63(2044.16+7.47)
5313.13: repack size with --path-walk                2.5G

This demonstrates that the --path-walk feature does not always present
measurable improvements, especially in cases where the name-hash has
very few collisions.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-repack.txt | 17 ++++++++++++++++-
 builtin/repack.c             |  9 ++++++++-
 t/perf/p5313-pack-objects.sh | 18 ++++++++++++++++++
 3 files changed, 42 insertions(+), 2 deletions(-)
diff mbox series

Patch

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index c902512a9e8..4ec59cd27b1 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -9,7 +9,9 @@  git-repack - Pack unpacked objects in a repository
 SYNOPSIS
 --------
 [verse]
-'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m] [--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>] [--write-midx]
+'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]
+	[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]
+	[--write-midx] [--path-walk]
 
 DESCRIPTION
 -----------
@@ -249,6 +251,19 @@  linkgit:git-multi-pack-index[1]).
 	Write a multi-pack index (see linkgit:git-multi-pack-index[1])
 	containing the non-redundant packs.
 
+--path-walk::
+	This option passes the `--path-walk` option to the underlying
+	`git pack-options` process (see linkgit:git-pack-objects[1]).
+	By default, `git pack-objects` walks objects in an order that
+	presents trees and blobs in an order unrelated to the path they
+	appear relative to a commit's root tree. The `--path-walk` option
+	enables a different walking algorithm that organizes trees and
+	blobs by path. This has the potential to improve delta compression
+	especially in the presence of filenames that cause collisions in
+	Git's default name-hash algorithm. Due to changing how the objects
+	are walked, this option is not compatible with `--delta-islands`
+	or `--filter`.
+
 CONFIGURATION
 -------------
 
diff --git a/builtin/repack.c b/builtin/repack.c
index cb4420f0856..af3f218ced7 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -39,7 +39,9 @@  static int run_update_server_info = 1;
 static char *packdir, *packtmp_name, *packtmp;
 
 static const char *const git_repack_usage[] = {
-	N_("git repack [<options>]"),
+	N_("git repack [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]\n"
+	   "[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]\n"
+	   "[--write-midx] [--full-path-walk]"),
 	NULL
 };
 
@@ -58,6 +60,7 @@  struct pack_objects_args {
 	int no_reuse_object;
 	int quiet;
 	int local;
+	int path_walk;
 	struct list_objects_filter_options filter_options;
 };
 
@@ -289,6 +292,8 @@  static void prepare_pack_objects(struct child_process *cmd,
 		strvec_pushf(&cmd->args, "--no-reuse-delta");
 	if (args->no_reuse_object)
 		strvec_pushf(&cmd->args, "--no-reuse-object");
+	if (args->path_walk)
+		strvec_pushf(&cmd->args, "--path-walk");
 	if (args->local)
 		strvec_push(&cmd->args,  "--local");
 	if (args->quiet)
@@ -1182,6 +1187,8 @@  int cmd_repack(int argc,
 				N_("pass --no-reuse-delta to git-pack-objects")),
 		OPT_BOOL('F', NULL, &po_args.no_reuse_object,
 				N_("pass --no-reuse-object to git-pack-objects")),
+		OPT_BOOL(0, "path-walk", &po_args.path_walk,
+				N_("pass --path-walk to git-pack-objects")),
 		OPT_NEGBIT('n', NULL, &run_update_server_info,
 				N_("do not run git-update-server-info"), 1),
 		OPT__QUIET(&po_args.quiet, N_("be quiet")),
diff --git a/t/perf/p5313-pack-objects.sh b/t/perf/p5313-pack-objects.sh
index 840075f5691..b588066ddb0 100755
--- a/t/perf/p5313-pack-objects.sh
+++ b/t/perf/p5313-pack-objects.sh
@@ -56,4 +56,22 @@  test_size 'big pack size with --path-walk' '
 	test_file_size out
 '
 
+test_perf 'repack' '
+	git repack -adf
+'
+
+test_size 'repack size' '
+	pack=$(ls .git/objects/pack/pack-*.pack) &&
+	test_file_size "$pack"
+'
+
+test_perf 'repack with --path-walk' '
+	git repack -adf --path-walk
+'
+
+test_size 'repack size with --path-walk' '
+	pack=$(ls .git/objects/pack/pack-*.pack) &&
+	test_file_size "$pack"
+'
+
 test_done