diff mbox series

pack-objects: introduce --exclude-delta=<pattern> option

Message ID pull.1392.git.1666453564661.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series pack-objects: introduce --exclude-delta=<pattern> option | expand

Commit Message

ZheNing Hu Oct. 22, 2022, 3:46 p.m. UTC
From: ZheNing Hu <adlternative@gmail.com>

The server uses delta compression during git clone to reduce
the amount of data transferred over the network, but delta
compression for large binary blobs often does not reduce
storage size significantly and wastes a lot of CPU. Git now
disables delta compression for objects that meet these conditions:

1. files that have -delta set in .gitattributes
2. files that its size exceed the big_file_threshold

However, in 1, .gitattributes needs to be set manually by the user,
and in most cases the user does not actively set it, and it is not
something that can be actively adjusted on the server aside. In 2,
the big_file_threshold now defaults to 512MB, and many binary files
smaller than that will be uselessly delta-compressed, and this is
made worse if the server actively increases the big_file_threshold.

Therefore, we need a way to be able to actively skip the delta
compression of some files on the server. Introduces the
`-exclude-delta=<pattern>` option, which can be used to disable delta
compression for objects that satisfy the pattern.

Signed-off-by: ZheNing Hu <adlternative@gmail.com>
---
    pack-objects: introduce --exclude-delta= option
    
    While analyzing some repositories using git filter-repo -analyze, I
    noticed that many huge binaries in the repositories were
    delta-compressed without much reduction in size.
    
    $ cat .git/filter-repo/analysis/path-all-sizes.txt | more === All paths
    by reverse accumulated size === Format: unpacked size, packed size, date
    deleted, path name 23816778 23765921 2022-08-22
    managed/src/universal/ybc/ybc-1.0.0-b1-linux-x86_64.tar.gz 22504398
    22445676 2022-08-22
    managed/src/universal/ybc/ybc-1.0.0-b1-el8-aarch64.tar.gz 11726471
    6424233 2022-08-09 managed/yba-installer/yba-installer_linux_amd64
    294644800 5794201 src/yb/master/catalog_manager.cc 2912780 2872186
    docs/static/images/yp/tables-view-ycql.png 2992192 2634232
    docs/static/images/yb-cloud/cloud-clusters-backups.png 2757095 2501915
    docs/static/images/deploy/aws/aws-cf-configure-options.png ...
    
    The current solution to avoid delta compression is not very suitable for
    git servers. First, files that exceed the big_file_threshold are not
    delta compressed, but the above analysis indicates that many big binary
    files do not exceed the the big_file_threshold (default to 512MB).
    Second, there is not .gitattrbutes to disable delta compression for
    them, we also don't really can let repo administrators add it manually.
    
    But we can also see that the large files in these repositories often
    have some common characteristics: they end in ".tar.gz"or “.png". So
    perhaps we can take advantage of this feature and disable delta
    compression on the server for some common type binary files.
    
    This is currently implemented by command line parameters
    --exclude-delta=<pattern>. But maybe we can also try passing it through
    git config.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1392%2Fadlternative%2Fadl%2Fpack-object-no-try-delta-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1392/adlternative/adl/pack-object-no-try-delta-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1392

 Documentation/git-pack-objects.txt |  6 +++++-
 builtin/pack-objects.c             | 28 +++++++++++++++++++++++++++-
 2 files changed, 32 insertions(+), 2 deletions(-)


base-commit: 1fc3c0ad407008c2f71dd9ae1241d8b75f8ef886

Comments

Junio C Hamano Oct. 22, 2022, 5 p.m. UTC | #1
"ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: ZheNing Hu <adlternative@gmail.com>
>
> The server uses delta compression during git clone to reduce
> the amount of data transferred over the network, but delta
> compression for large binary blobs often does not reduce
> storage size significantly and wastes a lot of CPU. Git now
> disables delta compression for objects that meet these conditions:
>
> 1. files that have -delta set in .gitattributes
> 2. files that its size exceed the big_file_threshold
>
> However, in 1, .gitattributes needs to be set manually by the user,
> and in most cases the user does not actively set it, and it is not
> something that can be actively adjusted on the server aside. In 2,
> the big_file_threshold now defaults to 512MB, and many binary files
> smaller than that will be uselessly delta-compressed, and this is
> made worse if the server actively increases the big_file_threshold.

Who are you trying to help, though?  The user has to somehow
manually cause the --exclude-delta=<pattern> to be added at the
server side, so I do not see the new feature as correcting the
weakness you perceive in the approach to mark the undeltifiable
blobs in the attributes system at all.  Does this feature assume
that the server operator knows better than the project that can
maintain their own .gitattributes in tree?  If so, the server
operator can already use <bare-repository>/info/attributes file
to achieve that already, no?

Now hosting sites may not give hosted projects flexibility to
configure their own server side repositories, like its "config"
and "info/attributes", but that limitation would equally apply
what command line options pack-objects runs with.

So, again, it is not clear who this patch is trying to help and how.
If we assume that the hosting operators can give project owners more
control of how the server side is configured, then we can do that
already without a new option, no?

IOW

> Therefore, we need a way to be able to actively skip the delta
> compression of some files on the server.

I do not quite follow the "Therefore" here.
diff mbox series

Patch

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index a9995a932ca..92cfee83df5 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -13,7 +13,7 @@  SYNOPSIS
 	[--no-reuse-delta] [--delta-base-offset] [--non-empty]
 	[--local] [--incremental] [--window=<n>] [--depth=<n>]
 	[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
-	[--cruft] [--cruft-expiration=<time>]
+	[--cruft] [--cruft-expiration=<time>] [--exclude-delta=<file>]
 	[--stdout [--filter=<filter-spec>] | <base-name>]
 	[--shallow] [--keep-true-parents] [--[no-]sparse] < <object-list>
 
@@ -221,6 +221,10 @@  depth is 4095.
 	This flag tells the command not to reuse existing deltas
 	but compute them from scratch.
 
+--exclude-delta=<pattern>::
+	Delta compression will not be attempted for blobs for paths
+	matching pattern. See linkgit:gitignore[5] for pattern details.
+
 --no-reuse-object::
 	This flag tells the command not to reuse existing object data at all,
 	including non deltified object, forcing recompression of everything.
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 3658c05cafc..ab9cff98e3a 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -272,6 +272,8 @@  static struct commit **indexed_commits;
 static unsigned int indexed_commits_nr;
 static unsigned int indexed_commits_alloc;
 
+static struct pattern_list *exclude_delta_patterns;
+
 static void index_commit_for_bitmap(struct commit *commit)
 {
 	if (indexed_commits_nr >= indexed_commits_alloc) {
@@ -1315,13 +1317,20 @@  static void write_pack_file(void)
 static int no_try_delta(const char *path)
 {
 	static struct attr_check *check;
+	int dtype;
 
 	if (!check)
 		check = attr_check_initl("delta", NULL);
 	git_check_attr(the_repository->index, path, check);
 	if (ATTR_FALSE(check->items[0].value))
 		return 1;
-	return 0;
+
+	return exclude_delta_patterns &&
+		path_matches_pattern_list(path,
+					  strlen(path),
+					  path, &dtype,
+					  exclude_delta_patterns,
+					  the_repository->index) == MATCHED;
 }
 
 /*
@@ -4149,6 +4158,19 @@  static int option_parse_cruft_expiration(const struct option *opt,
 	return 0;
 }
 
+static int option_parse_exclude_delta(const struct option *opt,
+					 const char *arg, int unset)
+{
+	BUG_ON_OPT_NEG(unset);
+
+	if (!exclude_delta_patterns)
+		exclude_delta_patterns = xcalloc(1, sizeof(*exclude_delta_patterns));
+
+	if (arg)
+		add_pattern(arg, "", 0, exclude_delta_patterns, 0);
+	return 0;
+}
+
 struct po_filter_data {
 	unsigned have_revs:1;
 	struct rev_info revs;
@@ -4242,6 +4264,9 @@  int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		OPT_CALLBACK_F(0, "cruft-expiration", NULL, N_("time"),
 		  N_("expire cruft objects older than <time>"),
 		  PARSE_OPT_OPTARG, option_parse_cruft_expiration),
+		OPT_CALLBACK_F(0, "exclude-delta", NULL, N_("pattern"),
+		  N_("disable delta compression for files matching pattern"),
+		  PARSE_OPT_NONEG, option_parse_exclude_delta),
 		OPT_BOOL(0, "sparse", &sparse,
 			 N_("use the sparse reachability algorithm")),
 		OPT_BOOL(0, "thin", &thin,
@@ -4514,6 +4539,7 @@  int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 cleanup:
 	strvec_clear(&rp);
+	FREE_AND_NULL(exclude_delta_patterns);
 
 	return 0;
 }