diff mbox series

[v3] packfile: freshen the mtime of packfile by configuration

Message ID pull.1043.v3.git.git.1626724399377.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series [v3] packfile: freshen the mtime of packfile by configuration | expand

Commit Message

孙超 July 19, 2021, 7:53 p.m. UTC
From: Sun Chao <16657101987@163.com>

Commit 33d4221c79 (write_sha1_file: freshen existing objects,
2014-10-15) avoid writing existing objects by freshen their
mtime (especially the packfiles contains them) in order to
aid the correct caching, and some process like find_lru_pack
can make good decision. However, this is unfriendly to
incremental backup jobs or services rely on cached file system
when there are large '.pack' files exists.

For example, after packed all objects, use 'write-tree' to
create same commit with the same tree and same environments
such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can
notice the '.pack' file's mtime changed. Git servers
that mount the same NFS disk will re-sync the '.pack' files
to cached file system which will slow the git commands.

So if add core.freshenPackfiles to indicate whether or not
packs can be freshened, turning off this option on some
servers can speed up the execution of some commands on servers
which use NFS disk instead of local disk.

Signed-off-by: Sun Chao <16657101987@163.com>
---
    packfile: freshen the mtime of packfile by configuration
    
    packfile: freshen the mtime of packfile by configuration
    
    Commit 33d4221c79 (write_sha1_file: freshen existing objects,
    2014-10-15) avoid writing existing objects by freshen their mtime
    (especially the packfiles contains them) in order to aid the correct
    caching, and some process like find_lru_pack can make good decision.
    However, this is unfriendly to incremental backup jobs or services rely
    on cached file system when there are large '.pack' files exists.
    
    For example, after packed all objects, use 'write-tree' to create same
    commit with the same tree and same environments such like
    GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can notice the '.pack' file's
    mtime changed. Git servers that mount the same NFS disk will re-sync the
    '.pack' files to cached file system which will slow the git commands.
    
    Here we can find the description of the cached file system for NFS
    Client from
    https://www.ibm.com/docs/en/aix/7.2?topic=performance-cache-file-system:
    
    3. To ensure that the cached directories and files are kept up to date, 
    CacheFS periodically checks the consistency of files stored in the cache.
    It does this by comparing the current modification time to the previous
    modification time.
    
    4. If the modification times are different, all data and attributes
    for the directory or file are purged from the cache, and new data and
    attributes are retrieved from the back file system.
    
    
    So if add core.freshenPackfiles to indicate whether or not packs can be
    freshened, turning off this option on some servers can speed up the
    execution of some commands on servers which use NFS disk instead of
    local disk.
    
    Signed-off-by: Sun Chao 16657101987@163.com

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1043%2Fsunchao9%2Fmaster-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1043/sunchao9/master-v3
Pull-Request: https://github.com/git/git/pull/1043

Range-diff vs v2:

 1:  943e31e8587 ! 1:  16c68923bea packfile: freshen the mtime of packfile by configuration
     @@ Commit message
          mtime (especially the packfiles contains them) in order to
          aid the correct caching, and some process like find_lru_pack
          can make good decision. However, this is unfriendly to
     -    incremental backup jobs or services rely on file system
     -    cache when there are large '.pack' files exists.
     +    incremental backup jobs or services rely on cached file system
     +    when there are large '.pack' files exists.
      
          For example, after packed all objects, use 'write-tree' to
          create same commit with the same tree and same environments
          such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can
     -    notice the '.pack' file's mtime changed, and '.idx' file not.
     +    notice the '.pack' file's mtime changed. Git servers
     +    that mount the same NFS disk will re-sync the '.pack' files
     +    to cached file system which will slow the git commands.
      
     -    If we freshen the mtime of packfile by updating another
     -    file instead of '.pack' file e.g. a empty '.bump' file,
     -    when we need to check the mtime of packfile, get it from
     -    another file instead. Large git repository may contains
     -    large '.pack' files, and we can use smaller files even empty
     -    file to do the mtime get/set operation, this can avoid
     -    file system cache re-sync large '.pack' files again and
     -    then speed up most git commands.
     +    So if add core.freshenPackfiles to indicate whether or not
     +    packs can be freshened, turning off this option on some
     +    servers can speed up the execution of some commands on servers
     +    which use NFS disk instead of local disk.
      
          Signed-off-by: Sun Chao <16657101987@163.com>
      
     @@ Documentation/config/core.txt: the largest projects.  You probably do not need t
       +
       Common unit suffixes of 'k', 'm', or 'g' are supported.
       
     -+core.packMtimeSuffix::
     ++core.freshenPackFiles::
      +	Normally we avoid writing existing object by freshening the mtime
      +	of the *.pack file which contains it in order to aid some processes
     -+	such like prune. Use different file instead of *.pack file will
     -+	avoid file system cache re-sync the large packfiles, and consequently
     -+	make git commands faster.
     ++	such like prune. Turning off this option on some servers can speed
     ++	up the execution of some commands like 'git-upload-pack'(e.g. some
     ++	servers that mount the same NFS disk will re-sync the *.pack files
     ++	to cached file system if the mtime cahnges).
      ++
     -+The default is 'pack' which means the *.pack file will be freshened by
     -+default. You can configure a different suffix to use, the file with the
     -+suffix will be created automatically, it's better not using any known
     -+suffix such like 'idx', 'keep', 'promisor'.
     ++The default is true which means the *.pack file will be freshened if we
     ++want to write a existing object whthin it.
      +
       core.deltaBaseCacheLimit::
       	Maximum number of bytes per thread to reserve for caching base objects
       	that may be referenced by multiple deltified objects.  By storing the
      
     - ## builtin/index-pack.c ##
     -@@ builtin/index-pack.c: static void fix_unresolved_deltas(struct hashfile *f)
     - 	free(sorted_by_pos);
     - }
     - 
     --static const char *derive_filename(const char *pack_name, const char *strip,
     --				   const char *suffix, struct strbuf *buf)
     --{
     --	size_t len;
     --	if (!strip_suffix(pack_name, strip, &len) || !len ||
     --	    pack_name[len - 1] != '.')
     --		die(_("packfile name '%s' does not end with '.%s'"),
     --		    pack_name, strip);
     --	strbuf_add(buf, pack_name, len);
     --	strbuf_addstr(buf, suffix);
     --	return buf->buf;
     --}
     --
     - static void write_special_file(const char *suffix, const char *msg,
     - 			       const char *pack_name, const unsigned char *hash,
     - 			       const char **report)
     -@@ builtin/index-pack.c: static void write_special_file(const char *suffix, const char *msg,
     - 	int msg_len = strlen(msg);
     - 
     - 	if (pack_name)
     --		filename = derive_filename(pack_name, "pack", suffix, &name_buf);
     -+		filename = derive_pack_filename(pack_name, "pack", suffix, &name_buf);
     - 	else
     - 		filename = odb_pack_name(&name_buf, hash, suffix);
     - 
     -@@ builtin/index-pack.c: int cmd_index_pack(int argc, const char **argv, const char *prefix)
     - 	if (from_stdin && hash_algo)
     - 		die(_("--object-format cannot be used with --stdin"));
     - 	if (!index_name && pack_name)
     --		index_name = derive_filename(pack_name, "pack", "idx", &index_name_buf);
     -+		index_name = derive_pack_filename(pack_name, "pack", "idx", &index_name_buf);
     - 
     - 	opts.flags &= ~(WRITE_REV | WRITE_REV_VERIFY);
     - 	if (rev_index) {
     - 		opts.flags |= verify ? WRITE_REV_VERIFY : WRITE_REV;
     - 		if (index_name)
     --			rev_index_name = derive_filename(index_name,
     -+			rev_index_name = derive_pack_filename(index_name,
     - 							 "idx", "rev",
     - 							 &rev_index_name_buf);
     - 	}
     -
       ## cache.h ##
      @@ cache.h: extern size_t packed_git_limit;
       extern size_t delta_base_cache_limit;
       extern unsigned long big_file_threshold;
       extern unsigned long pack_size_limit_cfg;
     -+extern const char *pack_mtime_suffix;
     ++extern int core_freshen_packfiles;
       
       /*
        * Accessors for the core.sharedrepository config which lazy-load the value
     @@ config.c: static int git_default_core_config(const char *var, const char *value,
       		return 0;
       	}
       
     -+	if (!strcmp(var, "core.packmtimesuffix")) {
     -+		return git_config_string(&pack_mtime_suffix, var, value);
     ++	if (!strcmp(var, "core.freshenpackfiles")) {
     ++		core_freshen_packfiles = git_config_bool(var, value);
      +	}
      +
       	if (!strcmp(var, "core.deltabasecachelimit")) {
     @@ config.c: static int git_default_core_config(const char *var, const char *value,
       		return 0;
      
       ## environment.c ##
     -@@ environment.c: const char *git_hooks_path;
     - int zlib_compression_level = Z_BEST_SPEED;
     - int core_compression_level;
     - int pack_compression_level = Z_DEFAULT_COMPRESSION;
     -+const char *pack_mtime_suffix = "pack";
     - int fsync_object_files;
     - size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE;
     - size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT;
     +@@ environment.c: int core_sparse_checkout_cone;
     + int merge_log_config = -1;
     + int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
     + unsigned long pack_size_limit_cfg;
     ++int core_freshen_packfiles = 1;
     + enum log_refs_config log_all_ref_updates = LOG_REFS_UNSET;
     + 
     + #ifndef PROTECT_HFS_DEFAULT
      
       ## object-file.c ##
      @@ object-file.c: static int freshen_loose_object(const struct object_id *oid)
       static int freshen_packed_object(const struct object_id *oid)
       {
       	struct pack_entry e;
     -+	struct stat st;
     -+	struct strbuf name_buf = STRBUF_INIT;
     -+	const char *filename;
     ++
     ++	if (!core_freshen_packfiles)
     ++		return 1;
      +
       	if (!find_pack_entry(the_repository, oid, &e))
       		return 0;
       	if (e.p->freshened)
     - 		return 1;
     --	if (!freshen_file(e.p->pack_name))
     --		return 0;
     -+
     -+	filename = e.p->pack_name;
     -+	if (!strcasecmp(pack_mtime_suffix, "pack")) {
     -+		if (!freshen_file(filename))
     -+			return 0;
     -+		e.p->freshened = 1;
     -+		return 1;
     -+	}
     -+
     -+	/* If we want to freshen different file instead of .pack file, we need
     -+	 * to make sure the file exists and create it if needed.
     -+	 */
     -+	filename = derive_pack_filename(filename, "pack", pack_mtime_suffix, &name_buf);
     -+	if (lstat(filename, &st) < 0) {
     -+		int fd = open(filename, O_CREAT|O_EXCL|O_WRONLY, 0664);
     -+		if (fd < 0) {
     -+			// here we need to check it again because other git process may created it
     -+			if (lstat(filename, &st) < 0)
     -+				die_errno("unable to create '%s'", filename);
     -+		} else {
     -+			close(fd);
     -+		}
     -+	} else {
     -+		if (!freshen_file(filename))
     -+			return 0;
     -+	}
     -+
     - 	e.p->freshened = 1;
     - 	return 1;
     - }
     -
     - ## packfile.c ##
     -@@ packfile.c: char *sha1_pack_index_name(const unsigned char *sha1)
     - 	return odb_pack_name(&buf, sha1, "idx");
     - }
     - 
     -+const char *derive_pack_filename(const char *pack_name, const char *strip,
     -+				const char *suffix, struct strbuf *buf)
     -+{
     -+	size_t len;
     -+	if (!strip_suffix(pack_name, strip, &len) || !len ||
     -+	    pack_name[len - 1] != '.')
     -+		die(_("packfile name '%s' does not end with '.%s'"),
     -+		    pack_name, strip);
     -+	strbuf_add(buf, pack_name, len);
     -+	strbuf_addstr(buf, suffix);
     -+	return buf->buf;
     -+}
     -+
     - static unsigned int pack_used_ctr;
     - static unsigned int pack_mmap_calls;
     - static unsigned int peak_pack_open_windows;
     -@@ packfile.c: struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
     - 	 */
     - 	p->pack_size = st.st_size;
     - 	p->pack_local = local;
     -+
     -+	/* If we have different file used to freshen the mtime, we should
     -+	 * use it at a higher priority.
     -+	 */
     -+	if (!!strcasecmp(pack_mtime_suffix, "pack")) {
     -+		struct strbuf name_buf = STRBUF_INIT;
     -+		const char *filename;
     -+
     -+		filename = derive_pack_filename(path, "idx", pack_mtime_suffix, &name_buf);
     -+		stat(filename, &st);
     -+	}
     - 	p->mtime = st.st_mtime;
     - 	if (path_len < the_hash_algo->hexsz ||
     - 	    get_sha1_hex(path + path_len - the_hash_algo->hexsz, p->hash))
     -
     - ## packfile.h ##
     -@@ packfile.h: char *sha1_pack_name(const unsigned char *sha1);
     -  */
     - char *sha1_pack_index_name(const unsigned char *sha1);
     - 
     -+/*
     -+ * Return the corresponding filename with given suffix from "file_name"
     -+ * which must has "strip" suffix.
     -+ */
     -+const char *derive_pack_filename(const char *file_name, const char *strip,
     -+		const char *suffix, struct strbuf *buf);
     -+
     - /*
     -  * Return the basename of the packfile, omitting any containing directory
     -  * (e.g., "pack-1234abcd[...].pack").
      
       ## t/t7701-repack-unpack-unreachable.sh ##
      @@ t/t7701-repack-unpack-unreachable.sh: test_expect_success 'do not bother loosening old objects' '
       	test_must_fail git cat-file -p $obj2
       '
       
     -+test_expect_success 'do not bother loosening old objects with core.packmtimesuffix config' '
     ++test_expect_success 'do not bother loosening old objects without freshen pack time' '
      +	obj1=$(echo three | git hash-object -w --stdin) &&
      +	obj2=$(echo four | git hash-object -w --stdin) &&
     -+	pack1=$(echo $obj1 | git -c core.packmtimesuffix=bump pack-objects .git/objects/pack/pack) &&
     -+	pack2=$(echo $obj2 | git -c core.packmtimesuffix=bump pack-objects .git/objects/pack/pack) &&
     -+	git -c core.packmtimesuffix=bump prune-packed &&
     ++	pack1=$(echo $obj1 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
     ++	pack2=$(echo $obj2 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
     ++	git -c core.freshenPackFiles=false prune-packed &&
      +	git cat-file -p $obj1 &&
      +	git cat-file -p $obj2 &&
     -+	touch .git/objects/pack/pack-$pack2.bump &&
     -+	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.bump &&
     -+	git -c core.packmtimesuffix=bump repack -A -d --unpack-unreachable=1.hour.ago &&
     ++	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
     ++	git -c core.freshenPackFiles=false repack -A -d --unpack-unreachable=1.hour.ago &&
      +	git cat-file -p $obj1 &&
      +	test_must_fail git cat-file -p $obj2
      +'


 Documentation/config/core.txt        | 11 +++++++++++
 cache.h                              |  1 +
 config.c                             |  4 ++++
 environment.c                        |  1 +
 object-file.c                        |  4 ++++
 t/t7701-repack-unpack-unreachable.sh | 14 ++++++++++++++
 6 files changed, 35 insertions(+)


base-commit: 75ae10bc75336db031ee58d13c5037b929235912

Comments

Taylor Blau July 19, 2021, 8:51 p.m. UTC | #1
On Mon, Jul 19, 2021 at 07:53:19PM +0000, Sun Chao via GitGitGadget wrote:
> From: Sun Chao <16657101987@163.com>
>
> Commit 33d4221c79 (write_sha1_file: freshen existing objects,
> 2014-10-15) avoid writing existing objects by freshen their
> mtime (especially the packfiles contains them) in order to
> aid the correct caching, and some process like find_lru_pack
> can make good decision. However, this is unfriendly to
> incremental backup jobs or services rely on cached file system
> when there are large '.pack' files exists.
>
> For example, after packed all objects, use 'write-tree' to
> create same commit with the same tree and same environments
> such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can
> notice the '.pack' file's mtime changed. Git servers
> that mount the same NFS disk will re-sync the '.pack' files
> to cached file system which will slow the git commands.
>
> So if add core.freshenPackfiles to indicate whether or not
> packs can be freshened, turning off this option on some
> servers can speed up the execution of some commands on servers
> which use NFS disk instead of local disk.

Hmm. I'm still quite unconvinced that we should be taking this direction
without better motivation. We talked about your assumption that NFS
seems to be invalidating the block cache when updating the inodes that
point at those blocks, but I don't recall seeing further evidence.

Regardless, a couple of idle thoughts:

> +	if (!core_freshen_packfiles)
> +		return 1;

It is important to still freshen the object mtimes even when we cannot
update the pack mtimes. That's why we return 0 when "freshen_file"
returned 0: even if there was an error calling utime, we should still
freshen the object. This is important because it impacts when
unreachable objects are pruned.

So I would have assumed that if a user set "core.freshenPackfiles=false"
that they would still want their object mtimes updated, in which case
the only option we have is to write those objects out loose.

...and that happens by the caller of freshen_packed_object (either
write_object_file() or hash_object_file_literally()) then calling
write_loose_object() if freshen_packed_object() failed. So I would have
expected to see a "return 0" in the case that packfile freshening was
disabled.

But that leads us to an interesting problem: how many redundant objects
do we expect to see on the server? It may be a lot, in which case you
may end up having the same IO problems for a different reason. Peff
mentioned to me off-list that he suspected write-tree was overeager in
how many trees it would try to write out. I'm not sure.

> +test_expect_success 'do not bother loosening old objects without freshen pack time' '
> +	obj1=$(echo three | git hash-object -w --stdin) &&
> +	obj2=$(echo four | git hash-object -w --stdin) &&
> +	pack1=$(echo $obj1 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
> +	pack2=$(echo $obj2 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
> +	git -c core.freshenPackFiles=false prune-packed &&
> +	git cat-file -p $obj1 &&
> +	git cat-file -p $obj2 &&
> +	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
> +	git -c core.freshenPackFiles=false repack -A -d --unpack-unreachable=1.hour.ago &&
> +	git cat-file -p $obj1 &&
> +	test_must_fail git cat-file -p $obj2
> +'

I had a little bit of a hard time following this test. AFAICT, it
proceeds as follows:

  - Write two packs, each containing a unique unreachable blob.
  - Call 'git prune-packed' with packfile freshening disabled, then
    check that the object survived.
  - Then repack while in a state where one of the pack's contents would
    be pruned.
  - Make sure that one object survives and the other doesn't.

This doesn't really seem to be testing the behavior of disabling
packfile freshening so much as it's testing prune-packed, and repack's
`--unpack-unreachable` option. I would probably have expected to see
something more along the lines of:

  - Write an unreachable object, pack it, and then remove the loose copy
    you wrote in the first place.
  - Then roll the pack's mtime to some fixed value in the past.
  - Try to write the same object again with packfile freshening
    disabled, and verify that:
    - the pack's mtime was unchanged,
    - the object exists loose again

But I'd really like to get some other opinions (especially from Peff,
who brought up the potential concerns with write-tree) as to whether or
not this is a direction worth pursuing.

My opinion is that it is not, and that the bizarre caching behavior you
are seeing is out of Git's control.

Thanks,
Taylor
Junio C Hamano July 20, 2021, 12:07 a.m. UTC | #2
Taylor Blau <me@ttaylorr.com> writes:

> Hmm. I'm still quite unconvinced that we should be taking this direction
> without better motivation. We talked about your assumption that NFS
> seems to be invalidating the block cache when updating the inodes that
> point at those blocks, but I don't recall seeing further evidence.

Me neither.  Not touching the pack and not updating the "most
recently used" time of individual objects smells like a recipe
for repository corruption.

> My opinion is that it is not, and that the bizarre caching behavior you
> are seeing is out of Git's control.
Ævar Arnfjörð Bjarmason July 20, 2021, 6:19 a.m. UTC | #3
On Mon, Jul 19 2021, Taylor Blau wrote:

> On Mon, Jul 19, 2021 at 07:53:19PM +0000, Sun Chao via GitGitGadget wrote:
>> From: Sun Chao <16657101987@163.com>
>>
>> Commit 33d4221c79 (write_sha1_file: freshen existing objects,
>> 2014-10-15) avoid writing existing objects by freshen their
>> mtime (especially the packfiles contains them) in order to
>> aid the correct caching, and some process like find_lru_pack
>> can make good decision. However, this is unfriendly to
>> incremental backup jobs or services rely on cached file system
>> when there are large '.pack' files exists.
>>
>> For example, after packed all objects, use 'write-tree' to
>> create same commit with the same tree and same environments
>> such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can
>> notice the '.pack' file's mtime changed. Git servers
>> that mount the same NFS disk will re-sync the '.pack' files
>> to cached file system which will slow the git commands.
>>
>> So if add core.freshenPackfiles to indicate whether or not
>> packs can be freshened, turning off this option on some
>> servers can speed up the execution of some commands on servers
>> which use NFS disk instead of local disk.
>
> Hmm. I'm still quite unconvinced that we should be taking this direction
> without better motivation. We talked about your assumption that NFS
> seems to be invalidating the block cache when updating the inodes that
> point at those blocks, but I don't recall seeing further evidence.

I don't know about Sun's setup, but what he's describing is consistent
with how NFS works, or can commonly be made to work.

See e.g. "lookupcache" in nfs(5) on Linux, but also a lot of people use
some random vendor's proprietary NFS implementation, and commonly tweak
various options that make it anywhere between "I guess that's not too
crazy" and "are you kidding me?" levels of non-POSIX compliant.

> Regardless, a couple of idle thoughts:
>
>> +	if (!core_freshen_packfiles)
>> +		return 1;
>
> It is important to still freshen the object mtimes even when we cannot
> update the pack mtimes. That's why we return 0 when "freshen_file"
> returned 0: even if there was an error calling utime, we should still
> freshen the object. This is important because it impacts when
> unreachable objects are pruned.
>
> So I would have assumed that if a user set "core.freshenPackfiles=false"
> that they would still want their object mtimes updated, in which case
> the only option we have is to write those objects out loose.
>
> ...and that happens by the caller of freshen_packed_object (either
> write_object_file() or hash_object_file_literally()) then calling
> write_loose_object() if freshen_packed_object() failed. So I would have
> expected to see a "return 0" in the case that packfile freshening was
> disabled.
>
> But that leads us to an interesting problem: how many redundant objects
> do we expect to see on the server? It may be a lot, in which case you
> may end up having the same IO problems for a different reason. Peff
> mentioned to me off-list that he suspected write-tree was overeager in
> how many trees it would try to write out. I'm not sure.

In my experience with NFS the thing that kills you is anything that
needs to do iterations, i.e. recursive readdir() and the like, or to
read a lot of data, throughput was excellent. It's why I hacked core
that core.checkCollisions patch.

Jeff improved the situation I was mainly trying to fix with with the
loose objects cache. I never got around to benchmarking the two in
production, and now that setup is at an ex-job...

>> +test_expect_success 'do not bother loosening old objects without freshen pack time' '
>> +	obj1=$(echo three | git hash-object -w --stdin) &&
>> +	obj2=$(echo four | git hash-object -w --stdin) &&
>> +	pack1=$(echo $obj1 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
>> +	pack2=$(echo $obj2 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
>> +	git -c core.freshenPackFiles=false prune-packed &&
>> +	git cat-file -p $obj1 &&
>> +	git cat-file -p $obj2 &&
>> +	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
>> +	git -c core.freshenPackFiles=false repack -A -d --unpack-unreachable=1.hour.ago &&
>> +	git cat-file -p $obj1 &&
>> +	test_must_fail git cat-file -p $obj2
>> +'
>
> I had a little bit of a hard time following this test. AFAICT, it
> proceeds as follows:
>
>   - Write two packs, each containing a unique unreachable blob.
>   - Call 'git prune-packed' with packfile freshening disabled, then
>     check that the object survived.
>   - Then repack while in a state where one of the pack's contents would
>     be pruned.
>   - Make sure that one object survives and the other doesn't.
>
> This doesn't really seem to be testing the behavior of disabling
> packfile freshening so much as it's testing prune-packed, and repack's
> `--unpack-unreachable` option. I would probably have expected to see
> something more along the lines of:
>
>   - Write an unreachable object, pack it, and then remove the loose copy
>     you wrote in the first place.
>   - Then roll the pack's mtime to some fixed value in the past.
>   - Try to write the same object again with packfile freshening
>     disabled, and verify that:
>     - the pack's mtime was unchanged,
>     - the object exists loose again
>
> But I'd really like to get some other opinions (especially from Peff,
> who brought up the potential concerns with write-tree) as to whether or
> not this is a direction worth pursuing.
>
> My opinion is that it is not, and that the bizarre caching behavior you
> are seeing is out of Git's control.

Thanks for this, I found the test hard to follow too, but didn't have
time to really think about it, this makes sense.

Back to the topic: I share your sentiment of trying to avoid complexity
in this area.

Sun: Have you considered --keep-unreachable to "git repack"? It's
orthagonal to what you're trying here, but I wonder if being more
aggressive about keeping objects + some impromevents to perhaps skip
this "touch" at all if we have that in effect wouldn't be more viable &
something e.g. Taylor would be more comforable having part of git.git.
孙超 July 20, 2021, 3 p.m. UTC | #4
> 2021年7月20日 04:51,Taylor Blau <me@ttaylorr.com> 写道:
> 
> On Mon, Jul 19, 2021 at 07:53:19PM +0000, Sun Chao via GitGitGadget wrote:
>> From: Sun Chao <16657101987@163.com>
>> 
>> Commit 33d4221c79 (write_sha1_file: freshen existing objects,
>> 2014-10-15) avoid writing existing objects by freshen their
>> mtime (especially the packfiles contains them) in order to
>> aid the correct caching, and some process like find_lru_pack
>> can make good decision. However, this is unfriendly to
>> incremental backup jobs or services rely on cached file system
>> when there are large '.pack' files exists.
>> 
>> For example, after packed all objects, use 'write-tree' to
>> create same commit with the same tree and same environments
>> such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can
>> notice the '.pack' file's mtime changed. Git servers
>> that mount the same NFS disk will re-sync the '.pack' files
>> to cached file system which will slow the git commands.
>> 
>> So if add core.freshenPackfiles to indicate whether or not
>> packs can be freshened, turning off this option on some
>> servers can speed up the execution of some commands on servers
>> which use NFS disk instead of local disk.
> 
> Hmm. I'm still quite unconvinced that we should be taking this direction
> without better motivation. We talked about your assumption that NFS
> seems to be invalidating the block cache when updating the inodes that
> point at those blocks, but I don't recall seeing further evidence.

Yes, these days I'm trying to asking help from our SRE to do tests in our
production environments (where we can get the real traffic report of NFS server),
such like:

- setup a repository witch some large packfiles in a NFS disk
- keep running 'git pack-objects --all --stdout --delta-base-offset >/dev/null' in
multiple git servers that mount the same NFS disk above
- 'touch' the packfiles in another server, and check (a) if the IOPS and IO traffic
of the NFS server changes and (b) if the IO traffic of network interfaces from
the git servers to the NFS server changes

I like to share the data when I receive the reports.

Meanwhile I find the description of the cached file system for NFS Client:
   https://www.ibm.com/docs/en/aix/7.2?topic=performance-cache-file-system
It is mentioned that:

  3. To ensure that the cached directories and files are kept up to date, 
     CacheFS periodically checks the consistency of files stored in the cache.
     It does this by comparing the current modification time to the previous
     modification time.
  4. If the modification times are different, all data and attributes
     for the directory or file are purged from the cache, and new data and
     attributes are retrieved from the back file system.

It looks like reasonable, but perhaps I should check it from my test reports ;)

> 
> Regardless, a couple of idle thoughts:
> 
>> +	if (!core_freshen_packfiles)
>> +		return 1;
> 
> It is important to still freshen the object mtimes even when we cannot
> update the pack mtimes. That's why we return 0 when "freshen_file"
> returned 0: even if there was an error calling utime, we should still
> freshen the object. This is important because it impacts when
> unreachable objects are pruned.
> 
> So I would have assumed that if a user set "core.freshenPackfiles=false"
> that they would still want their object mtimes updated, in which case
> the only option we have is to write those objects out loose.
> 
> ...and that happens by the caller of freshen_packed_object (either
> write_object_file() or hash_object_file_literally()) then calling
> write_loose_object() if freshen_packed_object() failed. So I would have
> expected to see a "return 0" in the case that packfile freshening was
> disabled.
> 
> But that leads us to an interesting problem: how many redundant objects
> do we expect to see on the server? It may be a lot, in which case you
> may end up having the same IO problems for a different reason. Peff
> mentioned to me off-list that he suspected write-tree was overeager in
> how many trees it would try to write out. I'm not sure.

You are right, I haven't thought it in details, if we do not update the mtime
both of packfiles and loose files, 'prune' may delete them by accident.

> 
>> +test_expect_success 'do not bother loosening old objects without freshen pack time' '
>> +	obj1=$(echo three | git hash-object -w --stdin) &&
>> +	obj2=$(echo four | git hash-object -w --stdin) &&
>> +	pack1=$(echo $obj1 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
>> +	pack2=$(echo $obj2 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
>> +	git -c core.freshenPackFiles=false prune-packed &&
>> +	git cat-file -p $obj1 &&
>> +	git cat-file -p $obj2 &&
>> +	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
>> +	git -c core.freshenPackFiles=false repack -A -d --unpack-unreachable=1.hour.ago &&
>> +	git cat-file -p $obj1 &&
>> +	test_must_fail git cat-file -p $obj2
>> +'
> 
> I had a little bit of a hard time following this test. AFAICT, it
> proceeds as follows:
> 
>  - Write two packs, each containing a unique unreachable blob.
>  - Call 'git prune-packed' with packfile freshening disabled, then
>    check that the object survived.
>  - Then repack while in a state where one of the pack's contents would
>    be pruned.
>  - Make sure that one object survives and the other doesn't.
> 
> This doesn't really seem to be testing the behavior of disabling
> packfile freshening so much as it's testing prune-packed, and repack's
> `--unpack-unreachable` option. I would probably have expected to see
> something more along the lines of:
> 
>  - Write an unreachable object, pack it, and then remove the loose copy
>    you wrote in the first place.
>  - Then roll the pack's mtime to some fixed value in the past.
>  - Try to write the same object again with packfile freshening
>    disabled, and verify that:
>    - the pack's mtime was unchanged,
>    - the object exists loose again
> 
> But I'd really like to get some other opinions (especially from Peff,
> who brought up the potential concerns with write-tree) as to whether or
> not this is a direction worth pursuing.
> 
> My opinion is that it is not, and that the bizarre caching behavior you
> are seeing is out of Git's control.
OK, I will try this. And I will try to get the test reports from our SRE and
check how the mtime impacts the caches. 

> 
> Thanks,
> Taylor
孙超 July 20, 2021, 3:07 p.m. UTC | #5
> 2021年7月20日 08:07,Junio C Hamano <gitster@pobox.com> 写道:
> 
> Taylor Blau <me@ttaylorr.com> writes:
> 
>> Hmm. I'm still quite unconvinced that we should be taking this direction
>> without better motivation. We talked about your assumption that NFS
>> seems to be invalidating the block cache when updating the inodes that
>> point at those blocks, but I don't recall seeing further evidence.
> 
> Me neither.  Not touching the pack and not updating the "most
> recently used" time of individual objects smells like a recipe
> for repository corruption.
> 
>> My opinion is that it is not, and that the bizarre caching behavior you
>> are seeing is out of Git's control.
> 
Thanks Junio, I will try to get a more detial reports of the NFS caches and
share it if it is valuable. Not touching the mtime of packfiles really has
potencial problems just as Taylor said.
孙超 July 20, 2021, 3:34 p.m. UTC | #6
> 2021年7月20日 14:19,Ævar Arnfjörð Bjarmason <avarab@gmail.com> 写道:
> 
> 
> On Mon, Jul 19 2021, Taylor Blau wrote:
> 
>> On Mon, Jul 19, 2021 at 07:53:19PM +0000, Sun Chao via GitGitGadget wrote:
>>> From: Sun Chao <16657101987@163.com>
>>> 
>>> Commit 33d4221c79 (write_sha1_file: freshen existing objects,
>>> 2014-10-15) avoid writing existing objects by freshen their
>>> mtime (especially the packfiles contains them) in order to
>>> aid the correct caching, and some process like find_lru_pack
>>> can make good decision. However, this is unfriendly to
>>> incremental backup jobs or services rely on cached file system
>>> when there are large '.pack' files exists.
>>> 
>>> For example, after packed all objects, use 'write-tree' to
>>> create same commit with the same tree and same environments
>>> such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can
>>> notice the '.pack' file's mtime changed. Git servers
>>> that mount the same NFS disk will re-sync the '.pack' files
>>> to cached file system which will slow the git commands.
>>> 
>>> So if add core.freshenPackfiles to indicate whether or not
>>> packs can be freshened, turning off this option on some
>>> servers can speed up the execution of some commands on servers
>>> which use NFS disk instead of local disk.
>> 
>> Hmm. I'm still quite unconvinced that we should be taking this direction
>> without better motivation. We talked about your assumption that NFS
>> seems to be invalidating the block cache when updating the inodes that
>> point at those blocks, but I don't recall seeing further evidence.
> 
> I don't know about Sun's setup, but what he's describing is consistent
> with how NFS works, or can commonly be made to work.
> 
> See e.g. "lookupcache" in nfs(5) on Linux, but also a lot of people use
> some random vendor's proprietary NFS implementation, and commonly tweak
> various options that make it anywhere between "I guess that's not too
> crazy" and "are you kidding me?" levels of non-POSIX compliant.
> 
>> Regardless, a couple of idle thoughts:
>> 
>>> +	if (!core_freshen_packfiles)
>>> +		return 1;
>> 
>> It is important to still freshen the object mtimes even when we cannot
>> update the pack mtimes. That's why we return 0 when "freshen_file"
>> returned 0: even if there was an error calling utime, we should still
>> freshen the object. This is important because it impacts when
>> unreachable objects are pruned.
>> 
>> So I would have assumed that if a user set "core.freshenPackfiles=false"
>> that they would still want their object mtimes updated, in which case
>> the only option we have is to write those objects out loose.
>> 
>> ...and that happens by the caller of freshen_packed_object (either
>> write_object_file() or hash_object_file_literally()) then calling
>> write_loose_object() if freshen_packed_object() failed. So I would have
>> expected to see a "return 0" in the case that packfile freshening was
>> disabled.
>> 
>> But that leads us to an interesting problem: how many redundant objects
>> do we expect to see on the server? It may be a lot, in which case you
>> may end up having the same IO problems for a different reason. Peff
>> mentioned to me off-list that he suspected write-tree was overeager in
>> how many trees it would try to write out. I'm not sure.
> 
> In my experience with NFS the thing that kills you is anything that
> needs to do iterations, i.e. recursive readdir() and the like, or to
> read a lot of data, throughput was excellent. It's why I hacked core
> that core.checkCollisions patch.
I have read your patch, looks like a good idea to reduce the expensive operations
like readdir(). And in my production environments, IOPS stress of the NFS server
bothers me which make the git commands slow.

> 
> Jeff improved the situation I was mainly trying to fix with with the
> loose objects cache. I never got around to benchmarking the two in
> production, and now that setup is at an ex-job...
> 
>>> +test_expect_success 'do not bother loosening old objects without freshen pack time' '
>>> +	obj1=$(echo three | git hash-object -w --stdin) &&
>>> +	obj2=$(echo four | git hash-object -w --stdin) &&
>>> +	pack1=$(echo $obj1 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
>>> +	pack2=$(echo $obj2 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
>>> +	git -c core.freshenPackFiles=false prune-packed &&
>>> +	git cat-file -p $obj1 &&
>>> +	git cat-file -p $obj2 &&
>>> +	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
>>> +	git -c core.freshenPackFiles=false repack -A -d --unpack-unreachable=1.hour.ago &&
>>> +	git cat-file -p $obj1 &&
>>> +	test_must_fail git cat-file -p $obj2
>>> +'
>> 
>> I had a little bit of a hard time following this test. AFAICT, it
>> proceeds as follows:
>> 
>>  - Write two packs, each containing a unique unreachable blob.
>>  - Call 'git prune-packed' with packfile freshening disabled, then
>>    check that the object survived.
>>  - Then repack while in a state where one of the pack's contents would
>>    be pruned.
>>  - Make sure that one object survives and the other doesn't.
>> 
>> This doesn't really seem to be testing the behavior of disabling
>> packfile freshening so much as it's testing prune-packed, and repack's
>> `--unpack-unreachable` option. I would probably have expected to see
>> something more along the lines of:
>> 
>>  - Write an unreachable object, pack it, and then remove the loose copy
>>    you wrote in the first place.
>>  - Then roll the pack's mtime to some fixed value in the past.
>>  - Try to write the same object again with packfile freshening
>>    disabled, and verify that:
>>    - the pack's mtime was unchanged,
>>    - the object exists loose again
>> 
>> But I'd really like to get some other opinions (especially from Peff,
>> who brought up the potential concerns with write-tree) as to whether or
>> not this is a direction worth pursuing.
>> 
>> My opinion is that it is not, and that the bizarre caching behavior you
>> are seeing is out of Git's control.
> 
> Thanks for this, I found the test hard to follow too, but didn't have
> time to really think about it, this makes sense.
> 
> Back to the topic: I share your sentiment of trying to avoid complexity
> in this area.
> 
> Sun: Have you considered --keep-unreachable to "git repack"? It's
> orthagonal to what you're trying here, but I wonder if being more
> aggressive about keeping objects + some impromevents to perhaps skip
> this "touch" at all if we have that in effect wouldn't be more viable &
> something e.g. Taylor would be more comforable having part of git.git.

Yes, I will try to create some more useful test cases with both `--keep-unreachable`
and `--unpack-unreachable` if I still believe I need to do something with
the mtime, whatever I need to get my test reports first, thanks ;)
Taylor Blau July 20, 2021, 4:53 p.m. UTC | #7
On Tue, Jul 20, 2021 at 11:00:18PM +0800, Sun Chao wrote:
> Meanwhile I find the description of the cached file system for NFS Client:
>    https://www.ibm.com/docs/en/aix/7.2?topic=performance-cache-file-system
> It is mentioned that:
>
>   3. To ensure that the cached directories and files are kept up to date,
>      CacheFS periodically checks the consistency of files stored in the cache.
>      It does this by comparing the current modification time to the previous
>      modification time.
>   4. If the modification times are different, all data and attributes
>      for the directory or file are purged from the cache, and new data and
>      attributes are retrieved from the back file system.
>
> It looks like reasonable, but perhaps I should check it from my test reports ;)

That seems reasonable, assuming that you have CacheFS mounted and that's
what you're interacting with (instead of talking to NFS directly).

It seems reasonable that CacheFS would also have some way to tune how
often the "purge cached blocks because of stale mtimes" would kick in,
and what the grace period for determining if an mtime is "stale" is. So
hopefully those values are (a) configurable and (b) you can find values
that result in acceptable performance.

Thanks,
Taylor
diff mbox series

Patch

diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
index c04f62a54a1..1e7cf366628 100644
--- a/Documentation/config/core.txt
+++ b/Documentation/config/core.txt
@@ -398,6 +398,17 @@  the largest projects.  You probably do not need to adjust this value.
 +
 Common unit suffixes of 'k', 'm', or 'g' are supported.
 
+core.freshenPackFiles::
+	Normally we avoid writing existing object by freshening the mtime
+	of the *.pack file which contains it in order to aid some processes
+	such like prune. Turning off this option on some servers can speed
+	up the execution of some commands like 'git-upload-pack'(e.g. some
+	servers that mount the same NFS disk will re-sync the *.pack files
+	to cached file system if the mtime cahnges).
++
+The default is true which means the *.pack file will be freshened if we
+want to write a existing object whthin it.
+
 core.deltaBaseCacheLimit::
 	Maximum number of bytes per thread to reserve for caching base objects
 	that may be referenced by multiple deltified objects.  By storing the
diff --git a/cache.h b/cache.h
index ba04ff8bd36..46126c6977c 100644
--- a/cache.h
+++ b/cache.h
@@ -956,6 +956,7 @@  extern size_t packed_git_limit;
 extern size_t delta_base_cache_limit;
 extern unsigned long big_file_threshold;
 extern unsigned long pack_size_limit_cfg;
+extern int core_freshen_packfiles;
 
 /*
  * Accessors for the core.sharedrepository config which lazy-load the value
diff --git a/config.c b/config.c
index f9c400ad306..02dcc8a028e 100644
--- a/config.c
+++ b/config.c
@@ -1431,6 +1431,10 @@  static int git_default_core_config(const char *var, const char *value, void *cb)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.freshenpackfiles")) {
+		core_freshen_packfiles = git_config_bool(var, value);
+	}
+
 	if (!strcmp(var, "core.deltabasecachelimit")) {
 		delta_base_cache_limit = git_config_ulong(var, value);
 		return 0;
diff --git a/environment.c b/environment.c
index 2f27008424a..397525609a8 100644
--- a/environment.c
+++ b/environment.c
@@ -73,6 +73,7 @@  int core_sparse_checkout_cone;
 int merge_log_config = -1;
 int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
 unsigned long pack_size_limit_cfg;
+int core_freshen_packfiles = 1;
 enum log_refs_config log_all_ref_updates = LOG_REFS_UNSET;
 
 #ifndef PROTECT_HFS_DEFAULT
diff --git a/object-file.c b/object-file.c
index f233b440b22..884c3e92c38 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1974,6 +1974,10 @@  static int freshen_loose_object(const struct object_id *oid)
 static int freshen_packed_object(const struct object_id *oid)
 {
 	struct pack_entry e;
+
+	if (!core_freshen_packfiles)
+		return 1;
+
 	if (!find_pack_entry(the_repository, oid, &e))
 		return 0;
 	if (e.p->freshened)
diff --git a/t/t7701-repack-unpack-unreachable.sh b/t/t7701-repack-unpack-unreachable.sh
index 937f89ee8c8..b6a0b6c9695 100755
--- a/t/t7701-repack-unpack-unreachable.sh
+++ b/t/t7701-repack-unpack-unreachable.sh
@@ -112,6 +112,20 @@  test_expect_success 'do not bother loosening old objects' '
 	test_must_fail git cat-file -p $obj2
 '
 
+test_expect_success 'do not bother loosening old objects without freshen pack time' '
+	obj1=$(echo three | git hash-object -w --stdin) &&
+	obj2=$(echo four | git hash-object -w --stdin) &&
+	pack1=$(echo $obj1 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
+	pack2=$(echo $obj2 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
+	git -c core.freshenPackFiles=false prune-packed &&
+	git cat-file -p $obj1 &&
+	git cat-file -p $obj2 &&
+	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
+	git -c core.freshenPackFiles=false repack -A -d --unpack-unreachable=1.hour.ago &&
+	git cat-file -p $obj1 &&
+	test_must_fail git cat-file -p $obj2
+'
+
 test_expect_success 'keep packed objects found only in index' '
 	echo my-unique-content >file &&
 	git add file &&