mbox series

[0/1] send-pack: set core.warnAmbiguousRefs=false

Message ID pull.68.git.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series send-pack: set core.warnAmbiguousRefs=false | expand

Message

Johannes Schindelin via GitGitGadget Nov. 6, 2018, 7:13 p.m. UTC
I've been looking into the performance of git push for very large repos. Our
users are reporting that 60-80% of git push time is spent during the
"Enumerating objects" phase of git pack-objects.

A git push process runs several processes during its run, but one includes 
git send-pack which calls git pack-objects and passes the known have/wants
into stdin using object ids. However, the default setting for 
core.warnAmbiguousRefs requires git pack-objects to check for ref names
matching the ref_rev_parse_rules array in refs.c. This means that every
object is triggering at least six "file exists?" queries.

When there are a lot of refs, this can add up significantly! My PerfView
trace for a simple push measured 3 seconds spent checking these paths.

The fix for this is simple: set core.warnAmbiguousRefs to false for this
specific call of git pack-objects coming from git send-pack. We don't want
to default it to false for all calls to git pack-objects, as it is valid to
pass ref names instead of object ids. This helps regain these seconds during
a push.

In addition to this patch submission, we are looking into merging it into
our fork sooner [1].

[1] https://github.com/Microsoft/git/pull/67

Derrick Stolee (1):
  send-pack: set core.warnAmbiguousRefs=false

 send-pack.c | 2 ++
 1 file changed, 2 insertions(+)


base-commit: cae598d9980661a978e2df4fb338518f7bf09572
Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-68%2Fderrickstolee%2Fsend-pack-config-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-68/derrickstolee/send-pack-config-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/68

Comments

Jeff King Nov. 6, 2018, 7:44 p.m. UTC | #1
On Tue, Nov 06, 2018 at 11:13:47AM -0800, Derrick Stolee via GitGitGadget wrote:

> I've been looking into the performance of git push for very large repos. Our
> users are reporting that 60-80% of git push time is spent during the
> "Enumerating objects" phase of git pack-objects.
> 
> A git push process runs several processes during its run, but one includes 
> git send-pack which calls git pack-objects and passes the known have/wants
> into stdin using object ids. However, the default setting for 
> core.warnAmbiguousRefs requires git pack-objects to check for ref names
> matching the ref_rev_parse_rules array in refs.c. This means that every
> object is triggering at least six "file exists?" queries.
> 
> When there are a lot of refs, this can add up significantly! My PerfView
> trace for a simple push measured 3 seconds spent checking these paths.

Some of this might be useful in the commit message. :)

> The fix for this is simple: set core.warnAmbiguousRefs to false for this
> specific call of git pack-objects coming from git send-pack. We don't want
> to default it to false for all calls to git pack-objects, as it is valid to
> pass ref names instead of object ids. This helps regain these seconds during
> a push.

I don't think you actually care about the ambiguity check between refs
here; you just care about avoiding the ref check when we've seen (and
are mostly expecting) a 40-hex sha1. We have a more specific flag for
that: warn_on_object_refname_ambiguity.

And I think it would be OK to enable that all the time for pack-objects,
which is plumbing that does typically expect object names. See prior art
in 25fba78d36 (cat-file: disable object/refname ambiguity check for
batch mode, 2013-07-12) and 4c30d50402 (rev-list: disable object/refname
ambiguity check with --stdin, 2014-03-12).

> Derrick Stolee (1):
>   send-pack: set core.warnAmbiguousRefs=false
> 
>  send-pack.c | 2 ++
>  1 file changed, 2 insertions(+)

Whenever I see a change like this to the pack-objects invocation for
send-pack, it makes me wonder if upload-pack would want the same thing.

It's a moot point if we just set the flag directly in inside
pack-objects, though.

-Peff
Jeff King Nov. 6, 2018, 7:51 p.m. UTC | #2
On Tue, Nov 06, 2018 at 02:44:42PM -0500, Jeff King wrote:

> > The fix for this is simple: set core.warnAmbiguousRefs to false for this
> > specific call of git pack-objects coming from git send-pack. We don't want
> > to default it to false for all calls to git pack-objects, as it is valid to
> > pass ref names instead of object ids. This helps regain these seconds during
> > a push.
> 
> I don't think you actually care about the ambiguity check between refs
> here; you just care about avoiding the ref check when we've seen (and
> are mostly expecting) a 40-hex sha1. We have a more specific flag for
> that: warn_on_object_refname_ambiguity.
> 
> And I think it would be OK to enable that all the time for pack-objects,
> which is plumbing that does typically expect object names. See prior art
> in 25fba78d36 (cat-file: disable object/refname ambiguity check for
> batch mode, 2013-07-12) and 4c30d50402 (rev-list: disable object/refname
> ambiguity check with --stdin, 2014-03-12).

I'd probably do it here:

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index e50c6cd1ff..d370638a5d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3104,6 +3104,7 @@ static void get_object_list(int ac, const char **av)
 	struct rev_info revs;
 	char line[1000];
 	int flags = 0;
+	int save_warning;
 
 	repo_init_revisions(the_repository, &revs, NULL);
 	save_commit_buffer = 0;
@@ -3112,6 +3113,9 @@ static void get_object_list(int ac, const char **av)
 	/* make sure shallows are read */
 	is_repository_shallow(the_repository);
 
+	save_warning = warn_on_object_refname_ambiguity;
+	warn_on_object_refname_ambiguity = 0;
+
 	while (fgets(line, sizeof(line), stdin) != NULL) {
 		int len = strlen(line);
 		if (len && line[len - 1] == '\n')
@@ -3138,6 +3142,8 @@ static void get_object_list(int ac, const char **av)
 			die(_("bad revision '%s'"), line);
 	}
 
+	warn_on_object_refname_ambiguity = save_warning;
+
 	if (use_bitmap_index && !get_object_list_from_bitmap(&revs))
 		return;
 

But I'll leave it to you to wrap that up in a patch, since you probably
should re-check your timings (which it would be interesting to include
in the commit message, if you have reproducible timings).

-Peff
Derrick Stolee Nov. 6, 2018, 8 p.m. UTC | #3
On 11/6/2018 2:44 PM, Jeff King wrote:
> On Tue, Nov 06, 2018 at 11:13:47AM -0800, Derrick Stolee via GitGitGadget wrote:
>
>> I've been looking into the performance of git push for very large repos. Our
>> users are reporting that 60-80% of git push time is spent during the
>> "Enumerating objects" phase of git pack-objects.
>>
>> A git push process runs several processes during its run, but one includes
>> git send-pack which calls git pack-objects and passes the known have/wants
>> into stdin using object ids. However, the default setting for
>> core.warnAmbiguousRefs requires git pack-objects to check for ref names
>> matching the ref_rev_parse_rules array in refs.c. This means that every
>> object is triggering at least six "file exists?" queries.
>>
>> When there are a lot of refs, this can add up significantly! My PerfView
>> trace for a simple push measured 3 seconds spent checking these paths.
> Some of this might be useful in the commit message. :)
>
>> The fix for this is simple: set core.warnAmbiguousRefs to false for this
>> specific call of git pack-objects coming from git send-pack. We don't want
>> to default it to false for all calls to git pack-objects, as it is valid to
>> pass ref names instead of object ids. This helps regain these seconds during
>> a push.
> I don't think you actually care about the ambiguity check between refs
> here; you just care about avoiding the ref check when we've seen (and
> are mostly expecting) a 40-hex sha1. We have a more specific flag for
> that: warn_on_object_refname_ambiguity.
>
> And I think it would be OK to enable that all the time for pack-objects,
> which is plumbing that does typically expect object names. See prior art
> in 25fba78d36 (cat-file: disable object/refname ambiguity check for
> batch mode, 2013-07-12) and 4c30d50402 (rev-list: disable object/refname
> ambiguity check with --stdin, 2014-03-12).
Thanks for these pointers. Helps to know there is precedent for shutting 
down the
behavior without relying on "-c" flags.

> Whenever I see a change like this to the pack-objects invocation for
> send-pack, it makes me wonder if upload-pack would want the same thing.
>
> It's a moot point if we just set the flag directly in inside
> pack-objects, though.
I'll send a v2 that does just that.

Thanks,
-Stolee
Derrick Stolee Nov. 6, 2018, 8:16 p.m. UTC | #4
On 11/6/2018 2:51 PM, Jeff King wrote:
> On Tue, Nov 06, 2018 at 02:44:42PM -0500, Jeff King wrote:
>
>>> The fix for this is simple: set core.warnAmbiguousRefs to false for this
>>> specific call of git pack-objects coming from git send-pack. We don't want
>>> to default it to false for all calls to git pack-objects, as it is valid to
>>> pass ref names instead of object ids. This helps regain these seconds during
>>> a push.
>> I don't think you actually care about the ambiguity check between refs
>> here; you just care about avoiding the ref check when we've seen (and
>> are mostly expecting) a 40-hex sha1. We have a more specific flag for
>> that: warn_on_object_refname_ambiguity.
>>
>> And I think it would be OK to enable that all the time for pack-objects,
>> which is plumbing that does typically expect object names. See prior art
>> in 25fba78d36 (cat-file: disable object/refname ambiguity check for
>> batch mode, 2013-07-12) and 4c30d50402 (rev-list: disable object/refname
>> ambiguity check with --stdin, 2014-03-12).
> I'd probably do it here:
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index e50c6cd1ff..d370638a5d 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -3104,6 +3104,7 @@ static void get_object_list(int ac, const char **av)

Scoping the change into get_object_list does make sense. I was doing it 
a level higher, which is not worth it. I'll reproduce your change here.

>   	struct rev_info revs;
>   	char line[1000];
>   	int flags = 0;
> +	int save_warning;
>   
>   	repo_init_revisions(the_repository, &revs, NULL);
>   	save_commit_buffer = 0;
> @@ -3112,6 +3113,9 @@ static void get_object_list(int ac, const char **av)
>   	/* make sure shallows are read */
>   	is_repository_shallow(the_repository);
>   
> +	save_warning = warn_on_object_refname_ambiguity;
> +	warn_on_object_refname_ambiguity = 0;
> +
>   	while (fgets(line, sizeof(line), stdin) != NULL) {
>   		int len = strlen(line);
>   		if (len && line[len - 1] == '\n')
> @@ -3138,6 +3142,8 @@ static void get_object_list(int ac, const char **av)
>   			die(_("bad revision '%s'"), line);
>   	}
>   
> +	warn_on_object_refname_ambiguity = save_warning;
> +
>   	if (use_bitmap_index && !get_object_list_from_bitmap(&revs))
>   		return;
>   
>
> But I'll leave it to you to wrap that up in a patch, since you probably
> should re-check your timings (which it would be interesting to include
> in the commit message, if you have reproducible timings).

The timings change a lot depending on the disk cache and the remote 
refs, which is unfortunate, but I have measured a three-second improvement.

Thanks,
-Stolee