diff mbox series

[v1] add: speed up cmd_add() by utilizing read_cache_preload()

Message ID 20181102133050.10756-1-peartben@gmail.com (mailing list archive)
State New, archived
Headers show
Series [v1] add: speed up cmd_add() by utilizing read_cache_preload() | expand

Commit Message

Ben Peart Nov. 2, 2018, 1:30 p.m. UTC
From: Ben Peart <benpeart@microsoft.com>

During an "add", a call is made to run_diff_files() which calls
check_remove() for each index-entry.  The preload_index() code distributes
some of the costs across multiple threads.

Because the files checked are restricted to pathspec, adding individual
files makes no measurable impact but on a Windows repo with ~200K files,
'git add .' drops from 6.3 seconds to 3.3 seconds for a 47% savings.

Signed-off-by: Ben Peart <benpeart@microsoft.com>
---

Notes:
    Base Ref: master
    Web-Diff: https://github.com/benpeart/git/commit/fc4830b545
    Checkout: git fetch https://github.com/benpeart/git add-preload-index-v1 && git checkout fc4830b545

 builtin/add.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)


base-commit: 4ede3d42dfb57f9a41ac96a1f216c62eb7566cc2

Comments

Junio C Hamano Nov. 2, 2018, 3:23 p.m. UTC | #1
Ben Peart <peartben@gmail.com> writes:

> From: Ben Peart <benpeart@microsoft.com>
>
> During an "add", a call is made to run_diff_files() which calls
> check_remove() for each index-entry.  The preload_index() code
> distributes some of the costs across multiple threads.

Nice.  I peeked around and noticed that we already do this in
builtin_diff_index() before running run_diff_index() when !cached,
and builtin_diff_files(), of course.

> Because the files checked are restricted to pathspec, adding individual
> files makes no measurable impact but on a Windows repo with ~200K files,
> 'git add .' drops from 6.3 seconds to 3.3 seconds for a 47% savings.

;-)

> diff --git a/builtin/add.c b/builtin/add.c
> index ad49806ebf..f65c172299 100644
> --- a/builtin/add.c
> +++ b/builtin/add.c
> @@ -445,11 +445,6 @@ int cmd_add(int argc, const char **argv, const char *prefix)
>  		return 0;
>  	}
>  
> -	if (read_cache() < 0)
> -		die(_("index file corrupt"));
> -
> -	die_in_unpopulated_submodule(&the_index, prefix);
> -
>  	/*
>  	 * Check the "pathspec '%s' did not match any files" block
>  	 * below before enabling new magic.

It is not explained why this is not a mere s/read_cache/&_preload/
in the log message.  I can see it is because you wanted to make the
pathspec available to preload to further cut down the preloaded
paths, and I do not think it has any unintended (negatie) side
effect to parse the pathspec before populating the in-core index,
but that would have been a good thing to mention in the proposed log
message.

> @@ -459,6 +454,10 @@ int cmd_add(int argc, const char **argv, const char *prefix)
>  		       PATHSPEC_SYMLINK_LEADING_PATH,
>  		       prefix, argv);
>  
> +	if (read_cache_preload(&pathspec) < 0)
> +		die(_("index file corrupt"));
> +
> +	die_in_unpopulated_submodule(&the_index, prefix);
>  	die_path_inside_submodule(&the_index, &pathspec);
>  
>  	if (add_new_files) {
>
> base-commit: 4ede3d42dfb57f9a41ac96a1f216c62eb7566cc2
Duy Nguyen Nov. 2, 2018, 3:49 p.m. UTC | #2
On Fri, Nov 2, 2018 at 2:32 PM Ben Peart <peartben@gmail.com> wrote:
>
> From: Ben Peart <benpeart@microsoft.com>
>
> During an "add", a call is made to run_diff_files() which calls
> check_remove() for each index-entry.  The preload_index() code distributes
> some of the costs across multiple threads.

Instead of doing this site by site. How about we make read_cache()
always do multithread preload?

The only downside I see is preload may actually harm when there are
too few cache entries (but more than 500), but this needs to be
verified. If the penalty is small enough, I think we could live with
it since everything is fast when you have few entries anyway.

But if that's not true, we could add a threshold to activate preload.
Something like "if you have 50k files or more, then activate preload"
would do. I notice THREAD_COST in preload code, but I don't think it's
the same thing.

>
> Because the files checked are restricted to pathspec, adding individual
> files makes no measurable impact but on a Windows repo with ~200K files,
> 'git add .' drops from 6.3 seconds to 3.3 seconds for a 47% savings.
>
> Signed-off-by: Ben Peart <benpeart@microsoft.com>
> ---
>
> Notes:
>     Base Ref: master
>     Web-Diff: https://github.com/benpeart/git/commit/fc4830b545
>     Checkout: git fetch https://github.com/benpeart/git add-preload-index-v1 && git checkout fc4830b545
>
>  builtin/add.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/builtin/add.c b/builtin/add.c
> index ad49806ebf..f65c172299 100644
> --- a/builtin/add.c
> +++ b/builtin/add.c
> @@ -445,11 +445,6 @@ int cmd_add(int argc, const char **argv, const char *prefix)
>                 return 0;
>         }
>
> -       if (read_cache() < 0)
> -               die(_("index file corrupt"));
> -
> -       die_in_unpopulated_submodule(&the_index, prefix);
> -
>         /*
>          * Check the "pathspec '%s' did not match any files" block
>          * below before enabling new magic.
> @@ -459,6 +454,10 @@ int cmd_add(int argc, const char **argv, const char *prefix)
>                        PATHSPEC_SYMLINK_LEADING_PATH,
>                        prefix, argv);
>
> +       if (read_cache_preload(&pathspec) < 0)
> +               die(_("index file corrupt"));
> +
> +       die_in_unpopulated_submodule(&the_index, prefix);
>         die_path_inside_submodule(&the_index, &pathspec);
>
>         if (add_new_files) {
>
> base-commit: 4ede3d42dfb57f9a41ac96a1f216c62eb7566cc2
> --
> 2.18.0.windows.1
>
Ben Peart Nov. 2, 2018, 4:14 p.m. UTC | #3
On 11/2/2018 11:23 AM, Junio C Hamano wrote:
> Ben Peart <peartben@gmail.com> writes:
> 
>> From: Ben Peart <benpeart@microsoft.com>
>>
>> During an "add", a call is made to run_diff_files() which calls
>> check_remove() for each index-entry.  The preload_index() code
>> distributes some of the costs across multiple threads.
> 
> Nice.  I peeked around and noticed that we already do this in
> builtin_diff_index() before running run_diff_index() when !cached,
> and builtin_diff_files(), of course.
> 

Thanks for double checking!

>> Because the files checked are restricted to pathspec, adding individual
>> files makes no measurable impact but on a Windows repo with ~200K files,
>> 'git add .' drops from 6.3 seconds to 3.3 seconds for a 47% savings.
> 
> ;-)
> 
>> diff --git a/builtin/add.c b/builtin/add.c
>> index ad49806ebf..f65c172299 100644
>> --- a/builtin/add.c
>> +++ b/builtin/add.c
>> @@ -445,11 +445,6 @@ int cmd_add(int argc, const char **argv, const char *prefix)
>>   		return 0;
>>   	}
>>   
>> -	if (read_cache() < 0)
>> -		die(_("index file corrupt"));
>> -
>> -	die_in_unpopulated_submodule(&the_index, prefix);
>> -
>>   	/*
>>   	 * Check the "pathspec '%s' did not match any files" block
>>   	 * below before enabling new magic.
> 
> It is not explained why this is not a mere s/read_cache/&_preload/
> in the log message.  I can see it is because you wanted to make the
> pathspec available to preload to further cut down the preloaded
> paths, and I do not think it has any unintended (negatie) side
> effect to parse the pathspec before populating the in-core index,
> but that would have been a good thing to mention in the proposed log
> message.
> 

That is correct.  parse_pathspec() was after read_cache() because it 
_used_ to use the index to determine whether a pathspec is in a 
submodule.  That was fixed by Brandon in Aug 2017 when he cleaned up all 
submodule config code so it is safe to move read_cache_preload() after 
the call to parse_pathspec().

How about this for a revised commit message?



During an "add", a call is made to run_diff_files() which calls
check_remove() for each index-entry.  The preload_index() code 
distributes some of the costs across multiple threads.

Limit read_cache_preload() to pathspec, so that it doesn't process more 
entries than are needed and end up slowing things down instead of 
speeding them up.

Because the files checked are restricted to pathspec, adding individual
files makes no measurable impact but on a Windows repo with ~200K files,
'git add .' drops from 6.3 seconds to 3.3 seconds for a 47% savings.



>> @@ -459,6 +454,10 @@ int cmd_add(int argc, const char **argv, const char *prefix)
>>   		       PATHSPEC_SYMLINK_LEADING_PATH,
>>   		       prefix, argv);
>>   
>> +	if (read_cache_preload(&pathspec) < 0)
>> +		die(_("index file corrupt"));
>> +
>> +	die_in_unpopulated_submodule(&the_index, prefix);
>>   	die_path_inside_submodule(&the_index, &pathspec);
>>   
>>   	if (add_new_files) {
>>
>> base-commit: 4ede3d42dfb57f9a41ac96a1f216c62eb7566cc2
Junio C Hamano Nov. 3, 2018, 12:38 a.m. UTC | #4
Duy Nguyen <pclouds@gmail.com> writes:

> On Fri, Nov 2, 2018 at 2:32 PM Ben Peart <peartben@gmail.com> wrote:
>>
>> From: Ben Peart <benpeart@microsoft.com>
>>
>> During an "add", a call is made to run_diff_files() which calls
>> check_remove() for each index-entry.  The preload_index() code distributes
>> some of the costs across multiple threads.
>
> Instead of doing this site by site. How about we make read_cache()
> always do multithread preload?

I suspect that it would be a huge performance killer. 

Many codepaths do not even want to know if the working tree files
have been modified, even though they need to know what's in the
index.  Think "git commit-tree", "git diff --cached", etc.
Duy Nguyen Nov. 3, 2018, 4:47 a.m. UTC | #5
On Sat, Nov 3, 2018 at 1:38 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Duy Nguyen <pclouds@gmail.com> writes:
>
> > On Fri, Nov 2, 2018 at 2:32 PM Ben Peart <peartben@gmail.com> wrote:
> >>
> >> From: Ben Peart <benpeart@microsoft.com>
> >>
> >> During an "add", a call is made to run_diff_files() which calls
> >> check_remove() for each index-entry.  The preload_index() code distributes
> >> some of the costs across multiple threads.
> >
> > Instead of doing this site by site. How about we make read_cache()
> > always do multithread preload?
>
> I suspect that it would be a huge performance killer.
>
> Many codepaths do not even want to know if the working tree files
> have been modified, even though they need to know what's in the
> index.  Think "git commit-tree", "git diff --cached", etc.

Ah. I keep forgetting read_cache_preload is loading the index _and_
refreshing. I thought the two had some different semantics but failed
to see it last time.
diff mbox series

Patch

diff --git a/builtin/add.c b/builtin/add.c
index ad49806ebf..f65c172299 100644
--- a/builtin/add.c
+++ b/builtin/add.c
@@ -445,11 +445,6 @@  int cmd_add(int argc, const char **argv, const char *prefix)
 		return 0;
 	}
 
-	if (read_cache() < 0)
-		die(_("index file corrupt"));
-
-	die_in_unpopulated_submodule(&the_index, prefix);
-
 	/*
 	 * Check the "pathspec '%s' did not match any files" block
 	 * below before enabling new magic.
@@ -459,6 +454,10 @@  int cmd_add(int argc, const char **argv, const char *prefix)
 		       PATHSPEC_SYMLINK_LEADING_PATH,
 		       prefix, argv);
 
+	if (read_cache_preload(&pathspec) < 0)
+		die(_("index file corrupt"));
+
+	die_in_unpopulated_submodule(&the_index, prefix);
 	die_path_inside_submodule(&the_index, &pathspec);
 
 	if (add_new_files) {