diff mbox series

[v1,1/2] reset: don't compute unstaged changes after reset when --quiet

Message ID 20181017164021.15204-2-peartben@gmail.com (mailing list archive)
State New, archived
Headers show
Series [v1,1/2] reset: don't compute unstaged changes after reset when --quiet | expand

Commit Message

Ben Peart Oct. 17, 2018, 4:40 p.m. UTC
From: Ben Peart <benpeart@microsoft.com>

When git reset is run with the --quiet flag, don't bother finding any
additional unstaged changes as they won't be output anyway.  This speeds up
the git reset command by avoiding having to lstat() every file looking for
changes that aren't going to be reported anyway.

The savings can be significant.  In a repo with 200K files "git reset"
drops from 7.16 seconds to 0.32 seconds for a savings of 96%.

Signed-off-by: Ben Peart <benpeart@microsoft.com>
---
 Documentation/git-reset.txt | 4 +++-
 builtin/reset.c             | 2 +-
 2 files changed, 4 insertions(+), 2 deletions(-)

Comments

Eric Sunshine Oct. 17, 2018, 6:14 p.m. UTC | #1
On Wed, Oct 17, 2018 at 12:40 PM Ben Peart <peartben@gmail.com> wrote:
> When git reset is run with the --quiet flag, don't bother finding any
> additional unstaged changes as they won't be output anyway.  This speeds up
> the git reset command by avoiding having to lstat() every file looking for
> changes that aren't going to be reported anyway.
>
> Signed-off-by: Ben Peart <benpeart@microsoft.com>
> ---
> diff --git a/Documentation/git-reset.txt b/Documentation/git-reset.txt
> @@ -95,7 +95,9 @@ OPTIONS
>  --quiet::
> -       Be quiet, only report errors.
> +       Be quiet, only report errors.  Can optimize the performance of reset
> +       by avoiding scaning all files in the repo looking for additional
> +       unstaged changes.

s/scaning/scanning/

However, I'm not convinced that this should be documented here or at
least in this fashion. When I read this new documentation before
reading the commit message, I was baffled by what it was trying to say
since --quiet'ness is a superficial quality, not an optimizer. My
knee-jerk reaction is that it doesn't belong in end-user documentation
at all since it's an implementation detail, however, I can see that
such knowledge could be handy for people in situations which would be
helped by this. That said, if you do document it, this doesn't feel
like the correct place to do so; it should be in a "Discussion"
section or something. (Who would expect to find --quiet documentation
talking about optimizations? Likely, nobody.)
Jeff King Oct. 17, 2018, 6:22 p.m. UTC | #2
On Wed, Oct 17, 2018 at 02:14:32PM -0400, Eric Sunshine wrote:

> > diff --git a/Documentation/git-reset.txt b/Documentation/git-reset.txt
> > @@ -95,7 +95,9 @@ OPTIONS
> >  --quiet::
> > -       Be quiet, only report errors.
> > +       Be quiet, only report errors.  Can optimize the performance of reset
> > +       by avoiding scaning all files in the repo looking for additional
> > +       unstaged changes.
> 
> s/scaning/scanning/
> 
> However, I'm not convinced that this should be documented here or at
> least in this fashion. When I read this new documentation before
> reading the commit message, I was baffled by what it was trying to say
> since --quiet'ness is a superficial quality, not an optimizer. My
> knee-jerk reaction is that it doesn't belong in end-user documentation
> at all since it's an implementation detail, however, I can see that
> such knowledge could be handy for people in situations which would be
> helped by this. That said, if you do document it, this doesn't feel
> like the correct place to do so; it should be in a "Discussion"
> section or something. (Who would expect to find --quiet documentation
> talking about optimizations? Likely, nobody.)

Yeah, I had the same thought. You'd probably choose --quiet because you
want it, you know, quiet.

Whereas for the new config variable, you'd probably set it not because
you want it quiet all the time, but because you want to get some time
savings. So there it does make sense to me to explain.

Other than that, this seems like an obvious and easy win. It does feel a
little hacky (you're really losing something in the output, and ideally
we'd just be able to give that answer quickly), but this may be OK as a
hack in the interim.

The sad thing is just that it's user-facing, so we have to respect it
forever. I almost wonder if there should be a global core.optimizeMessages
or something that tries to tradeoff less information for speed in all
commands, but makes no promises about which. Then a user with a big repo
who sets it once will get the benefit as more areas are identified (I
think "status" already has a similar case with ahead/behind)? And vice
versa, as some messages get faster to produce, they can be dropped from
that option.

I dunno. Maybe that is a stupid idea, and people really do want to
control it on a per-message basis.

-Peff
Junio C Hamano Oct. 18, 2018, 3:40 a.m. UTC | #3
Jeff King <peff@peff.net> writes:

> Whereas for the new config variable, you'd probably set it not because
> you want it quiet all the time, but because you want to get some time
> savings. So there it does make sense to me to explain.
>
> Other than that, this seems like an obvious and easy win. It does feel a
> little hacky (you're really losing something in the output, and ideally
> we'd just be able to give that answer quickly), but this may be OK as a
> hack in the interim.

After "git reset --quiet -- this/area/" with this change, any
operation you'd do next that needs to learn if working tree files
are different from what is recorded in the index outside that area
will have to spend more cycles, because the refresh done by "reset"
is now limited to the area.  So if your final goal is "make 'reset'
as fast as possible", this is an obvious and easy win.  For other
goals, i.e. "make the overall experience of using Git feel faster",
it is not so obvious to me, though.

If we somehow know that it is much less important in your setup that
the cached stat bits in the index is kept up to date (e.g. perhaps
you are more heavily relying on fsmonitor and are happy with it),
then I suspect that we could even skip the refreshing altogether and
gain more performance, without sacrificing the "overall experience
of using Git" at all, which would be even better.

> The sad thing is just that it's user-facing, so we have to respect it
> forever. I almost wonder if there should be a global core.optimizeMessages
> or something that tries to tradeoff less information for speed in all
> commands, but makes no promises about which. Then a user with a big repo
> who sets it once will get the benefit as more areas are identified (I
> think "status" already has a similar case with ahead/behind)? And vice
> versa, as some messages get faster to produce, they can be dropped from
> that option.
>
> I dunno. Maybe that is a stupid idea, and people really do want to
> control it on a per-message basis.
>
> -Peff
Jeff King Oct. 18, 2018, 6:36 a.m. UTC | #4
On Thu, Oct 18, 2018 at 12:40:48PM +0900, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Whereas for the new config variable, you'd probably set it not because
> > you want it quiet all the time, but because you want to get some time
> > savings. So there it does make sense to me to explain.
> >
> > Other than that, this seems like an obvious and easy win. It does feel a
> > little hacky (you're really losing something in the output, and ideally
> > we'd just be able to give that answer quickly), but this may be OK as a
> > hack in the interim.
> 
> After "git reset --quiet -- this/area/" with this change, any
> operation you'd do next that needs to learn if working tree files
> are different from what is recorded in the index outside that area
> will have to spend more cycles, because the refresh done by "reset"
> is now limited to the area.  So if your final goal is "make 'reset'
> as fast as possible", this is an obvious and easy win.  For other
> goals, i.e. "make the overall experience of using Git feel faster",
> it is not so obvious to me, though.
> 
> If we somehow know that it is much less important in your setup that
> the cached stat bits in the index is kept up to date (e.g. perhaps
> you are more heavily relying on fsmonitor and are happy with it),
> then I suspect that we could even skip the refreshing altogether and
> gain more performance, without sacrificing the "overall experience
> of using Git" at all, which would be even better.

Yeah, I assumed that Ben was using fsmonitor. I agree if we can just use
that to make this output faster, that would be the ideal. This is the
"later the message would get faster to produce" I hinted at in my
earlier message.

So I think we are in agreement. It just isn't clear to me how much work
it would take to get to the "ideal". If it's long enough, then this kind
of hackery may be useful in the meantime.

-Peff
Ben Peart Oct. 18, 2018, 6:15 p.m. UTC | #5
On 10/18/2018 2:36 AM, Jeff King wrote:
> On Thu, Oct 18, 2018 at 12:40:48PM +0900, Junio C Hamano wrote:
> 
>> Jeff King <peff@peff.net> writes:
>>
>>> Whereas for the new config variable, you'd probably set it not because
>>> you want it quiet all the time, but because you want to get some time
>>> savings. So there it does make sense to me to explain.
>>>
>>> Other than that, this seems like an obvious and easy win. It does feel a
>>> little hacky (you're really losing something in the output, and ideally
>>> we'd just be able to give that answer quickly), but this may be OK as a
>>> hack in the interim.
>>
>> After "git reset --quiet -- this/area/" with this change, any
>> operation you'd do next that needs to learn if working tree files
>> are different from what is recorded in the index outside that area
>> will have to spend more cycles, because the refresh done by "reset"
>> is now limited to the area.  So if your final goal is "make 'reset'
>> as fast as possible", this is an obvious and easy win.  For other
>> goals, i.e. "make the overall experience of using Git feel faster",
>> it is not so obvious to me, though.

The final goal is to make git faster (especially on larger repos) and 
this proposal accomplishes that.  Let's look at why that is.

By scoping down (or eliminating) what refresh_index() has to lstat() at 
the end of the reset command, clearly the reset command is faster.  Yes, 
the index isn't as "fresh" because not everything was updated but that 
doesn't typically impact the performance of subsequent commands.

On the next command, git still has to lstat() every file because it 
isn't sure what changes could have happened in the file system.  As a 
result, the overall impact is that we have had to lstat() every file one 
fewer times between the two commands.  A net win overall.

In addition, the preload_index() code that does the lstat() command is 
highly optimized across multiple threads (and on Windows takes advantage 
of the fscache).  This means that it can lstat() every file _much_ 
faster than the single threaded loop in refresh_index().  This also 
makes the overall performance of the pair of git commands faster as we 
got rid of the slow lstat() loop and kept the fast one.

Here are some numbers to demonstrate that.  These are hot cache numbers 
as they are easier to generate.  Cold cache numbers make the net perf 
win significantly better as the cost for the reset jumps from 2.43 
seconds to 7.16 seconds.

0.32 git add asdf
0.31 git -c reset.quiet=true reset asdf
1.34 git status
1.97 Total


0.32 git add asdf
2.43 git -c reset.quiet=false reset asdf
1.32 git status
4.07 Total

Note the status command after the reset doesn't really change as it 
still must lstat() every file (the 0.02 difference is well within the 
variability of run to run differences).

FWIW, none of these numbers are using fsmonitor.



There was additional discussion about whether this should be tied to the 
"--quiet" option and how it should be documented.

One option would be to change the default behavior of reset so that it 
doesn't do the refresh_index() call at all.  This speeds up reset by 
default so there are no user discoverability issues but changes the 
default behavior which is an issue.

Another option that was suggested was to add a separate flag that could 
be passed to reset so that the "quiet" and "fast" options don't get 
conflated.  I don't care for that option because the two options (at 
this point and for the foreseeable future) would be identical in 
behavior from the end users perspective.

It was also suggested that there be a single "fast and quiet" option for 
all of git instead of separate options for each command.  I worry about 
that because now we're forcing users to choose between the "fast and 
quiet" version of git and the "slow and noisy" version.  How do we help 
them decide which they want?  That seems difficult to explain so that 
they can make a rational choice and also hard to discover.  I also have 
to wonder who would say "give me the slow and noisy version please." :)

I'd prefer we systematically move towards a model where the default 
values that are chosen for various settings throughout the code are all 
configurable via settings.  All defaults by necessity make certain 
assumptions about user preference, data shape, machine performance, etc 
and if those assumptions don't match the user's environment then the 
hard coded defaults aren't appropriate.  We do our best but its going to 
be hit or miss.

A consistent way to be able to change those defaults would be very 
useful in those circumstances.  To be clear, I'm not proposing we do a 
wholesale update of our preferences model at this point in time - that 
seems like a significant undertaking and I don't want to tie this 
specific optimization to a potential change in how default settings work.


To move this forward, here is what I propose:

1) If the '--quiet' flag is passed, we silently take advantage of the 
fact we can avoid having to do an "extra" lstat() of every file and 
scope the refresh_index() call to those paths that we know have changed.

2) I can remove the note in the documentation of --quiet which I only 
added to facilitate discoverability.

3) I can also edit the documentation for reset.quietDefault (maybe I 
should rename that to "reset.quiet"?) so that it does not discuss the 
potential performance impact.

4) To improve the discoverability of the enhanced performance, I could 
add logic similar to what exists for "status --uno" and if 
refresh_index() takes > x seconds, prompt the user with something like:

"It took %.2f seconds to enumerate unstaged changes after reset. 'reset 
--quiet' may speed it up. Set the config setting reset.quiet to true to 
make this the default."


>>
>> If we somehow know that it is much less important in your setup that
>> the cached stat bits in the index is kept up to date (e.g. perhaps
>> you are more heavily relying on fsmonitor and are happy with it),
>> then I suspect that we could even skip the refreshing altogether and
>> gain more performance, without sacrificing the "overall experience
>> of using Git" at all, which would be even better.
> 
> Yeah, I assumed that Ben was using fsmonitor. I agree if we can just use
> that to make this output faster, that would be the ideal. This is the
> "later the message would get faster to produce" I hinted at in my
> earlier message.
> 
> So I think we are in agreement. It just isn't clear to me how much work
> it would take to get to the "ideal". If it's long enough, then this kind
> of hackery may be useful in the meantime.
> 

I actually started my effort to speed up reset by attempting to 
multi-thread refresh_index().  You can see a work in progress at:

https://github.com/benpeart/git/pull/new/refresh-index-multithread-gvfs

The patch doesn't always work as it is still not thread safe.  When it 
works, it's great but I ran into to many difficulties trying to debug 
the remaining threading issues (even adding print statements would 
change the timing and the repro would disappear).  It will take a lot of 
code review to discover and fix the remaining non-thread safe code paths.

In addition, the optimized code path that takes advantage of fsmonitor, 
uses multiple threads, fscache, etc _already exists_ in preload_index(). 
  Trying to recreate all those optimizations in refresh_index() is (as I 
discovered) a daunting task.

This patch was tiny/trivial in comparison and provided all the 
performance benefits so seems like a much better option at this point in 
time.  For now, I suggest we just use that existing path as it provides 
the benefits without the significant additional work and complexity.

> -Peff
>
Duy Nguyen Oct. 18, 2018, 6:26 p.m. UTC | #6
On Thu, Oct 18, 2018 at 8:18 PM Ben Peart <peartben@gmail.com> wrote:
> I actually started my effort to speed up reset by attempting to
> multi-thread refresh_index().  You can see a work in progress at:
>
> https://github.com/benpeart/git/pull/new/refresh-index-multithread-gvfs
>
> The patch doesn't always work as it is still not thread safe.  When it
> works, it's great but I ran into to many difficulties trying to debug
> the remaining threading issues (even adding print statements would
> change the timing and the repro would disappear).  It will take a lot of
> code review to discover and fix the remaining non-thread safe code paths.
>
> In addition, the optimized code path that takes advantage of fsmonitor,
> uses multiple threads, fscache, etc _already exists_ in preload_index().
>   Trying to recreate all those optimizations in refresh_index() is (as I
> discovered) a daunting task.

Why not make refresh_index() run preload_index() first (or the
parallel lstat part to be precise), and only do the heavy
content-based refresh in single thread mode?
Ben Peart Oct. 18, 2018, 7:03 p.m. UTC | #7
On 10/18/2018 2:26 PM, Duy Nguyen wrote:
> On Thu, Oct 18, 2018 at 8:18 PM Ben Peart <peartben@gmail.com> wrote:
>> I actually started my effort to speed up reset by attempting to
>> multi-thread refresh_index().  You can see a work in progress at:
>>
>> https://github.com/benpeart/git/pull/new/refresh-index-multithread-gvfs
>>
>> The patch doesn't always work as it is still not thread safe.  When it
>> works, it's great but I ran into to many difficulties trying to debug
>> the remaining threading issues (even adding print statements would
>> change the timing and the repro would disappear).  It will take a lot of
>> code review to discover and fix the remaining non-thread safe code paths.
>>
>> In addition, the optimized code path that takes advantage of fsmonitor,
>> uses multiple threads, fscache, etc _already exists_ in preload_index().
>>    Trying to recreate all those optimizations in refresh_index() is (as I
>> discovered) a daunting task.
> 
> Why not make refresh_index() run preload_index() first (or the
> parallel lstat part to be precise), and only do the heavy
> content-based refresh in single thread mode?
> 

Head smack! Why didn't I think of that?

That is a terrific suggestion.  Calling preload_index() right before the 
big for loop in refresh_index() is a trivial and effective way to do the 
bulk of the updating with the optimized code.  After doing that, most of 
the cache entries can bail out quickly down in refresh_cache_ent() when 
it tests ce_uptodate(ce).

Here are the numbers using that optimization (hot cache, averaged across 
3 runs):

0.32 git add asdf
1.67 git reset asdf
1.68 git status
3.67 Total

vs without it:

0.32 git add asdf
2.48 git reset asdf
1.50 git status
4.30 Total

For a savings in the reset command of 32% and 15% overall.

Clearly doing the refresh_index() faster is not as much savings as not 
doing it at all.  Given how simple this patch is, I think it makes sense 
to do both so that we have optimized each path to is fullest.
Junio C Hamano Oct. 19, 2018, 12:34 a.m. UTC | #8
Ben Peart <peartben@gmail.com> writes:

> Note the status command after the reset doesn't really change as it
> still must lstat() every file (the 0.02 difference is well within the
> variability of run to run differences).

Of course, it would not make an iota of difference, whether reset
refreshes the cached stat index fully, to the cost of later lstat().
What the refreshing saves is having to scan the contents to find that
the file is unchanged at runtime.

If your lstat() is not significantly faster than opening and
scanning the file, the optimization based on the cached-stat
information becomes moot.  In a working tree full of unmodified
files, stale cached-stat info in the index will cause us to compare
the contents and waste a lot of time, and that is what refreshing
avoids.  If the "status" in your test sequence do not have to do
that (e.g. the cached-stat information is already up-to-date and
there is no point running refresh in reset), then I'd expect no
difference between these two tests.

> To move this forward, here is what I propose:
>
> 1) If the '--quiet' flag is passed, we silently take advantage of the
> fact we can avoid having to do an "extra" lstat() of every file and
> scope the refresh_index() call to those paths that we know have
> changed.

That's pretty much what the patch under discussion does.

> 2) I can remove the note in the documentation of --quiet which I only
> added to facilitate discoverability.

Quite honestly, I am not sure if this (meaning #1 above) alone need
to be even discoverable.  Those who want --quiet output would use
it, those who want to be told which paths are modified would not,
and those who want to quickly be told which paths are modified would
not be helped by the limited refresh anyway, so "with --quiet you
can make it go faster" would not help anybody.

> 3) I can also edit the documentation for reset.quietDefault (maybe I
> should rename that to "reset.quiet"?) so that it does not discuss the
> potential performance impact.

I think reset.quiet (or reset.verbosity) is a good thing to have
regardless.
diff mbox series

Patch

diff --git a/Documentation/git-reset.txt b/Documentation/git-reset.txt
index 1d697d9962..8610309b55 100644
--- a/Documentation/git-reset.txt
+++ b/Documentation/git-reset.txt
@@ -95,7 +95,9 @@  OPTIONS
 
 -q::
 --quiet::
-	Be quiet, only report errors.
+	Be quiet, only report errors.  Can optimize the performance of reset
+	by avoiding scaning all files in the repo looking for additional
+	unstaged changes.
 
 
 EXAMPLES
diff --git a/builtin/reset.c b/builtin/reset.c
index 11cd0dcb8c..0822798616 100644
--- a/builtin/reset.c
+++ b/builtin/reset.c
@@ -376,7 +376,7 @@  int cmd_reset(int argc, const char **argv, const char *prefix)
 			if (read_from_tree(&pathspec, &oid, intent_to_add))
 				return 1;
 			if (get_git_work_tree())
-				refresh_index(&the_index, flags, NULL, NULL,
+				refresh_index(&the_index, flags, quiet ? &pathspec : NULL, NULL,
 					      _("Unstaged changes after reset:"));
 		} else {
 			int err = reset_index(&oid, reset_type, quiet);