[1/4] fsmonitor: use fsmonitor data in `git diff`

Message ID	13fd992a375e30e8c7b0953a128e149951dee0ea.1602968677.git.gitgitgadget@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=oQO4=DY=vger.kernel.org=git-owner@kernel.org> Message-Id: <13fd992a375e30e8c7b0953a128e149951dee0ea.1602968677.git.gitgitgadget@gmail.com> In-Reply-To: <pull.756.git.1602968677.gitgitgadget@gmail.com> References: <pull.756.git.1602968677.gitgitgadget@gmail.com> Date: Sat, 17 Oct 2020 21:04:33 +0000 Subject: [PATCH 1/4] fsmonitor: use fsmonitor data in `git diff` Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: Derrick Stolee <stolee@gmail.com>, Utsav Shah <utsav@dropbox.com>, Nipunn Koorapati <nipunn1313@gmail.com>, Alex Vandiver <alexmv@dropbox.com> Precedence: bulk From: Alex Vandiver <alexmv@dropbox.com>
Series	use fsmonitor data in git diff eliminating O(num_files) calls to lstat \| expand [0/4] use fsmonitor data in git diff eliminating O(num_files) calls to lstat [1/4] fsmonitor: use fsmonitor data in `git diff` [2/4] t/perf/README: elaborate on output format [3/4] t/perf/p7519-fsmonitor.sh: warm cache on first git status [4/4] t/perf: add fsmonitor perf test for git diff

Alex Vandiver Oct. 17, 2020, 9:04 p.m. UTC

From: Alex Vandiver <alexmv@dropbox.com>

With fsmonitor enabled, the first call to match_stat_with_submodule
calls refresh_fsmonitor, incurring the overhead of reading the list of
updated files -- but run_diff_files does not respect the
CE_FSMONITOR_VALID flag.

Make use of the fsmonitor extension to skip lstat() calls on files
that fsmonitor judged as unmodified.

Notably, this change improves performance of the git shell prompt when
GIT_PS1_SHOWDIRTYSTATE is set.

Signed-off-by: Alex Vandiver <alexmv@dropbox.com>
Signed-off-by: Nipunn Koorapati <nipunn@dropbox.com>
---
 diff-lib.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

Junio C Hamano Oct. 17, 2020, 10:25 p.m. UTC | #1

"Alex Vandiver via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Alex Vandiver <alexmv@dropbox.com>
>
> With fsmonitor enabled, the first call to match_stat_with_submodule
> calls refresh_fsmonitor, incurring the overhead of reading the list of
> updated files -- but run_diff_files does not respect the
> CE_FSMONITOR_VALID flag.

run_diff_files() is used not just by "git diff" but other things
like "git add", so if we get an overall speed-up without having to
pay undue cost, that would be a very good news.

> diff --git a/diff-lib.c b/diff-lib.c
> index f95c6de75f..b7ee1b89ef 100644
> --- a/diff-lib.c
> +++ b/diff-lib.c
> @@ -97,6 +97,8 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
>  
>  	diff_set_mnemonic_prefix(&revs->diffopt, "i/", "w/");
>  
> +	refresh_fsmonitor(istate);
> +

"git diff" and friends are often run with pathspec, but the API into
the fsmonitor, refresh_fsmonitor() call, has no way to say "I only
am interested in the status of this directory and everything else
does not matter".  How expensive would this call to accept fsmonitor
data for the entire tree be, and would there eventually be a point
where the number of paths we are interested in checking (i.e. the
paths that would match the pathspec) is so small that we would be
better off not making this call?  E.g. if we are checking more than
20% of the working tree, running refresh_fsmonitor() for the entire
working tree is still a win, but if we are only checking less than
that, we are better off without fsmonitor, or does a tradeoff like
that exist?

> @@ -197,8 +199,19 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
>  		if (ce_uptodate(ce) || ce_skip_worktree(ce))
>  			continue;
>  
> -		/* If CE_VALID is set, don't look at workdir for file removal */
> -		if (ce->ce_flags & CE_VALID) {
> +		/*
> +		 * If CE_VALID is set, the user has promised us that the workdir
> +		 * hasn't changed compared to index, so don't stat workdir
> +		 * for file removal

The above seems to be an attempt to elaborate on the existing
comment, but ...

> +		 *  eg - via git udpate-index --assume-unchanged
> +		 *  eg - via core.ignorestat=true

... what are these two lines doing here?  It makes no sense to say
"Don't stat workdir for file removal by doing 'git update-index' or
by seetting core.ignorestat", but the placement of these two lines
makes it look as if that is what you are saying.  Perhaps

	When CE_VALID is set (via "update-index --assume-unchanged"
	or via adding paths while core.ignorestat is set to true),
	the user has promised ..., so don't stat workdir for removed
	files.

would probably be what you meant bo say.

> +		 * When using FSMONITOR:
> +		 * If CE_FSMONITOR_VALID is set, then we know the metadata on disk
> +		 * has not changed since the last refresh, and we can skip the
> +		 * file-removal checks without doing the stat in check_removed.

An iffy description.  You skip all the file-removal check by not
calling check_removed() as a whole.

This is not the fault of this patch, but in any case, the
description places too much stress on "removal" when in reality,
removal is not all that special in this codepath.  The check_removed
call also contributes to noticiing modified (not removed) files.  If
we are updating the comment here, we should correct that too,
perhaps

	When CE_VALID is set (via "update-index --assume-unchanged"
	or via adding paths while core.ignorestat is set to true),
	the user has promised that the working tree file for that
	path will not be modified.  When CE_FSMONITOR_VALID is true,
	the fsmonitor knows that the path hasn't been modified since
	we refreshed the cached stat information.  In either case,
	we do not have to stat to see if the path has been removed
	or modified.

or something like that, perhaps.

> +		 */
> +		if (ce->ce_flags & CE_VALID || ce->ce_flags & CE_FSMONITOR_VALID) {

Would it become easier to read, if written like this instead?

		if (ce->ce_flags & (CE_VALID | CE_FSMONITOR_VALID)) {

That reflects what the suggested comment says better.

>  			changed = 0;
>  			newmode = ce->ce_mode;
>  		} else {

Thanks.

Nipunn Koorapati Oct. 18, 2020, 12:54 a.m. UTC | #2

> run_diff_files() is used not just by "git diff" but other things
> like "git add", so if we get an overall speed-up without having to
> pay undue cost, that would be a very good news.

Agreed! I may be able to write perf benchmark tests to highlight
benefits to git add as well.

> 20% of the working tree, running refresh_fsmonitor() for the entire
> working tree is still a win, but if we are only checking less than
> that, we are better off without fsmonitor, or does a tradeoff like
> that exist?

My understanding is that refresh_fsmonitor is
O(delta_since_last_refresh) - so for developers
with large repositories - this cost will amortize out over subsequent
commands, so I don't
think it's worth investigating this tradeoff here.
As a user of large repositories, I expect that my major source of
fsmonitor activity to be user
intent (eg git pull, or intentionally copying/editing a large number
of files). After such a command,
I expect my next git command to be slower - that would be unsurprising.

I think the tradeoff could be made for small diff requests, but I
don't think it's worth adding complexity here -
as that user will just have to pay the cost on their next git command.

> > +              *  eg - via git udpate-index --assume-unchanged
> > +              *  eg - via core.ignorestat=true
>
> ... what are these two lines doing here?

Intended to indicate potential ways that CE_VALID might be set. When I
was reading the source
here, it was pretty difficult to determine how this would be set.
Agree that I picked unfortunate wording.
Thanks for the suggestions. Will update in the next iteration.

>
> would probably be what you meant bo say.
>
>         When CE_VALID is set (via "update-index --assume-unchanged"
>         or via adding paths while core.ignorestat is set to true),
>         the user has promised that the working tree file for that
>         path will not be modified.  When CE_FSMONITOR_VALID is true,
>         the fsmonitor knows that the path hasn't been modified since
>         we refreshed the cached stat information.  In either case,
>         we do not have to stat to see if the path has been removed
>         or modified.
>
> or something like that, perhaps.

Sounds good. Will clarify. I like your comment better as well.

>
> > +              */
> > +             if (ce->ce_flags & CE_VALID || ce->ce_flags & CE_FSMONITOR_VALID) {
>
> Would it become easier to read, if written like this instead?
>
>                 if (ce->ce_flags & (CE_VALID | CE_FSMONITOR_VALID)) {

I personally find this more confusing because it involves multiple
bitwise ops, but this
is potentially due to me having more mental practice thinking about
boolean operators vs bitwise operators.
I'm more than happy to align with the common pattern of the repo. I'll
change this.

>
> Thanks.

Thank you for the thorough review!

Taylor Blau Oct. 18, 2020, 4:17 a.m. UTC | #3

On Sun, Oct 18, 2020 at 01:54:44AM +0100, Nipunn Koorapati wrote:
> > 20% of the working tree, running refresh_fsmonitor() for the entire
> > working tree is still a win, but if we are only checking less than
> > that, we are better off without fsmonitor, or does a tradeoff like
> > that exist?
>
> My understanding is that refresh_fsmonitor is
> O(delta_since_last_refresh) - so for developers
> with large repositories - this cost will amortize out over subsequent
> commands, so I don't
> think it's worth investigating this tradeoff here.
> As a user of large repositories, I expect that my major source of
> fsmonitor activity to be user
> intent (eg git pull, or intentionally copying/editing a large number
> of files). After such a command,
> I expect my next git command to be slower - that would be unsurprising.
>
> I think the tradeoff could be made for small diff requests, but I
> don't think it's worth adding complexity here -
> as that user will just have to pay the cost on their next git command.

Hmm. I do agree that I'd like to stay out of the business of trying to
figure out exactly what that trade-off is (although I'm sure that it
exists), only because it seems likely to vary to a large extent from
repository to repository. (That is, 20% may be a good number for some
repository, but a terrible choice for another).

But, I think that we can invoke watchman better here; the
fsmonitor-watchman hook has no notion of a "pathspec", so every query
just asks for everything that isn't in '$GIT_DIR'. Is there anything
preventing us from taking an optional pathspec and building up a more
targeted query?

There is some overhead to invoke the hook and talk to watchman, but
I'd expect that to be dwarfed by not having to issue O(# files)
syscalls.

> >
> > > +              */
> > > +             if (ce->ce_flags & CE_VALID || ce->ce_flags & CE_FSMONITOR_VALID) {
> >
> > Would it become easier to read, if written like this instead?
> >
> >                 if (ce->ce_flags & (CE_VALID | CE_FSMONITOR_VALID)) {
>
> I personally find this more confusing because it involves multiple
> bitwise ops, but this
> is potentially due to me having more mental practice thinking about
> boolean operators vs bitwise operators.
> I'm more than happy to align with the common pattern of the repo. I'll
> change this.

I don't have an opinion, nor do I think that git.git has an established
practice of doing one over the other. For what it's worth, my two-cents
is that Junio's suggestion is easier to read.

Thanks,
Taylor

Junio C Hamano Oct. 18, 2020, 5:02 a.m. UTC | #4

Taylor Blau <me@ttaylorr.com> writes:

> Hmm. I do agree that I'd like to stay out of the business of trying to
> figure out exactly what that trade-off is (although I'm sure that it
> exists), only because it seems likely to vary to a large extent from
> repository to repository. (That is, 20% may be a good number for some
> repository, but a terrible choice for another).

I think both of you misunderstood me.  

My question was a simple yes/no "does there a trade off exist?"
question and the sentences with 20% in it were mere example of
possible trade-off I had in mind that _could_ exist.  I wasn't even
suggesting to figure out what the optimum cut-off heuristics would
be (e.g. solving "when more than N% paths are subject to diff
fsmonitor is faster" for N).

I was hoping that we can show that even having to lstat just a
single path is expensive enough---IOW, "there is no trade-off worth
worrying about, because talking to fsmonitor is so cheap compared to
the cost of even a single lstst" would have been a valid and happy
answer.  With such a number, there is no risk of introducing an
unwarranted performance regression to use cases that we did not
anticipate by adding an unconditional call to refresh_fsmonitor().

But without any rationale, the performance implication of adding an
unconditional call to refresh_fsmonitor() would become much muddier.

> But, I think that we can invoke watchman better here; the
> fsmonitor-watchman hook has no notion of a "pathspec", so every query
> just asks for everything that isn't in '$GIT_DIR'. Is there anything
> preventing us from taking an optional pathspec and building up a more
> targeted query?

Yup, it is what I had in mind when I brought up the pathspec.  It
may be something worth pursuing longer term, but not within the
scope of this patch.

> There is some overhead to invoke the hook and talk to watchman, but
> I'd expect that to be dwarfed by not having to issue O(# files)
> syscalls.

"invoke the hook"---is that a pipe+fork+exec, or something else that
is far lighter-weight?

n

Taylor Blau Oct. 18, 2020, 11:43 p.m. UTC | #5

On Sat, Oct 17, 2020 at 10:02:04PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > Hmm. I do agree that I'd like to stay out of the business of trying to
> > figure out exactly what that trade-off is (although I'm sure that it
> > exists), only because it seems likely to vary to a large extent from
> > repository to repository. (That is, 20% may be a good number for some
> > repository, but a terrible choice for another).
>
> I think both of you misunderstood me.
>
> My question was a simple yes/no "does there a trade off exist?"
> question and the sentences with 20% in it were mere example of
> possible trade-off I had in mind that _could_ exist.  I wasn't even
> suggesting to figure out what the optimum cut-off heuristics would
> be (e.g. solving "when more than N% paths are subject to diff
> fsmonitor is faster" for N).
>
> I was hoping that we can show that even having to lstat just a
> single path is expensive enough---IOW, "there is no trade-off worth
> worrying about, because talking to fsmonitor is so cheap compared to
> the cost of even a single lstst" would have been a valid and happy
> answer.  With such a number, there is no risk of introducing an
> unwarranted performance regression to use cases that we did not
> anticipate by adding an unconditional call to refresh_fsmonitor().
>
> But without any rationale, the performance implication of adding an
> unconditional call to refresh_fsmonitor() would become much muddier.

Aha; thanks for clarifying. I'm glad we agree that finding 'N' would not
be worth it, or at least that showing that talking to fsmonitor is
cheaper than a single lstat would be more worthwhile.

Nipunn - I don't have fsmonitor/watchman setup on my workstation, but if
you do, some numbers (or an interpretation of the numbers you already
provided) on this would be really useful. If you don't have it set up,
or don't have time to measure it, let me know, and I'd be happy to take
a look.

> > But, I think that we can invoke watchman better here; the
> > fsmonitor-watchman hook has no notion of a "pathspec", so every query
> > just asks for everything that isn't in '$GIT_DIR'. Is there anything
> > preventing us from taking an optional pathspec and building up a more
> > targeted query?
>
> Yup, it is what I had in mind when I brought up the pathspec.  It
> may be something worth pursuing longer term, but not within the
> scope of this patch.
>
> > There is some overhead to invoke the hook and talk to watchman, but
> > I'd expect that to be dwarfed by not having to issue O(# files)
> > syscalls.
>
> "invoke the hook"---is that a pipe+fork+exec, or something else that
> is far lighter-weight?

The former; see 'fsmonitor.c:query_fsmonitor()'.

Thanks,
Taylor

Junio C Hamano Oct. 19, 2020, 5:23 p.m. UTC | #6

Taylor Blau <me@ttaylorr.com> writes:

>> > There is some overhead to invoke the hook and talk to watchman, but
>> > I'd expect that to be dwarfed by not having to issue O(# files)
>> > syscalls.
>>
>> "invoke the hook"---is that a pipe+fork+exec, or something else that
>> is far lighter-weight?
>
> The former; see 'fsmonitor.c:query_fsmonitor()'.

It brings us back to the "overhead of how many lstat(2) takes us
closer to the overhead of a single pipe+fork+exec plus reading from
the pipe", doesn't it?

Taylor Blau Oct. 19, 2020, 5:37 p.m. UTC | #7

On Mon, Oct 19, 2020 at 10:23:26AM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> >> > There is some overhead to invoke the hook and talk to watchman, but
> >> > I'd expect that to be dwarfed by not having to issue O(# files)
> >> > syscalls.
> >>
> >> "invoke the hook"---is that a pipe+fork+exec, or something else that
> >> is far lighter-weight?
> >
> > The former; see 'fsmonitor.c:query_fsmonitor()'.
>
> It brings us back to the "overhead of how many lstat(2) takes us
> closer to the overhead of a single pipe+fork+exec plus reading from
> the pipe", doesn't it?

Somewhat unfortunately, yes. Hopefully any user that cares to use
fsmonitor has enough files in their repository that a pipe+fork+exec is
still faster than however many lstats they would have needed otherwise.

Of course, finding out what that number is is still interesting...

Thanks,
Taylor

Nipunn Koorapati Oct. 19, 2020, 6:07 p.m. UTC | #8

> It brings us back to the "overhead of how many lstat(2) takes us
> closer to the overhead of a single pipe+fork+exec plus reading from
> the pipe", doesn't it?
>

I will add a benchmark for a `git diff -- <pathspec>`

> Somewhat unfortunately, yes. Hopefully any user that cares to use
> fsmonitor has enough files in their repository that a pipe+fork+exec is
> still faster than however many lstats they would have needed otherwise.
>
> Of course, finding out what that number is is still interesting...

I can try to do some manual testing to figure this out. Doesn't seem like the
type of thing we'd want to add to the benchmark, as it would involve running
git diff on a variety of pathspec workloads

--Nipunn

[1/4] fsmonitor: use fsmonitor data in `git diff`

Commit Message

Comments

Patch