index: add trace2 region for clear skip worktree

Message ID	pull.1368.git.git.1666742722502.gitgitgadget@gmail.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <pull.1368.git.git.1666742722502.gitgitgadget@gmail.com> Date: Wed, 26 Oct 2022 00:05:22 +0000 Subject: [PATCH] index: add trace2 region for clear skip worktree Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: Timothy Jones <timothy@canva.com>, Jeff Hostetler <jeffhost@microsoft.com>, Anh Le <anh@canva.com>, Anh Le <anh@canva.com> Precedence: bulk From: Anh Le <anh@canva.com>
Series	index: add trace2 region for clear skip worktree \| expand index: add trace2 region for clear skip worktree

Anh Le Oct. 26, 2022, 12:05 a.m. UTC

From: Anh Le <anh@canva.com>

In a large repository using sparse checkout, checking
whether a file marked with skip worktree is present
on disk and its skip worktree bit should be cleared
can take a considerable amount of time. Add a trace2
region to surface this information.

Signed-off-by: Anh Le <anh@canva.com>
---
    index: add trace2 region for clear skip worktree
    
    In large repository using sparse checkout, checking whether a file
    marked with skip worktree is present on disk and its skip worktree bit
    should be cleared can take a considerable amount of time. Add a trace2
    region to surface this information.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1368%2FHaizzz%2Fmaster-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1368/Haizzz/master-v1
Pull-Request: https://github.com/git/git/pull/1368

 sparse-index.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)


base-commit: 1fc3c0ad407008c2f71dd9ae1241d8b75f8ef886

Junio C Hamano Oct. 26, 2022, 3:16 a.m. UTC | #1

"Anh Le via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Anh Le <anh@canva.com>
>
> In a large repository using sparse checkout, checking
> whether a file marked with skip worktree is present
> on disk and its skip worktree bit should be cleared
> can take a considerable amount of time. Add a trace2
> region to surface this information.
>
> Signed-off-by: Anh Le <anh@canva.com>
> ---
>     index: add trace2 region for clear skip worktree
>     
>     In large repository using sparse checkout, checking whether a file
>     marked with skip worktree is present on disk and its skip worktree bit
>     should be cleared can take a considerable amount of time. Add a trace2
>     region to surface this information.

It is easy to see that the change is no-op from functionality's
standpoint.  The condition under which ce->ce_flags loses the
CE_SKIP_WORKTREE bit is the same as before, and the only change is
that in the iteration a couple of variables are incremented, which
may (or may not) have performance impact, but shouldn't break
correctness.

I am not sure about the value of these counters, honestly.

If we hit a sparse dir just once, we fully flatten the index and go
back to the restart state, and it would be a bug if we still see a
sparsified directory in the in-core index after that point, so
restart_count would be at most one, wouldn't it?  Why do we even
need to count with intmax_t?

Similarly, path_count is bounded by istate->cache_nr.  As you know
the approximate size of your project, I am not sure what extra
information you want to get by counting the paths with the skip bit
set before the first sparsified directory in the in-core index
twice, and the other paths with the skip bit set just once, and
adding these numbers together.  Also again, I am not quite sure what
the point is to count the paths in intmax_t.

> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1368%2FHaizzz%2Fmaster-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1368/Haizzz/master-v1
> Pull-Request: https://github.com/git/git/pull/1368
>
>  sparse-index.c | 21 +++++++++++++++------
>  1 file changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/sparse-index.c b/sparse-index.c
> index e4a54ce1943..d11049c8aeb 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -493,24 +493,33 @@ void clear_skip_worktree_from_present_files(struct index_state *istate)
>  	int dir_found = 1;
>  
>  	int i;
> +	intmax_t path_count = 0;
> +	intmax_t restart_count = 0;
>  
>  	if (!core_apply_sparse_checkout ||
>  	    sparse_expect_files_outside_of_patterns)
>  		return;
>  
> +	trace2_region_enter("index", "clear_skip_worktree_from_present_files", istate->repo);
>  restart:
>  	for (i = 0; i < istate->cache_nr; i++) {
>  		struct cache_entry *ce = istate->cache[i];
>  
> -		if (ce_skip_worktree(ce) &&
> -		    path_found(ce->name, &last_dirname, &dir_len, &dir_found)) {
> -			if (S_ISSPARSEDIR(ce->ce_mode)) {
> -				ensure_full_index(istate);
> -				goto restart;
> +		if (ce_skip_worktree(ce)) {
> +			path_count++;
> +			if (path_found(ce->name, &last_dirname, &dir_len, &dir_found)) {
> +				if (S_ISSPARSEDIR(ce->ce_mode)) {
> +					ensure_full_index(istate);
> +					restart_count++;
> +					goto restart;
> +				}
> +				ce->ce_flags &= ~CE_SKIP_WORKTREE;
>  			}
> -			ce->ce_flags &= ~CE_SKIP_WORKTREE;
>  		}
>  	}
> +	trace2_data_intmax("index", istate->repo, "clear_skip_worktree_from_present_files/path_count", path_count);
> +	trace2_data_intmax("index", istate->repo, "clear_skip_worktree_from_present_files/restart_count", restart_count);
> +	trace2_region_leave("index", "clear_skip_worktree_from_present_files", istate->repo);
>  }
>  
>  /*
>
> base-commit: 1fc3c0ad407008c2f71dd9ae1241d8b75f8ef886

Jeff Hostetler Oct. 26, 2022, 2:13 p.m. UTC | #2

On 10/25/22 11:16 PM, Junio C Hamano wrote:
> "Anh Le via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Anh Le <anh@canva.com>
>>
>> In a large repository using sparse checkout, checking
>> whether a file marked with skip worktree is present
>> on disk and its skip worktree bit should be cleared
>> can take a considerable amount of time. Add a trace2
>> region to surface this information.
>>
>> Signed-off-by: Anh Le <anh@canva.com>
>> ---
>>      index: add trace2 region for clear skip worktree
>>      
>>      In large repository using sparse checkout, checking whether a file
>>      marked with skip worktree is present on disk and its skip worktree bit
>>      should be cleared can take a considerable amount of time. Add a trace2
>>      region to surface this information.
> 
> It is easy to see that the change is no-op from functionality's
> standpoint.  The condition under which ce->ce_flags loses the
> CE_SKIP_WORKTREE bit is the same as before, and the only change is
> that in the iteration a couple of variables are incremented, which
> may (or may not) have performance impact, but shouldn't break
> correctness.
> 
> I am not sure about the value of these counters, honestly.

I suggested adding the counters while the series was still over
on GitGitGadget.

The original goal was to measure the time spent in that loop
with the region_enter/region_leave events.  They will tell you
whether the loop was slow or not and you can compare that with
other (peer or containing) regions to see if it significant.

However, it doesn't tell us anything about WHY it was slow or
was slower than another instance.  The path_count gives us a
simple estimate on the number of lstat() calls -- which is
one of frequent sources of slowness.  Yes, it is bounded by
`istate->cache_nr`, but if I have a repo with 1M cache-entries
and 95% are sparse, this loop is going to be much slower than
if only 5% are sparse.

The restart_count (which we only expect to be 0 or 1), will
tell us whether or not at some point mid-loop we had to fully
inflate the index -- which is another source of slowness.
Granted, `ensure_full_index()` probably contains a region of
its own, so the restart_count here may not absolutely necessary,
but in the context of this loop -- this counter may point to the
other smoking gun.

In the worst case, we walk the entire index and lstat() for a
significant number of skipped-and-not-present files, then near
the end of the loop, we find a skipped-but-present directory
and have to restart the loop.  The second pass will still run
the full loop again.  Will the second pass actually see any
skipped cache-entries?  Will it re-lstat() them?  Could the
`goto restart` just be a `break` or `return`?

I haven't had time to look under the hood here, but I was
hoping that these two counters would help the series author
collect telemetry over many runs and gain more insight into
the perf problem.

Continuing the example from above, if we've already paid the
costs to lstat() the 95% sparse files AND THEN near the bottom
of the loop we have to do a restart, then we should expect
this loop to be doubly slow.  It was my hope that this combination
of counters would help us understand the variations in perf.

WRT the `intmax_t` vs just `int`: either is fine.  I suggested
the former because that is what `trace2_data_intmax()` takes.
I really don't expect any usage to overflow a regular 32bit int.

[...]
>> +	trace2_data_intmax("index", istate->repo, "clear_skip_worktree_from_present_files/path_count", path_count);
>> +	trace2_data_intmax("index", istate->repo, "clear_skip_worktree_from_present_files/restart_count", restart_count);

One thing I forgot to mention in my GGG suggestion was to only
emit these data events if the counter is > 0.  We've been trying
to avoid logging zeroes.

>> +	trace2_region_leave("index", "clear_skip_worktree_from_present_files", istate->repo);
[...]

Jeff

Junio C Hamano Oct. 26, 2022, 4:01 p.m. UTC | #3

Jeff Hostetler <git@jeffhostetler.com> writes:

> In the worst case, we walk the entire index and lstat() for a
> significant number of skipped-and-not-present files, then near
> the end of the loop, we find a skipped-but-present directory
> and have to restart the loop.  The second pass will still run
> the full loop again.  Will the second pass actually see any
> skipped cache-entries?  Will it re-lstat() them?  Could the
> `goto restart` just be a `break` or `return`?
>
> I haven't had time to look under the hood here, but I was
> hoping that these two counters would help the series author
> collect telemetry over many runs and gain more insight into
> the perf problem.

Without being able to answer these questions, would we be able to
interpret the numbers reported from these counters?

> Continuing the example from above, if we've already paid the
> costs to lstat() the 95% sparse files AND THEN near the bottom
> of the loop we have to do a restart, then we should expect
> this loop to be doubly slow.  It was my hope that this combination
> of counters would help us understand the variations in perf.

Yes, I understand that double-counting may give some clue to detect
that, but it just looked roundabout way to do that.  Perhaps
counting the path inspected during the first iteration and the path
inspected during the second iteration, separately, without the "how
many times did we repeat?", would give you a better picture?  "After
inspecting 2400 paths, we need to go back and then ended up scanning
3000 paths in the flattened index in the second round" would be
easier to interpret than "We needed flattening, and scanned 5400
paths in total in the two iterations".

> WRT the `intmax_t` vs just `int`: either is fine.

I thought "int" was supposed to be natural machine word, while
incrementing "intmax_t" is allowed to be much slower than "int".
Do we expect more than 2 billion paths?

Jeff Hostetler Oct. 26, 2022, 6:29 p.m. UTC | #4

On 10/26/22 12:01 PM, Junio C Hamano wrote:
> Jeff Hostetler <git@jeffhostetler.com> writes:
> 
>> In the worst case, we walk the entire index and lstat() for a
>> significant number of skipped-and-not-present files, then near
>> the end of the loop, we find a skipped-but-present directory
>> and have to restart the loop.  The second pass will still run
>> the full loop again.  Will the second pass actually see any
>> skipped cache-entries?  Will it re-lstat() them?  Could the
>> `goto restart` just be a `break` or `return`?
>>
>> I haven't had time to look under the hood here, but I was
>> hoping that these two counters would help the series author
>> collect telemetry over many runs and gain more insight into
>> the perf problem.
> 
> Without being able to answer these questions, would we be able to
> interpret the numbers reported from these counters?
> 
>> Continuing the example from above, if we've already paid the
>> costs to lstat() the 95% sparse files AND THEN near the bottom
>> of the loop we have to do a restart, then we should expect
>> this loop to be doubly slow.  It was my hope that this combination
>> of counters would help us understand the variations in perf.
> 
> Yes, I understand that double-counting may give some clue to detect
> that, but it just looked roundabout way to do that.  Perhaps
> counting the path inspected during the first iteration and the path
> inspected during the second iteration, separately, without the "how
> many times did we repeat?", would give you a better picture?  "After
> inspecting 2400 paths, we need to go back and then ended up scanning
> 3000 paths in the flattened index in the second round" would be
> easier to interpret than "We needed flattening, and scanned 5400
> paths in total in the two iterations".

Good point and I was wondering about whether we might see "2 x 95%"
values for path_count in really slow cases.  And yes, it would be
better to have `int path_count[2]` and tally each loop pass
independently.

I guess I was looking for a first-order "where is the pain?" knowing
that we could always dig deeper later.  That is, is the lstat's or
is it the ensure-full and index rewrite?  Or both?

We have other places that do lstat() over the cache-entries and have
added code to spread the loop across n threads.  I doubt that is needed
here and I didn't want to lead with it.


> 
>> WRT the `intmax_t` vs just `int`: either is fine.
> 
> I thought "int" was supposed to be natural machine word, while
> incrementing "intmax_t" is allowed to be much slower than "int".
> Do we expect more than 2 billion paths?
> 

You're right.  An `int` would be fine here.

Thanks,
Jeff

Anh Le Oct. 27, 2022, 12:04 a.m. UTC | #5

Thank you Junio and Jeff for the feedback! I hadn't considered that
`ensure_full_index()` would mean the loop can only at most restart
once. I'll action the feedbacks:
- count both loop iterations separately with int instead of intmax_t
- remove the restart counter
- don't log metrics if it's 0

Regards,
Anh


Regards,
Anh


On Thu, Oct 27, 2022 at 5:29 AM Jeff Hostetler <git@jeffhostetler.com> wrote:
>
>
>
> On 10/26/22 12:01 PM, Junio C Hamano wrote:
> > Jeff Hostetler <git@jeffhostetler.com> writes:
> >
> >> In the worst case, we walk the entire index and lstat() for a
> >> significant number of skipped-and-not-present files, then near
> >> the end of the loop, we find a skipped-but-present directory
> >> and have to restart the loop.  The second pass will still run
> >> the full loop again.  Will the second pass actually see any
> >> skipped cache-entries?  Will it re-lstat() them?  Could the
> >> `goto restart` just be a `break` or `return`?
> >>
> >> I haven't had time to look under the hood here, but I was
> >> hoping that these two counters would help the series author
> >> collect telemetry over many runs and gain more insight into
> >> the perf problem.
> >
> > Without being able to answer these questions, would we be able to
> > interpret the numbers reported from these counters?
> >
> >> Continuing the example from above, if we've already paid the
> >> costs to lstat() the 95% sparse files AND THEN near the bottom
> >> of the loop we have to do a restart, then we should expect
> >> this loop to be doubly slow.  It was my hope that this combination
> >> of counters would help us understand the variations in perf.
> >
> > Yes, I understand that double-counting may give some clue to detect
> > that, but it just looked roundabout way to do that.  Perhaps
> > counting the path inspected during the first iteration and the path
> > inspected during the second iteration, separately, without the "how
> > many times did we repeat?", would give you a better picture?  "After
> > inspecting 2400 paths, we need to go back and then ended up scanning
> > 3000 paths in the flattened index in the second round" would be
> > easier to interpret than "We needed flattening, and scanned 5400
> > paths in total in the two iterations".
>
> Good point and I was wondering about whether we might see "2 x 95%"
> values for path_count in really slow cases.  And yes, it would be
> better to have `int path_count[2]` and tally each loop pass
> independently.
>
> I guess I was looking for a first-order "where is the pain?" knowing
> that we could always dig deeper later.  That is, is the lstat's or
> is it the ensure-full and index rewrite?  Or both?
>
> We have other places that do lstat() over the cache-entries and have
> added code to spread the loop across n threads.  I doubt that is needed
> here and I didn't want to lead with it.
>
>
> >
> >> WRT the `intmax_t` vs just `int`: either is fine.
> >
> > I thought "int" was supposed to be natural machine word, while
> > incrementing "intmax_t" is allowed to be much slower than "int".
> > Do we expect more than 2 billion paths?
> >
>
> You're right.  An `int` would be fine here.
>
> Thanks,
> Jeff

index: add trace2 region for clear skip worktree

Commit Message

Comments

Patch