Message ID | pull.1368.git.git.1666742722502.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | index: add trace2 region for clear skip worktree | expand |
"Anh Le via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Anh Le <anh@canva.com> > > In a large repository using sparse checkout, checking > whether a file marked with skip worktree is present > on disk and its skip worktree bit should be cleared > can take a considerable amount of time. Add a trace2 > region to surface this information. > > Signed-off-by: Anh Le <anh@canva.com> > --- > index: add trace2 region for clear skip worktree > > In large repository using sparse checkout, checking whether a file > marked with skip worktree is present on disk and its skip worktree bit > should be cleared can take a considerable amount of time. Add a trace2 > region to surface this information. It is easy to see that the change is no-op from functionality's standpoint. The condition under which ce->ce_flags loses the CE_SKIP_WORKTREE bit is the same as before, and the only change is that in the iteration a couple of variables are incremented, which may (or may not) have performance impact, but shouldn't break correctness. I am not sure about the value of these counters, honestly. If we hit a sparse dir just once, we fully flatten the index and go back to the restart state, and it would be a bug if we still see a sparsified directory in the in-core index after that point, so restart_count would be at most one, wouldn't it? Why do we even need to count with intmax_t? Similarly, path_count is bounded by istate->cache_nr. As you know the approximate size of your project, I am not sure what extra information you want to get by counting the paths with the skip bit set before the first sparsified directory in the in-core index twice, and the other paths with the skip bit set just once, and adding these numbers together. Also again, I am not quite sure what the point is to count the paths in intmax_t. > Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1368%2FHaizzz%2Fmaster-v1 > Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1368/Haizzz/master-v1 > Pull-Request: https://github.com/git/git/pull/1368 > > sparse-index.c | 21 +++++++++++++++------ > 1 file changed, 15 insertions(+), 6 deletions(-) > > diff --git a/sparse-index.c b/sparse-index.c > index e4a54ce1943..d11049c8aeb 100644 > --- a/sparse-index.c > +++ b/sparse-index.c > @@ -493,24 +493,33 @@ void clear_skip_worktree_from_present_files(struct index_state *istate) > int dir_found = 1; > > int i; > + intmax_t path_count = 0; > + intmax_t restart_count = 0; > > if (!core_apply_sparse_checkout || > sparse_expect_files_outside_of_patterns) > return; > > + trace2_region_enter("index", "clear_skip_worktree_from_present_files", istate->repo); > restart: > for (i = 0; i < istate->cache_nr; i++) { > struct cache_entry *ce = istate->cache[i]; > > - if (ce_skip_worktree(ce) && > - path_found(ce->name, &last_dirname, &dir_len, &dir_found)) { > - if (S_ISSPARSEDIR(ce->ce_mode)) { > - ensure_full_index(istate); > - goto restart; > + if (ce_skip_worktree(ce)) { > + path_count++; > + if (path_found(ce->name, &last_dirname, &dir_len, &dir_found)) { > + if (S_ISSPARSEDIR(ce->ce_mode)) { > + ensure_full_index(istate); > + restart_count++; > + goto restart; > + } > + ce->ce_flags &= ~CE_SKIP_WORKTREE; > } > - ce->ce_flags &= ~CE_SKIP_WORKTREE; > } > } > + trace2_data_intmax("index", istate->repo, "clear_skip_worktree_from_present_files/path_count", path_count); > + trace2_data_intmax("index", istate->repo, "clear_skip_worktree_from_present_files/restart_count", restart_count); > + trace2_region_leave("index", "clear_skip_worktree_from_present_files", istate->repo); > } > > /* > > base-commit: 1fc3c0ad407008c2f71dd9ae1241d8b75f8ef886
On 10/25/22 11:16 PM, Junio C Hamano wrote: > "Anh Le via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> From: Anh Le <anh@canva.com> >> >> In a large repository using sparse checkout, checking >> whether a file marked with skip worktree is present >> on disk and its skip worktree bit should be cleared >> can take a considerable amount of time. Add a trace2 >> region to surface this information. >> >> Signed-off-by: Anh Le <anh@canva.com> >> --- >> index: add trace2 region for clear skip worktree >> >> In large repository using sparse checkout, checking whether a file >> marked with skip worktree is present on disk and its skip worktree bit >> should be cleared can take a considerable amount of time. Add a trace2 >> region to surface this information. > > It is easy to see that the change is no-op from functionality's > standpoint. The condition under which ce->ce_flags loses the > CE_SKIP_WORKTREE bit is the same as before, and the only change is > that in the iteration a couple of variables are incremented, which > may (or may not) have performance impact, but shouldn't break > correctness. > > I am not sure about the value of these counters, honestly. I suggested adding the counters while the series was still over on GitGitGadget. The original goal was to measure the time spent in that loop with the region_enter/region_leave events. They will tell you whether the loop was slow or not and you can compare that with other (peer or containing) regions to see if it significant. However, it doesn't tell us anything about WHY it was slow or was slower than another instance. The path_count gives us a simple estimate on the number of lstat() calls -- which is one of frequent sources of slowness. Yes, it is bounded by `istate->cache_nr`, but if I have a repo with 1M cache-entries and 95% are sparse, this loop is going to be much slower than if only 5% are sparse. The restart_count (which we only expect to be 0 or 1), will tell us whether or not at some point mid-loop we had to fully inflate the index -- which is another source of slowness. Granted, `ensure_full_index()` probably contains a region of its own, so the restart_count here may not absolutely necessary, but in the context of this loop -- this counter may point to the other smoking gun. In the worst case, we walk the entire index and lstat() for a significant number of skipped-and-not-present files, then near the end of the loop, we find a skipped-but-present directory and have to restart the loop. The second pass will still run the full loop again. Will the second pass actually see any skipped cache-entries? Will it re-lstat() them? Could the `goto restart` just be a `break` or `return`? I haven't had time to look under the hood here, but I was hoping that these two counters would help the series author collect telemetry over many runs and gain more insight into the perf problem. Continuing the example from above, if we've already paid the costs to lstat() the 95% sparse files AND THEN near the bottom of the loop we have to do a restart, then we should expect this loop to be doubly slow. It was my hope that this combination of counters would help us understand the variations in perf. WRT the `intmax_t` vs just `int`: either is fine. I suggested the former because that is what `trace2_data_intmax()` takes. I really don't expect any usage to overflow a regular 32bit int. [...] >> + trace2_data_intmax("index", istate->repo, "clear_skip_worktree_from_present_files/path_count", path_count); >> + trace2_data_intmax("index", istate->repo, "clear_skip_worktree_from_present_files/restart_count", restart_count); One thing I forgot to mention in my GGG suggestion was to only emit these data events if the counter is > 0. We've been trying to avoid logging zeroes. >> + trace2_region_leave("index", "clear_skip_worktree_from_present_files", istate->repo); [...] Jeff
Jeff Hostetler <git@jeffhostetler.com> writes: > In the worst case, we walk the entire index and lstat() for a > significant number of skipped-and-not-present files, then near > the end of the loop, we find a skipped-but-present directory > and have to restart the loop. The second pass will still run > the full loop again. Will the second pass actually see any > skipped cache-entries? Will it re-lstat() them? Could the > `goto restart` just be a `break` or `return`? > > I haven't had time to look under the hood here, but I was > hoping that these two counters would help the series author > collect telemetry over many runs and gain more insight into > the perf problem. Without being able to answer these questions, would we be able to interpret the numbers reported from these counters? > Continuing the example from above, if we've already paid the > costs to lstat() the 95% sparse files AND THEN near the bottom > of the loop we have to do a restart, then we should expect > this loop to be doubly slow. It was my hope that this combination > of counters would help us understand the variations in perf. Yes, I understand that double-counting may give some clue to detect that, but it just looked roundabout way to do that. Perhaps counting the path inspected during the first iteration and the path inspected during the second iteration, separately, without the "how many times did we repeat?", would give you a better picture? "After inspecting 2400 paths, we need to go back and then ended up scanning 3000 paths in the flattened index in the second round" would be easier to interpret than "We needed flattening, and scanned 5400 paths in total in the two iterations". > WRT the `intmax_t` vs just `int`: either is fine. I thought "int" was supposed to be natural machine word, while incrementing "intmax_t" is allowed to be much slower than "int". Do we expect more than 2 billion paths?
On 10/26/22 12:01 PM, Junio C Hamano wrote: > Jeff Hostetler <git@jeffhostetler.com> writes: > >> In the worst case, we walk the entire index and lstat() for a >> significant number of skipped-and-not-present files, then near >> the end of the loop, we find a skipped-but-present directory >> and have to restart the loop. The second pass will still run >> the full loop again. Will the second pass actually see any >> skipped cache-entries? Will it re-lstat() them? Could the >> `goto restart` just be a `break` or `return`? >> >> I haven't had time to look under the hood here, but I was >> hoping that these two counters would help the series author >> collect telemetry over many runs and gain more insight into >> the perf problem. > > Without being able to answer these questions, would we be able to > interpret the numbers reported from these counters? > >> Continuing the example from above, if we've already paid the >> costs to lstat() the 95% sparse files AND THEN near the bottom >> of the loop we have to do a restart, then we should expect >> this loop to be doubly slow. It was my hope that this combination >> of counters would help us understand the variations in perf. > > Yes, I understand that double-counting may give some clue to detect > that, but it just looked roundabout way to do that. Perhaps > counting the path inspected during the first iteration and the path > inspected during the second iteration, separately, without the "how > many times did we repeat?", would give you a better picture? "After > inspecting 2400 paths, we need to go back and then ended up scanning > 3000 paths in the flattened index in the second round" would be > easier to interpret than "We needed flattening, and scanned 5400 > paths in total in the two iterations". Good point and I was wondering about whether we might see "2 x 95%" values for path_count in really slow cases. And yes, it would be better to have `int path_count[2]` and tally each loop pass independently. I guess I was looking for a first-order "where is the pain?" knowing that we could always dig deeper later. That is, is the lstat's or is it the ensure-full and index rewrite? Or both? We have other places that do lstat() over the cache-entries and have added code to spread the loop across n threads. I doubt that is needed here and I didn't want to lead with it. > >> WRT the `intmax_t` vs just `int`: either is fine. > > I thought "int" was supposed to be natural machine word, while > incrementing "intmax_t" is allowed to be much slower than "int". > Do we expect more than 2 billion paths? > You're right. An `int` would be fine here. Thanks, Jeff
Thank you Junio and Jeff for the feedback! I hadn't considered that `ensure_full_index()` would mean the loop can only at most restart once. I'll action the feedbacks: - count both loop iterations separately with int instead of intmax_t - remove the restart counter - don't log metrics if it's 0 Regards, Anh Regards, Anh On Thu, Oct 27, 2022 at 5:29 AM Jeff Hostetler <git@jeffhostetler.com> wrote: > > > > On 10/26/22 12:01 PM, Junio C Hamano wrote: > > Jeff Hostetler <git@jeffhostetler.com> writes: > > > >> In the worst case, we walk the entire index and lstat() for a > >> significant number of skipped-and-not-present files, then near > >> the end of the loop, we find a skipped-but-present directory > >> and have to restart the loop. The second pass will still run > >> the full loop again. Will the second pass actually see any > >> skipped cache-entries? Will it re-lstat() them? Could the > >> `goto restart` just be a `break` or `return`? > >> > >> I haven't had time to look under the hood here, but I was > >> hoping that these two counters would help the series author > >> collect telemetry over many runs and gain more insight into > >> the perf problem. > > > > Without being able to answer these questions, would we be able to > > interpret the numbers reported from these counters? > > > >> Continuing the example from above, if we've already paid the > >> costs to lstat() the 95% sparse files AND THEN near the bottom > >> of the loop we have to do a restart, then we should expect > >> this loop to be doubly slow. It was my hope that this combination > >> of counters would help us understand the variations in perf. > > > > Yes, I understand that double-counting may give some clue to detect > > that, but it just looked roundabout way to do that. Perhaps > > counting the path inspected during the first iteration and the path > > inspected during the second iteration, separately, without the "how > > many times did we repeat?", would give you a better picture? "After > > inspecting 2400 paths, we need to go back and then ended up scanning > > 3000 paths in the flattened index in the second round" would be > > easier to interpret than "We needed flattening, and scanned 5400 > > paths in total in the two iterations". > > Good point and I was wondering about whether we might see "2 x 95%" > values for path_count in really slow cases. And yes, it would be > better to have `int path_count[2]` and tally each loop pass > independently. > > I guess I was looking for a first-order "where is the pain?" knowing > that we could always dig deeper later. That is, is the lstat's or > is it the ensure-full and index rewrite? Or both? > > We have other places that do lstat() over the cache-entries and have > added code to spread the loop across n threads. I doubt that is needed > here and I didn't want to lead with it. > > > > > >> WRT the `intmax_t` vs just `int`: either is fine. > > > > I thought "int" was supposed to be natural machine word, while > > incrementing "intmax_t" is allowed to be much slower than "int". > > Do we expect more than 2 billion paths? > > > > You're right. An `int` would be fine here. > > Thanks, > Jeff
diff --git a/sparse-index.c b/sparse-index.c index e4a54ce1943..d11049c8aeb 100644 --- a/sparse-index.c +++ b/sparse-index.c @@ -493,24 +493,33 @@ void clear_skip_worktree_from_present_files(struct index_state *istate) int dir_found = 1; int i; + intmax_t path_count = 0; + intmax_t restart_count = 0; if (!core_apply_sparse_checkout || sparse_expect_files_outside_of_patterns) return; + trace2_region_enter("index", "clear_skip_worktree_from_present_files", istate->repo); restart: for (i = 0; i < istate->cache_nr; i++) { struct cache_entry *ce = istate->cache[i]; - if (ce_skip_worktree(ce) && - path_found(ce->name, &last_dirname, &dir_len, &dir_found)) { - if (S_ISSPARSEDIR(ce->ce_mode)) { - ensure_full_index(istate); - goto restart; + if (ce_skip_worktree(ce)) { + path_count++; + if (path_found(ce->name, &last_dirname, &dir_len, &dir_found)) { + if (S_ISSPARSEDIR(ce->ce_mode)) { + ensure_full_index(istate); + restart_count++; + goto restart; + } + ce->ce_flags &= ~CE_SKIP_WORKTREE; } - ce->ce_flags &= ~CE_SKIP_WORKTREE; } } + trace2_data_intmax("index", istate->repo, "clear_skip_worktree_from_present_files/path_count", path_count); + trace2_data_intmax("index", istate->repo, "clear_skip_worktree_from_present_files/restart_count", restart_count); + trace2_region_leave("index", "clear_skip_worktree_from_present_files", istate->repo); } /*