Message ID | pull.1754.git.1718899877.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | sparse-index: improve clear_skip_worktree_from_present_files() | expand |
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > While doing some investigation in a private monorepo with sparse-checkout > and a sparse index, I accidentally left a modified file outside of my > sparse-checkout cone. This caused my Git commands to slow to a crawl, so I > reran with GIT_TRACE2_PERF=1. > > While I was able to identify clear_skip_worktree_from_present_files() as the > culprit, it took longer than desired to figure out what was going on. This > series intends to both fix the performance issue (as much as possible) and > do some refactoring to make it easier to understand what is happening. > > In the end, I was able to reduce the number of lstat() calls in my case from > over 170,000 to about 6,500, improving the time from 2.5s to 71ms on a warm > disk cache. Thanks, Stolee That's impressive but I cannot offhand tell how big 170k (or 6.5k for that matter) is relative to the size of the tree. How many paths are there in the entire tree (i.e. "git ls-tree -r HEAD | wc -l") vs the number of the in-cone paths in the working tree? If 6.5k is in the same ballpark as the latter, it would be really good. Thanks.
On 6/20/24 3:16 PM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> While doing some investigation in a private monorepo with sparse-checkout >> and a sparse index, I accidentally left a modified file outside of my >> sparse-checkout cone. This caused my Git commands to slow to a crawl, so I >> reran with GIT_TRACE2_PERF=1. >> >> While I was able to identify clear_skip_worktree_from_present_files() as the >> culprit, it took longer than desired to figure out what was going on. This >> series intends to both fix the performance issue (as much as possible) and >> do some refactoring to make it easier to understand what is happening. >> >> In the end, I was able to reduce the number of lstat() calls in my case from >> over 170,000 to about 6,500, improving the time from 2.5s to 71ms on a warm >> disk cache. Thanks, Stolee > > That's impressive but I cannot offhand tell how big 170k (or 6.5k > for that matter) is relative to the size of the tree. How many > paths are there in the entire tree (i.e. "git ls-tree -r HEAD | wc > -l") vs the number of the in-cone paths in the working tree? > > If 6.5k is in the same ballpark as the latter, it would be really > good. You're right, I didn't include the full context here. The repo has about 2.1 million paths at HEAD, but most of them are sparse. In Patch 5, I detail that there are 1,841,997 total sparse files in the expanded index. Thus, the previous caching algorithm was already doing decent work and calling lstat() 11x fewer times than the naive implementation. The new caching algorithm improves this to 6,521, which is a 282x improvement over naive and and 26x improvement over the previous caching algorithm. But what you are really asking is how close this is to the optimal. I didn't include that in Patch 5 details, but I was able to look at my notes and see that the sparse_path_count data point was 1,962, meaning there are that many sparse trees in the sparse index before expanding. Thus, the 6,521 lstat() calls are 3.3x more than the absolute minimum required. Does that help answer the questions you had? I'm happy to provide more information. Thanks, -Stolee
Derrick Stolee <stolee@gmail.com> writes: > But what you are really asking is how close this is to the optimal. > I didn't include that in Patch 5 details, but I was able to look at > my notes and see that the sparse_path_count data point was 1,962, > meaning there are that many sparse trees in the sparse index before > expanding. Thus, the 6,521 lstat() calls are 3.3x more than the > absolute minimum required. Nice, and still impressive. Thanks.