[0/5] sparse-index: improve clear_skip_worktree_from_present_files()

Message ID	pull.1754.git.1718899877.gitgitgadget@gmail.com (mailing list archive)
Headers	show Received: from mail-wm1-f54.google.com (mail-wm1-f54.google.com [209.85.128.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6ACAF1B14EA for <git@vger.kernel.org>; Thu, 20 Jun 2024 16:11:21 +0000 (UTC) Message-Id: <pull.1754.git.1718899877.gitgitgadget@gmail.com> From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> Date: Thu, 20 Jun 2024 16:11:12 +0000 Subject: [PATCH 0/5] sparse-index: improve clear_skip_worktree_from_present_files() Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, newren@gmail.com, anh@canva.com, Derrick Stolee <stolee@gmail.com>
Series	sparse-index: improve clear_skip_worktree_from_present_files() \| expand [0/5] sparse-index: improve clear_skip_worktree_from_present_files() [1/5] sparse-index: refactor skip worktree retry logic [2/5] sparse-index: refactor path_found() [3/5] sparse-index: use strbuf in path_found() [4/5] sparse-index: count lstat() calls [5/5] sparse-index: improve lstat caching of sparse paths

Message ID

pull.1754.git.1718899877.gitgitgadget@gmail.com (mailing list archive)

Headers

Message-Id: <pull.1754.git.1718899877.gitgitgadget@gmail.com>
From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
Date: Thu, 20 Jun 2024 16:11:12 +0000
Subject: [PATCH 0/5] sparse-index: improve
 clear_skip_worktree_from_present_files()
Fcc: Sent
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: bulk
MIME-Version: 1.0
To: git@vger.kernel.org
Cc: gitster@pobox.com,
    newren@gmail.com,
    anh@canva.com,
    Derrick Stolee <stolee@gmail.com>

Series

sparse-index: improve clear_skip_worktree_from_present_files() | expand

Message

Philippe Blain via GitGitGadget June 20, 2024, 4:11 p.m. UTC

While doing some investigation in a private monorepo with sparse-checkout
and a sparse index, I accidentally left a modified file outside of my
sparse-checkout cone. This caused my Git commands to slow to a crawl, so I
reran with GIT_TRACE2_PERF=1.

While I was able to identify clear_skip_worktree_from_present_files() as the
culprit, it took longer than desired to figure out what was going on. This
series intends to both fix the performance issue (as much as possible) and
do some refactoring to make it easier to understand what is happening.

In the end, I was able to reduce the number of lstat() calls in my case from
over 170,000 to about 6,500, improving the time from 2.5s to 71ms on a warm
disk cache.   Thanks, Stolee

Derrick Stolee (5):
  sparse-index: refactor skip worktree retry logic
  sparse-index: refactor path_found()
  sparse-index: use strbuf in path_found()
  sparse-index: count lstat() calls
  sparse-index: improve lstat caching of sparse paths

 sparse-index.c | 220 +++++++++++++++++++++++++++++++++++++------------
 1 file changed, 167 insertions(+), 53 deletions(-)


base-commit: 66ac6e4bcd111be3fa9c2a6b3fafea718d00678d
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1754%2Fderrickstolee%2Fclear-skip-speed-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1754/derrickstolee/clear-skip-speed-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1754

Comments

Junio C Hamano June 20, 2024, 7:16 p.m. UTC | #1

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> While doing some investigation in a private monorepo with sparse-checkout
> and a sparse index, I accidentally left a modified file outside of my
> sparse-checkout cone. This caused my Git commands to slow to a crawl, so I
> reran with GIT_TRACE2_PERF=1.
>
> While I was able to identify clear_skip_worktree_from_present_files() as the
> culprit, it took longer than desired to figure out what was going on. This
> series intends to both fix the performance issue (as much as possible) and
> do some refactoring to make it easier to understand what is happening.
>
> In the end, I was able to reduce the number of lstat() calls in my case from
> over 170,000 to about 6,500, improving the time from 2.5s to 71ms on a warm
> disk cache.   Thanks, Stolee

That's impressive but I cannot offhand tell how big 170k (or 6.5k
for that matter) is relative to the size of the tree.  How many
paths are there in the entire tree (i.e. "git ls-tree -r HEAD | wc
-l") vs the number of the in-cone paths in the working tree?  

If 6.5k is in the same ballpark as the latter, it would be really
good.

Thanks.

Derrick Stolee June 20, 2024, 8:21 p.m. UTC | #2

On 6/20/24 3:16 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> While doing some investigation in a private monorepo with sparse-checkout
>> and a sparse index, I accidentally left a modified file outside of my
>> sparse-checkout cone. This caused my Git commands to slow to a crawl, so I
>> reran with GIT_TRACE2_PERF=1.
>>
>> While I was able to identify clear_skip_worktree_from_present_files() as the
>> culprit, it took longer than desired to figure out what was going on. This
>> series intends to both fix the performance issue (as much as possible) and
>> do some refactoring to make it easier to understand what is happening.
>>
>> In the end, I was able to reduce the number of lstat() calls in my case from
>> over 170,000 to about 6,500, improving the time from 2.5s to 71ms on a warm
>> disk cache.   Thanks, Stolee
> 
> That's impressive but I cannot offhand tell how big 170k (or 6.5k
> for that matter) is relative to the size of the tree.  How many
> paths are there in the entire tree (i.e. "git ls-tree -r HEAD | wc
> -l") vs the number of the in-cone paths in the working tree?
> 
> If 6.5k is in the same ballpark as the latter, it would be really
> good.

You're right, I didn't include the full context here. The repo has
about 2.1 million paths at HEAD, but most of them are sparse.

In Patch 5, I detail that there are 1,841,997 total sparse files in
the expanded index. Thus, the previous caching algorithm was already
doing decent work and calling lstat() 11x fewer times than the naive
implementation.

The new caching algorithm improves this to 6,521, which is a 282x
improvement over naive and and 26x improvement over the previous
caching algorithm.

But what you are really asking is how close this is to the optimal.
I didn't include that in Patch 5 details, but I was able to look at
my notes and see that the sparse_path_count data point was 1,962,
meaning there are that many sparse trees in the sparse index before
expanding. Thus, the 6,521 lstat() calls are 3.3x more than the
absolute minimum required.

Does that help answer the questions you had? I'm happy to provide
more information.

Thanks,
-Stolee

Junio C Hamano June 20, 2024, 9:02 p.m. UTC | #3

Derrick Stolee <stolee@gmail.com> writes:

> But what you are really asking is how close this is to the optimal.
> I didn't include that in Patch 5 details, but I was able to look at
> my notes and see that the sparse_path_count data point was 1,962,
> meaning there are that many sparse trees in the sparse index before
> expanding. Thus, the 6,521 lstat() calls are 3.3x more than the
> absolute minimum required.

Nice, and still impressive.  Thanks.