[5/5] sparse-index: improve lstat caching of sparse paths

Message ID	2654fcb7142a606c5684c762ed28bb5e8d9b4712.1718899877.git.gitgitgadget@gmail.com (mailing list archive)
State	Superseded
Headers	show Received: from mail-wm1-f42.google.com (mail-wm1-f42.google.com [209.85.128.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8C65B1B29B9 for <git@vger.kernel.org>; Thu, 20 Jun 2024 16:11:25 +0000 (UTC) Message-Id: <2654fcb7142a606c5684c762ed28bb5e8d9b4712.1718899877.git.gitgitgadget@gmail.com> In-Reply-To: <pull.1754.git.1718899877.gitgitgadget@gmail.com> References: <pull.1754.git.1718899877.gitgitgadget@gmail.com> Date: Thu, 20 Jun 2024 16:11:17 +0000 Subject: [PATCH 5/5] sparse-index: improve lstat caching of sparse paths Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, newren@gmail.com, anh@canva.com, Derrick Stolee <stolee@gmail.com>, Derrick Stolee <stolee@gmail.com> From: Derrick Stolee <stolee@gmail.com>
Series	sparse-index: improve clear_skip_worktree_from_present_files() \| expand [0/5] sparse-index: improve clear_skip_worktree_from_present_files() [1/5] sparse-index: refactor skip worktree retry logic [2/5] sparse-index: refactor path_found() [3/5] sparse-index: use strbuf in path_found() [4/5] sparse-index: count lstat() calls [5/5] sparse-index: improve lstat caching of sparse paths

Message ID

2654fcb7142a606c5684c762ed28bb5e8d9b4712.1718899877.git.gitgitgadget@gmail.com (mailing list archive)

State

Superseded

Headers

Message-Id: 
 <2654fcb7142a606c5684c762ed28bb5e8d9b4712.1718899877.git.gitgitgadget@gmail.com>
In-Reply-To: <pull.1754.git.1718899877.gitgitgadget@gmail.com>
References: <pull.1754.git.1718899877.gitgitgadget@gmail.com>
Date: Thu, 20 Jun 2024 16:11:17 +0000
Subject: [PATCH 5/5] sparse-index: improve lstat caching of sparse paths
Fcc: Sent
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: bulk
MIME-Version: 1.0
To: git@vger.kernel.org
Cc: gitster@pobox.com,
    newren@gmail.com,
    anh@canva.com,
    Derrick Stolee <stolee@gmail.com>,
    Derrick Stolee <stolee@gmail.com>
From: Derrick Stolee <stolee@gmail.com>

Series

sparse-index: improve clear_skip_worktree_from_present_files() | expand

Commit Message

Derrick Stolee June 20, 2024, 4:11 p.m. UTC

From: Derrick Stolee <stolee@gmail.com>

The clear_skip_worktree_from_present_files() method was first introduced
in af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files
present in worktree, 2022-01-14) to allow better interaction with the
working directory in the presence of paths outside of the
sparse-checkout cone. The initial implementation would lstat() every
single sparse tree to see if it existed, and if one did, then the sparse
index would expand and every sparse file would be checked.

Since these lstat() calls were very expensive, this was improved in
d79d299352 (Accelerate clear_skip_worktree_from_present_files() by
caching, 2022-01-14) by caching directories that do not exist. However,
there are some inefficiencies in that caching mechanism.

The caching mechanism stored only the parent directory as not existing,
even if a higher parent directory also does not exist. This means that
wasted lstat() calls would occur when the sparse files change immediate
parent directories but within the same root directory that does not
exist.

To set up a scenario that triggers this code in an interesting way, we
need a sparse-checkout in cone mode and a sparse index. To trigger the
full index expansion and a call to the
clear_skip_worktree_from_present_files_full() method, we need one of the
sparse trees to actually exist on disk. The performance test script
p2000-sparse-operations.sh takes the sample repository and copies its
HEAD to several copies nested in directories of the form f<i>/f<j>/f<k>
where i, j, and k are numbers from 1 to 4. The sparse-checkout cone is
then selected as "f2/f4/". Creating "f1/f1/" will trigger the behavior
and also lead to some interesting cases for the caching algorithm since
"f1/f1/" exists but "f1/f2/" and "f3/" do not.

This is difficult to notice when running performance tests using the Git
repository (or a blow-up of the Git repository, as in
p2000-sparse-operations.sh) because Git has a very shallow directory
structure.

This change reorganizes the caching algorithm to focus on storing both
the deepest _existing_ directory and the next-level non-existing
directory. By doing a little extra work on the first sparse file, we can
short-circuit all of the sparse files that exist in that non-existing
directory. When in a repository where the first sparse file is likely to
have a much deeper path than the first non-existing directory, this can
realize significant gains.

The details of this algorithm require careful attention, so the new
implementation of path_found() has detailed comments, including the use
of a new max_common_dir_prefix() method that may be of independent
interest.

It's worth noting that this is not universally positive, since we are
doing extra lstat() calls to establish the exact path to cache. In the
blow-up of the Git repository, we can see that the lstat count
_increases_ from 28 to 31. However, these numbers were already
artificially low.

Using an internal monorepo with over two million paths at HEAD and a
typical sparse-checkout cone such that the index contains ~190,000
entries (including over two thousand sparse trees), I was able to
measure these lstat counts when one sparse directory actually exists on
disk:

  Sparse files in expanded index: 1,841,997
       full_lstat_count (before):   173,259
       full_lstat_count  (after):     6,521

This resulted in this absolute time change, on a warm disk:

      Time in full loop (before): 2.527 s
      Time in full loop  (after): 0.071 s

(These times were calculated on a Windows machine, where lstat() is
slower than a similar Linux machine.)

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 sparse-index.c | 118 ++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 93 insertions(+), 25 deletions(-)

Comments

Elijah Newren June 24, 2024, 10:14 p.m. UTC | #1

On Thu, Jun 20, 2024 at 10:11 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <stolee@gmail.com>
>
> The clear_skip_worktree_from_present_files() method was first introduced
> in af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files
> present in worktree, 2022-01-14) to allow better interaction with the
> working directory in the presence of paths outside of the
> sparse-checkout cone.

s/cone//

> The initial implementation would lstat() every
> single sparse tree to see if it existed, and if one did, then the sparse
> index would expand and every sparse file would be checked.

This sounds like the algorithm only lstat()ed the sparse directories,
which feels misleading.  Perhaps

The initial implementation would lstat() every SKIP_WORKTREE path to
see if it existed; if it ran across a sparse directory that existed
(when a sparse-index was in use), then it would expand the index and
then check every SKIP_WORKTREE path.

> Since these lstat() calls were very expensive, this was improved in
> d79d299352 (Accelerate clear_skip_worktree_from_present_files() by
> caching, 2022-01-14) by caching directories that do not exist. However,
> there are some inefficiencies in that caching mechanism.

Maybe this is obvious, but I thought a few extra words to end the
second-to-last sentence would be helpful, such as "so it could avoid
lstat()ing any files under such directories".

> The caching mechanism stored only the parent directory as not existing,
> even if a higher parent directory also does not exist. This means that
> wasted lstat() calls would occur when the sparse files change immediate
> parent directories but within the same root directory that does not
> exist.

This is the crucial insight that makes this patch improve things so much.

> To set up a scenario that triggers this code in an interesting way, we
> need a sparse-checkout in cone mode and a sparse index. To trigger the

I think you state this too strongly.  While trying to duplicate, I
first went with a cone mode & sparse index at first, but out of
curiosity tried it without either of these modes set and still saw
dramatic improvement from your patch.  What is needed is that the
sparsity is such that entire directories are missing, and not just one
level above the files of interest.  That is more likely to occur when
cone mode and perhaps sparse index are in use, but perhaps consider
changing "we need" to "it is easiest to consider"

> full index expansion and a call to the
> clear_skip_worktree_from_present_files_full() method, we need one of the
> sparse trees to actually exist on disk. The performance test script
> p2000-sparse-operations.sh takes the sample repository and copies its
> HEAD to several copies nested in directories of the form f<i>/f<j>/f<k>
> where i, j, and k are numbers from 1 to 4. The sparse-checkout cone is
> then selected as "f2/f4/". Creating "f1/f1/" will trigger the behavior
> and also lead to some interesting cases for the caching algorithm since
> "f1/f1/" exists but "f1/f2/" and "f3/" do not.

For some reason I had difficulty triggering a case using this guide.
I might have made an error, but I decided I wanted a deeper directory
tree to test with anyway.  After some playing around to come up with
an interesting testcase, I eventually came up with the following steps
to reproduce in case anyone else wants to try:

    git clone https://github.com/newren/gvfs-like-git-bomb
    cd gvfs-like-git-bomb
    ./runme.sh
    git sparse-checkout set bomb/b/c            # incidentally cone mode
    mkdir -p bomb/d/e/f/a/a
    git ls-files -t | colrm 2 | sort | uniq -c  # optional, but interesting
    GIT_TRACE2_PERF=$(pwd)/trace git ls-files -t >/dev/null
    grep lstat_count trace

Further, you can recompile the git version in use in another window,
then come back to this one and run 'rm trace' followed by the last two
commands to retest.

The commands above create a 'gvfs-like-git-bomb' git directory that
has 1,000,001 files in HEAD.

With this test directory, before applying this patch, I see:
    ..sparse_lstat_count:722011
After applying this patch I see
    ..sparse_lstat_count:135
and with a slight tweak to your patch I see
    ..sparse_lstat_count:125
I'll comment on the slight tweak at the end of the patch.

> This is difficult to notice when running performance tests using the Git
> repository (or a blow-up of the Git repository, as in
> p2000-sparse-operations.sh) because Git has a very shallow directory
> structure.
>
> This change reorganizes the caching algorithm to focus on storing both
> the deepest _existing_ directory and the next-level non-existing
> directory.

This was slightly hard to parse for me, and misled me into thinking
you were tracking two directories.  Maybe:

This change reorganizes the caching algorithm to focus on storing the
highest level leading directory that does not exist (i.e. we are
restricting to the leading directory whose parent directory does
exist).

> By doing a little extra work on the first sparse file, we can
> short-circuit all of the sparse files that exist in that non-existing
> directory.

Here you use "exist" as "tracked by git" in one case, and "appears in
the working tree" in another.  That's a problem, because the files in
question are tracked by git but do not appear in the working tree,
making it impossible for people to understand unless they guess the
correct definition for each use.  I think we want "exist" to just mean
"appears in the working tree" here, so we'd need to s/sparse files
that exist in/sparse files underneath/ (or something similar) to fix
this sentence.

Also, you've used the phrase "sparse file(s)" a number of times in
this commit message; I think I know what you mean, but it is not
defined in the vocabulary section of
Documentation/technical/sparse-checkout.txt.  Together with the above
problem, it made me question what was meant, re-read all the
definitions, etc.  Perhaps "sparse file(s)" should be added to that
vocabulary section, though...especially if we are going to use it and
since we never fixed "sparse directory" despite mentioning that we
wanted to?

> When in a repository where the first sparse file is likely to
> have a much deeper path than the first non-existing directory, this can
> realize significant gains.
>
> The details of this algorithm require careful attention, so the new
> implementation of path_found() has detailed comments, including the use
> of a new max_common_dir_prefix() method that may be of independent
> interest.

These comments are well written and very helpful; thanks for including them.

> It's worth noting that this is not universally positive, since we are
> doing extra lstat() calls to establish the exact path to cache. In the
> blow-up of the Git repository, we can see that the lstat count
> _increases_ from 28 to 31. However, these numbers were already
> artificially low.

Yeah, I spent a while thinking about whether there were funny cases
where your algorithm might significantly increase the number of
lstat() calls for non-sparse-index, non-cone mode cases and came up
kind of empty handed.  It seems you only ever add a modest number of
lstat() calls, and in common scenarios (entire non-leaf directories
missing) you remove a dramatic number of lstat() calls.

> Using an internal monorepo with over two million paths at HEAD and a
> typical sparse-checkout cone such that the index contains ~190,000
> entries (including over two thousand sparse trees), I was able to
> measure these lstat counts when one sparse directory actually exists on
> disk:
>
>   Sparse files in expanded index: 1,841,997
>        full_lstat_count (before):   173,259
>        full_lstat_count  (after):     6,521
>
> This resulted in this absolute time change, on a warm disk:
>
>       Time in full loop (before): 2.527 s
>       Time in full loop  (after): 0.071 s

Very nice.  :-)

> (These times were calculated on a Windows machine, where lstat() is
> slower than a similar Linux machine.)
>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
>  sparse-index.c | 118 ++++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 93 insertions(+), 25 deletions(-)
>
> diff --git a/sparse-index.c b/sparse-index.c
> index 8577fa726b8..cccd8550dfe 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -440,14 +440,21 @@ void ensure_correct_sparsity(struct index_state *istate)
>  }
>
>  struct path_found_data {
> +       /**
> +        * The path stored in 'dir', if non-empty, corresponds to the most-
> +        * recent path that we checked where:
> +        *
> +        *   1. The path should be a directory, according to the index.
> +        *   2. The path does not exist.
> +        *   3. The parent path _does_ exist. (This may be the root of the
> +        *      working directory.)
> +        */
>         struct strbuf dir;
> -       int dir_found;
>         size_t lstat_count;
>  };
>
>  #define PATH_FOUND_DATA_INIT { \
> -       .dir = STRBUF_INIT, \
> -       .dir_found = 1 \
> +       .dir = STRBUF_INIT \
>  }
>
>  static void clear_path_found_data(struct path_found_data *data)
> @@ -455,50 +462,111 @@ static void clear_path_found_data(struct path_found_data *data)
>         strbuf_release(&data->dir);
>  }
>
> +/**
> + * Return the length of the largest common substring that ends in a
> + * slash ('/') to indicate the largest common parent directory. Returns
> + * zero if no common directory exists.
> + */
> +static size_t max_common_dir_prefix(const char *path1, const char *path2)
> +{
> +       size_t common_prefix = 0;
> +       for (size_t i = 0; path1[i] && path2[i]; i++) {
> +               if (path1[i] != path2[i])
> +                       break;
> +
> +               /*
> +                * If they agree at a directory separator, then add one
> +                * to make sure it is included in the common prefix string.
> +                */
> +               if (path1[i] == '/')
> +                       common_prefix = i + 1;
> +       }
> +
> +       return common_prefix;
> +}
> +
>  static int path_found(const char *path, struct path_found_data *data)
>  {
>         struct stat st;
> -       char *newdir;
> +       size_t common_prefix;
>
>         /*
> -        * If dirname corresponds to a directory that doesn't exist, and this
> -        * path starts with dirname, then path can't exist.
> +        * If data->dir is non-empty, then it contains a path that doesn't
> +        * exist, including an ending slash ('/'). If it is a prefix of 'path',
> +        * then we can return 0.
>          */
> -       if (!data->dir_found && !memcmp(path, data->dir.buf, data->dir.len))
> +       if (data->dir.len && !memcmp(path, data->dir.buf, data->dir.len))
>                 return 0;
>
>         /*
> -        * If path itself exists, return 1.
> +        * Otherwise, we must check if the current path exists. If it does, then
> +        * return 1. The cached directory will be skipped until we come across
> +        * a missing path again.
>          */
>         data->lstat_count++;
>         if (!lstat(path, &st))
>                 return 1;
>
>         /*
> -        * Otherwise, path does not exist so we'll return 0...but we'll first
> -        * determine some info about its parent directory so we can avoid
> -        * lstat calls for future cache entries.
> +        * At this point, we know that 'path' doesn't exist, and we know that
> +        * the parent directory of 'data->dir' does exist. Let's set 'data->dir'
> +        * to be the top-most non-existing directory of 'path'. If the first
> +        * parent of 'path' exists, then we will act ast though 'path'

s/ast/as/

> +        * corresponds to a directory (by adding a slash).
>          */
> -       newdir = strrchr(path, '/');
> -       if (!newdir)
> -               return 0; /* Didn't find a parent dir; just return 0 now. */
> +       common_prefix = max_common_dir_prefix(path, data->dir.buf);
>
>         /*
> -        * If path starts with directory (which we already lstat'ed and found),
> -        * then no need to lstat parent directory again.
> +        * At this point, 'path' and 'data->dir' have a common existing parent
> +        * directory given by path[0..common_prefix] (which could have length 0).
> +        * We "grow" the data->dir buffer by checking for existing directories
> +        * along 'path'.
>          */
> -       if (data->dir_found && data->dir.buf &&
> -           memcmp(path, data->dir.buf, data->dir.len))
> -               return 0;
>
> -       /* Free previous dirname, and cache path's dirname */
> -       strbuf_reset(&data->dir);
> -       strbuf_add(&data->dir, path, newdir - path + 1);
> +       strbuf_setlen(&data->dir, common_prefix);
> +       while (1) {
> +               /* Find the next directory in 'path'. */
> +               const char *next_slash = strchr(path + data->dir.len, '/');
>
> -       data->lstat_count++;
> -       data->dir_found = !lstat(data->dir.buf, &st);
> +               /*
> +                * If there are no more slashes, then 'path' doesn't contain a
> +                * non-existent _parent_ directory. Set 'data->dir' to be equal
> +                * to 'path' plus an additional slash, so it can be used for
> +                * caching in the future. The filename of 'path' is considered
> +                * a non-existent directory.
> +                *
> +                * Note: if "{path}/" exists as a directory, then it will never
> +                * appear as a prefix of other callers to this method, assuming
> +                * the context from the clear_skip_worktree... methods. If this
> +                * method is reused, then this must be reconsidered.
> +                */
> +               if (!next_slash) {
> +                       strbuf_addstr(&data->dir, path + data->dir.len);
> +                       strbuf_addch(&data->dir, '/');
> +                       break;
> +               }
>
> -       return 0;
> +               /*
> +                * Now that we have a slash, let's grow 'data->dir' to include
> +                * this slash, then test if we should stop.
> +                */
> +               strbuf_add(&data->dir, path + data->dir.len,
> +                          (next_slash - path) - data->dir.len + 1);

I had to re-read this multiple times and was confused by it.  I
eventually realized it was simple -- you use "path + data->dir.len"
3-4 times in this loop.  Could we reduce the cognitive overhead by
setting some variable to this value at the beginning within the loop
and then just use it?  It'd simplify this particular call to

    strbuf_add(&data->dir, rest, next_slash - rest + 1);

or substitute any other variable name for "rest" there.  Maybe it
shouldn't be a big deal, but the rest of the method was complex enough
that I just blew my local stack space at this point.  I think this
simple substitution would have made it easier for me.

> +
> +               /* If the path doesn't exist, then stop here. */
> +               data->lstat_count++;
> +               if (lstat(data->dir.buf, &st))
> +                       return 0;
> +       }
> +
> +       /*
> +        * At this point, 'data->dir' is equal to 'path' plus a slash character,
> +        * and the parent directory of 'path' definitely exists. Let's return
> +        * the case of whether 'path' exists.
> +        */

Can I suggest adding the following to this comment?
  " path does not exist at this point or we would have already
returned 1 above when we lstat()ed it before the above loop. "

> +
> +       data->lstat_count++;
> +       return !lstat(path, &st);

...and, as long as I didn't missing something with my above comment
suggestion, these two lines can be replaced by

    return 0;

Or did I miss something here?


Anyway, despite the many small comments I made, well done!  I think
the method is not only much more performant, but more readable than my
original.  With a few small tweaks, it should be good to merge.

Junio C Hamano June 25, 2024, 12:08 a.m. UTC | #2

Elijah Newren <newren@gmail.com> writes:

> ...  It'd simplify this particular call to
>
>     strbuf_add(&data->dir, rest, next_slash - rest + 1);
>
> or substitute any other variable name for "rest" there.  Maybe it
> shouldn't be a big deal, but ...

Good observation.  Naming is hard, but it often happens that a
temporary variable that is given an appropriate name improves the
readability of the code a lot.

> Anyway, despite the many small comments I made, well done!  I think
> the method is not only much more performant, but more readable than my
> original.  With a few small tweaks, it should be good to merge.

Yup, and your review was a delight to read.

Thanks, both.

Derrick Stolee June 26, 2024, 1:06 p.m. UTC | #3

On 6/24/24 6:14 PM, Elijah Newren wrote:
> On Thu, Jun 20, 2024 at 10:11 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> The clear_skip_worktree_from_present_files() method was first introduced
>> in af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files
>> present in worktree, 2022-01-14) to allow better interaction with the
>> working directory in the presence of paths outside of the
>> sparse-checkout cone.
> 
> s/cone//
> 
>> The initial implementation would lstat() every
>> single sparse tree to see if it existed, and if one did, then the sparse
>> index would expand and every sparse file would be checked.
> 
> This sounds like the algorithm only lstat()ed the sparse directories,
> which feels misleading.  Perhaps
> 
> The initial implementation would lstat() every SKIP_WORKTREE path to
> see if it existed; if it ran across a sparse directory that existed
> (when a sparse-index was in use), then it would expand the index and
> then check every SKIP_WORKTREE path.
> 
>> Since these lstat() calls were very expensive, this was improved in
>> d79d299352 (Accelerate clear_skip_worktree_from_present_files() by
>> caching, 2022-01-14) by caching directories that do not exist. However,
>> there are some inefficiencies in that caching mechanism.
> 
> Maybe this is obvious, but I thought a few extra words to end the
> second-to-last sentence would be helpful, such as "so it could avoid
> lstat()ing any files under such directories".

Thanks. I will use both of these suggestions verbatim.

>> To set up a scenario that triggers this code in an interesting way, we
>> need a sparse-checkout in cone mode and a sparse index. To trigger the
> 
> I think you state this too strongly.  While trying to duplicate, I
> first went with a cone mode & sparse index at first, but out of
> curiosity tried it without either of these modes set and still saw
> dramatic improvement from your patch.  What is needed is that the
> sparsity is such that entire directories are missing, and not just one
> level above the files of interest.  That is more likely to occur when
> cone mode and perhaps sparse index are in use, but perhaps consider
> changing "we need" to "it is easiest to consider"

Absolutely. This stems from my earlier assumption that the sparse index
was required, but it is not. I have reworded locally.

>> full index expansion and a call to the
>> clear_skip_worktree_from_present_files_full() method, we need one of the
>> sparse trees to actually exist on disk. The performance test script
>> p2000-sparse-operations.sh takes the sample repository and copies its
>> HEAD to several copies nested in directories of the form f<i>/f<j>/f<k>
>> where i, j, and k are numbers from 1 to 4. The sparse-checkout cone is
>> then selected as "f2/f4/". Creating "f1/f1/" will trigger the behavior
>> and also lead to some interesting cases for the caching algorithm since
>> "f1/f1/" exists but "f1/f2/" and "f3/" do not.
> 
> For some reason I had difficulty triggering a case using this guide.
> I might have made an error, but I decided I wanted a deeper directory
> tree to test with anyway.  After some playing around to come up with
> an interesting testcase, I eventually came up with the following steps
> to reproduce in case anyone else wants to try:
> 
>      git clone https://github.com/newren/gvfs-like-git-bomb
>      cd gvfs-like-git-bomb
>      ./runme.sh
>      git sparse-checkout set bomb/b/c            # incidentally cone mode
>      mkdir -p bomb/d/e/f/a/a
>      git ls-files -t | colrm 2 | sort | uniq -c  # optional, but interesting
>      GIT_TRACE2_PERF=$(pwd)/trace git ls-files -t >/dev/null
>      grep lstat_count trace
> 
> Further, you can recompile the git version in use in another window,
> then come back to this one and run 'rm trace' followed by the last two
> commands to retest.
> 
> The commands above create a 'gvfs-like-git-bomb' git directory that
> has 1,000,001 files in HEAD.
> 
> With this test directory, before applying this patch, I see:
>      ..sparse_lstat_count:722011
> After applying this patch I see
>      ..sparse_lstat_count:135
> and with a slight tweak to your patch I see
>      ..sparse_lstat_count:125
> I'll comment on the slight tweak at the end of the patch.

Thanks for these numbers! Are you willing to keep that example repo
on GitHub so I can refer to it in the message?

>> This is difficult to notice when running performance tests using the Git
>> repository (or a blow-up of the Git repository, as in
>> p2000-sparse-operations.sh) because Git has a very shallow directory
>> structure.
>>
>> This change reorganizes the caching algorithm to focus on storing both
>> the deepest _existing_ directory and the next-level non-existing
>> directory.
> 
> This was slightly hard to parse for me, and misled me into thinking
> you were tracking two directories.  Maybe:
> 
> This change reorganizes the caching algorithm to focus on storing the
> highest level leading directory that does not exist (i.e. we are
> restricting to the leading directory whose parent directory does
> exist).

I have reworded in a slightly different way, but with the same meaning
as you provide here.

>> By doing a little extra work on the first sparse file, we can
>> short-circuit all of the sparse files that exist in that non-existing
>> directory.
> 
> Here you use "exist" as "tracked by git" in one case, and "appears in
> the working tree" in another.  That's a problem, because the files in
> question are tracked by git but do not appear in the working tree,
> making it impossible for people to understand unless they guess the
> correct definition for each use.  I think we want "exist" to just mean
> "appears in the working tree" here, so we'd need to s/sparse files
> that exist in/sparse files underneath/ (or something similar) to fix
> this sentence.
> 
> Also, you've used the phrase "sparse file(s)" a number of times in
> this commit message; I think I know what you mean, but it is not
> defined in the vocabulary section of
> Documentation/technical/sparse-checkout.txt.  Together with the above
> problem, it made me question what was meant, re-read all the
> definitions, etc.  Perhaps "sparse file(s)" should be added to that
> vocabulary section, though...especially if we are going to use it and
> since we never fixed "sparse directory" despite mentioning that we
> wanted to?

You're right. My use of "sparse file" or "sparse directory" was intended
to mean "a path from a cache entry with SKIP_WORKTREE that is either a
blob or tree" but that isn't necessary here.

Instead, I'll focus my attention on paths being passed to path_found(),
which removes the context of an index. The danger there is that the
caching assumes that we are moving through the paths in an order such
as in the index, but that's a natural type of ordering for paths.

>>          /*
>> -        * Otherwise, path does not exist so we'll return 0...but we'll first
>> -        * determine some info about its parent directory so we can avoid
>> -        * lstat calls for future cache entries.
>> +        * At this point, we know that 'path' doesn't exist, and we know that
>> +        * the parent directory of 'data->dir' does exist. Let's set 'data->dir'
>> +        * to be the top-most non-existing directory of 'path'. If the first
>> +        * parent of 'path' exists, then we will act ast though 'path'
> 
> s/ast/as/

Oops! Thanks.

>> +               /*
>> +                * Now that we have a slash, let's grow 'data->dir' to include
>> +                * this slash, then test if we should stop.
>> +                */
>> +               strbuf_add(&data->dir, path + data->dir.len,
>> +                          (next_slash - path) - data->dir.len + 1);
> 
> I had to re-read this multiple times and was confused by it.  I
> eventually realized it was simple -- you use "path + data->dir.len"
> 3-4 times in this loop.  Could we reduce the cognitive overhead by
> setting some variable to this value at the beginning within the loop
> and then just use it?  It'd simplify this particular call to
> 
>      strbuf_add(&data->dir, rest, next_slash - rest + 1);
> 
> or substitute any other variable name for "rest" there.  Maybe it
> shouldn't be a big deal, but the rest of the method was complex enough
> that I just blew my local stack space at this point.  I think this
> simple substitution would have made it easier for me.

Excellent. Thanks for recognizing this pattern and recommending a
simplification.

>> +       /*
>> +        * At this point, 'data->dir' is equal to 'path' plus a slash character,
>> +        * and the parent directory of 'path' definitely exists. Let's return
>> +        * the case of whether 'path' exists.
>> +        */
> 
> Can I suggest adding the following to this comment?
>    " path does not exist at this point or we would have already
> returned 1 above when we lstat()ed it before the above loop. "
> 
>> +
>> +       data->lstat_count++;
>> +       return !lstat(path, &st);
> 
> ...and, as long as I didn't missing something with my above comment
> suggestion, these two lines can be replaced by
> 
>      return 0;
> 
> Or did I miss something here?

No, you understood correctly. Thanks for the further simplification and
speed improvement.

> Anyway, despite the many small comments I made, well done!  I think
> the method is not only much more performant, but more readable than my
> original.  With a few small tweaks, it should be good to merge.

Thanks for your detailed review. I'll put these updates into a v2 and
send it later today. I need to retest in my monorepo example and add
your publicly-available example before v2 will be ready.

Thanks,
-Stolee

Elijah Newren June 28, 2024, 12:10 a.m. UTC | #4

On Wed, Jun 26, 2024 at 6:06 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 6/24/24 6:14 PM, Elijah Newren wrote:
[...]
> > Further, you can recompile the git version in use in another window,
> > then come back to this one and run 'rm trace' followed by the last two
> > commands to retest.
> >
> > The commands above create a 'gvfs-like-git-bomb' git directory that
> > has 1,000,001 files in HEAD.
> >
> > With this test directory, before applying this patch, I see:
> >      ..sparse_lstat_count:722011
> > After applying this patch I see
> >      ..sparse_lstat_count:135
> > and with a slight tweak to your patch I see
> >      ..sparse_lstat_count:125
> > I'll comment on the slight tweak at the end of the patch.
>
> Thanks for these numbers! Are you willing to keep that example repo
> on GitHub so I can refer to it in the message?

Sure, I've had it up for several years already without change (though
the anachronistic 'gvfs' in the repository name feels like it should
be updated...)

diff --git a/sparse-index.c b/sparse-index.c
index 8577fa726b8..cccd8550dfe 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -440,14 +440,21 @@  void ensure_correct_sparsity(struct index_state *istate)
 }
 
 struct path_found_data {
+	/**
+	 * The path stored in 'dir', if non-empty, corresponds to the most-
+	 * recent path that we checked where:
+	 *
+	 *   1. The path should be a directory, according to the index.
+	 *   2. The path does not exist.
+	 *   3. The parent path _does_ exist. (This may be the root of the
+	 *      working directory.)
+	 */
 	struct strbuf dir;
-	int dir_found;
 	size_t lstat_count;
 };
 
 #define PATH_FOUND_DATA_INIT { \
-	.dir = STRBUF_INIT, \
-	.dir_found = 1 \
+	.dir = STRBUF_INIT \
 }
 
 static void clear_path_found_data(struct path_found_data *data)
@@ -455,50 +462,111 @@  static void clear_path_found_data(struct path_found_data *data)
 	strbuf_release(&data->dir);
 }
 
+/**
+ * Return the length of the largest common substring that ends in a
+ * slash ('/') to indicate the largest common parent directory. Returns
+ * zero if no common directory exists.
+ */
+static size_t max_common_dir_prefix(const char *path1, const char *path2)
+{
+	size_t common_prefix = 0;
+	for (size_t i = 0; path1[i] && path2[i]; i++) {
+		if (path1[i] != path2[i])
+			break;
+
+		/*
+		 * If they agree at a directory separator, then add one
+		 * to make sure it is included in the common prefix string.
+		 */
+		if (path1[i] == '/')
+			common_prefix = i + 1;
+	}
+
+	return common_prefix;
+}
+
 static int path_found(const char *path, struct path_found_data *data)
 {
 	struct stat st;
-	char *newdir;
+	size_t common_prefix;
 
 	/*
-	 * If dirname corresponds to a directory that doesn't exist, and this
-	 * path starts with dirname, then path can't exist.
+	 * If data->dir is non-empty, then it contains a path that doesn't
+	 * exist, including an ending slash ('/'). If it is a prefix of 'path',
+	 * then we can return 0.
 	 */
-	if (!data->dir_found && !memcmp(path, data->dir.buf, data->dir.len))
+	if (data->dir.len && !memcmp(path, data->dir.buf, data->dir.len))
 		return 0;
 
 	/*
-	 * If path itself exists, return 1.
+	 * Otherwise, we must check if the current path exists. If it does, then
+	 * return 1. The cached directory will be skipped until we come across
+	 * a missing path again.
 	 */
 	data->lstat_count++;
 	if (!lstat(path, &st))
 		return 1;
 
 	/*
-	 * Otherwise, path does not exist so we'll return 0...but we'll first
-	 * determine some info about its parent directory so we can avoid
-	 * lstat calls for future cache entries.
+	 * At this point, we know that 'path' doesn't exist, and we know that
+	 * the parent directory of 'data->dir' does exist. Let's set 'data->dir'
+	 * to be the top-most non-existing directory of 'path'. If the first
+	 * parent of 'path' exists, then we will act ast though 'path'
+	 * corresponds to a directory (by adding a slash).
 	 */
-	newdir = strrchr(path, '/');
-	if (!newdir)
-		return 0; /* Didn't find a parent dir; just return 0 now. */
+	common_prefix = max_common_dir_prefix(path, data->dir.buf);
 
 	/*
-	 * If path starts with directory (which we already lstat'ed and found),
-	 * then no need to lstat parent directory again.
+	 * At this point, 'path' and 'data->dir' have a common existing parent
+	 * directory given by path[0..common_prefix] (which could have length 0).
+	 * We "grow" the data->dir buffer by checking for existing directories
+	 * along 'path'.
 	 */
-	if (data->dir_found && data->dir.buf &&
-	    memcmp(path, data->dir.buf, data->dir.len))
-		return 0;
 
-	/* Free previous dirname, and cache path's dirname */
-	strbuf_reset(&data->dir);
-	strbuf_add(&data->dir, path, newdir - path + 1);
+	strbuf_setlen(&data->dir, common_prefix);
+	while (1) {
+		/* Find the next directory in 'path'. */
+		const char *next_slash = strchr(path + data->dir.len, '/');
 
-	data->lstat_count++;
-	data->dir_found = !lstat(data->dir.buf, &st);
+		/*
+		 * If there are no more slashes, then 'path' doesn't contain a
+		 * non-existent _parent_ directory. Set 'data->dir' to be equal
+		 * to 'path' plus an additional slash, so it can be used for
+		 * caching in the future. The filename of 'path' is considered
+		 * a non-existent directory.
+		 *
+		 * Note: if "{path}/" exists as a directory, then it will never
+		 * appear as a prefix of other callers to this method, assuming
+		 * the context from the clear_skip_worktree... methods. If this
+		 * method is reused, then this must be reconsidered.
+		 */
+		if (!next_slash) {
+			strbuf_addstr(&data->dir, path + data->dir.len);
+			strbuf_addch(&data->dir, '/');
+			break;
+		}
 
-	return 0;
+		/*
+		 * Now that we have a slash, let's grow 'data->dir' to include
+		 * this slash, then test if we should stop.
+		 */
+		strbuf_add(&data->dir, path + data->dir.len,
+			   (next_slash - path) - data->dir.len + 1);
+
+		/* If the path doesn't exist, then stop here. */
+		data->lstat_count++;
+		if (lstat(data->dir.buf, &st))
+			return 0;
+	}
+
+	/*
+	 * At this point, 'data->dir' is equal to 'path' plus a slash character,
+	 * and the parent directory of 'path' definitely exists. Let's return
+	 * the case of whether 'path' exists.
+	 */
+
+	data->lstat_count++;
+	return !lstat(path, &st);
 }
 
 static int clear_skip_worktree_from_present_files_sparse(struct index_state *istate)

[5/5] sparse-index: improve lstat caching of sparse paths

Commit Message

Comments

Patch