mbox series

[00/27,RFC] Sparse Index

Message ID pull.847.git.1611596533.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series Sparse Index | expand

Message

ZheNing Hu via GitGitGadget Jan. 25, 2021, 5:41 p.m. UTC
This is based on ds/more-index-cleanups also available as GitGitGadget PR
#839.

The sparse checkout feature allows users to specify a "populated set" that
is smaller than the full list of files at HEAD. Files outside the sparse
checkout definition are not present in the working directory, but are still
present in the index (and marked with the CE_SKIP_WORKTREE bit).

This means that the working directory has size O(populated), and commands
like "git status" or "git checkout" operate using an O(populated) number of
filesystem operations. However, parsing the index still operates on the
scale of O(HEAD).

This can be particularly jarring if you are merging a small repository with
a large monorepo for the purpose of simplifying dependency management. Even
if users have nothing more in their working directory than they had before,
they suddenly see a significant increase in their "git status" or "git add"
times. In these cases, simply parsing the index can be a huge portion of the
command time.

This RFC proposes an update to the index formats to allow "sparse directory
entries". These entries correspond to directories that are completely
excluded from the sparse checkout definition. We can detect that a directory
is excluded when using "cone mode" patterns.

Since having directory entries is a radical departure from the existing
index format, a new extension "extensions.sparseIndex" is added. Using a
sparse index should cause incompatible tools to fail because they do not
understand this extension.

The index is a critical data structure, so making such a drastic change must
be handled carefully. This RFC does only enough adjustments to demonstrate
performance improvements for "git status" and "git add." Other commands
should operate identically to before, since the other commands will expand a
sparse index into a full index by parsing trees.

WARNING: I'm getting a failure on the FreeBSD build with my sparse-checkout
tests. I'm not sure what is causing these failures, but I will explore while
we discuss the possibility of the feature as a whole.

Here is an outline for this RFC:

 * Patches 1-14: create and test the sparse index format. This is just
   enough to start writing the format, but all Git commands become slower
   for using it. This is because everything is guarded to expand to a full
   index before actually operating on the cache entries.

 * Patch 15: This massive patch is actually a bunch of patches squashed
   together. I have a branch that adds "ensure_full_index()" guards in each
   necessary file along with some commentary about how the index is being
   used. This patch is presented here as one big dump because that
   commentary isn't particularly interesting if the RFC leads to a very
   different approach.

 * Patches 16-27: These changes make enough code "sparse aware" such that
   "git status" and "git add" start operating in time O(populated) instead
   of O(HEAD).

Performance numbers are given in patch 27, but repeated somewhat here. The
test environment I use has ~2.1 million paths at HEAD, but only 68,000
populated paths given the sparse-checkout I'm using. The sparse index has
about 2,000 sparse directory entries.

 1. Use the full index. The index size is ~186 MB.
 2. Use the sparse index. The index size is ~5.5 MB.
 3. Use a commit where HEAD matches the populated set. The full index size
    is ~5.3MB.

The third benchmark is included as a theoretical optimum for a repository of
the same object database.

First, a clean 'git status' improves from 3.1s to 240ms.

Benchmark #1: full index (git status) Time (mean ± σ): 3.167 s ± 0.036 s
[User: 2.006 s, System: 1.078 s] Range (min … max): 3.100 s … 3.208 s 10
runs

Benchmark #2: sparse index (git status) Time (mean ± σ): 239.5 ms ± 8.1 ms
[User: 189.4 ms, System: 226.8 ms] Range (min … max): 226.0 ms … 251.9 ms 13
runs

Benchmark #3: small tree (git status) Time (mean ± σ): 195.3 ms ± 4.5 ms
[User: 116.5 ms, System: 84.4 ms] Range (min … max): 188.8 ms … 202.8 ms 15
runs

The performance numbers for 'git add .' are much closer to optimal:

Benchmark #1: full index (git add .) Time (mean ± σ): 3.076 s ± 0.022 s
[User: 2.065 s, System: 0.943 s] Range (min … max): 3.044 s … 3.116 s 10
runs

Benchmark #2: sparse index (git add .) Time (mean ± σ): 218.0 ms ± 6.6 ms
[User: 195.7 ms, System: 206.6 ms] Range (min … max): 209.8 ms … 228.2 ms 13
runs

Benchmark #3: small tree (git add .) Time (mean ± σ): 217.6 ms ± 5.4 ms
[User: 131.9 ms, System: 86.7 ms] Range (min … max): 212.1 ms … 228.4 ms 14
runs

I expect that making a sparse index work optimally through the most common
Git commands will take a year of effort. During this process, I expect to
add a lot of testing infrastructure around the sparse-checkout feature,
especially in corner cases. (This RFC focuses on the happy paths of
operating only within the sparse cone, but that will change in the future.)

If this general approach is acceptable, then I would follow it with a
sequence of patch submissions that follow this approach:

 1. Basics of the format. (Patches 1-14)
 2. Add additional guards around index interactions (Patch 15, but split
    appropriately.)
 3. Speed up "git status" and "git add" (Patches 16-27)

After those three items that are represented in this RFC, the work starts to
parallelize a bit. My basic ideas for moving forward from this point are to
do these basic steps:

 * Add new index API abstractions where appropriate, make them sparse-aware.
 * Add new tests around sparse-checkout corner cases. Ensure the sparse
   index works correctly.
 * For a given builtin, add extra testing for sparse-checkouts then it them
   sparse-aware.

Here are some specific questions I'm hoping to answer in this RFC period:

 1. Are these sparse directory entries an appropriate way to extend the
    index format?
 2. Is extensions.sparseIndex a good way to signal that these entries can
    exist?
 3. Is git sparse-checkout init --cone --sparse-index an appropriate way to
    toggle the format?
 4. Are there specific areas that I should target to harden the index API
    before I submit this work?
 5. Does anyone have a good idea how to test a large portion of the test
    suite with sparse-index enabled? The problem I see is that most tests
    don't use sparse-checkout, so the sparse index is identical to the full
    index. Would it be interesting to enable the test setup to add such
    "extra" directories during the test setup?

Thanks, -Stolee

Derrick Stolee (27):
  sparse-index: add guard to ensure full index
  sparse-index: implement ensure_full_index()
  t1092: compare sparse-checkout to sparse-index
  test-read-cache: print cache entries with --table
  test-tool: read-cache --table --no-stat
  test-tool: don't force full index
  unpack-trees: ensure full index
  sparse-checkout: hold pattern list in index
  sparse-index: convert from full to sparse
  submodule: sparse-index should not collapse links
  unpack-trees: allow sparse directories
  sparse-index: check index conversion happens
  sparse-index: create extension for compatibility
  sparse-checkout: toggle sparse index from builtin
  [RFC-VERSION] *: ensure full index
  unpack-trees: make sparse aware
  dir.c: accept a directory as part of cone-mode patterns
  status: use sparse-index throughout
  status: skip sparse-checkout percentage with sparse-index
  sparse-index: expand_to_path() trivial implementation
  sparse-index: expand_to_path no-op if path exists
  add: allow operating on a sparse-only index
  submodule: die_path_inside_submodule is sparse aware
  dir: use expand_to_path in add_patterns()
  fsmonitor: disable if index is sparse
  pathspec: stop calling ensure_full_index
  cache-tree: integrate with sparse directory entries

 Documentation/config/extensions.txt      |   7 +
 Documentation/git-sparse-checkout.txt    |  14 +
 Makefile                                 |   1 +
 apply.c                                  |  10 +-
 blame.c                                  |   7 +-
 builtin/add.c                            |   3 +
 builtin/checkout-index.c                 |   5 +-
 builtin/commit.c                         |   3 +
 builtin/grep.c                           |   2 +
 builtin/ls-files.c                       |   9 +-
 builtin/merge-index.c                    |   2 +
 builtin/mv.c                             |   2 +
 builtin/rm.c                             |   2 +
 builtin/sparse-checkout.c                |  35 ++-
 builtin/update-index.c                   |   2 +
 cache-tree.c                             |  21 ++
 cache.h                                  |  15 +-
 diff.c                                   |   2 +
 dir.c                                    |  19 +-
 dir.h                                    |   2 +-
 entry.c                                  |   2 +
 fsmonitor.c                              |  18 +-
 merge-recursive.c                        |  22 +-
 name-hash.c                              |  10 +
 pathspec.c                               |   4 +-
 pathspec.h                               |   4 +-
 preload-index.c                          |   2 +
 read-cache.c                             |  51 +++-
 repo-settings.c                          |  15 +
 repository.c                             |  11 +-
 repository.h                             |   3 +
 rerere.c                                 |   2 +
 resolve-undo.c                           |   6 +
 setup.c                                  |   3 +
 sha1-name.c                              |   3 +
 sparse-index.c                           | 360 +++++++++++++++++++++++
 sparse-index.h                           |  23 ++
 split-index.c                            |   2 +
 submodule.c                              |  22 +-
 submodule.h                              |   6 +-
 t/helper/test-read-cache.c               |  77 ++++-
 t/t1092-sparse-checkout-compatibility.sh | 130 +++++++-
 tree.c                                   |   2 +
 unpack-trees.c                           |  40 ++-
 wt-status.c                              |  21 +-
 wt-status.h                              |   1 +
 46 files changed, 942 insertions(+), 61 deletions(-)
 create mode 100644 sparse-index.c
 create mode 100644 sparse-index.h


base-commit: 2271fe7848aa11b30e5313d95d9caebc2937fce5
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-847%2Fderrickstolee%2Fsparse-index%2Frfc-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-847/derrickstolee/sparse-index/rfc-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/847

Comments

Junio C Hamano Jan. 25, 2021, 8:10 p.m. UTC | #1
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This RFC proposes an update to the index formats to allow "sparse directory
> entries". These entries correspond to directories that are completely
> excluded from the sparse checkout definition. We can detect that a directory
> is excluded when using "cone mode" patterns.

Yay.

> Since having directory entries is a radical departure from the existing
> index format, a new extension "extensions.sparseIndex" is added. Using a
> sparse index should cause incompatible tools to fail because they do not
> understand this extension.

Safety is good, but because the index is purely a local matter, we
do not have to be so careful as updating the network protocols or
pack/object formats.

I think the use of "extensions.*" mechanism to render the repository
that uses the new feature unusable by older Git is safe enough, but
it may be too draconian.  For example, when things go wrong, don't
you want to "fetch"/"clone" from it into another repository to first
save the objects and refs?  You do not need a version of the index
file you understand in order to do that.

The index format has a mechanism to make older versions of Git bail
when it encounters a file that uses newer feature that they do not
understand.  Perhaps using it is sufficient instead?
Derrick Stolee Jan. 25, 2021, 9:18 p.m. UTC | #2
On 1/25/21 3:10 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> This RFC proposes an update to the index formats to allow "sparse directory
>> entries". These entries correspond to directories that are completely
>> excluded from the sparse checkout definition. We can detect that a directory
>> is excluded when using "cone mode" patterns.
> 
> Yay.
> 
>> Since having directory entries is a radical departure from the existing
>> index format, a new extension "extensions.sparseIndex" is added. Using a
>> sparse index should cause incompatible tools to fail because they do not
>> understand this extension.
> 
> Safety is good, but because the index is purely a local matter, we
> do not have to be so careful as updating the network protocols or
> pack/object formats.
> 
> I think the use of "extensions.*" mechanism to render the repository
> that uses the new feature unusable by older Git is safe enough, but
> it may be too draconian.  For example, when things go wrong, don't
> you want to "fetch"/"clone" from it into another repository to first
> save the objects and refs?  You do not need a version of the index
> file you understand in order to do that.
> 
> The index format has a mechanism to make older versions of Git bail
> when it encounters a file that uses newer feature that they do not
> understand.  Perhaps using it is sufficient instead?

There are interesting subtleties with the differences between index
formats 2, 3, and 4 that are worth keeping around. Perhaps the
extension could be a mechanism for allowing sparse directories in
those versions, but then a future "index version 5" includes sparse
directories without the extension.

I could spend time working on such an index v5 in parallel with
the updates to Git commands to make them sparse aware. The logic
from patches 1-14 in this series will be required before that could
begin.

Thanks,
-Stolee
Elijah Newren Feb. 2, 2021, 3:11 a.m. UTC | #3
On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> This is based on ds/more-index-cleanups also available as GitGitGadget PR
> #839.
>
> The sparse checkout feature allows users to specify a "populated set" that
> is smaller than the full list of files at HEAD. Files outside the sparse
> checkout definition are not present in the working directory, but are still
> present in the index (and marked with the CE_SKIP_WORKTREE bit).
>
> This means that the working directory has size O(populated), and commands
> like "git status" or "git checkout" operate using an O(populated) number of
> filesystem operations. However, parsing the index still operates on the
> scale of O(HEAD).
>
> This can be particularly jarring if you are merging a small repository with
> a large monorepo for the purpose of simplifying dependency management. Even
> if users have nothing more in their working directory than they had before,
> they suddenly see a significant increase in their "git status" or "git add"
> times. In these cases, simply parsing the index can be a huge portion of the
> command time.
>
> This RFC proposes an update to the index formats to allow "sparse directory
> entries". These entries correspond to directories that are completely
> excluded from the sparse checkout definition. We can detect that a directory
> is excluded when using "cone mode" patterns.
>
> Since having directory entries is a radical departure from the existing
> index format, a new extension "extensions.sparseIndex" is added. Using a
> sparse index should cause incompatible tools to fail because they do not
> understand this extension.
>
> The index is a critical data structure, so making such a drastic change must
> be handled carefully. This RFC does only enough adjustments to demonstrate
> performance improvements for "git status" and "git add." Other commands
> should operate identically to before, since the other commands will expand a
> sparse index into a full index by parsing trees.
>
> WARNING: I'm getting a failure on the FreeBSD build with my sparse-checkout
> tests. I'm not sure what is causing these failures, but I will explore while
> we discuss the possibility of the feature as a whole.
>
> Here is an outline for this RFC:
>
>  * Patches 1-14: create and test the sparse index format. This is just
>    enough to start writing the format, but all Git commands become slower
>    for using it. This is because everything is guarded to expand to a full
>    index before actually operating on the cache entries.
>
>  * Patch 15: This massive patch is actually a bunch of patches squashed
>    together. I have a branch that adds "ensure_full_index()" guards in each
>    necessary file along with some commentary about how the index is being
>    used. This patch is presented here as one big dump because that
>    commentary isn't particularly interesting if the RFC leads to a very
>    different approach.
>
>  * Patches 16-27: These changes make enough code "sparse aware" such that
>    "git status" and "git add" start operating in time O(populated) instead
>    of O(HEAD).
>
> Performance numbers are given in patch 27, but repeated somewhat here. The
> test environment I use has ~2.1 million paths at HEAD, but only 68,000
> populated paths given the sparse-checkout I'm using. The sparse index has
> about 2,000 sparse directory entries.
>
>  1. Use the full index. The index size is ~186 MB.
>  2. Use the sparse index. The index size is ~5.5 MB.
>  3. Use a commit where HEAD matches the populated set. The full index size
>     is ~5.3MB.
>
> The third benchmark is included as a theoretical optimum for a repository of
> the same object database.
>
> First, a clean 'git status' improves from 3.1s to 240ms.
>
> Benchmark #1: full index (git status) Time (mean ± σ): 3.167 s ± 0.036 s
> [User: 2.006 s, System: 1.078 s] Range (min … max): 3.100 s … 3.208 s 10
> runs
>
> Benchmark #2: sparse index (git status) Time (mean ± σ): 239.5 ms ± 8.1 ms
> [User: 189.4 ms, System: 226.8 ms] Range (min … max): 226.0 ms … 251.9 ms 13
> runs
>
> Benchmark #3: small tree (git status) Time (mean ± σ): 195.3 ms ± 4.5 ms
> [User: 116.5 ms, System: 84.4 ms] Range (min … max): 188.8 ms … 202.8 ms 15
> runs
>
> The performance numbers for 'git add .' are much closer to optimal:
>
> Benchmark #1: full index (git add .) Time (mean ± σ): 3.076 s ± 0.022 s
> [User: 2.065 s, System: 0.943 s] Range (min … max): 3.044 s … 3.116 s 10
> runs
>
> Benchmark #2: sparse index (git add .) Time (mean ± σ): 218.0 ms ± 6.6 ms
> [User: 195.7 ms, System: 206.6 ms] Range (min … max): 209.8 ms … 228.2 ms 13
> runs
>
> Benchmark #3: small tree (git add .) Time (mean ± σ): 217.6 ms ± 5.4 ms
> [User: 131.9 ms, System: 86.7 ms] Range (min … max): 212.1 ms … 228.4 ms 14
> runs
>
> I expect that making a sparse index work optimally through the most common
> Git commands will take a year of effort. During this process, I expect to
> add a lot of testing infrastructure around the sparse-checkout feature,
> especially in corner cases. (This RFC focuses on the happy paths of
> operating only within the sparse cone, but that will change in the future.)
>
> If this general approach is acceptable, then I would follow it with a
> sequence of patch submissions that follow this approach:
>
>  1. Basics of the format. (Patches 1-14)
>  2. Add additional guards around index interactions (Patch 15, but split
>     appropriately.)
>  3. Speed up "git status" and "git add" (Patches 16-27)
>
> After those three items that are represented in this RFC, the work starts to
> parallelize a bit. My basic ideas for moving forward from this point are to
> do these basic steps:
>
>  * Add new index API abstractions where appropriate, make them sparse-aware.
>  * Add new tests around sparse-checkout corner cases. Ensure the sparse
>    index works correctly.
>  * For a given builtin, add extra testing for sparse-checkouts then it them
>    sparse-aware.
>
> Here are some specific questions I'm hoping to answer in this RFC period:
>
>  1. Are these sparse directory entries an appropriate way to extend the
>     index format?
>  2. Is extensions.sparseIndex a good way to signal that these entries can
>     exist?
>  3. Is git sparse-checkout init --cone --sparse-index an appropriate way to
>     toggle the format?
>  4. Are there specific areas that I should target to harden the index API
>     before I submit this work?
>  5. Does anyone have a good idea how to test a large portion of the test
>     suite with sparse-index enabled? The problem I see is that most tests
>     don't use sparse-checkout, so the sparse index is identical to the full
>     index. Would it be interesting to enable the test setup to add such
>     "extra" directories during the test setup?
>
> Thanks, -Stolee

Thanks for working on this.  It's very exciting seeing this idea come
alive.  I had lots of little nitpicks and questions and whatnot here
and there, but that almost feels like a diversion.  Overall, I think
you divided up the series in a very logical and easy to follow
fashion, and actually achieved quite a bit already.

I suspect I have partially answered some of your questions above among
all my comments, and left others unanswered or worse, just re-asked
the same question(s) myself.  Feel free to ping again with the next
round and I'll see if I can dodge your questions again...er, I mean,
try to think of something helpful to say.  :-)

>
> Derrick Stolee (27):
>   sparse-index: add guard to ensure full index
>   sparse-index: implement ensure_full_index()
>   t1092: compare sparse-checkout to sparse-index
>   test-read-cache: print cache entries with --table
>   test-tool: read-cache --table --no-stat
>   test-tool: don't force full index
>   unpack-trees: ensure full index
>   sparse-checkout: hold pattern list in index
>   sparse-index: convert from full to sparse
>   submodule: sparse-index should not collapse links
>   unpack-trees: allow sparse directories
>   sparse-index: check index conversion happens
>   sparse-index: create extension for compatibility
>   sparse-checkout: toggle sparse index from builtin
>   [RFC-VERSION] *: ensure full index
>   unpack-trees: make sparse aware
>   dir.c: accept a directory as part of cone-mode patterns
>   status: use sparse-index throughout
>   status: skip sparse-checkout percentage with sparse-index
>   sparse-index: expand_to_path() trivial implementation
>   sparse-index: expand_to_path no-op if path exists
>   add: allow operating on a sparse-only index
>   submodule: die_path_inside_submodule is sparse aware
>   dir: use expand_to_path in add_patterns()
>   fsmonitor: disable if index is sparse
>   pathspec: stop calling ensure_full_index
>   cache-tree: integrate with sparse directory entries
>
>  Documentation/config/extensions.txt      |   7 +
>  Documentation/git-sparse-checkout.txt    |  14 +
>  Makefile                                 |   1 +
>  apply.c                                  |  10 +-
>  blame.c                                  |   7 +-
>  builtin/add.c                            |   3 +
>  builtin/checkout-index.c                 |   5 +-
>  builtin/commit.c                         |   3 +
>  builtin/grep.c                           |   2 +
>  builtin/ls-files.c                       |   9 +-
>  builtin/merge-index.c                    |   2 +
>  builtin/mv.c                             |   2 +
>  builtin/rm.c                             |   2 +
>  builtin/sparse-checkout.c                |  35 ++-
>  builtin/update-index.c                   |   2 +
>  cache-tree.c                             |  21 ++
>  cache.h                                  |  15 +-
>  diff.c                                   |   2 +
>  dir.c                                    |  19 +-
>  dir.h                                    |   2 +-
>  entry.c                                  |   2 +
>  fsmonitor.c                              |  18 +-
>  merge-recursive.c                        |  22 +-
>  name-hash.c                              |  10 +
>  pathspec.c                               |   4 +-
>  pathspec.h                               |   4 +-
>  preload-index.c                          |   2 +
>  read-cache.c                             |  51 +++-
>  repo-settings.c                          |  15 +
>  repository.c                             |  11 +-
>  repository.h                             |   3 +
>  rerere.c                                 |   2 +
>  resolve-undo.c                           |   6 +
>  setup.c                                  |   3 +
>  sha1-name.c                              |   3 +
>  sparse-index.c                           | 360 +++++++++++++++++++++++
>  sparse-index.h                           |  23 ++
>  split-index.c                            |   2 +
>  submodule.c                              |  22 +-
>  submodule.h                              |   6 +-
>  t/helper/test-read-cache.c               |  77 ++++-
>  t/t1092-sparse-checkout-compatibility.sh | 130 +++++++-
>  tree.c                                   |   2 +
>  unpack-trees.c                           |  40 ++-
>  wt-status.c                              |  21 +-
>  wt-status.h                              |   1 +
>  46 files changed, 942 insertions(+), 61 deletions(-)
>  create mode 100644 sparse-index.c
>  create mode 100644 sparse-index.h
>
>
> base-commit: 2271fe7848aa11b30e5313d95d9caebc2937fce5
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-847%2Fderrickstolee%2Fsparse-index%2Frfc-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-847/derrickstolee/sparse-index/rfc-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/847
> --
> gitgitgadget