diff mbox series

[v3,01/20] sparse-index: design doc and format update

Message ID 62ac13945bec13270e0898126756c3f947ae264b.1615912983.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Sparse Index: Design, Format, Tests | expand

Commit Message

Derrick Stolee March 16, 2021, 4:42 p.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

This begins a long effort to update the index format to allow sparse
directory entries. This should result in a significant improvement to
Git commands when HEAD contains millions of files, but the user has
selected many fewer files to keep in their sparse-checkout definition.

Currently, the index format is only updated in the presence of
extensions.sparseIndex instead of increasing a file format version
number. This is temporary, and index v5 is part of the plan for future
work in this area.

The design document details many of the reasons for embarking on this
work, and also the plan for completing it safely.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/index-format.txt |   7 +
 Documentation/technical/sparse-index.txt | 173 +++++++++++++++++++++++
 2 files changed, 180 insertions(+)
 create mode 100644 Documentation/technical/sparse-index.txt

Comments

Junio C Hamano March 19, 2021, 11:43 p.m. UTC | #1
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> This begins a long effort to update the index format to allow sparse
> directory entries. This should result in a significant improvement to
> Git commands when HEAD contains millions of files, but the user has
> selected many fewer files to keep in their sparse-checkout definition.

This compromise makes sense.

In the past, we often dreamed of recording trees in the index
(instead of using a bolted on extension like cache-tree, treating
trees as first-class citizens) and lazily expanding it only when the
user starts modifying the paths within the subdirectory.

But such an optimization never materialized, as the dual and
conflicting nature of the index to keep track of the contents for
the "next" commit (for which it is sufficient to just record trees
for parts that have not been modified) and to cache stat information
to detect which working tree paths may possibly have modifications
(for which, we used the one-entry-per-path nature of the cache
entries so far) was never resolved.

But if we limit the use of trees-in-index for sparse/cone checkout
case, we do not even have to worry about having to cache the stat
information for those paths that we are not going to populate in the
working tree at all.  It is a great simplification of the problem.

> +  These entries have mode `0040000`, include the `SKIP_WORKTREE` bit, and
> +  the path ends in a directory separator.
> +

Why leading two 0's?  At the tree object level, we do not 0-pad blob
mode word, and if you are writing for C programmers, you need only
one '0' prefix to signal that it is in octal (in the on-disk index
file, the blob mode word is stored in a be16 word).

> diff --git a/Documentation/technical/sparse-index.txt b/Documentation/technical/sparse-index.txt
> new file mode 100644
> index 000000000000..aa116406a016
> --- /dev/null
> +++ b/Documentation/technical/sparse-index.txt
> @@ -0,0 +1,173 @@
> +Git Sparse-Index Design Document
> +================================
> +
> +The sparse-checkout feature allows users to focus a working directory on
> +a subset of the files at HEAD. The cone mode patterns, enabled by
> +`core.sparseCheckoutCone`, allow for very fast pattern matching to
> +discover which files at HEAD belong in the sparse-checkout cone.
> +
> +Three important scale dimensions for a Git worktree are:

s/worktree/working tree/; The former is the thing the "git worktree"
command deals with.  The latter is relevant even when "git worktree"
is not used (the traditional "git clone and you get a working tree
to work in").

> +* `HEAD`: How many files are present at `HEAD`?
> +
> +* Populated: How many files are within the sparse-checkout cone.
> +
> +* Modified: How many files has the user modified in the working directory?
> +
> +We will use big-O notation -- O(X) -- to denote how expensive certain
> +operations are in terms of these dimensions.
> +
> +These dimensions are ordered by their magnitude: users (typically) modify
> +fewer files than are populated, and we can only populate files at `HEAD`.

OK.

> +These dimensions are also ordered by how expensive they are per item: it
> +is expensive to detect a modified file than it is to write one that we
> +know must be populated; changing `HEAD` only really requires updating the
> +index.

This is a bit too dense to grok.  Among Populated, there are some
Modified but it takes lstat(2) per path or fsmonitor listening to
inotify to know which ones are in the Modified set.  Is that the
"expensive" you are referring to here?  I am not sure how you
compared the cost to know if a path is modified or merely populated
with the cost of "write one that we know must be populated" (which I
take as "given a populated file, make modification to it").  Also it
is unclear what you mean by "changing HEAD only require updating the
index".  Certainly when "git switch" flips HEAD from one commit to
another, you'd update the index and update the files in the working
tree (in the Populated part that is in the sparse-checkout cone) to
match, no?

> +Problems occur if there is an extreme imbalance in these dimensions. For
> +example, if `HEAD` contains millions of paths but the populated set has
> +only tens of thousands, then commands like `git status` and `git add` can
> +be dominated by operations that require O(`HEAD`) operations instead of
> +O(Populated). Primarily, the cost is in parsing and rewriting the index,
> +which is filled primarily with files at `HEAD` that are marked with the
> +`SKIP_WORKTREE` bit.
> +
> +The sparse-index intends to take these commands that read and modify the
> +index from O(`HEAD`) to O(Populated). To do this, we need to modify the
> +index format in a significant way: add "sparse directory" entries.

OK.

> +With cone mode patterns, it is possible to detect when an entire
> +directory will have its contents outside of the sparse-checkout definition.
> +Instead of listing all of the files it contains as individual entries, a
> +sparse-index contains an entry with the directory name, referencing the
> +object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit.
> +If we need to discover the details for paths within that directory, we
> +can parse trees to find that list.

;-)

> +At time of writing, sparse-directory entries violate expectations about the
> +index format and its in-memory data structure. There are many consumers in
> +the codebase that expect to iterate through all of the index entries and
> +see only files.

True.

> In addition, they expect to see all files at `HEAD`.

It is not clear to me what this means.  After "git add", "git
ls-files" would expect to see a file that may not even in HEAD.
After "git rm", it would expect to see some file missing from the
set of paths in HEAD.  While I do not think that is what you meant
here, it is hard to guess what you wanted to say.

> One
> +way to handle this is to parse trees to replace a sparse-directory entry
> +with all of the files within that tree as the index is loaded. However,
> +parsing trees is slower than parsing the index format, so that is a slower
> +operation than if we left the index alone.

Besides, that would leave in-core index fully populated, so I would
suspect that you'd lose a lot of benefit that comes from having to
keep much fewer entries in the in-core index than what is in HEAD.
It would be nice for "git diff-index --cached" (which is part of
"git status") to be able to skip a single "tree" entry in the sparse
index as "known to be untouched", than skipping thousands of paths
in that single subdirectory (in a mega monorepo project) as "these
are marked with SKIP_WORKTREE so ignore what is in the working tree".

> +The implementation plan below follows four phases to slowly integrate with
> +the sparse-index. The intention is to incrementally update Git commands to
> +interact safely with the sparse-index without significant slowdowns. This
> +may not always be possible, but the hope is that the primary commands that
> +users need in their daily work are dramatically improved.

OK.

> +Phase I: Format and initial speedups
> +------------------------------------
> +
> +During this phase, Git learns to enable the sparse-index and safely parse
> +one. Protections are put in place so that every consumer of the in-memory
> +data structure can operate with its current assumption of every file at
> +`HEAD`.

IOW, before they iterate over the in-core index, tree entries are expanded
into bunch of individual entries with SKIP_WORKTREE bit?  Makes sense.

> +At first, every index parse will expand the sparse-directory entries into
> +the full list of paths at `HEAD`. This will be slower in all cases. The
> +only noticable change in behavior will be that the serialized index file
> +contains sparse-directory entries.

Hmph, do you mean that the expansion is done by not replacing each
"tree" entry with blob entries for the contents of the directory,
but the original "tree" entry is still left in the in-core index?
It is not immediately clear what we are trying to gain by leaving it
in, but let's read on.  Perhaps we can get rid of cache-tree
extension and replace its use with these "tree" entries whose
content paths are populated in the index?

> +To start, we use a new repository extension, `extensions.sparseIndex`, to
> +allow inserting sparse-directory entries into indexes with file format
> +versions 2, 3, and 4. This prevents Git versions that do not understand
> +the sparse-index from operating on one, but it also prevents other
> +operations that do not use the index at all. A new format, index v5, will
> +be introduced that includes sparse-directory entries by default. It might
> +also introduce other features that have been considered for improving the
> +index, as well.

OK.

> +Next, consumers of the index will be guarded against operating on a
> +sparse-index by inserting calls to `ensure_full_index()` or
> +`expand_index_to_path()`. After these guards are in place, we can begin
> +leaving sparse-directory entries in the in-memory index structure.

It is unclear why "we can begin leaving"; an iterator that only
expects to see blobs would need to be updated to skip them, too, no?
They would probably be already skipping blob entries that are marked
with the SKIP_WORKTREE bit, so it may be just a matter of skipping
more things than the current code.

Or did I misread the design presented earlier, and when a directory
that is outside the cone is expanded into the paths of blobs in the
directory, the "tree" entry is removed from the in-core index?

> +Even after inserting these guards, we will keep expanding sparse-indexes
> +for most Git commands using the `command_requires_full_index` repository
> +setting. This setting will be on by default and disabled one builtin at a
> +time until we have sufficient confidence that all of the index operations
> +are properly guarded.

OK.

> +To complete this phase, the commands `git status` and `git add` will be
> +integrated with the sparse-index so that they operate with O(Populated)
> +performance. They will be carefully tested for operations within and
> +outside the sparse-checkout definition.

;-)
Derrick Stolee March 23, 2021, 11:16 a.m. UTC | #2
On 3/19/2021 7:43 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> This begins a long effort to update the index format to allow sparse
>> directory entries. This should result in a significant improvement to
>> Git commands when HEAD contains millions of files, but the user has
>> selected many fewer files to keep in their sparse-checkout definition.
> 
> This compromise makes sense.
> 
> In the past, we often dreamed of recording trees in the index
> (instead of using a bolted on extension like cache-tree, treating
> trees as first-class citizens) and lazily expanding it only when the
> user starts modifying the paths within the subdirectory.
> 
> But such an optimization never materialized, as the dual and
> conflicting nature of the index to keep track of the contents for
> the "next" commit (for which it is sufficient to just record trees
> for parts that have not been modified) and to cache stat information
> to detect which working tree paths may possibly have modifications
> (for which, we used the one-entry-per-path nature of the cache
> entries so far) was never resolved.
> 
> But if we limit the use of trees-in-index for sparse/cone checkout
> case, we do not even have to worry about having to cache the stat
> information for those paths that we are not going to populate in the
> working tree at all.  It is a great simplification of the problem.

Thanks. I appreciate your input here.
 
>> +  These entries have mode `0040000`, include the `SKIP_WORKTREE` bit, and
>> +  the path ends in a directory separator.
>> +
> 
> Why leading two 0's?  At the tree object level, we do not 0-pad blob
> mode word, and if you are writing for C programmers, you need only
> one '0' prefix to signal that it is in octal (in the on-disk index
> file, the blob mode word is stored in a be16 word).

Fixed.

>> diff --git a/Documentation/technical/sparse-index.txt b/Documentation/technical/sparse-index.txt
>> new file mode 100644
>> index 000000000000..aa116406a016
>> --- /dev/null
>> +++ b/Documentation/technical/sparse-index.txt
>> @@ -0,0 +1,173 @@
>> +Git Sparse-Index Design Document
>> +================================
>> +
>> +The sparse-checkout feature allows users to focus a working directory on
>> +a subset of the files at HEAD. The cone mode patterns, enabled by
>> +`core.sparseCheckoutCone`, allow for very fast pattern matching to
>> +discover which files at HEAD belong in the sparse-checkout cone.
>> +
>> +Three important scale dimensions for a Git worktree are:
> 
> s/worktree/working tree/; The former is the thing the "git worktree"
> command deals with.  The latter is relevant even when "git worktree"
> is not used (the traditional "git clone and you get a working tree
> to work in").

I guess I'm distracted by using SKIP_WORKTREE a lot, but "working
directory" is more specific and hence better.

>> +These dimensions are also ordered by how expensive they are per item: it
>> +is expensive to detect a modified file than it is to write one that we
>> +know must be populated; changing `HEAD` only really requires updating the
>> +index.
> 
> This is a bit too dense to grok.  Among Populated, there are some
> Modified but it takes lstat(2) per path or fsmonitor listening to
> inotify to know which ones are in the Modified set.  Is that the
> "expensive" you are referring to here?  I am not sure how you
> compared the cost to know if a path is modified or merely populated
> with the cost of "write one that we know must be populated" (which I
> take as "given a populated file, make modification to it"). 

I could rearrange things here. The important things to note are:

1. Updating index entries is very fast, but adds up at large scale.

2. It is faster to write a file to disk from Git's object database
   than it is to compare a file on disk to the copy in the database,
   which is frequently necessary when the mtime on disk doesn't match
   the mtime in the index.

> Also it
> is unclear what you mean by "changing HEAD only require updating the
> index".  Certainly when "git switch" flips HEAD from one commit to
> another, you'd update the index and update the files in the working
> tree (in the Populated part that is in the sparse-checkout cone) to
> match, no?

This is unclear of me. I was thinking more on the lines of "git reset"
(soft mode) which updates HEAD without changing the files on disk.

After all of this postulating, I think that the offending sentences
are better off deleted. They don't add clarity over what can be
inferred by an interested reader.

>> In addition, they expect to see all files at `HEAD`.
> 
> It is not clear to me what this means.  After "git add", "git
> ls-files" would expect to see a file that may not even in HEAD.
> After "git rm", it would expect to see some file missing from the
> set of paths in HEAD.  While I do not think that is what you meant
> here, it is hard to guess what you wanted to say.

I'm mixing terms incorrectly. I think what I really mean is

  In fact, these loops expect to see a reference to every
  staged file.

>> One
>> +way to handle this is to parse trees to replace a sparse-directory entry
>> +with all of the files within that tree as the index is loaded. However,
>> +parsing trees is slower than parsing the index format, so that is a slower
>> +operation than if we left the index alone.
> 
> Besides, that would leave in-core index fully populated, so I would
> suspect that you'd lose a lot of benefit that comes from having to
> keep much fewer entries in the in-core index than what is in HEAD.
> It would be nice for "git diff-index --cached" (which is part of
> "git status") to be able to skip a single "tree" entry in the sparse
> index as "known to be untouched", than skipping thousands of paths
> in that single subdirectory (in a mega monorepo project) as "these
> are marked with SKIP_WORKTREE so ignore what is in the working tree".

Absolutely! I'm burying the lead here, so I should get to the real
point by adding this to the end:

 The plan is to make all of these integrations "sparse aware" so
 this expansion through tree parsing is unnecessary and they use
 fewer resources than when using a full index.

>> +Phase I: Format and initial speedups
>> +------------------------------------
>> +
>> +During this phase, Git learns to enable the sparse-index and safely parse
>> +one. Protections are put in place so that every consumer of the in-memory
>> +data structure can operate with its current assumption of every file at
>> +`HEAD`.
> 
> IOW, before they iterate over the in-core index, tree entries are expanded
> into bunch of individual entries with SKIP_WORKTREE bit?  Makes sense.
> 
>> +At first, every index parse will expand the sparse-directory entries into
>> +the full list of paths at `HEAD`. This will be slower in all cases. The
>> +only noticable change in behavior will be that the serialized index file
>> +contains sparse-directory entries.
> 
> Hmph, do you mean that the expansion is done by not replacing each
> "tree" entry with blob entries for the contents of the directory,
> but the original "tree" entry is still left in the in-core index?

I meant by "serialized index file" is that the file written to disk has
the sparse directory entries, but the in-core copy will not (except for
a very brief moment in time, during do_read_index()).

The intention at this point in time is that all code behaves identically
to the full index case, except that the index file itself is smaller due
to these sparse directory entries.

> It is not immediately clear what we are trying to gain by leaving it
> in, but let's read on.  Perhaps we can get rid of cache-tree
> extension and replace its use with these "tree" entries whose
> content paths are populated in the index?

This is an interesting idea, but not one I plan to pursue with this work.

>> +Next, consumers of the index will be guarded against operating on a
>> +sparse-index by inserting calls to `ensure_full_index()` or
>> +`expand_index_to_path()`. After these guards are in place, we can begin
>> +leaving sparse-directory entries in the in-memory index structure.
> 
> It is unclear why "we can begin leaving"; an iterator that only
> expects to see blobs would need to be updated to skip them, too, no?
> They would probably be already skipping blob entries that are marked
> with the SKIP_WORKTREE bit, so it may be just a matter of skipping
> more things than the current code.
> 
> Or did I misread the design presented earlier, and when a directory
> that is outside the cone is expanded into the paths of blobs in the
> directory, the "tree" entry is removed from the in-core index?

I will make this more explicit.
 
Thanks for your help improving this doc! Hopefully the plan is a
little more clear, now.

-Stolee
Junio C Hamano March 23, 2021, 8:10 p.m. UTC | #3
Derrick Stolee <stolee@gmail.com> writes:

>>> +Three important scale dimensions for a Git worktree are:
>> 
>> s/worktree/working tree/; The former is the thing the "git worktree"
>> command deals with.  The latter is relevant even when "git worktree"
>> is not used (the traditional "git clone and you get a working tree
>> to work in").
>
> I guess I'm distracted by using SKIP_WORKTREE a lot, but "working
> directory" is more specific and hence better.

Since the user's current working directory can be outside any
working tree that is governed by any git repository, "working
directory" is a term I try to avoid when describing the directory
where a checkout of a revision lives.

Documentation/glossary-content.txt is where the suggestion for
"working tree" comes from.

> I could rearrange things here. The important things to note are:
>
> 1. Updating index entries is very fast, but adds up at large scale.

This is the "checkout to match the index to the tree of HEAD" part,
ignoring the cost of writing working tree files out?

> 2. It is faster to write a file to disk from Git's object database
>    than it is to compare a file on disk to the copy in the database,
>    which is frequently necessary when the mtime on disk doesn't match
>    the mtime in the index.

True.  But of course, not having to do either (i.e. having a fresh
cached stat info) would be even faster ;-).

>> Also it
>> is unclear what you mean by "changing HEAD only require updating the
>> index".  Certainly when "git switch" flips HEAD from one commit to
>> another, you'd update the index and update the files in the working
>> tree (in the Populated part that is in the sparse-checkout cone) to
>> match, no?
>
> This is unclear of me. I was thinking more on the lines of "git reset"
> (soft mode) which updates HEAD without changing the files on disk.

OK, and that is in line with your "updating index entries is very
fast (but adds up)".

> After all of this postulating, I think that the offending sentences
> are better off deleted. They don't add clarity over what can be
> inferred by an interested reader.

OK.

> I'm mixing terms incorrectly. I think what I really mean is
>
>   In fact, these loops expect to see a reference to every
>   staged file.

OK.

>  The plan is to make all of these integrations "sparse aware" so
>  this expansion through tree parsing is unnecessary and they use
>  fewer resources than when using a full index.

;-)

> I meant by "serialized index file" is that the file written to disk has
> the sparse directory entries, but the in-core copy will not (except for
> a very brief moment in time, during do_read_index()).

Nice.  That would probably mean cache-tree extension on-disk can go
away, because we can populate in-core cache-tree from these entries.
I've always hated the on-disk encoding of that extension.

Or we are not doing this "extra tree" everywhere (i.e. limited only
to the parts that are marked for "sparse checkout")?

Thanks.
Derrick Stolee March 23, 2021, 8:42 p.m. UTC | #4
On 3/23/2021 4:10 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>>>> +Three important scale dimensions for a Git worktree are:
>>>
>>> s/worktree/working tree/; The former is the thing the "git worktree"
>>> command deals with.  The latter is relevant even when "git worktree"
>>> is not used (the traditional "git clone and you get a working tree
>>> to work in").
>>
>> I guess I'm distracted by using SKIP_WORKTREE a lot, but "working
>> directory" is more specific and hence better.
> 
> Since the user's current working directory can be outside any
> working tree that is governed by any git repository, "working
> directory" is a term I try to avoid when describing the directory
> where a checkout of a revision lives.
> 
> Documentation/glossary-content.txt is where the suggestion for
> "working tree" comes from.

Whoops. Somehow I read that wrong. Thanks for pointing out my error.

>> I meant by "serialized index file" is that the file written to disk has
>> the sparse directory entries, but the in-core copy will not (except for
>> a very brief moment in time, during do_read_index()).
> 
> Nice.  That would probably mean cache-tree extension on-disk can go
> away, because we can populate in-core cache-tree from these entries.
> I've always hated the on-disk encoding of that extension.
> 
> Or we are not doing this "extra tree" everywhere (i.e. limited only
> to the parts that are marked for "sparse checkout")?

The current design is to only have these entries when all paths
within the directory are marked with SKIP_WORKTREE. This pairs
with the cache-tree extension, which has these directories as
nodes, but only consuming one cache entry (for itself).

I haven't considered the idea of inserting trees for other
reasons. Seems like a valuable experiment.

Thanks,
-Stolee
diff mbox series

Patch

diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt
index d363a71c37ec..cc548eaa0e97 100644
--- a/Documentation/technical/index-format.txt
+++ b/Documentation/technical/index-format.txt
@@ -44,6 +44,13 @@  Git index format
   localization, no special casing of directory separator '/'). Entries
   with the same name are sorted by their stage field.
 
+  An index entry typically represents a file. However, if sparse-checkout
+  is enabled in cone mode (`core.sparseCheckoutCone` is enabled) and the
+  `extensions.sparseIndex` extension is enabled, then the index may
+  contain entries for directories outside of the sparse-checkout definition.
+  These entries have mode `0040000`, include the `SKIP_WORKTREE` bit, and
+  the path ends in a directory separator.
+
   32-bit ctime seconds, the last time a file's metadata changed
     this is stat(2) data
 
diff --git a/Documentation/technical/sparse-index.txt b/Documentation/technical/sparse-index.txt
new file mode 100644
index 000000000000..aa116406a016
--- /dev/null
+++ b/Documentation/technical/sparse-index.txt
@@ -0,0 +1,173 @@ 
+Git Sparse-Index Design Document
+================================
+
+The sparse-checkout feature allows users to focus a working directory on
+a subset of the files at HEAD. The cone mode patterns, enabled by
+`core.sparseCheckoutCone`, allow for very fast pattern matching to
+discover which files at HEAD belong in the sparse-checkout cone.
+
+Three important scale dimensions for a Git worktree are:
+
+* `HEAD`: How many files are present at `HEAD`?
+
+* Populated: How many files are within the sparse-checkout cone.
+
+* Modified: How many files has the user modified in the working directory?
+
+We will use big-O notation -- O(X) -- to denote how expensive certain
+operations are in terms of these dimensions.
+
+These dimensions are ordered by their magnitude: users (typically) modify
+fewer files than are populated, and we can only populate files at `HEAD`.
+These dimensions are also ordered by how expensive they are per item: it
+is expensive to detect a modified file than it is to write one that we
+know must be populated; changing `HEAD` only really requires updating the
+index.
+
+Problems occur if there is an extreme imbalance in these dimensions. For
+example, if `HEAD` contains millions of paths but the populated set has
+only tens of thousands, then commands like `git status` and `git add` can
+be dominated by operations that require O(`HEAD`) operations instead of
+O(Populated). Primarily, the cost is in parsing and rewriting the index,
+which is filled primarily with files at `HEAD` that are marked with the
+`SKIP_WORKTREE` bit.
+
+The sparse-index intends to take these commands that read and modify the
+index from O(`HEAD`) to O(Populated). To do this, we need to modify the
+index format in a significant way: add "sparse directory" entries.
+
+With cone mode patterns, it is possible to detect when an entire
+directory will have its contents outside of the sparse-checkout definition.
+Instead of listing all of the files it contains as individual entries, a
+sparse-index contains an entry with the directory name, referencing the
+object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit.
+If we need to discover the details for paths within that directory, we
+can parse trees to find that list.
+
+At time of writing, sparse-directory entries violate expectations about the
+index format and its in-memory data structure. There are many consumers in
+the codebase that expect to iterate through all of the index entries and
+see only files. In addition, they expect to see all files at `HEAD`. One
+way to handle this is to parse trees to replace a sparse-directory entry
+with all of the files within that tree as the index is loaded. However,
+parsing trees is slower than parsing the index format, so that is a slower
+operation than if we left the index alone.
+
+The implementation plan below follows four phases to slowly integrate with
+the sparse-index. The intention is to incrementally update Git commands to
+interact safely with the sparse-index without significant slowdowns. This
+may not always be possible, but the hope is that the primary commands that
+users need in their daily work are dramatically improved.
+
+Phase I: Format and initial speedups
+------------------------------------
+
+During this phase, Git learns to enable the sparse-index and safely parse
+one. Protections are put in place so that every consumer of the in-memory
+data structure can operate with its current assumption of every file at
+`HEAD`.
+
+At first, every index parse will expand the sparse-directory entries into
+the full list of paths at `HEAD`. This will be slower in all cases. The
+only noticable change in behavior will be that the serialized index file
+contains sparse-directory entries.
+
+To start, we use a new repository extension, `extensions.sparseIndex`, to
+allow inserting sparse-directory entries into indexes with file format
+versions 2, 3, and 4. This prevents Git versions that do not understand
+the sparse-index from operating on one, but it also prevents other
+operations that do not use the index at all. A new format, index v5, will
+be introduced that includes sparse-directory entries by default. It might
+also introduce other features that have been considered for improving the
+index, as well.
+
+Next, consumers of the index will be guarded against operating on a
+sparse-index by inserting calls to `ensure_full_index()` or
+`expand_index_to_path()`. After these guards are in place, we can begin
+leaving sparse-directory entries in the in-memory index structure.
+
+Even after inserting these guards, we will keep expanding sparse-indexes
+for most Git commands using the `command_requires_full_index` repository
+setting. This setting will be on by default and disabled one builtin at a
+time until we have sufficient confidence that all of the index operations
+are properly guarded.
+
+To complete this phase, the commands `git status` and `git add` will be
+integrated with the sparse-index so that they operate with O(Populated)
+performance. They will be carefully tested for operations within and
+outside the sparse-checkout definition.
+
+Phase II: Careful integrations
+------------------------------
+
+This phase focuses on ensuring that all index extensions and APIs work
+well with a sparse-index. This requires significant increases to our test
+coverage, especially for operations that interact with the working
+directory outside of the sparse-checkout definition. Some of these
+behaviors may not be the desirable ones, such as some tests already
+marked for failure in `t1092-sparse-checkout-compatibility.sh`.
+
+The index extensions that may require special integrations are:
+
+* FS Monitor
+* Untracked cache
+
+While integrating with these features, we should look for patterns that
+might lead to better APIs for interacting with the index. Coalescing
+common usage patterns into an API call can reduce the number of places
+where sparse-directories need to be handled carefully.
+
+Phase III: Important command speedups
+-------------------------------------
+
+At this point, the patterns for testing and implementing sparse-directory
+logic should be relatively stable. This phase focuses on updating some of
+the most common builtins that use the index to operate as O(Populated).
+Here is a potential list of commands that could be valuable to integrate
+at this point:
+
+* `git commit`
+* `git checkout`
+* `git merge`
+* `git rebase`
+
+Hopefully, commands such as `git merge` and `git rebase` can benefit
+instead from merge algorithms that do not use the index as a data
+structure, such as the merge-ORT strategy. As these topics mature, we
+may enable the ORT strategy by default for repositories using the
+sparse-index feature.
+
+Along with `git status` and `git add`, these commands cover the majority
+of users' interactions with the working directory. In addition, we can
+integrate with these commands:
+
+* `git grep`
+* `git rm`
+
+These have been proposed as some whose behavior could change when in a
+repo with a sparse-checkout definition. It would be good to include this
+behavior automatically when using a sparse-index. Some clarity is needed
+to make the behavior switch clear to the user.
+
+This phase is the first where parallel work might be possible without too
+much conflicts between topics.
+
+Phase IV: The long tail
+-----------------------
+
+This last phase is less a "phase" and more "the new normal" after all of
+the previous work.
+
+To start, the `command_requires_full_index` option could be removed in
+favor of expanding only when hitting an API guard.
+
+There are many Git commands that could use special attention to operate as
+O(Populated), while some might be so rare that it is acceptable to leave
+them with additional overhead when a sparse-index is present.
+
+Here are some commands that might be useful to update:
+
+* `git sparse-checkout set`
+* `git am`
+* `git clean`
+* `git stash`