diff mbox series

[6/8] index-format: update preamble to cached tree extension

Message ID fb9d5468184c4cbb3d80569f685743b9a5b45c8e.1609356414.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Cleanups around index operations | expand

Commit Message

Derrick Stolee Dec. 30, 2020, 7:26 p.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

I had difficulty in my efforts to learn about the cached tree extension
based on the documentation and code because I had an incorrect
assumption about how it behaved. This might be due to some ambiguity in
the documentation, so this change modifies the beginning of the cached
tree format by expanding the description of the feature.

My hope is that this documentation clarifies a few things:

1. There is an in-memory recursive tree structure that is constructed
   from the extension data. This structure has a few differences, such
   as where the name is stored.

2. What does it mean for an entry to be invalid?

3. When exactly are "new" trees created?

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/index-format.txt | 36 ++++++++++++++++++++----
 1 file changed, 30 insertions(+), 6 deletions(-)

Comments

Elijah Newren Dec. 30, 2020, 8 p.m. UTC | #1
On Wed, Dec 30, 2020 at 11:26 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> I had difficulty in my efforts to learn about the cached tree extension
> based on the documentation and code because I had an incorrect
> assumption about how it behaved. This might be due to some ambiguity in
> the documentation, so this change modifies the beginning of the cached
> tree format by expanding the description of the feature.
>
> My hope is that this documentation clarifies a few things:
>
> 1. There is an in-memory recursive tree structure that is constructed
>    from the extension data. This structure has a few differences, such
>    as where the name is stored.
>
> 2. What does it mean for an entry to be invalid?
>
> 3. When exactly are "new" trees created?
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/technical/index-format.txt | 36 ++++++++++++++++++++----
>  1 file changed, 30 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt
> index 69edf46c031..c614e136e24 100644
> --- a/Documentation/technical/index-format.txt
> +++ b/Documentation/technical/index-format.txt
> @@ -138,12 +138,36 @@ Git index format
>
>  === Cached tree
>
> -  Cached tree extension contains pre-computed hashes for trees that can
> -  be derived from the index. It helps speed up tree object generation
> -  from index for a new commit.
> -
> -  When a path is updated in index, the path must be invalidated and
> -  removed from tree cache.
> +  Since the index does not record entries for directories, the cache
> +  entries cannot describe tree objects that already exist in the object
> +  database for regions of the index that are unchanged from an existing
> +  commit. The cached tree extension stores a recursive tree structure that
> +  describes the trees that already exist and completely match sections of
> +  the cache entries. This speeds up tree object generation from the index
> +  for a new commit by only computing the trees that are "new" to that
> +  commit.
> +
> +  The recursive tree structure uses nodes that store a number of cache
> +  entries, a list of subnodes, and an object ID (OID). The OID references
> +  the exising tree for that node, if it is known to exist. The subnodes
> +  correspond to subdirectories that themselves have cached tree nodes. The
> +  number of cache entries corresponds to the number of cache entries in
> +  the index that describe paths within that tree's directory.
> +
> +  Note that the path for a given tree is part of the parent node in-memory
> +  but is part of the child in the file format. The root tree has an empty
> +  string for its name and its name does not exist in-memory.
> +
> +  When a path is updated in index, Git invalidates all nodes of the
> +  recurisive cached tree corresponding to the parent directories of that
> +  path. We store these tree nodes as being "invalid" by using "-1" as the
> +  number of cache entries. To create trees corresponding to the current
> +  index, Git only walks the invalid tree nodes and uses the cached OIDs
> +  for the valid trees to construct new trees. In this way, Git only
> +  constructs trees on the order of the number of changed paths (and their
> +  depth in the working directory). This comes at a cost of tracking the
> +  full directory structure in the cached tree extension, but this is
> +  generally smaller than the full cache entry list in the index.

Ooh, I really like it; this probably would have helped me.  However,
we'll need to get someone else to take a look at this, because I don't
know enough to say whether any part of it is incorrect, misleading, or
incomplete or whether it's all good.  My knowledge in the area is
limited to moving a function from merge-recursive.c to cache-tree.c in
commit 724dd767b2 ("cache-tree: share code between functions writing
an index as a tree", 2019-08-17), but I seem to recall that I had to
rely on Junio's reviews and guidance to make the minor adaptations
found in that commit.
diff mbox series

Patch

diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt
index 69edf46c031..c614e136e24 100644
--- a/Documentation/technical/index-format.txt
+++ b/Documentation/technical/index-format.txt
@@ -138,12 +138,36 @@  Git index format
 
 === Cached tree
 
-  Cached tree extension contains pre-computed hashes for trees that can
-  be derived from the index. It helps speed up tree object generation
-  from index for a new commit.
-
-  When a path is updated in index, the path must be invalidated and
-  removed from tree cache.
+  Since the index does not record entries for directories, the cache
+  entries cannot describe tree objects that already exist in the object
+  database for regions of the index that are unchanged from an existing
+  commit. The cached tree extension stores a recursive tree structure that
+  describes the trees that already exist and completely match sections of
+  the cache entries. This speeds up tree object generation from the index
+  for a new commit by only computing the trees that are "new" to that
+  commit.
+
+  The recursive tree structure uses nodes that store a number of cache
+  entries, a list of subnodes, and an object ID (OID). The OID references
+  the exising tree for that node, if it is known to exist. The subnodes
+  correspond to subdirectories that themselves have cached tree nodes. The
+  number of cache entries corresponds to the number of cache entries in
+  the index that describe paths within that tree's directory.
+
+  Note that the path for a given tree is part of the parent node in-memory
+  but is part of the child in the file format. The root tree has an empty
+  string for its name and its name does not exist in-memory.
+
+  When a path is updated in index, Git invalidates all nodes of the
+  recurisive cached tree corresponding to the parent directories of that
+  path. We store these tree nodes as being "invalid" by using "-1" as the
+  number of cache entries. To create trees corresponding to the current
+  index, Git only walks the invalid tree nodes and uses the cached OIDs
+  for the valid trees to construct new trees. In this way, Git only
+  constructs trees on the order of the number of changed paths (and their
+  depth in the working directory). This comes at a cost of tracking the
+  full directory structure in the cached tree extension, but this is
+  generally smaller than the full cache entry list in the index.
 
   The signature for this extension is { 'T', 'R', 'E', 'E' }.