Message ID | 75b51483d3c7088d0cfae36544966672374c50f9.1609729758.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Cleanups around index operations | expand |
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > Subject: Re: [PATCH v2 6/9] index-format: update preamble to cached tree extension By the way, the name of the extension is "cache tree". git grep -i 'cached[- ]tree' ':!po/' reveals there are a handful of mistakes already present but their number is dwarfed when we check: git grep -i 'cache tree' ':!po/' > I had difficulty in my efforts to learn about the cached tree extension > based on the documentation and code because I had an incorrect > assumption about how it behaved. This might be due to some ambiguity in > the documentation, so this change modifies the beginning of the cached > tree format by expanding the description of the feature. > > My hope is that this documentation clarifies a few things: > > 1. There is an in-memory recursive tree structure that is constructed > from the extension data. This structure has a few differences, such > as where the name is stored. > > 2. What does it mean for an entry to be invalid? > > 3. When exactly are "new" trees created? Thanks. Describing them is a worthy goal. > + Since the index does not record entries for directories, the cache > + entries cannot describe tree objects that already exist in the object > + database for regions of the index that are unchanged from an existing > + commit. The cached tree extension stores a recursive tree structure that > + describes the trees that already exist and completely match sections of > + the cache entries. This speeds up tree object generation from the index > + for a new commit by only computing the trees that are "new" to that > + commit. The original motivation was the above one. A cache of tree objects that correspond to unmodified part of the directory structure helps writing out a new tree out of modified index. We later found out that we rather often compare the index against the tree of HEAD (think: "git status"), and diff-lib.c::diff_cache() does take advantage of the fact that an entire directory can be skipped if the tree object taken from the HEAD side exactly matches the tree recorded for the subdirectory in the cache tree extension. > + The recursive tree structure uses nodes that store a number of cache > + entries, a list of subnodes, and an object ID (OID). The OID references > + the exising tree for that node, if it is known to exist. The subnodes > + correspond to subdirectories that themselves have cached tree nodes. The > + number of cache entries corresponds to the number of cache entries in > + the index that describe paths within that tree's directory. OK. > + Note that the path for a given tree is part of the parent node in-memory Sorry, I am not sure if I follow. The top-level in-core cache_tree object records the number of entries, tree object name for the entire tree (if valid), and cache_tree_sub structures, one for each subdirectory. Each of the cache_tree_sub structure describes the "child" directory, including the path to it. > + but is part of the child in the file format. The root tree has an empty > + string for its name and its name does not exist in-memory. It's more like we could have consistently used cache_tree_sub instances to represent each and every level (i.e. I consider that cache_tree_sub is what represents a directory, with cache_tree being a record of just one aspect of it) including the root of the hierarchy, but because there wasn't much point in giving a name to the root level, I cheated and avoided wasting a cache_tree_sub for it. So from that point of view, the path belongs to the node in each level in both in-core and on-disk representations. > + When a path is updated in index, Git invalidates all nodes of the > + recurisive cached tree corresponding to the parent directories of that > + path. We store these tree nodes as being "invalid" by using "-1" as the > + number of cache entries. Correct. > + To create trees corresponding to the current > + index, Git only walks the invalid tree nodes and uses the cached OIDs > + for the valid trees to construct new trees. I wonder if the above is sufficiently clear, or "Git only has to walk the spans of index entries that corresponds to the invalid trees, while reusing the ..." is too long and detailed. > + In this way, Git only > + constructs trees on the order of the number of changed paths (and their > + depth in the working directory). This comes at a cost of tracking the > + full directory structure in the cached tree extension, but this is > + generally smaller than the full cache entry list in the index. > > The signature for this extension is { 'T', 'R', 'E', 'E' }. Looks good. Thanks for working on this.
On 1/6/2021 9:10 PM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> Subject: Re: [PATCH v2 6/9] index-format: update preamble to cached tree extension > > By the way, the name of the extension is "cache tree". > > git grep -i 'cached[- ]tree' ':!po/' > > reveals there are a handful of mistakes already present but their > number is dwarfed when we check: > > git grep -i 'cache tree' ':!po/' I will fix my own additions and add a patch that fixes these mistakes. >> + Since the index does not record entries for directories, the cache >> + entries cannot describe tree objects that already exist in the object >> + database for regions of the index that are unchanged from an existing >> + commit. The cached tree extension stores a recursive tree structure that >> + describes the trees that already exist and completely match sections of >> + the cache entries. This speeds up tree object generation from the index >> + for a new commit by only computing the trees that are "new" to that >> + commit. > > The original motivation was the above one. A cache of tree objects > that correspond to unmodified part of the directory structure helps > writing out a new tree out of modified index. > > We later found out that we rather often compare the index against > the tree of HEAD (think: "git status"), and diff-lib.c::diff_cache() > does take advantage of the fact that an entire directory can be > skipped if the tree object taken from the HEAD side exactly matches > the tree recorded for the subdirectory in the cache tree extension. I need to read more about this. traverse_by_cache_tree() seems to be a good place to start. Thanks. >> + The recursive tree structure uses nodes that store a number of cache >> + entries, a list of subnodes, and an object ID (OID). The OID references >> + the exising tree for that node, if it is known to exist. The subnodes >> + correspond to subdirectories that themselves have cached tree nodes. The >> + number of cache entries corresponds to the number of cache entries in >> + the index that describe paths within that tree's directory. s/exising/existing/ > > OK. > >> + Note that the path for a given tree is part of the parent node in-memory > > Sorry, I am not sure if I follow. The top-level in-core cache_tree > object records the number of entries, tree object name for the > entire tree (if valid), and cache_tree_sub structures, one for each > subdirectory. Each of the cache_tree_sub structure describes the > "child" directory, including the path to it. > >> + but is part of the child in the file format. The root tree has an empty >> + string for its name and its name does not exist in-memory. > > It's more like we could have consistently used cache_tree_sub > instances to represent each and every level (i.e. I consider that > cache_tree_sub is what represents a directory, with cache_tree being > a record of just one aspect of it) including the root of the > hierarchy, but because there wasn't much point in giving a name to > the root level, I cheated and avoided wasting a cache_tree_sub for > it. So from that point of view, the path belongs to the node in > each level in both in-core and on-disk representations. That's a good point. I'll retract my statement here. >> + When a path is updated in index, Git invalidates all nodes of the >> + recurisive cached tree corresponding to the parent directories of that >> + path. We store these tree nodes as being "invalid" by using "-1" as the >> + number of cache entries. > > Correct. Making note of my s/recurisive/recursive/ typo here. >> + To create trees corresponding to the current >> + index, Git only walks the invalid tree nodes and uses the cached OIDs >> + for the valid trees to construct new trees. > > I wonder if the above is sufficiently clear, or "Git only has to > walk the spans of index entries that corresponds to the invalid > trees, while reusing the ..." is too long and detailed. I will try to simplify. Thanks, -Stolee
Derrick Stolee <stolee@gmail.com> writes: >> We later found out that we rather often compare the index against >> the tree of HEAD (think: "git status"), and diff-lib.c::diff_cache() >> does take advantage of the fact that an entire directory can be >> skipped if the tree object taken from the HEAD side exactly matches >> the tree recorded for the subdirectory in the cache tree extension. > > I need to read more about this. traverse_by_cache_tree() seems to > be a good place to start. Thanks. Ah, that one I forgot about. What I had in mind was a different optimization that is far more aggressive in unpack-trees.c::unpack_callback(); look for a comment that begins with "Everything under the name matches". The block allows us to skip an entire subdirectory hierarchy.
Derrick Stolee <stolee@gmail.com> writes: >> We later found out that we rather often compare the index against >> the tree of HEAD (think: "git status"), and diff-lib.c::diff_cache() >> does take advantage of the fact that an entire directory can be >> skipped if the tree object taken from the HEAD side exactly matches >> the tree recorded for the subdirectory in the cache tree extension. > > I need to read more about this. traverse_by_cache_tree() seems to > be a good place to start. Thanks. Ah, that one I forgot about. What I had in mind was a different optimization that is far more aggressive in unpack-trees.c::unpack_callback(); look for a comment that begins with "Everything under the name matches". The block allows us to skip an entire subdirectory hierarchy.
diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt index 69edf46c031..c614e136e24 100644 --- a/Documentation/technical/index-format.txt +++ b/Documentation/technical/index-format.txt @@ -138,12 +138,36 @@ Git index format === Cached tree - Cached tree extension contains pre-computed hashes for trees that can - be derived from the index. It helps speed up tree object generation - from index for a new commit. - - When a path is updated in index, the path must be invalidated and - removed from tree cache. + Since the index does not record entries for directories, the cache + entries cannot describe tree objects that already exist in the object + database for regions of the index that are unchanged from an existing + commit. The cached tree extension stores a recursive tree structure that + describes the trees that already exist and completely match sections of + the cache entries. This speeds up tree object generation from the index + for a new commit by only computing the trees that are "new" to that + commit. + + The recursive tree structure uses nodes that store a number of cache + entries, a list of subnodes, and an object ID (OID). The OID references + the exising tree for that node, if it is known to exist. The subnodes + correspond to subdirectories that themselves have cached tree nodes. The + number of cache entries corresponds to the number of cache entries in + the index that describe paths within that tree's directory. + + Note that the path for a given tree is part of the parent node in-memory + but is part of the child in the file format. The root tree has an empty + string for its name and its name does not exist in-memory. + + When a path is updated in index, Git invalidates all nodes of the + recurisive cached tree corresponding to the parent directories of that + path. We store these tree nodes as being "invalid" by using "-1" as the + number of cache entries. To create trees corresponding to the current + index, Git only walks the invalid tree nodes and uses the cached OIDs + for the valid trees to construct new trees. In this way, Git only + constructs trees on the order of the number of changed paths (and their + depth in the working directory). This comes at a cost of tracking the + full directory structure in the cached tree extension, but this is + generally smaller than the full cache entry list in the index. The signature for this extension is { 'T', 'R', 'E', 'E' }.