Message ID | 20210423031659.2362659-1-jonathantanmy@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | cache-tree: avoid needless promisor object fetch | expand |
Hi Jonathan, On Thu, 22 Apr 2021, Jonathan Tan wrote: > In update_one() (used only by cache_tree_update()), there is an object > existence check that, if it fails, will automatically trigger a lazy > fetch in a partial clone. But the fetch is not necessary - the object is > not actually being used. I find it curious, though, that the `ce_missing_ok` variable is defined thusly (sadly, the context of your diff is too small to show it): ce_missing_ok = mode == S_IFGITLINK || missing_ok || (has_promisor_remote() && ce_skip_worktree(ce)); Which means that the `has_object_file()` function is only called if the entry is not marked with the `skip-worktree` bit, i.e. if it is _not_ excluded from the sparse checkout. Wouldn't that mean that the object _should_ be there? I guess what I am saying is that while the commit message focuses on the "What?" of the patch, I would love to hear more about the "Why?". And maybe the "When?" as in: when does this actually matter? And since the bug was critical enough for you to spend time on crafting it, maybe it would make sense to add a regression test to ensure that this bug does not creep in again? > > Replace that check with two checks: an object existence check that does > not fetch, and then a check that that object is a promisor object. This essentially repeats what the diff says, but it might make more sense to explain why the post-image of this diff is more correct (and maybe discuss performance implications). > > Doing this avoids multiple lazy fetches when merging two trees in a > partial clone, as noticed at $DAYJOB. Ah. But where are those trees fetched, then? Maybe lead with the description of the bug? > Signed-off-by: Jonathan Tan <jonathantanmy@google.com> > --- > Another alternative is to think about whether the object existence check > here is needed in the first place. > > There might also be other places we can make a similar change in > update_one(), but I limited myself to what's needed to solve the > specific case we discovered at $DAYJOB. I only see another `has_object_file()` call site at the very beginning, and I think this needs to fetch. Or maybe it is more efficient to construct the cache tree from scratch than fetch it? There is also `cache_tree_fully_valid_1()`, where I think the same handling could potentially make sense. (Or, if you target `seen`, `cache_tree_fully_valid()`. Ciao, Johannes > --- > cache-tree.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/cache-tree.c b/cache-tree.c > index add1f07713..6728722597 100644 > --- a/cache-tree.c > +++ b/cache-tree.c > @@ -6,6 +6,7 @@ > #include "object-store.h" > #include "replace-object.h" > #include "promisor-remote.h" > +#include "packfile.h" > > #ifndef DEBUG_CACHE_TREE > #define DEBUG_CACHE_TREE 0 > @@ -362,7 +363,9 @@ static int update_one(struct cache_tree *it, > (has_promisor_remote() && > ce_skip_worktree(ce)); > if (is_null_oid(oid) || > - (!ce_missing_ok && !has_object_file(oid))) { > + (!ce_missing_ok && > + !has_object_file_with_flags(oid, OBJECT_INFO_SKIP_FETCH_OBJECT) && > + !is_promisor_object(oid))) { > strbuf_release(&buffer); > if (expected_missing) > return -1; > -- > 2.31.1.498.g6c1eba8ee3d-goog > >
> > In update_one() (used only by cache_tree_update()), there is an object > > existence check that, if it fails, will automatically trigger a lazy > > fetch in a partial clone. But the fetch is not necessary - the object is > > not actually being used. > > I find it curious, though, that the `ce_missing_ok` variable is defined > thusly (sadly, the context of your diff is too small to show it): > > ce_missing_ok = mode == S_IFGITLINK || missing_ok || > (has_promisor_remote() && > ce_skip_worktree(ce)); > > Which means that the `has_object_file()` function is only called if the > entry is not marked with the `skip-worktree` bit, i.e. if it is _not_ > excluded from the sparse checkout. > > Wouldn't that mean that the object _should_ be there? In a partial clone, probably not? > I guess what I am saying is that while the commit message focuses on the > "What?" of the patch, I would love to hear more about the "Why?". And > maybe the "When?" as in: when does this actually matter? In this case, that's something I'd like help in figuring out too. Normally this code path (unpack_trees()) prefetches everything through a call to check_updates(), but the update flag is somehow not set so there is no prefetching happening. > And since the bug was critical enough for you to spend time on crafting > it, maybe it would make sense to add a regression test to ensure that this > bug does not creep in again? OK. > > Replace that check with two checks: an object existence check that does > > not fetch, and then a check that that object is a promisor object. > > This essentially repeats what the diff says, but it might make more sense > to explain why the post-image of this diff is more correct (and maybe > discuss performance implications). OK - I think this is the "why" and "when" you described above. > > Doing this avoids multiple lazy fetches when merging two trees in a > > partial clone, as noticed at $DAYJOB. > > Ah. But where are those trees fetched, then? > > Maybe lead with the description of the bug? This was a partial clone excluding blobs only. I'll update the commit message to mention this detail. > > Another alternative is to think about whether the object existence check > > here is needed in the first place. > > > > There might also be other places we can make a similar change in > > update_one(), but I limited myself to what's needed to solve the > > specific case we discovered at $DAYJOB. > > I only see another `has_object_file()` call site at the very beginning, > and I think this needs to fetch. Or maybe it is more efficient to > construct the cache tree from scratch than fetch it? Good point - if we can construct it, we probably shouldn't fetch it. > There is also `cache_tree_fully_valid_1()`, where I think the same > handling could potentially make sense. (Or, if you target `seen`, > `cache_tree_fully_valid()`. True.
diff --git a/cache-tree.c b/cache-tree.c index add1f07713..6728722597 100644 --- a/cache-tree.c +++ b/cache-tree.c @@ -6,6 +6,7 @@ #include "object-store.h" #include "replace-object.h" #include "promisor-remote.h" +#include "packfile.h" #ifndef DEBUG_CACHE_TREE #define DEBUG_CACHE_TREE 0 @@ -362,7 +363,9 @@ static int update_one(struct cache_tree *it, (has_promisor_remote() && ce_skip_worktree(ce)); if (is_null_oid(oid) || - (!ce_missing_ok && !has_object_file(oid))) { + (!ce_missing_ok && + !has_object_file_with_flags(oid, OBJECT_INFO_SKIP_FETCH_OBJECT) && + !is_promisor_object(oid))) { strbuf_release(&buffer); if (expected_missing) return -1;
In update_one() (used only by cache_tree_update()), there is an object existence check that, if it fails, will automatically trigger a lazy fetch in a partial clone. But the fetch is not necessary - the object is not actually being used. Replace that check with two checks: an object existence check that does not fetch, and then a check that that object is a promisor object. Doing this avoids multiple lazy fetches when merging two trees in a partial clone, as noticed at $DAYJOB. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> --- Another alternative is to think about whether the object existence check here is needed in the first place. There might also be other places we can make a similar change in update_one(), but I limited myself to what's needed to solve the specific case we discovered at $DAYJOB. --- cache-tree.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)