Message ID | 300f53b8e39fa1dd55f65924d20f8abd22cbbfc9.1733170252.git.jonathantanmy@google.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Performance improvements for repacking non-promisor objects | expand |
On Mon, Dec 02, 2024 at 12:18:39PM -0800, Jonathan Tan wrote: > As a follow-up to the parent of this commit, it was found that not > checking for the existence of blobs linked from trees sped up the fetch > from 24m47.815s to 2m2.127s. Teach Git to do that. > > The benefit of doing this is as above (fetch speedup), but the drawback > is that if the packfile to be indexed references a local blob directly > (that is, not through a local tree), that local blob is in danger of > being garbage collected. Such a situation may arise if we push local > commits, including one with a change to a blob in the root tree, > and then the server incorporates them into its main branch through a > "rebase" or "squash" merge strategy, and then we fetch the new main > branch from the server. Okay, so we know that we are basically doing the wrong thing with the optimization, but by skipping blobs we can get a significant speedup and the failure mode is that we will re-fetch the object in a later step. And because we think the situation is rare it shouldn't be a huge issue in practice. > This situation has not been observed yet - we have only noticed missing > commits, not missing trees or blobs. (In fact, if it were believed that > only missing commits are problematic, one could argue that we should > also exclude trees during the outgoing link check; but it is safer to > include them.) > > Due to the rarity of the situation (it has not been observed to happen > in real life), and because the "penalty" in such a situation is merely > to refetch the missing blob when it's needed, the tradeoff seems > worth it. So is this a one-off event that may happen once per blob, or would we eventually evict the refetched blob and run into the same situation repeatedly? > diff --git a/builtin/index-pack.c b/builtin/index-pack.c > index 8e7d14c17e..58d24540dc 100644 > --- a/builtin/index-pack.c > +++ b/builtin/index-pack.c > @@ -830,8 +830,10 @@ static void do_record_outgoing_links(struct object *obj) > * verified, so do not print any here. > */ > return; > - while (tree_entry_gently(&desc, &entry)) > - record_outgoing_link(&entry.oid); > + while (tree_entry_gently(&desc, &entry)) { > + if (S_ISDIR(entry.mode)) > + record_outgoing_link(&entry.oid); > + } Without the context of the commit message this code snippet likely would not make any sense to a reader. The "correct" logic would be to record all objects, regardless of whether they are an object ID or not. But we explicitly choose not to as a tradeoff between performance and correctness. All to say that we should have a comment here that explains what is going on. Patrick
Patrick Steinhardt <ps@pks.im> writes: > > This situation has not been observed yet - we have only noticed missing > > commits, not missing trees or blobs. (In fact, if it were believed that > > only missing commits are problematic, one could argue that we should > > also exclude trees during the outgoing link check; but it is safer to > > include them.) > > > > Due to the rarity of the situation (it has not been observed to happen > > in real life), and because the "penalty" in such a situation is merely > > to refetch the missing blob when it's needed, the tradeoff seems > > worth it. > > So is this a one-off event that may happen once per blob, or would we > eventually evict the refetched blob and run into the same situation > repeatedly? One-off, since when refetched, the blob is in a promisor pack (and thus won't be GC-ed). I've added this to the code comment that I added following your suggestion below. > > diff --git a/builtin/index-pack.c b/builtin/index-pack.c > > index 8e7d14c17e..58d24540dc 100644 > > --- a/builtin/index-pack.c > > +++ b/builtin/index-pack.c > > @@ -830,8 +830,10 @@ static void do_record_outgoing_links(struct object *obj) > > * verified, so do not print any here. > > */ > > return; > > - while (tree_entry_gently(&desc, &entry)) > > - record_outgoing_link(&entry.oid); > > + while (tree_entry_gently(&desc, &entry)) { > > + if (S_ISDIR(entry.mode)) > > + record_outgoing_link(&entry.oid); > > + } > > Without the context of the commit message this code snippet likely would > not make any sense to a reader. The "correct" logic would be to record > all objects, regardless of whether they are an object ID or not. But we > explicitly choose not to as a tradeoff between performance and > correctness. > > All to say that we should have a comment here that explains what is > going on. > > Patrick Makes sense. I had to move almost the entirety of the commit message into a code comment - I don't think putting merely a part here would be enough context.
Patrick Steinhardt <ps@pks.im> writes: >> The benefit of doing this is as above (fetch speedup), but the drawback >> is that if the packfile to be indexed references a local blob directly >> (that is, not through a local tree), that local blob is in danger of >> being garbage collected. Such a situation may arise if we push local >> commits, including one with a change to a blob in the root tree, >> and then the server incorporates them into its main branch through a >> "rebase" or "squash" merge strategy, and then we fetch the new main >> branch from the server. > > Okay, so we know that we are basically doing the wrong thing with the > optimization, but by skipping blobs we can get a significant speedup and > the failure mode is that we will re-fetch the object in a later step. > And because we think the situation is rare it shouldn't be a huge issue > in practice. That is how I read it, but the description may want to make the pros and cons more explicit. One of the reasons why the users choose to use lazy clone is to avoid grabbing large blobs until they need them, so I am not sure how they feel about having to discard and then fetch again after creating a potentially large blob. As long as it does not happen repeatedly for the same blob, it probably is OK. > Without the context of the commit message this code snippet likely would > not make any sense to a reader. The "correct" logic would be to record > all objects, regardless of whether they are an object ID or not. But we > explicitly choose not to as a tradeoff between performance and > correctness. > > All to say that we should have a comment here that explains what is > going on. Thanks for reviewing and commenting.
diff --git a/builtin/index-pack.c b/builtin/index-pack.c index 8e7d14c17e..58d24540dc 100644 --- a/builtin/index-pack.c +++ b/builtin/index-pack.c @@ -830,8 +830,10 @@ static void do_record_outgoing_links(struct object *obj) * verified, so do not print any here. */ return; - while (tree_entry_gently(&desc, &entry)) - record_outgoing_link(&entry.oid); + while (tree_entry_gently(&desc, &entry)) { + if (S_ISDIR(entry.mode)) + record_outgoing_link(&entry.oid); + } } else if (obj->type == OBJ_COMMIT) { struct commit *commit = (struct commit *) obj; struct commit_list *parents = commit->parents;
As a follow-up to the parent of this commit, it was found that not checking for the existence of blobs linked from trees sped up the fetch from 24m47.815s to 2m2.127s. Teach Git to do that. The benefit of doing this is as above (fetch speedup), but the drawback is that if the packfile to be indexed references a local blob directly (that is, not through a local tree), that local blob is in danger of being garbage collected. Such a situation may arise if we push local commits, including one with a change to a blob in the root tree, and then the server incorporates them into its main branch through a "rebase" or "squash" merge strategy, and then we fetch the new main branch from the server. This situation has not been observed yet - we have only noticed missing commits, not missing trees or blobs. (In fact, if it were believed that only missing commits are problematic, one could argue that we should also exclude trees during the outgoing link check; but it is safer to include them.) Due to the rarity of the situation (it has not been observed to happen in real life), and because the "penalty" in such a situation is merely to refetch the missing blob when it's needed, the tradeoff seems worth it. (Blobs may also be linked from tag objects, but it is impossible to know the type of an object linked from a tag object without looking it up in the object database, so the code for that is untouched.) Signed-off-by: Jonathan Tan <jonathantanmy@google.com> --- builtin/index-pack.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)