diff mbox series

[2/3] index-pack: no blobs during outgoing link check

Message ID 300f53b8e39fa1dd55f65924d20f8abd22cbbfc9.1733170252.git.jonathantanmy@google.com (mailing list archive)
State Superseded
Headers show
Series Performance improvements for repacking non-promisor objects | expand

Commit Message

Jonathan Tan Dec. 2, 2024, 8:18 p.m. UTC
As a follow-up to the parent of this commit, it was found that not
checking for the existence of blobs linked from trees sped up the fetch
from 24m47.815s to 2m2.127s. Teach Git to do that.

The benefit of doing this is as above (fetch speedup), but the drawback
is that if the packfile to be indexed references a local blob directly
(that is, not through a local tree), that local blob is in danger of
being garbage collected. Such a situation may arise if we push local
commits, including one with a change to a blob in the root tree,
and then the server incorporates them into its main branch through a
"rebase" or "squash" merge strategy, and then we fetch the new main
branch from the server.

This situation has not been observed yet - we have only noticed missing
commits, not missing trees or blobs. (In fact, if it were believed that
only missing commits are problematic, one could argue that we should
also exclude trees during the outgoing link check; but it is safer to
include them.)

Due to the rarity of the situation (it has not been observed to happen
in real life), and because the "penalty" in such a situation is merely
to refetch the missing blob when it's needed, the tradeoff seems
worth it.

(Blobs may also be linked from tag objects, but it is impossible to know
the type of an object linked from a tag object without looking it up in
the object database, so the code for that is untouched.)

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/index-pack.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Patrick Steinhardt Dec. 3, 2024, 6 a.m. UTC | #1
On Mon, Dec 02, 2024 at 12:18:39PM -0800, Jonathan Tan wrote:
> As a follow-up to the parent of this commit, it was found that not
> checking for the existence of blobs linked from trees sped up the fetch
> from 24m47.815s to 2m2.127s. Teach Git to do that.
> 
> The benefit of doing this is as above (fetch speedup), but the drawback
> is that if the packfile to be indexed references a local blob directly
> (that is, not through a local tree), that local blob is in danger of
> being garbage collected. Such a situation may arise if we push local
> commits, including one with a change to a blob in the root tree,
> and then the server incorporates them into its main branch through a
> "rebase" or "squash" merge strategy, and then we fetch the new main
> branch from the server.

Okay, so we know that we are basically doing the wrong thing with the
optimization, but by skipping blobs we can get a significant speedup and
the failure mode is that we will re-fetch the object in a later step.
And because we think the situation is rare it shouldn't be a huge issue
in practice.

> This situation has not been observed yet - we have only noticed missing
> commits, not missing trees or blobs. (In fact, if it were believed that
> only missing commits are problematic, one could argue that we should
> also exclude trees during the outgoing link check; but it is safer to
> include them.)
> 
> Due to the rarity of the situation (it has not been observed to happen
> in real life), and because the "penalty" in such a situation is merely
> to refetch the missing blob when it's needed, the tradeoff seems
> worth it.

So is this a one-off event that may happen once per blob, or would we
eventually evict the refetched blob and run into the same situation
repeatedly?

> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> index 8e7d14c17e..58d24540dc 100644
> --- a/builtin/index-pack.c
> +++ b/builtin/index-pack.c
> @@ -830,8 +830,10 @@ static void do_record_outgoing_links(struct object *obj)
>  			 * verified, so do not print any here.
>  			 */
>  			return;
> -		while (tree_entry_gently(&desc, &entry))
> -			record_outgoing_link(&entry.oid);
> +		while (tree_entry_gently(&desc, &entry)) {
> +			if (S_ISDIR(entry.mode))
> +				record_outgoing_link(&entry.oid);
> +		}

Without the context of the commit message this code snippet likely would
not make any sense to a reader. The "correct" logic would be to record
all objects, regardless of whether they are an object ID or not. But we
explicitly choose not to as a tradeoff between performance and
correctness.

All to say that we should have a comment here that explains what is
going on.

Patrick
Jonathan Tan Dec. 3, 2024, 9:40 p.m. UTC | #2
Patrick Steinhardt <ps@pks.im> writes:
> > This situation has not been observed yet - we have only noticed missing
> > commits, not missing trees or blobs. (In fact, if it were believed that
> > only missing commits are problematic, one could argue that we should
> > also exclude trees during the outgoing link check; but it is safer to
> > include them.)
> > 
> > Due to the rarity of the situation (it has not been observed to happen
> > in real life), and because the "penalty" in such a situation is merely
> > to refetch the missing blob when it's needed, the tradeoff seems
> > worth it.
> 
> So is this a one-off event that may happen once per blob, or would we
> eventually evict the refetched blob and run into the same situation
> repeatedly?

One-off, since when refetched, the blob is in a promisor pack (and
thus won't be GC-ed). I've added this to the code comment that I added
following your suggestion below.

> > diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> > index 8e7d14c17e..58d24540dc 100644
> > --- a/builtin/index-pack.c
> > +++ b/builtin/index-pack.c
> > @@ -830,8 +830,10 @@ static void do_record_outgoing_links(struct object *obj)
> >  			 * verified, so do not print any here.
> >  			 */
> >  			return;
> > -		while (tree_entry_gently(&desc, &entry))
> > -			record_outgoing_link(&entry.oid);
> > +		while (tree_entry_gently(&desc, &entry)) {
> > +			if (S_ISDIR(entry.mode))
> > +				record_outgoing_link(&entry.oid);
> > +		}
> 
> Without the context of the commit message this code snippet likely would
> not make any sense to a reader. The "correct" logic would be to record
> all objects, regardless of whether they are an object ID or not. But we
> explicitly choose not to as a tradeoff between performance and
> correctness.
> 
> All to say that we should have a comment here that explains what is
> going on.
> 
> Patrick

Makes sense. I had to move almost the entirety of the commit message
into a code comment - I don't think putting merely a part here would be
enough context.
Junio C Hamano Dec. 3, 2024, 10:16 p.m. UTC | #3
Patrick Steinhardt <ps@pks.im> writes:

>> The benefit of doing this is as above (fetch speedup), but the drawback
>> is that if the packfile to be indexed references a local blob directly
>> (that is, not through a local tree), that local blob is in danger of
>> being garbage collected. Such a situation may arise if we push local
>> commits, including one with a change to a blob in the root tree,
>> and then the server incorporates them into its main branch through a
>> "rebase" or "squash" merge strategy, and then we fetch the new main
>> branch from the server.
>
> Okay, so we know that we are basically doing the wrong thing with the
> optimization, but by skipping blobs we can get a significant speedup and
> the failure mode is that we will re-fetch the object in a later step.
> And because we think the situation is rare it shouldn't be a huge issue
> in practice.

That is how I read it, but the description may want to make the pros
and cons more explicit.  One of the reasons why the users choose to
use lazy clone is to avoid grabbing large blobs until they need
them, so I am not sure how they feel about having to discard and
then fetch again after creating a potentially large blob.  As long
as it does not happen repeatedly for the same blob, it probably is
OK.

> Without the context of the commit message this code snippet likely would
> not make any sense to a reader. The "correct" logic would be to record
> all objects, regardless of whether they are an object ID or not. But we
> explicitly choose not to as a tradeoff between performance and
> correctness.
>
> All to say that we should have a comment here that explains what is
> going on.

Thanks for reviewing and commenting.
diff mbox series

Patch

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 8e7d14c17e..58d24540dc 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -830,8 +830,10 @@  static void do_record_outgoing_links(struct object *obj)
 			 * verified, so do not print any here.
 			 */
 			return;
-		while (tree_entry_gently(&desc, &entry))
-			record_outgoing_link(&entry.oid);
+		while (tree_entry_gently(&desc, &entry)) {
+			if (S_ISDIR(entry.mode))
+				record_outgoing_link(&entry.oid);
+		}
 	} else if (obj->type == OBJ_COMMIT) {
 		struct commit *commit = (struct commit *) obj;
 		struct commit_list *parents = commit->parents;