diff mbox series

builtin/fetch: iterate symrefs instead of all when checking dangling refs

Message ID pull.1812.git.git.1728962878717.gitgitgadget@gmail.com (mailing list archive)
State New
Headers show
Series builtin/fetch: iterate symrefs instead of all when checking dangling refs | expand

Commit Message

Liu Zhongbo Oct. 15, 2024, 3:27 a.m. UTC
From: Liu Zhongbo <liuzhongbo.6666@bytedance.com>

refs_warn_dangling_symref() traverse all references to check if there are
any dangling symbolic references. The complexity is
O(number of deleted references * total number of references).
It will take a lot of time if there are tens of thousands of branches in
monorepo.

So I first identified all the symbolic references, and then only traverse
in these references. The complexity is
O (number of deleted references * number of symbolic references).

Due to the infrequent use of symbolic references, there will be significant
performance improvements here. In my case, the prune_refs() time has been
reduced from 20 seconds to 4 seconds.

Signed-off-by: Liu Zhongbo <liuzhongbo.6666@bytedance.com>
---
    builtin/fetch: iterate symrefs instead of all refs

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1812%2Flzb6666%2Fspeed_up_prune_refs-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1812/lzb6666/speed_up_prune_refs-v1
Pull-Request: https://github.com/git/git/pull/1812

 builtin/fetch.c |  7 +++++--
 refs.c          | 35 ++++++++++++++++++++++++++---------
 refs.h          |  4 +++-
 3 files changed, 34 insertions(+), 12 deletions(-)


base-commit: ef8ce8f3d4344fd3af049c17eeba5cd20d98b69f

Comments

Taylor Blau Oct. 15, 2024, 7:08 p.m. UTC | #1
On Tue, Oct 15, 2024 at 03:27:58AM +0000, Liu Zhongbo via GitGitGadget wrote:
> From: Liu Zhongbo <liuzhongbo.6666@bytedance.com>
>
> refs_warn_dangling_symref() traverse all references to check if there are
> any dangling symbolic references. The complexity is
> O(number of deleted references * total number of references).
> It will take a lot of time if there are tens of thousands of branches in
> monorepo.
>
> So I first identified all the symbolic references, and then only traverse
> in these references. The complexity is
> O (number of deleted references * number of symbolic references).
>
> Due to the infrequent use of symbolic references, there will be significant
> performance improvements here. In my case, the prune_refs() time has been
> reduced from 20 seconds to 4 seconds.
>
> Signed-off-by: Liu Zhongbo <liuzhongbo.6666@bytedance.com>
> ---
>     builtin/fetch: iterate symrefs instead of all refs
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1812%2Flzb6666%2Fspeed_up_prune_refs-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1812/lzb6666/speed_up_prune_refs-v1
> Pull-Request: https://github.com/git/git/pull/1812
>
>  builtin/fetch.c |  7 +++++--
>  refs.c          | 35 ++++++++++++++++++++++++++---------
>  refs.h          |  4 +++-
>  3 files changed, 34 insertions(+), 12 deletions(-)
>
> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index 80a64d0d269..ec4be60cfeb 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -1412,15 +1412,18 @@ static int prune_refs(struct display_state *display_state,
>
>  	if (verbosity >= 0) {
>  		int summary_width = transport_summary_width(stale_refs);
> +	    struct string_list symrefs = STRING_LIST_INIT_NODUP;
> +	    refs_get_symrefs(get_main_ref_store(the_repository), &symrefs);

Without reading the contents of the patch below, there appear to be some
style issues here. It seems that the indentation here is incorrect, as
well as below throughout the patch.

Please make sure that the indentation is consistent with the rest of the
project's conventions which can be found in
Documentation/CodingGuidelines before re-submitting.

Thanks,
Taylor
Patrick Steinhardt Oct. 16, 2024, 7:13 a.m. UTC | #2
On Tue, Oct 15, 2024 at 03:27:58AM +0000, Liu Zhongbo via GitGitGadget wrote:
> From: Liu Zhongbo <liuzhongbo.6666@bytedance.com>
> 
> refs_warn_dangling_symref() traverse all references to check if there are
> any dangling symbolic references. The complexity is
> O(number of deleted references * total number of references).
> It will take a lot of time if there are tens of thousands of branches in
> monorepo.
> 
> So I first identified all the symbolic references, and then only traverse
> in these references. The complexity is
> O (number of deleted references * number of symbolic references).

Okay. I'm a bit curious here, mostly because I would have thought that
it should be able to make this O(number of deleted refs * logn(existing
refs)): for every deleted ref, you should only have to look up whether
its target exists or not, which typically is O(logn existing refs).

But let's read on.

> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index 80a64d0d269..ec4be60cfeb 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -1412,15 +1412,18 @@ static int prune_refs(struct display_state *display_state,
>  
>  	if (verbosity >= 0) {
>  		int summary_width = transport_summary_width(stale_refs);
> +	    struct string_list symrefs = STRING_LIST_INIT_NODUP;
> +	    refs_get_symrefs(get_main_ref_store(the_repository), &symrefs);
>  
>  		for (ref = stale_refs; ref; ref = ref->next) {
>  			display_ref_update(display_state, '-', _("[deleted]"), NULL,
>  					   _("(none)"), ref->name,
>  					   &ref->new_oid, &ref->old_oid,
>  					   summary_width);
> -			refs_warn_dangling_symref(get_main_ref_store(the_repository),
> -						  stderr, dangling_msg, ref->name);
> +	        refs_warn_dangling_symref(get_main_ref_store(the_repository), stderr,
> +				      dangling_msg, ref->name, &symrefs);
>  		}
> +	    string_list_clear(&symrefs, 0);
>  	}
>  
>  cleanup:

Okay, so here we're now iterating through the refs which we are about to
delete. For every such ref, we call `refs_warn_dangling_symref()`, which
iterates through all references in the repository to check whether any
of them resolves to the passed ref.

That feels inefficient indeed, and I agree that reading symrefs once is
going to be way more efficient. But open-coding part of the logic here
does not make much sense to me, as the function itself should know to do
it efficiently.

Can we instead refactor the code to use `refs_warn_dangling_symrefs()`,
which does all of this in a single iteration over all refs? That'd
remove the need for us to do all of the above as we now only iterate a
single time through all refs, wouldn't it?

That still isn't quite O(number of deleted refs * logn existing refs) in
theory. It's rather O(existing refs * logn deleted refs) as before
because we have to also look up the currently iterader refname in
`struct warn_if_dangling_data`, which is using a binary search in the
sorted string list. But I think this should still be way more efficient
compared to the current solution, where we iterate through all refs
multiple times.

Patrick
diff mbox series

Patch

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 80a64d0d269..ec4be60cfeb 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1412,15 +1412,18 @@  static int prune_refs(struct display_state *display_state,
 
 	if (verbosity >= 0) {
 		int summary_width = transport_summary_width(stale_refs);
+	    struct string_list symrefs = STRING_LIST_INIT_NODUP;
+	    refs_get_symrefs(get_main_ref_store(the_repository), &symrefs);
 
 		for (ref = stale_refs; ref; ref = ref->next) {
 			display_ref_update(display_state, '-', _("[deleted]"), NULL,
 					   _("(none)"), ref->name,
 					   &ref->new_oid, &ref->old_oid,
 					   summary_width);
-			refs_warn_dangling_symref(get_main_ref_store(the_repository),
-						  stderr, dangling_msg, ref->name);
+	        refs_warn_dangling_symref(get_main_ref_store(the_repository), stderr,
+				      dangling_msg, ref->name, &symrefs);
 		}
+	    string_list_clear(&symrefs, 0);
 	}
 
 cleanup:
diff --git a/refs.c b/refs.c
index 5f729ed4124..8dd480a7a91 100644
--- a/refs.c
+++ b/refs.c
@@ -463,16 +463,33 @@  static int warn_if_dangling_symref(const char *refname, const char *referent UNU
 	return 0;
 }
 
-void refs_warn_dangling_symref(struct ref_store *refs, FILE *fp,
-			       const char *msg_fmt, const char *refname)
+static int append_symref(const char *refname, const char *referent UNUSED,
+			 const struct object_id *oid UNUSED,
+			 int flags, void *cb_data) {
+    struct string_list *d = cb_data;
+    if ((flags & REF_ISSYMREF)){
+        string_list_append(d, refname);
+    }
+    return 0;
+}
+
+void refs_get_symrefs(struct ref_store *refs, struct string_list *refnames)
 {
-	struct warn_if_dangling_data data = {
-		.refs = refs,
-		.fp = fp,
-		.refname = refname,
-		.msg_fmt = msg_fmt,
-	};
-	refs_for_each_rawref(refs, warn_if_dangling_symref, &data);
+	refs_for_each_rawref(refs, append_symref, refnames);
+}
+
+void refs_warn_dangling_symref(struct ref_store *refs, FILE *fp,
+                               const char *msg_fmt, const char *refname, struct string_list *symrefs) {
+    const char *resolves_to;
+    struct string_list_item *symref;
+    for_each_string_list_item(symref, symrefs) {
+        resolves_to = refs_resolve_ref_unsafe(refs, symref->string,
+                                              0, NULL, NULL);
+        if (resolves_to && strcmp(resolves_to, refname) == 0) {
+            fprintf(fp, msg_fmt, symref->string);
+            fputc('\n', fp);
+        }
+    }
 }
 
 void refs_warn_dangling_symrefs(struct ref_store *refs, FILE *fp,
diff --git a/refs.h b/refs.h
index 108dfc93b34..d3b65564561 100644
--- a/refs.h
+++ b/refs.h
@@ -394,8 +394,10 @@  static inline const char *has_glob_specials(const char *pattern)
 	return strpbrk(pattern, "?*[");
 }
 
+void refs_get_symrefs(struct ref_store *refs, struct string_list *refnames);
+
 void refs_warn_dangling_symref(struct ref_store *refs, FILE *fp,
-			       const char *msg_fmt, const char *refname);
+			       const char *msg_fmt, const char *refname, struct string_list *symrefs);
 void refs_warn_dangling_symrefs(struct ref_store *refs, FILE *fp,
 				const char *msg_fmt, const struct string_list *refnames);