diff mbox series

[GSoC] merge-recursive: optimize string_list construction

Message ID 20250211194334.20710-1-meetsoni3017@gmail.com (mailing list archive)
State New
Headers show
Series [GSoC] merge-recursive: optimize string_list construction | expand

Commit Message

Meet Soni Feb. 11, 2025, 7:43 p.m. UTC
Avoid O(n^2) complexity when building a sorted `string_list` by
constructing it unsorted and sorting it afterward, reducing the
complexity to O(n log n).

Signed-off-by: Meet Soni <meetsoni3017@gmail.com>
---
 merge-recursive.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)


base-commit: 9520f7d9985d8879bddd157309928fc0679c8e92

Comments

Elijah Newren Feb. 11, 2025, 8:59 p.m. UTC | #1
On Tue, Feb 11, 2025 at 11:43 AM Meet Soni <meetsoni3017@gmail.com> wrote:
>
> Avoid O(n^2) complexity when building a sorted `string_list` by
> constructing it unsorted and sorting it afterward, reducing the
> complexity to O(n log n).

I'm tempted to say merge-recursive.[ch] is nearly dead and planned for
removal, so there's not much value in messing with it, but...it's not
dead yet, so I guess this is worthwhile.

> Signed-off-by: Meet Soni <meetsoni3017@gmail.com>
> ---
>  merge-recursive.c | 14 ++++----------
>  1 file changed, 4 insertions(+), 10 deletions(-)
>
> diff --git a/merge-recursive.c b/merge-recursive.c
> index 5dfaf32b2c..c43b79e6ef 100644
> --- a/merge-recursive.c
> +++ b/merge-recursive.c
> @@ -2757,24 +2757,18 @@ static int process_renames(struct merge_options *opt,
>         struct string_list b_by_dst = STRING_LIST_INIT_NODUP;
>         const struct rename *sre;
>
> -       /*
> -        * FIXME: As string-list.h notes, it's O(n^2) to build a sorted
> -        * string_list one-by-one, but O(n log n) to build it unsorted and
> -        * then sort it.  Note that as we build the list, we do not need to
> -        * check if the existing destination path is already in the list,
> -        * because the structure of diffcore_rename guarantees we won't
> -        * have duplicates.
> -        */
>         for (i = 0; i < a_renames->nr; i++) {
>                 sre = a_renames->items[i].util;
> -               string_list_insert(&a_by_dst, sre->pair->two->path)->util
> +               string_list_append(&a_by_dst, sre->pair->two->path)->util
>                         = (void *)sre;
>         }
>         for (i = 0; i < b_renames->nr; i++) {
>                 sre = b_renames->items[i].util;
> -               string_list_insert(&b_by_dst, sre->pair->two->path)->util
> +               string_list_append(&b_by_dst, sre->pair->two->path)->util
>                         = (void *)sre;
>         }
> +       string_list_sort(&a_by_dst);
> +       string_list_sort(&b_by_dst);

If the original source had duplicates, this would change behavior (the
insert function checks for duplicates while append does not).
Granted, the comment above the block points out why there aren't
duplicates, but will that be obvious to future readers now that you've
removed the whole comment?

Also, are the sources already sorted?  If so, we can avoid the manual
sort calls at the end, and drop this from O(n log n) to O(n).  Digging
through the code...it appears these are setup in get_renames() and are
sorted but by pair->one->path rather than pair->two->path, so we do
need the sorts here.

Of course, get_renames() itself utilizes string_list_insert() rather
than string_list_append. and there are a number of other
string_list_insert calls in the code (though some of the others might
be hard to restructure) -- perhaps the first line of your commit
message should have a "in process_renames" qualifier, since it's only
addressing one case?

Anyway, other than perhaps tweaking the first line of the commit
message, and not removing the whole comment, the patch looks good to
me.
diff mbox series

Patch

diff --git a/merge-recursive.c b/merge-recursive.c
index 5dfaf32b2c..c43b79e6ef 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -2757,24 +2757,18 @@  static int process_renames(struct merge_options *opt,
 	struct string_list b_by_dst = STRING_LIST_INIT_NODUP;
 	const struct rename *sre;
 
-	/*
-	 * FIXME: As string-list.h notes, it's O(n^2) to build a sorted
-	 * string_list one-by-one, but O(n log n) to build it unsorted and
-	 * then sort it.  Note that as we build the list, we do not need to
-	 * check if the existing destination path is already in the list,
-	 * because the structure of diffcore_rename guarantees we won't
-	 * have duplicates.
-	 */
 	for (i = 0; i < a_renames->nr; i++) {
 		sre = a_renames->items[i].util;
-		string_list_insert(&a_by_dst, sre->pair->two->path)->util
+		string_list_append(&a_by_dst, sre->pair->two->path)->util
 			= (void *)sre;
 	}
 	for (i = 0; i < b_renames->nr; i++) {
 		sre = b_renames->items[i].util;
-		string_list_insert(&b_by_dst, sre->pair->two->path)->util
+		string_list_append(&b_by_dst, sre->pair->two->path)->util
 			= (void *)sre;
 	}
+	string_list_sort(&a_by_dst);
+	string_list_sort(&b_by_dst);
 
 	for (i = 0, j = 0; i < a_renames->nr || j < b_renames->nr;) {
 		struct string_list *renames1, *renames2Dst;