diff mbox series

[4/9] submodule.c: sort changed_submodule_names before searching it

Message ID 20180911234951.14129-5-sbeller@google.com (mailing list archive)
State New, archived
Headers show
Series fetch: make sure submodule oids are fetched | expand

Commit Message

Stefan Beller Sept. 11, 2018, 11:49 p.m. UTC
We can string_list_insert() to maintain sorted-ness of the
list as we find new items, or we can string_list_append() to
build an unsorted list and sort it at the end just once.

To pick which one is more appropriate, we notice the fact
that we discover new items more or less in the already
sorted order.  That makes "append then sort" more
appropriate.

Signed-off-by: Stefan Beller <sbeller@google.com>
---
 submodule.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Junio C Hamano Sept. 12, 2018, 6:18 p.m. UTC | #1
Stefan Beller <sbeller@google.com> writes:

> We can string_list_insert() to maintain sorted-ness of the
> list as we find new items, or we can string_list_append() to
> build an unsorted list and sort it at the end just once.
>
> To pick which one is more appropriate, we notice the fact
> that we discover new items more or less in the already
> sorted order.  That makes "append then sort" more
> appropriate.

Sorry, but I still do not get the math you are implying in the
second paragraph.  Are you saying that append-then-sort is efficient
when items being appended is already sorted?  That depends on the
sorting algorithm used, so the logic is incomplete unless you say
"given that we use X for sorting,...", I think.

Do we really discover new items in sorted order, by the way?  In a
single diff invocation made inside collect_changed_submodules() for
one commit in the superproject's history, we will grab changed paths
in the pathname order (i.e. sorted); if the superproject's tip commit
touches the submodules at paths A and Z, we will discover these two
paths in sorted order.

But because we are walking the superproject's history to collect all
paths that have been affected in that function, and repeatedly
calling diff as we discover commit in the superproject's history, I
am not sure how well the resulting set of paths would be sorted.

The tip commit in superproject's history may have modified the
submodule at path X, the parent of that commit may have touched the
submodule at path M, and its parent may have touched the submodule
at path A.  Don't we end up grabbing these paths in that discoverd
order, i.e. X, M and A?

I still think changing it from "insert as we find an item, keeping
the list sorted" to "append all and then sort before we start
looking things up from the result" makes sense, but I do not think
the "we find things in sorted order" is either true, or it would
affect the choice between the two.  A justification to choose the
latter I can think of that makes sense is that we don't have to pay
cost to keep the list sorted while building it because we do not do
any look-up while building the list.

> Signed-off-by: Stefan Beller <sbeller@google.com>
> ---
>  submodule.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/submodule.c b/submodule.c
> index d29dfa3d1f5..c6eff7699f3 100644
> --- a/submodule.c
> +++ b/submodule.c
> @@ -1256,7 +1256,7 @@ static int get_next_submodule(struct child_process *cp,
>  		case RECURSE_SUBMODULES_DEFAULT:
>  		case RECURSE_SUBMODULES_ON_DEMAND:
>  			if (!submodule ||
> -			    !unsorted_string_list_lookup(
> +			    !string_list_lookup(
>  					&changed_submodule_names,
>  					submodule->name))
>  				continue;
> @@ -1350,6 +1350,7 @@ int fetch_populated_submodules(struct repository *r,
>  	/* default value, "--submodule-prefix" and its value are added later */
>  
>  	calculate_changed_submodule_paths();
> +	string_list_sort(&changed_submodule_names);
>  	run_processes_parallel(max_parallel_jobs,
>  			       get_next_submodule,
>  			       fetch_start_failure,
Stefan Beller Sept. 12, 2018, 7:06 p.m. UTC | #2
On Wed, Sep 12, 2018 at 11:18 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Stefan Beller <sbeller@google.com> writes:
>
> > We can string_list_insert() to maintain sorted-ness of the
> > list as we find new items, or we can string_list_append() to
> > build an unsorted list and sort it at the end just once.
> >
> > To pick which one is more appropriate, we notice the fact
> > that we discover new items more or less in the already
> > sorted order.  That makes "append then sort" more
> > appropriate.
>
> Sorry, but I still do not get the math you are implying in the
> second paragraph.  Are you saying that append-then-sort is efficient
> when items being appended is already sorted?  That depends on the
> sorting algorithm used, so the logic is incomplete unless you say
> "given that we use X for sorting,...", I think.
>
> Do we really discover new items in sorted order, by the way?  In a
> single diff invocation made inside collect_changed_submodules() for
> one commit in the superproject's history, we will grab changed paths
> in the pathname order (i.e. sorted); if the superproject's tip commit
> touches the submodules at paths A and Z, we will discover these two
> paths in sorted order.
>
> But because we are walking the superproject's history to collect all
> paths that have been affected in that function, and repeatedly
> calling diff as we discover commit in the superproject's history, I
> am not sure how well the resulting set of paths would be sorted.
>
> The tip commit in superproject's history may have modified the
> submodule at path X, the parent of that commit may have touched the
> submodule at path M, and its parent may have touched the submodule
> at path A.  Don't we end up grabbing these paths in that discoverd
> order, i.e. X, M and A?

That is true.

>
> I still think changing it from "insert as we find an item, keeping
> the list sorted" to "append all and then sort before we start
> looking things up from the result" makes sense, but I do not think
> the "we find things in sorted order" is either true, or it would
> affect the choice between the two.  A justification to choose the
> latter I can think of that makes sense is that we don't have to pay
> cost to keep the list sorted while building it because we do not do
> any look-up while building the list.

ok.

Thanks,
Stefan
diff mbox series

Patch

diff --git a/submodule.c b/submodule.c
index d29dfa3d1f5..c6eff7699f3 100644
--- a/submodule.c
+++ b/submodule.c
@@ -1256,7 +1256,7 @@  static int get_next_submodule(struct child_process *cp,
 		case RECURSE_SUBMODULES_DEFAULT:
 		case RECURSE_SUBMODULES_ON_DEMAND:
 			if (!submodule ||
-			    !unsorted_string_list_lookup(
+			    !string_list_lookup(
 					&changed_submodule_names,
 					submodule->name))
 				continue;
@@ -1350,6 +1350,7 @@  int fetch_populated_submodules(struct repository *r,
 	/* default value, "--submodule-prefix" and its value are added later */
 
 	calculate_changed_submodule_paths();
+	string_list_sort(&changed_submodule_names);
 	run_processes_parallel(max_parallel_jobs,
 			       get_next_submodule,
 			       fetch_start_failure,