diff mbox series

fetch: add "--parallel", which fetches all remotes in parallel

Message ID 20190717015903.4384-1-palmer@sifive.com (mailing list archive)
State New, archived
Headers show
Series fetch: add "--parallel", which fetches all remotes in parallel | expand

Commit Message

Palmer Dabbelt July 17, 2019, 1:59 a.m. UTC
I have a handful of repositories with a bunch of remotes and this week
I'm on an internet connection with a high bandwidth high latency
connection to those remotes.  This results in "fetch --all" being
painfully slow: 3-4 seconds per remote, all at almost zero link
utilization (presumably doing an SSH handshake), for a total of a
minute or so.

This patch fixes the issue in the simplest way I could come up with: it
adds a "--parallel" argument to fetch that selects when to block on the
fetch children.  This results in every fetch child running in parallel,
which provides per-remote parallelism.  My "fetch --all" times go down
from ~60 seconds to ~5 seconds, which is great!

Signed-off-by: Palmer Dabbelt <palmer@sifive.com>

---

This is more of an RFC that an actual patch, for three reasons:

* I'm not sure if it's safe to access the .git database from multiple
  processes at the same time.  This is my first time messing around with
  git to this degree, so I figured it's better to just ask than to try
  to figure it out.
* As I was writing the documentation I found "fetch --jobs".  It seems
  like I should use that instead of adding a new argrument, but I wasn't
  sure.  If I do this, I'd add some other config argument because
  submodule.jobs seems like a bad name for this.
* This implementation is recursive and I'm a bit worried about stack
  usage.  The stack-based implementation doesn't make nearly as much
  sense with --jobs, so I'd probably just do it right.

It's at the point where I've spent about as long writing the code as
I'll save waiting on "fetch --all" over the rest of my life, and given
the unknows I thought it would be best to just send out the patch and
see if it has any legs.

If this seems reasonable then I'll try to find some time to finish it.
Specificially, I'd like to:

* Use "--jobs" instead of "--parallel", to avoid the extra argument.
* Write some sort of test case.
---
 Documentation/fetch-options.txt |  3 +++
 builtin/fetch.c                 | 45 +++++++++++++++++++++++----------
 2 files changed, 34 insertions(+), 14 deletions(-)

Comments

Junio C Hamano July 18, 2019, 5:45 p.m. UTC | #1
Palmer Dabbelt <palmer@sifive.com> writes:

> * I'm not sure if it's safe to access the .git database from multiple
>   processes at the same time.

It is supposed to, and also you are supposed to keep it that way ;-)

> * As I was writing the documentation I found "fetch --jobs".  It seems
>   like I should use that instead of adding a new argrument, but I wasn't
>   sure.

Why not?  What makes you feel it is a bad idea to follow that
pattern?

Ah, --jobs that is taken already is right now too tied to fetches
that happen in submodules, which arguably was a design mistake.

    -j::
    --jobs=<n>::
            Number of parallel children to be used for fetching
            submodules.  Each will fetch from different submodules,
            such that fetching many submodules will be faster. By
            default submodules will be fetched one at a time.

The simplest endgame would be to replace "submodule" with
"repository" in the above description, perhaps like

	Number of parallel jobs to be used for fetching from
	multiple repositories (both fetching with "--multiple" from
	multiple repositories, and also fetching updated contents
	for submodules).  By default, fetching from multiple
	repositories and submodules is done one at a time.

and nobody would have complained if the system were like so from the
beginning.  Existing users, however, may want extra flexibility, and
would complain loudly if we did the above, in which case, we may
have to

 - introduce --fetch-jobs=<n> for what you are adding;

 - introduce --submodule-fetch-jobs=<n> as a synonym for existing
   --jobs=<n> and deprecate the current use of --jobs=<n>;

 - eventually repurpose --jobs=<n> as a short-hand to give both
   --fetch-jobs and --submoduje-fetch-jobs at the same time.

> +static int parallel = 0;

Don't explicitly "= 0;" initialize file-scope static.  Instead let
BSS take care of it.

>  static int git_fetch_config(const char *k, const char *v, void *cb)
>  {
> @@ -178,6 +179,8 @@ static struct option builtin_fetch_options[] = {
>  			TRANSPORT_FAMILY_IPV6),
>  	OPT_STRING_LIST(0, "negotiation-tip", &negotiation_tip, N_("revision"),
>  			N_("report that we have only objects reachable from this object")),
> +	OPT_BOOL(0, "parallel", &parallel,
> +		 N_("fetch in parallel from each remote")),
>  	OPT_PARSE_LIST_OBJECTS_FILTER(&filter_options),
>  	OPT_BOOL(0, "auto-gc", &enable_auto_gc,
>  		 N_("run 'gc --auto' after fetching")),
> @@ -1456,12 +1459,15 @@ static void add_options_to_argv(struct argv_array *argv)
>  
>  }
>  
> -static int fetch_multiple(struct string_list *list)
> +static int fetch_multiple(struct string_list *list, int i)
>  {
> -	int i, result = 0;

'i' is perfectly a good name for a local variable that is used for
loop control purposes, but makes a horrible name for a parameter.

Existing 'list' is not any better either---we know it is a list by
its type already, the name should say what the list is about, what
it represents.  But having a horribly named parameter already is not
a good reason to make the code even worse.

And as you said, recursion makes the code structure harder to follow
here.  Keeping an array of --jobs=<n> cmd structures, looping to
fill them by starting, doing wait() to reap any of the started ones
that first exits to refill the slot just opened, etc. would be easier
to see if done in a loop, I think.
Johannes Schindelin July 19, 2019, 1:23 p.m. UTC | #2
Hi Junio,

On Thu, 18 Jul 2019, Junio C Hamano wrote:

> Palmer Dabbelt <palmer@sifive.com> writes:
>
> > * As I was writing the documentation I found "fetch --jobs".  It seems
> >   like I should use that instead of adding a new argrument, but I wasn't
> >   sure.
>
> Why not?  What makes you feel it is a bad idea to follow that
> pattern?
>
> Ah, --jobs that is taken already is right now too tied to fetches
> that happen in submodules, which arguably was a design mistake.
>
>     -j::
>     --jobs=<n>::
>             Number of parallel children to be used for fetching
>             submodules.  Each will fetch from different submodules,
>             such that fetching many submodules will be faster. By
>             default submodules will be fetched one at a time.
>
> The simplest endgame would be to replace "submodule" with
> "repository" in the above description, perhaps like
>
> 	Number of parallel jobs to be used for fetching from
> 	multiple repositories (both fetching with "--multiple" from
> 	multiple repositories, and also fetching updated contents
> 	for submodules).  By default, fetching from multiple
> 	repositories and submodules is done one at a time.
>
> and nobody would have complained if the system were like so from the
> beginning.  Existing users, however, may want extra flexibility, and
> would complain loudly if we did the above, in which case, we may
> have to
>
>  - introduce --fetch-jobs=<n> for what you are adding;
>
>  - introduce --submodule-fetch-jobs=<n> as a synonym for existing
>    --jobs=<n> and deprecate the current use of --jobs=<n>;
>
>  - eventually repurpose --jobs=<n> as a short-hand to give both
>    --fetch-jobs and --submoduje-fetch-jobs at the same time.

Given that the relevant code looks like this:

        if (remote) {
                if (filter_options.choice || repository_format_partial_clone)
                        fetch_one_setup_partial(remote);
                result = fetch_one(remote, argc, argv, prune_tags_ok);
        } else {
                if (filter_options.choice)
                        die(_("--filter can only be used with the remote "
                              "configured in extensions.partialclone"));
                /* TODO should this also die if we have a previous partial-clone? */
                result = fetch_multiple(&list);
        }

        if (!result && (recurse_submodules != RECURSE_SUBMODULES_OFF)) {
                struct argv_array options = ARGV_ARRAY_INIT;

                add_options_to_argv(&options);
                result = fetch_populated_submodules(the_repository,
                                                    &options,
                                                    submodule_prefix,
                                                    recurse_submodules,
                                                    recurse_submodules_default,
                                                    verbosity < 0,
                                                    max_children);
                argv_array_clear(&options);
        }

i.e. the `fetch_multiple()` call is _strictly_ before the call to
`fetch_populated_submodules()`, I would contend that this level of
separation does not serve anybody but a fan of the complicators' gloves.

You would also find yourself in quite a pickle if you wanted to explain
to a user why those `--fetch-jobs` and `--submodule-fetch-jobs` are
separate options _and_ why they repeat the command name in the option
name. And if it is hard to explain, there is usually a better design
choice in the first place.

In other words, I would be much more in favor of `--jobs` _also_
extending to the `fetch_multiple()` call, not just the
`fetch_populated_submodules()` call.

> > @@ -1456,12 +1459,15 @@ static void add_options_to_argv(struct argv_array *argv)
> >
> >  }
> >
> > -static int fetch_multiple(struct string_list *list)
> > +static int fetch_multiple(struct string_list *list, int i)
> >  {
> > -	int i, result = 0;
>
> 'i' is perfectly a good name for a local variable that is used for
> loop control purposes, but makes a horrible name for a parameter.
>
> Existing 'list' is not any better either---we know it is a list by
> its type already, the name should say what the list is about, what
> it represents.  But having a horribly named parameter already is not
> a good reason to make the code even worse.
>
> And as you said, recursion makes the code structure harder to follow
> here.  Keeping an array of --jobs=<n> cmd structures, looping to
> fill them by starting, doing wait() to reap any of the started ones
> that first exits to refill the slot just opened, etc. would be easier
> to see if done in a loop, I think.

I have to admit that I'd _much_ rather see the strategy of
`fetch_populated_submodules()` emulated, where it uses the
`run_processes_parallel_tr2()` function to perform the actual fetches.

That buys us a lot of advantages:

- the recursion problems mentioned in the original mail will _not_ be
  any issue,
- the progress won't be garbled,
- we can automatically restrict the maximal number of parallel fetches.

It should be super easy to get started with this, simply by moving the
`list` and the `i` variables into a `struct` that is then passed to
`run_processes_parallel_tr2()`, then implementing the three callbacks.

Those callbacks can take inspiration from `submodule.c`'s
`get_next_submodule()`, `fetch_start_failure()`, and `fetch_finish()`
functions, but rest assured that the `fetch_multiple()` ones will look
_a lot_ simpler.

Ciao,
Dscho

P.S.: It would probably also make sense to extend this contribution by a
second patch that teaches `git remote update` a `--jobs` option.
Junio C Hamano July 19, 2019, 3:12 p.m. UTC | #3
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> I have to admit that I'd _much_ rather see the strategy of
> `fetch_populated_submodules()` emulated, where it uses the
> `run_processes_parallel_tr2()` function to perform the actual fetches.

Good. Thanks for chiming in.
diff mbox series

Patch

diff --git a/Documentation/fetch-options.txt b/Documentation/fetch-options.txt
index 3c9b4f9e0951..99666f40e405 100644
--- a/Documentation/fetch-options.txt
+++ b/Documentation/fetch-options.txt
@@ -92,6 +92,9 @@  ifndef::git-pull[]
 	Run `git gc --auto` at the end to perform garbage collection
 	if needed. This is enabled by default.
 
+--parallel::
+	Fetch all remotes in parallel.
+
 -p::
 --prune::
 	Before fetching, remove any remote-tracking references that no
diff --git a/builtin/fetch.c b/builtin/fetch.c
index 53ce99d2bbc4..2dfdeea3b3ec 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -71,6 +71,7 @@  static struct refspec refmap = REFSPEC_INIT_FETCH;
 static struct list_objects_filter_options filter_options;
 static struct string_list server_options = STRING_LIST_INIT_DUP;
 static struct string_list negotiation_tip = STRING_LIST_INIT_NODUP;
+static int parallel = 0;
 
 static int git_fetch_config(const char *k, const char *v, void *cb)
 {
@@ -178,6 +179,8 @@  static struct option builtin_fetch_options[] = {
 			TRANSPORT_FAMILY_IPV6),
 	OPT_STRING_LIST(0, "negotiation-tip", &negotiation_tip, N_("revision"),
 			N_("report that we have only objects reachable from this object")),
+	OPT_BOOL(0, "parallel", &parallel,
+		 N_("fetch in parallel from each remote")),
 	OPT_PARSE_LIST_OBJECTS_FILTER(&filter_options),
 	OPT_BOOL(0, "auto-gc", &enable_auto_gc,
 		 N_("run 'gc --auto' after fetching")),
@@ -1456,12 +1459,15 @@  static void add_options_to_argv(struct argv_array *argv)
 
 }
 
-static int fetch_multiple(struct string_list *list)
+static int fetch_multiple(struct string_list *list, int i)
 {
-	int i, result = 0;
+	int result = 0;
 	struct argv_array argv = ARGV_ARRAY_INIT;
+	const char *name = list->items[i].string;
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	int code;
 
-	if (!append && !dry_run) {
+	if (i == 0 && !append && !dry_run) {
 		int errcode = truncate_fetch_head();
 		if (errcode)
 			return errcode;
@@ -1469,20 +1475,31 @@  static int fetch_multiple(struct string_list *list)
 
 	argv_array_pushl(&argv, "fetch", "--append", "--no-auto-gc", NULL);
 	add_options_to_argv(&argv);
+	argv_array_push(&argv, name);
 
-	for (i = 0; i < list->nr; i++) {
-		const char *name = list->items[i].string;
-		argv_array_push(&argv, name);
-		if (verbosity >= 0)
-			printf(_("Fetching %s\n"), name);
-		if (run_command_v_opt(argv.argv, RUN_GIT_CMD)) {
-			error(_("Could not fetch %s"), name);
-			result = 1;
-		}
-		argv_array_pop(&argv);
+	if (verbosity >= 0)
+		printf(_("Fetching %s\n"), name);
+
+	cmd.argv = argv.argv;
+	cmd.git_cmd = 1;
+	code = start_command(&cmd);
+
+	if (!code && !parallel)
+		code = finish_command(&cmd);
+
+	if (i+1 < list->nr)
+		result |= fetch_multiple(list, i+1);
+
+	if (!code && parallel)
+		code |= finish_command(&cmd);
+
+	if (code) {
+		error(_("Could not fetch %s"), name);
+		result |= 1;
 	}
 
 	argv_array_clear(&argv);
+
 	return result;
 }
 
@@ -1696,7 +1713,7 @@  int cmd_fetch(int argc, const char **argv, const char *prefix)
 			die(_("--filter can only be used with the remote "
 			      "configured in extensions.partialclone"));
 		/* TODO should this also die if we have a previous partial-clone? */
-		result = fetch_multiple(&list);
+		result = fetch_multiple(&list, 0);
 	}
 
 	if (!result && (recurse_submodules != RECURSE_SUBMODULES_OFF)) {