Message ID | 20230302220251.1474923-6-calvinwan@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | submodule: parallelize diff | expand |
On Thu, Mar 02 2023, Calvin Wan wrote: Some of this is stuff I probably should have noted in earlier rounds, sorry, but then again the diff-churn in those made it harder to review, now that that's mostly out of the way (yay!) .... > +submodule.diffJobs:: > + Specifies how many submodules are diffed at the same time. A > + positive integer allows up to that number of submodules diffed > + in parallel. A value of 0 will give some reasonable default. > + If unset, it defaults to 1. The diff operation is used by many Nit: Maybe start a new paragraph as of "The diff..."? > + other git commands such as add, merge, diff, status, stash and > + more. Note that the expensive part of the diff operation is Nit: Maybe change 'add', 'merge' etc. to linkgit:git-add[1], or quote them? > + reading the index from cache or memory. Therefore multiple jobs With how much we conflate "the cache" and "index" saying "the index from cache" might be especially confusing. I think we can just skip " from cache or memory" here. > static int match_stat_with_submodule(struct diff_options *diffopt, > const struct cache_entry *ce, > struct stat *st, unsigned ce_option, > - unsigned *dirty_submodule) > + unsigned *dirty_submodule, int *defer_submodule_status, Nit: The other one is an "unsigned", shouldn't "defer_submodule_status" also be (more on this below). > + unsigned *ignore_untracked) > { > int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option); > + int defer = 0; > + > if (S_ISGITLINK(ce->ce_mode)) { > struct diff_flags orig_flags = diffopt->flags; > if (!diffopt->flags.override_submodule_config) > set_diffopt_flags_from_submodule_config(diffopt, ce->name); The meaty functional change here looks *much* better, thanks! I.e. this is pretty much what I suggested in https://lore.kernel.org/git/230208.861qn01s4g.gmgdl@evledraar.gmail.com/ > - if (diffopt->flags.ignore_submodules) > + if (diffopt->flags.ignore_submodules) { Not worth a re-roll in itself, but FWIW I think this change would be marginally easier to follow with *a* preceding refactoring change, but per the above & https://lore.kernel.org/git/230209.867cwrzk1l.gmgdl@evledraar.gmail.com/ I just didn't think v7's 6/7 (https://lore.kernel.org/git/20230207181706.363453-7-calvinwan@google.com/) was what we needed there. I.e. in this case a leading change that would add these braces would make this a bit easier to read... > changed = 0; > - else if (!diffopt->flags.ignore_dirty_submodules && ...ditto this line, which would stay the same. > - (!changed || diffopt->flags.dirty_submodules)) > - *dirty_submodule = is_submodule_modified(ce->name, > - diffopt->flags.ignore_untracked_in_submodules); Here you are incorrectly changing the indentation of this away from our usual coding style, which... > + } else if (!diffopt->flags.ignore_dirty_submodules && > + (!changed || diffopt->flags.dirty_submodules)) { > + if (defer_submodule_status && *defer_submodule_status) { Hrm, if if I remove that "&& *defer_submodule_status" all of our tests pass, the only two callers of this function are one where this is NULL, and where it's non-NULL but pre-initilized to 1, and the caller will check if it's then flipped to 0. > + defer = 1; > + *ignore_untracked = diffopt->flags.ignore_untracked_in_submodules; > + } else { > + *dirty_submodule = is_submodule_modified(ce->name, > + diffopt->flags.ignore_untracked_in_submodules); ...needlessly inflates the diff here, at least under -w and move detection, as we correctly detect the "*dirty_submodule" line as the same, but the "diffopt->flags" line also has a re-indentation change unrelated to adding the "else" scope. > + } > + } > diffopt->flags = orig_flags; > } > + > + if (defer_submodule_status) > + *defer_submodule_status = defer; Having read this whole thing to the end again I think this on top would be much simpler (if I'm right about it being functionally equivalent), and would address some of the above: diff --git a/diff-lib.c b/diff-lib.c index 7fe6ced9501..d5c823f512a 100644 --- a/diff-lib.c +++ b/diff-lib.c @@ -78,7 +78,6 @@ static int match_stat_with_submodule(struct diff_options *diffopt, unsigned *ignore_untracked) { int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option); - int defer = 0; if (S_ISGITLINK(ce->ce_mode)) { struct diff_flags orig_flags = diffopt->flags; @@ -88,8 +87,8 @@ static int match_stat_with_submodule(struct diff_options *diffopt, changed = 0; } else if (!diffopt->flags.ignore_dirty_submodules && (!changed || diffopt->flags.dirty_submodules)) { - if (defer_submodule_status && *defer_submodule_status) { - defer = 1; + if (defer_submodule_status) { + *defer_submodule_status = 1; *ignore_untracked = diffopt->flags.ignore_untracked_in_submodules; } else { *dirty_submodule = is_submodule_modified(ce->name, @@ -99,8 +98,6 @@ static int match_stat_with_submodule(struct diff_options *diffopt, diffopt->flags = orig_flags; } - if (defer_submodule_status) - *defer_submodule_status = defer; return changed; } @@ -153,7 +150,7 @@ int run_diff_files(struct rev_info *revs, unsigned int option) unsigned int newmode; struct cache_entry *ce = istate->cache[i]; int changed; - int defer_submodule_status = 1; + int defer_submodule_status = 0; if (diff_can_quit_early(&revs->diffopt)) break; We could also just leave it, but I for one found it a bit hard to follow that this interface seems to be a tri-state (NULL, set to 0, set to 1), but really it's dual-state, i.e. NULL or a "tell me to defer this" bit. > return changed; > } > > @@ -124,6 +140,7 @@ int run_diff_files(struct rev_info *revs, unsigned int option) > ? CE_MATCH_RACY_IS_DIRTY : 0); > uint64_t start = getnanotime(); > struct index_state *istate = revs->diffopt.repo->index; > + struct string_list submodules = STRING_LIST_INIT_NODUP; > > diff_set_mnemonic_prefix(&revs->diffopt, "i/", "w/"); > > @@ -136,7 +153,7 @@ int run_diff_files(struct rev_info *revs, unsigned int option) > unsigned int newmode; > struct cache_entry *ce = istate->cache[i]; > int changed; > - unsigned dirty_submodule = 0; > + int defer_submodule_status = 1; Hrm, having suggested the diff above I just noticed this now, I ended up inverting this, but found the "defer_submodule_status" name a bit odd, can't we just keep "unsigned dirty_submodule"? (that would also address the change from "unsigned" to "int" noted above, which is seeminly unnecessary). But maybe I'm missing a subtlety here, and we should have "deferred status" as apposed to "dirty submodule", but in any case the new one looks like it doesn't need negative values. > + } > + if (submodules.nr) { > + unsigned long parallel_jobs; > + struct string_list_item *item; > + > + if (git_config_get_ulong("submodule.diffjobs", ¶llel_jobs)) > + parallel_jobs = 1; > + else if (!parallel_jobs) > + parallel_jobs = online_cpus(); Given that online_cpus() returns int the "unsigned long" is slightly odd here, but it's because git_config_get_ulong() exist, but we have no git_config_get_uint(), so this is OK (but could be cleaned up as some #leftoverbits). > + if (get_submodules_status(&submodules, parallel_jobs)) > + die(_("submodule status failed")); Here we're adding get_submodules_status(), and returning the actual error code from "status", but then ignoring it here, and returning 128 for any non-zero. I think this would be better as either: code = get_submodules_status(...); die_message(...) exit(code); Or to just have the function itself return !!status, i.e. a "ok" or "not ok". Admittedly a nit, but I have spent quite a bit of time chasing down various exit-code losses in the submodule code, and it would be nice if we just carry the code up, or more explicitly ignore it, but don't add code that seems to care about it, but really doesn't. I also changed this "die" to a "BUG" and our tests passed, so we have no tests for when "status" failed, will such a thing even happen in practice? > + for_each_string_list_item(item, &submodules) { > + struct submodule_status_util *util = item->util; > + > + record_file_diff(&revs->diffopt, util->newmode, > + util->dirty_submodule, util->changed, > + istate, util->ce); > + } > } > + string_list_clear(&submodules, 1); > diffcore_std(&revs->diffopt); > diff_flush(&revs->diffopt); > trace_performance_since(start, "diff-files"); > @@ -322,7 +379,7 @@ static int get_stat_data(const struct index_state *istate, > return -1; > } > changed = match_stat_with_submodule(diffopt, ce, &st, > - 0, dirty_submodule); > + 0, dirty_submodule, NULL, NULL); > if (changed) { > mode = ce_mode_from_stat(ce, st.st_mode); > oid = null_oid(); > diff --git a/submodule.c b/submodule.c > index 426074cebb..6f6e150a3f 100644 > --- a/submodule.c > +++ b/submodule.c > @@ -1373,6 +1373,13 @@ int submodule_touches_in_range(struct repository *r, > return ret; > } > > +struct submodule_parallel_status { > + size_t index_count; > + int result; > + > + struct string_list *submodule_names; > +}; Hrm, actually reading a bit more I think part of my comments above are incorrect, i.e. this "result" seems like an exit code, but really in the guts of the API we're ignoring the actual code we get, and just setting this to 1. Per the above I think it might be OK to ignore the exit code (or not), but I really wish we did this more explicitly, e.g. if you want to ignore it call this something like "failed", not "result", and make it an "unsigned int failed:1" to firmly indicate that it's a boolean at the API level. > +struct status_task { > + const char *path; I think we should call this "ce_path", but more on that below. > + struct strbuf out; > + int ignore_untracked; Continued type mismatch commentary: Elsewhere in this diff this is "unsigned", and this compiles for me if I make it "unsigned int ignore_untracked:1", so let's set it to such a flag instead? > +static int status_finish(int retvalue, struct strbuf *err, > + void *cb, void *task_cb) > +{ > + struct submodule_parallel_status *sps = cb; > + struct status_task *task = task_cb; > + struct string_list_item *it = > + string_list_lookup(sps->submodule_names, task->path); > + struct submodule_status_util *util = it->util; > + struct string_list list = STRING_LIST_INIT_DUP; > + struct string_list_item *item; > + > + if (retvalue) { > + sps->result = 1; > + strbuf_addf(err, _(STATUS_PORCELAIN_FAIL_ERROR), task->path); > + } > + > + string_list_split(&list, task->out.buf, '\n', -1); I think I noted in some earlier round that taking a string and splitting it by \n was a bit wasteful in the test code, but this uses the same pattern. Maybe it's not a performance concern here either, but won't we potentially have to parse some very large statuses here? Aside from that, I haven't tried or reviewed this bit in detail, but this seems to be making things harder than they need to be. Why are we buffering up all of the output into "out" here, only to split it by "\n" later on, and then consider each line as a status line? Shouldn't we be allocating this string_list to begin with, and append to it in the "status_on_stderr_output" callback instead? > + for_each_string_list_item(item, &list) { > + if (parse_status_porcelain(item->string, > + strlen(item->string), > + &util->dirty_submodule, > + util->ignore_untracked)) OK, this seemingly buggy bit of error handling seems to actually be OK on further review, because we'll BUG() out in the function if it fails, so the non-zero return here just means "we're done here". > + break; > + } Style: drop the braces here, as this is just a for/if/body with a single body line. > +int get_submodules_status(struct string_list *submodules, > + int max_parallel_jobs) > +{ > + struct submodule_parallel_status sps = { > + .submodule_names = submodules, > + }; > + const struct run_process_parallel_opts opts = { > + .tr2_category = "submodule", > + .tr2_label = "parallel/status", > + > + .processes = max_parallel_jobs, > + > + .get_next_task = get_next_submodule_status, > + .start_failure = status_start_failure, > + .on_stderr_output = status_on_stderr_output, > + .task_finished = status_finish, > + .data = &sps, > + }; > + > + string_list_sort(sps.submodule_names); > + run_processes_parallel(&opts); > + > + return sps.result; All OK, except as noted above the "result" here is just "did we fail?". > +} > + > int submodule_uses_gitfile(const char *path) > { > struct child_process cp = CHILD_PROCESS_INIT; > diff --git a/submodule.h b/submodule.h > index b52a4ff1e7..08d278a414 100644 > --- a/submodule.h > +++ b/submodule.h > @@ -41,6 +41,13 @@ struct submodule_update_strategy { > .type = SM_UPDATE_UNSPECIFIED, \ > } > > +struct submodule_status_util { > + int changed, ignore_untracked; > + unsigned dirty_submodule, newmode; > + struct cache_entry *ce; > + const char *path; Re "ce_path" above: What's the point of adding a "path" here if we already have "ce"? You just seem to assign "path" to "ce->name" always. I tried this fix-up on top & it worked: diff --git a/diff-lib.c b/diff-lib.c index d5c823f512a..39d8179f0ed 100644 --- a/diff-lib.c +++ b/diff-lib.c @@ -294,7 +294,6 @@ int run_diff_files(struct rev_info *revs, unsigned int option) .ignore_untracked = ignore_untracked, .newmode = newmode, .ce = ce, - .path = ce->name, }; struct string_list_item *item; diff --git a/submodule.c b/submodule.c index 3eba00f1533..c220d85815a 100644 --- a/submodule.c +++ b/submodule.c @@ -2002,11 +2002,11 @@ get_status_task_from_index(struct submodule_parallel_status *sps, struct submodule_status_util *util = sps->submodule_names->items[sps->index_count].util; struct status_task *task; - if (!verify_submodule_git_directory(util->path)) + if (!verify_submodule_git_directory(util->ce->name)) continue; task = xmalloc(sizeof(*task)); - task->path = util->path; + task->path = util->ce->name; task->ignore_untracked = util->ignore_untracked; strbuf_init(&task->out, 0); sps->index_count++; diff --git a/submodule.h b/submodule.h index 3b6abca05cd..3427c495573 100644 --- a/submodule.h +++ b/submodule.h @@ -45,7 +45,6 @@ struct submodule_status_util { int changed, ignore_untracked; unsigned dirty_submodule, newmode; struct cache_entry *ce; - const char *path; }; int is_gitmodules_unmerged(struct index_state *istate); I'd be all for actually narrowing the scope of data we get in general, i.e. do we need all of the "ce" members? I didn't check, but doing this just seems like needless duplication. > @@ -94,6 +101,8 @@ int fetch_submodules(struct repository *r, > int command_line_option, > int default_option, > int quiet, int max_parallel_jobs); > +int get_submodules_status(struct string_list *submodules, > + int max_parallel_jobs); It would be nice to get some API docs for the new function, re its "result" behavior etc. noted above > unsigned is_submodule_modified(const char *path, int ignore_untracked); > int submodule_uses_gitfile(const char *path); > > diff --git a/t/t4027-diff-submodule.sh b/t/t4027-diff-submodule.sh > index 40164ae07d..1c747cc325 100755 > --- a/t/t4027-diff-submodule.sh > +++ b/t/t4027-diff-submodule.sh > @@ -34,6 +34,25 @@ test_expect_success setup ' > subtip=$3 subprev=$2 > ' > > +test_expect_success 'diff in superproject with submodules respects parallel settings' ' > + test_when_finished "rm -f trace.out" && > + ( > + GIT_TRACE=$(pwd)/trace.out git diff && > + grep "1 tasks" trace.out && > + >trace.out && > + > + git config submodule.diffJobs 8 && > + GIT_TRACE=$(pwd)/trace.out git diff && > + grep "8 tasks" trace.out && > + >trace.out && > + > + GIT_TRACE=$(pwd)/trace.out git -c submodule.diffJobs=0 diff && > + grep "preparing to run up to [0-9]* tasks" trace.out && > + ! grep "up to 0 tasks" trace.out && > + >trace.out > + ) > +' > + > test_expect_success 'git diff --raw HEAD' ' > hexsz=$(test_oid hexsz) && > git diff --raw --abbrev=$hexsz HEAD >actual && > @@ -70,6 +89,18 @@ test_expect_success 'git diff HEAD with dirty submodule (work tree)' ' > test_cmp expect.body actual.body > ' > > +test_expect_success 'git diff HEAD with dirty submodule (work tree, parallel)' ' > + ( > + cd sub && > + git reset --hard && > + echo >>world > + ) && > + git -c submodule.diffJobs=8 diff HEAD >actual && > + sed -e "1,/^@@/d" actual >actual.body && > + expect_from_to >expect.body $subtip $subprev-dirty && > + test_cmp expect.body actual.body > +' > + > test_expect_success 'git diff HEAD with dirty submodule (index)' ' > ( > cd sub && > diff --git a/t/t7506-status-submodule.sh b/t/t7506-status-submodule.sh > index d050091345..7da64e4c4c 100755 > --- a/t/t7506-status-submodule.sh > +++ b/t/t7506-status-submodule.sh > @@ -412,4 +412,29 @@ test_expect_success 'status with added file in nested submodule (short)' ' > EOF > ' > > +test_expect_success 'status in superproject with submodules respects parallel settings' ' > + test_when_finished "rm -f trace.out" && > + ( > + GIT_TRACE=$(pwd)/trace.out git status && > + grep "1 tasks" trace.out && > + >trace.out && > + > + git config submodule.diffJobs 8 && > + GIT_TRACE=$(pwd)/trace.out git status && > + grep "8 tasks" trace.out && > + >trace.out && > + > + GIT_TRACE=$(pwd)/trace.out git -c submodule.diffJobs=0 status && > + grep "preparing to run up to [0-9]* tasks" trace.out && > + ! grep "up to 0 tasks" trace.out && > + >trace.out > + ) > +' > + > +test_expect_success 'status in superproject with submodules (parallel)' ' > + git -C super status --porcelain >output && > + git -C super -c submodule.diffJobs=8 status --porcelain >output_parallel && > + diff output output_parallel Shouldn't this be a "test_cmp" instead of "diff", and use "actual" and "expect" instead of "output" and "output_parallel"? I'd also rename the test to something like "output with submodule.diffJobs=N equals submodule.diffJobs=1". Except is that even correct? Don't we need to set submodule.diffJobs=1 explicitly so it doesn't default to online_cpus() here? Maybe I missed an earlier config setup...
On Thu, Mar 02 2023, Calvin Wan wrote: > + if (git_config_get_ulong("submodule.diffjobs", ¶llel_jobs)) > + parallel_jobs = 1; Something I missed when eyeballing this in my just-sent review, here we have a "revs->repo" already, so let's not fall back on "the_repository", but use it. I think you want this as a fix-up: diff --git a/diff-lib.c b/diff-lib.c index 925d64ff58c..ec8a0f98085 100644 --- a/diff-lib.c +++ b/diff-lib.c @@ -312,7 +312,8 @@ int run_diff_files(struct rev_info *revs, unsigned int option) unsigned long parallel_jobs; struct string_list_item *item; - if (git_config_get_ulong("submodule.diffjobs", ¶llel_jobs)) + if (repo_config_get_ulong(revs->repo, "submodule.diffjobs", + ¶llel_jobs)) parallel_jobs = 1; else if (!parallel_jobs) parallel_jobs = online_cpus();
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > On Thu, Mar 02 2023, Calvin Wan wrote: > >> + if (git_config_get_ulong("submodule.diffjobs", ¶llel_jobs)) >> + parallel_jobs = 1; > > Something I missed when eyeballing this in my just-sent review, here we > have a "revs->repo" already, so let's not fall back on "the_repository", > but use it. I think you want this as a fix-up: > > diff --git a/diff-lib.c b/diff-lib.c > index 925d64ff58c..ec8a0f98085 100644 > --- a/diff-lib.c > +++ b/diff-lib.c > @@ -312,7 +312,8 @@ int run_diff_files(struct rev_info *revs, unsigned int option) > unsigned long parallel_jobs; > struct string_list_item *item; > > - if (git_config_get_ulong("submodule.diffjobs", ¶llel_jobs)) > + if (repo_config_get_ulong(revs->repo, "submodule.diffjobs", > + ¶llel_jobs)) > parallel_jobs = 1; > else if (!parallel_jobs) > parallel_jobs = online_cpus(); Good eyes. Thanks for a careful review.
I haven't verified if the code in this version is correct or not, as I found it a bit difficult to follow through the churn. After reading this series again, I've established a better mental model of the code, and I think there are some renames and documentation changes we can make to make this clearer. Unfortunately, I think the biggest clarification would be _yet_ another refactor, and I'm not sure if we actually want to bear so much churn. I might do this refactor locally to see if it really is _much_ cleaner or not. If anyone has thoughts on the refactor, do chime in. Calvin Wan <calvinwan@google.com> writes: > diff --git a/diff-lib.c b/diff-lib.c > index 744ae98a69..7fe6ced950 100644 > --- a/diff-lib.c > +++ b/diff-lib.c > @@ -65,26 +66,41 @@ static int check_removed(const struct index_state *istate, const struct cache_en > * Return 1 when changes are detected, 0 otherwise. If the DIRTY_SUBMODULES > * option is set, the caller does not only want to know if a submodule is > * modified at all but wants to know all the conditions that are met (new > - * commits, untracked content and/or modified content). > + * commits, untracked content and/or modified content). If > + * defer_submodule_status bit is set, dirty_submodule will be left to the > + * caller to set. defer_submodule_status can also be set to 0 in this > + * function if there is no need to check if the submodule is modified. > */ > static int match_stat_with_submodule(struct diff_options *diffopt, > const struct cache_entry *ce, > struct stat *st, unsigned ce_option, > - unsigned *dirty_submodule) > + unsigned *dirty_submodule, int *defer_submodule_status, > + unsigned *ignore_untracked) > { > int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option); > + int defer = 0; > + > if (S_ISGITLINK(ce->ce_mode)) { > struct diff_flags orig_flags = diffopt->flags; > if (!diffopt->flags.override_submodule_config) > set_diffopt_flags_from_submodule_config(diffopt, ce->name); > - if (diffopt->flags.ignore_submodules) > + if (diffopt->flags.ignore_submodules) { > changed = 0; > - else if (!diffopt->flags.ignore_dirty_submodules && > - (!changed || diffopt->flags.dirty_submodules)) > - *dirty_submodule = is_submodule_modified(ce->name, > - diffopt->flags.ignore_untracked_in_submodules); > + } else if (!diffopt->flags.ignore_dirty_submodules && > + (!changed || diffopt->flags.dirty_submodules)) { > + if (defer_submodule_status && *defer_submodule_status) { > + defer = 1; > + *ignore_untracked = diffopt->flags.ignore_untracked_in_submodules; > + } else { > + *dirty_submodule = is_submodule_modified(ce->name, > + diffopt->flags.ignore_untracked_in_submodules); > + } > + } > diffopt->flags = orig_flags; > } > + > + if (defer_submodule_status) > + *defer_submodule_status = defer; The crux of this patch is that we are replacing some serial operation with a parallel operation. The replacement happens here, where we are replacing is_submodule_modified() by 'deferring' it. So to verify if the parallel implementation is correct, we should compare the "setup" and "finish" steps in is_submodule_modified() and get_submodules_status(). Eyeballing it, it looks correct, especially because we made sure to refactor out the shared logic in previous patches. To reflect this, I think it would be clearer to rename get_submodules_status() to something similar (e.g. are_submodules_modified_parallel()), with an explicit comment saying that it is meant to be a parallel implementation of is_submodule_modified(). Except, I told a little white lie in the previous paragraph, because get_submodules_status() isn't _just_ a parallel implementation of is_submodule_modified()... > @@ -268,13 +286,52 @@ int run_diff_files(struct rev_info *revs, unsigned int option) > } > > changed = match_stat_with_submodule(&revs->diffopt, ce, &st, > - ce_option, &dirty_submodule); > + ce_option, NULL, > + &defer_submodule_status, > + &ignore_untracked); > newmode = ce_mode_from_stat(ce, st.st_mode); > + if (defer_submodule_status) { > + struct submodule_status_util tmp = { > + .changed = changed, > + .dirty_submodule = 0, > + .ignore_untracked = ignore_untracked, > + .newmode = newmode, > + .ce = ce, > + .path = ce->name, > + }; > + struct string_list_item *item; > + > + item = string_list_append(&submodules, ce->name); > + item->util = xmalloc(sizeof(tmp)); > + memcpy(item->util, &tmp, sizeof(tmp)); > + continue; > + } because get_submodules_status() doesn't just contain the results of the parallel processes, it is _also_ shuttling "changed" and "ignore_untracked" from match_stat_with_submodule(), as well as .newmode, .ce and .path from run_diff_files() (basically everything except .dirty_submodule)... > } > > - record_file_diff(&revs->diffopt, newmode, dirty_submodule, > - changed, istate, ce); > + if (!defer_submodule_status) > + record_file_diff(&revs->diffopt, newmode, 0, > + changed,istate, ce); > + } > + if (submodules.nr) { > + unsigned long parallel_jobs; > + struct string_list_item *item; > + > + if (git_config_get_ulong("submodule.diffjobs", ¶llel_jobs)) > + parallel_jobs = 1; > + else if (!parallel_jobs) > + parallel_jobs = online_cpus(); > + > + if (get_submodules_status(&submodules, parallel_jobs)) > + die(_("submodule status failed")); > + for_each_string_list_item(item, &submodules) { > + struct submodule_status_util *util = item->util; > + > + record_file_diff(&revs->diffopt, util->newmode, > + util->dirty_submodule, util->changed, > + istate, util->ce); > + } so that we can pass all of this back into record_file_diff(). The only member that is changed by the parallel process is .dirty_submodule, which is exactly what we would expect from a parallel version of is_submodule_modified(). If we don't want to do a bigger refactor, I think we should also add comments to members of "struct submodule_status_util" to document where they come from and what they are used for. The rest of the comments are refactor-related. It would be good if we could avoid mixing unrelated information sources in "struct submodule_status_util", since a) this makes it very tightly coupled to run_diff_files() and b) it causes us to repeat ourselves in the same function (.changed = changed, record_file_diff()). The only reason why the code looks this way right now is that match_stat_with_submodule() sets defer_submodule_status based on whether or not we should ignore the submodule, and this eventually tells get_submodule_status() what submodules it needs to care about. But, deciding whether to spawn a subprocess for which submodule is exactly what the .get_next_task member is for. > diff --git a/submodule.c b/submodule.c > index 426074cebb..6f6e150a3f 100644 > --- a/submodule.c > +++ b/submodule.c > @@ -1981,6 +1994,121 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked) > return dirty_submodule; > } > > +static struct status_task * > +get_status_task_from_index(struct submodule_parallel_status *sps, > + struct strbuf *err) > +{ > + for (; sps->index_count < sps->submodule_names->nr; sps->index_count++) { > + struct submodule_status_util *util = sps->submodule_names->items[sps->index_count].util; > + struct status_task *task; > + > + if (!verify_submodule_git_directory(util->path)) > + continue; So right here, we could use the "check if this submodule should be ignored" logic form match_stat_with_submodule() to decide whether or not to spawn the subprocess. IOW, I am advocating for get_submodules_status() to be a parallel version of match_stat_with_submodule() (not a parallel version of is_submodule_modified() that shuttles extra information). Another sign that this refactor is a good idea is that it lets us simplify _existing_ submodule logic in run_diff_files(). Prior to this patch, we have: unsigned dirty_submodule = 0; ... changed = match_stat_with_submodule(&revs->diffopt, ce, &st, ce_option, NULL, &defer_submodule_status, &ignore_untracked); // If submodule was deferred, shuttle a bunch of information // If not, call record_file_diff() but the body of match_stat_with_submodule() is just ie_match_stat() + some additional submodule logic. Post refactor, this would look something like: struct string_list submodules; ... // For any submodule, just append it to a list and let the // parallel thing take care of it. if (S_ISGITLINK(ce->ce_mode) { // Probably pass .newmode and .ce to the util too... string_list_append(submodules, ce->name); } else { changed = ie_match_stat(foo, bar, baz); record_file_diff(); } ... if (submodules.nr) { parallel_match_stat_with_submodule_wip_name(&submodules); for_each_string_list_item(item, &submodules) { record_file_diff(&item); } } Which I think is easier to follow, since we won't need defer_submodule_status any more, and we don't shuttle information from match_stat_with_submodule(). Though I'm a bit unhappy that it's still pretty coupled to run_diff_files() (it still has to shuttle .newmode, .ce). Also, I don't think this refactor lets us avoid the refactors we did in the previous patches. > + > + task = xmalloc(sizeof(*task)); > + task->path = util->path; > + task->ignore_untracked = util->ignore_untracked; > + strbuf_init(&task->out, 0); > + sps->index_count++; > + return task; > + } > + return NULL; > +} > + > +static int get_next_submodule_status(struct child_process *cp, > + struct strbuf *err, void *data, > + void **task_cb) > +{ > + struct submodule_parallel_status *sps = data; > + struct status_task *task = get_status_task_from_index(sps, err); As an aside, I think we can inline get_status_task_from_index(). I suspect this pattern was copied from get_next_submodule(), which gets fetch tasks from two different places (hence _from_index and _from_changed), but here I don't think we will ever get status tasks from more than one place.
Glen Choo <chooglen@google.com> writes: > It would be good if we could avoid mixing unrelated information sources > in "struct submodule_status_util", since a) this makes it very tightly > coupled to run_diff_files() and b) it causes us to repeat ourselves in > the same function (.changed = changed, record_file_diff()). > > The only reason why the code looks this way right now is that > match_stat_with_submodule() sets defer_submodule_status based on whether > or not we should ignore the submodule, and this eventually tells > get_submodule_status() what submodules it needs to care about. But, > deciding whether to spawn a subprocess for which submodule is exactly > what the .get_next_task member is for. > >> diff --git a/submodule.c b/submodule.c >> index 426074cebb..6f6e150a3f 100644 >> --- a/submodule.c >> +++ b/submodule.c >> @@ -1981,6 +1994,121 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked) >> return dirty_submodule; >> } >> >> +static struct status_task * >> +get_status_task_from_index(struct submodule_parallel_status *sps, >> + struct strbuf *err) >> +{ >> + for (; sps->index_count < sps->submodule_names->nr; sps->index_count++) { >> + struct submodule_status_util *util = sps->submodule_names->items[sps->index_count].util; >> + struct status_task *task; >> + >> + if (!verify_submodule_git_directory(util->path)) >> + continue; > > So right here, we could use the "check if this submodule should be > ignored" logic form match_stat_with_submodule() to decide whether or not > to spawn the subprocess. IOW, I am advocating for > get_submodules_status() to be a parallel version of > match_stat_with_submodule() (not a parallel version of > is_submodule_modified() that shuttles extra information). It turns out to be quite difficult to implement a parallel match_stat_with_submodule(): a) we can't remove it because it still has another caller b) its internals are quite hard to refactor: one conditional arm depends on "changed", which is set by calling ie_match_stat(), which in turn requires the "struct stat" to have already been lstat()-ed... So even though this series adds a lot, it is just about as minimally invasive as possible. I suspect that there are some possible cleanups down the line, e.g. is_submodule_modified() is rightfully only called by diff-lib.c , so I think it should be a static function there. And once we move that, we can make our parallel function static, and then we don't have to worry about tight coupling to run_diff_files(). To keep the range-diff manageable, that can be left for a future cleanup though.
diff --git a/Documentation/config/submodule.txt b/Documentation/config/submodule.txt index 6490527b45..3209eb8117 100644 --- a/Documentation/config/submodule.txt +++ b/Documentation/config/submodule.txt @@ -93,6 +93,18 @@ submodule.fetchJobs:: in parallel. A value of 0 will give some reasonable default. If unset, it defaults to 1. +submodule.diffJobs:: + Specifies how many submodules are diffed at the same time. A + positive integer allows up to that number of submodules diffed + in parallel. A value of 0 will give some reasonable default. + If unset, it defaults to 1. The diff operation is used by many + other git commands such as add, merge, diff, status, stash and + more. Note that the expensive part of the diff operation is + reading the index from cache or memory. Therefore multiple jobs + may be detrimental to performance if your hardware does not + support parallel reads or if the number of jobs greatly exceeds + the amount of supported reads. + submodule.alternateLocation:: Specifies how the submodules obtain alternates when submodules are cloned. Possible values are `no`, `superproject`. diff --git a/diff-lib.c b/diff-lib.c index 744ae98a69..7fe6ced950 100644 --- a/diff-lib.c +++ b/diff-lib.c @@ -14,6 +14,7 @@ #include "dir.h" #include "fsmonitor.h" #include "commit-reach.h" +#include "config.h" /* * diff-files @@ -65,26 +66,41 @@ static int check_removed(const struct index_state *istate, const struct cache_en * Return 1 when changes are detected, 0 otherwise. If the DIRTY_SUBMODULES * option is set, the caller does not only want to know if a submodule is * modified at all but wants to know all the conditions that are met (new - * commits, untracked content and/or modified content). + * commits, untracked content and/or modified content). If + * defer_submodule_status bit is set, dirty_submodule will be left to the + * caller to set. defer_submodule_status can also be set to 0 in this + * function if there is no need to check if the submodule is modified. */ static int match_stat_with_submodule(struct diff_options *diffopt, const struct cache_entry *ce, struct stat *st, unsigned ce_option, - unsigned *dirty_submodule) + unsigned *dirty_submodule, int *defer_submodule_status, + unsigned *ignore_untracked) { int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option); + int defer = 0; + if (S_ISGITLINK(ce->ce_mode)) { struct diff_flags orig_flags = diffopt->flags; if (!diffopt->flags.override_submodule_config) set_diffopt_flags_from_submodule_config(diffopt, ce->name); - if (diffopt->flags.ignore_submodules) + if (diffopt->flags.ignore_submodules) { changed = 0; - else if (!diffopt->flags.ignore_dirty_submodules && - (!changed || diffopt->flags.dirty_submodules)) - *dirty_submodule = is_submodule_modified(ce->name, - diffopt->flags.ignore_untracked_in_submodules); + } else if (!diffopt->flags.ignore_dirty_submodules && + (!changed || diffopt->flags.dirty_submodules)) { + if (defer_submodule_status && *defer_submodule_status) { + defer = 1; + *ignore_untracked = diffopt->flags.ignore_untracked_in_submodules; + } else { + *dirty_submodule = is_submodule_modified(ce->name, + diffopt->flags.ignore_untracked_in_submodules); + } + } diffopt->flags = orig_flags; } + + if (defer_submodule_status) + *defer_submodule_status = defer; return changed; } @@ -124,6 +140,7 @@ int run_diff_files(struct rev_info *revs, unsigned int option) ? CE_MATCH_RACY_IS_DIRTY : 0); uint64_t start = getnanotime(); struct index_state *istate = revs->diffopt.repo->index; + struct string_list submodules = STRING_LIST_INIT_NODUP; diff_set_mnemonic_prefix(&revs->diffopt, "i/", "w/"); @@ -136,7 +153,7 @@ int run_diff_files(struct rev_info *revs, unsigned int option) unsigned int newmode; struct cache_entry *ce = istate->cache[i]; int changed; - unsigned dirty_submodule = 0; + int defer_submodule_status = 1; if (diff_can_quit_early(&revs->diffopt)) break; @@ -247,6 +264,7 @@ int run_diff_files(struct rev_info *revs, unsigned int option) newmode = ce->ce_mode; } else { struct stat st; + unsigned ignore_untracked = 0; changed = check_removed(istate, ce, &st); if (changed) { @@ -268,13 +286,52 @@ int run_diff_files(struct rev_info *revs, unsigned int option) } changed = match_stat_with_submodule(&revs->diffopt, ce, &st, - ce_option, &dirty_submodule); + ce_option, NULL, + &defer_submodule_status, + &ignore_untracked); newmode = ce_mode_from_stat(ce, st.st_mode); + if (defer_submodule_status) { + struct submodule_status_util tmp = { + .changed = changed, + .dirty_submodule = 0, + .ignore_untracked = ignore_untracked, + .newmode = newmode, + .ce = ce, + .path = ce->name, + }; + struct string_list_item *item; + + item = string_list_append(&submodules, ce->name); + item->util = xmalloc(sizeof(tmp)); + memcpy(item->util, &tmp, sizeof(tmp)); + continue; + } } - record_file_diff(&revs->diffopt, newmode, dirty_submodule, - changed, istate, ce); + if (!defer_submodule_status) + record_file_diff(&revs->diffopt, newmode, 0, + changed,istate, ce); + } + if (submodules.nr) { + unsigned long parallel_jobs; + struct string_list_item *item; + + if (git_config_get_ulong("submodule.diffjobs", ¶llel_jobs)) + parallel_jobs = 1; + else if (!parallel_jobs) + parallel_jobs = online_cpus(); + + if (get_submodules_status(&submodules, parallel_jobs)) + die(_("submodule status failed")); + for_each_string_list_item(item, &submodules) { + struct submodule_status_util *util = item->util; + + record_file_diff(&revs->diffopt, util->newmode, + util->dirty_submodule, util->changed, + istate, util->ce); + } } + string_list_clear(&submodules, 1); diffcore_std(&revs->diffopt); diff_flush(&revs->diffopt); trace_performance_since(start, "diff-files"); @@ -322,7 +379,7 @@ static int get_stat_data(const struct index_state *istate, return -1; } changed = match_stat_with_submodule(diffopt, ce, &st, - 0, dirty_submodule); + 0, dirty_submodule, NULL, NULL); if (changed) { mode = ce_mode_from_stat(ce, st.st_mode); oid = null_oid(); diff --git a/submodule.c b/submodule.c index 426074cebb..6f6e150a3f 100644 --- a/submodule.c +++ b/submodule.c @@ -1373,6 +1373,13 @@ int submodule_touches_in_range(struct repository *r, return ret; } +struct submodule_parallel_status { + size_t index_count; + int result; + + struct string_list *submodule_names; +}; + struct submodule_parallel_fetch { /* * The index of the last index entry processed by @@ -1455,6 +1462,12 @@ struct fetch_task { struct oid_array *commits; /* Ensure these commits are fetched */ }; +struct status_task { + const char *path; + struct strbuf out; + int ignore_untracked; +}; + /** * When a submodule is not defined in .gitmodules, we cannot access it * via the regular submodule-config. Create a fake submodule, which we can @@ -1981,6 +1994,121 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked) return dirty_submodule; } +static struct status_task * +get_status_task_from_index(struct submodule_parallel_status *sps, + struct strbuf *err) +{ + for (; sps->index_count < sps->submodule_names->nr; sps->index_count++) { + struct submodule_status_util *util = sps->submodule_names->items[sps->index_count].util; + struct status_task *task; + + if (!verify_submodule_git_directory(util->path)) + continue; + + task = xmalloc(sizeof(*task)); + task->path = util->path; + task->ignore_untracked = util->ignore_untracked; + strbuf_init(&task->out, 0); + sps->index_count++; + return task; + } + return NULL; +} + +static int get_next_submodule_status(struct child_process *cp, + struct strbuf *err, void *data, + void **task_cb) +{ + struct submodule_parallel_status *sps = data; + struct status_task *task = get_status_task_from_index(sps, err); + + if (!task) + return 0; + + child_process_init(cp); + prepare_submodule_repo_env_in_gitdir(&cp->env); + prepare_status_porcelain(cp, task->path, task->ignore_untracked); + *task_cb = task; + return 1; +} + +static int status_start_failure(struct strbuf *err, + void *cb, void *task_cb) +{ + struct submodule_parallel_status *sps = cb; + struct status_task *task = task_cb; + + sps->result = 1; + strbuf_addf(err, _(STATUS_PORCELAIN_START_ERROR), task->path); + return 0; +} + +static void status_on_stderr_output(struct strbuf *out, + size_t offset, + void *cb, void *task_cb) +{ + struct status_task *task = task_cb; + + strbuf_add(&task->out, out->buf + offset, out->len - offset); + strbuf_setlen(out, offset); +} + +static int status_finish(int retvalue, struct strbuf *err, + void *cb, void *task_cb) +{ + struct submodule_parallel_status *sps = cb; + struct status_task *task = task_cb; + struct string_list_item *it = + string_list_lookup(sps->submodule_names, task->path); + struct submodule_status_util *util = it->util; + struct string_list list = STRING_LIST_INIT_DUP; + struct string_list_item *item; + + if (retvalue) { + sps->result = 1; + strbuf_addf(err, _(STATUS_PORCELAIN_FAIL_ERROR), task->path); + } + + string_list_split(&list, task->out.buf, '\n', -1); + for_each_string_list_item(item, &list) { + if (parse_status_porcelain(item->string, + strlen(item->string), + &util->dirty_submodule, + util->ignore_untracked)) + break; + } + string_list_clear(&list, 0); + strbuf_release(&task->out); + free(task); + + return 0; +} + +int get_submodules_status(struct string_list *submodules, + int max_parallel_jobs) +{ + struct submodule_parallel_status sps = { + .submodule_names = submodules, + }; + const struct run_process_parallel_opts opts = { + .tr2_category = "submodule", + .tr2_label = "parallel/status", + + .processes = max_parallel_jobs, + + .get_next_task = get_next_submodule_status, + .start_failure = status_start_failure, + .on_stderr_output = status_on_stderr_output, + .task_finished = status_finish, + .data = &sps, + }; + + string_list_sort(sps.submodule_names); + run_processes_parallel(&opts); + + return sps.result; +} + int submodule_uses_gitfile(const char *path) { struct child_process cp = CHILD_PROCESS_INIT; diff --git a/submodule.h b/submodule.h index b52a4ff1e7..08d278a414 100644 --- a/submodule.h +++ b/submodule.h @@ -41,6 +41,13 @@ struct submodule_update_strategy { .type = SM_UPDATE_UNSPECIFIED, \ } +struct submodule_status_util { + int changed, ignore_untracked; + unsigned dirty_submodule, newmode; + struct cache_entry *ce; + const char *path; +}; + int is_gitmodules_unmerged(struct index_state *istate); int is_writing_gitmodules_ok(void); int is_staging_gitmodules_ok(struct index_state *istate); @@ -94,6 +101,8 @@ int fetch_submodules(struct repository *r, int command_line_option, int default_option, int quiet, int max_parallel_jobs); +int get_submodules_status(struct string_list *submodules, + int max_parallel_jobs); unsigned is_submodule_modified(const char *path, int ignore_untracked); int submodule_uses_gitfile(const char *path); diff --git a/t/t4027-diff-submodule.sh b/t/t4027-diff-submodule.sh index 40164ae07d..1c747cc325 100755 --- a/t/t4027-diff-submodule.sh +++ b/t/t4027-diff-submodule.sh @@ -34,6 +34,25 @@ test_expect_success setup ' subtip=$3 subprev=$2 ' +test_expect_success 'diff in superproject with submodules respects parallel settings' ' + test_when_finished "rm -f trace.out" && + ( + GIT_TRACE=$(pwd)/trace.out git diff && + grep "1 tasks" trace.out && + >trace.out && + + git config submodule.diffJobs 8 && + GIT_TRACE=$(pwd)/trace.out git diff && + grep "8 tasks" trace.out && + >trace.out && + + GIT_TRACE=$(pwd)/trace.out git -c submodule.diffJobs=0 diff && + grep "preparing to run up to [0-9]* tasks" trace.out && + ! grep "up to 0 tasks" trace.out && + >trace.out + ) +' + test_expect_success 'git diff --raw HEAD' ' hexsz=$(test_oid hexsz) && git diff --raw --abbrev=$hexsz HEAD >actual && @@ -70,6 +89,18 @@ test_expect_success 'git diff HEAD with dirty submodule (work tree)' ' test_cmp expect.body actual.body ' +test_expect_success 'git diff HEAD with dirty submodule (work tree, parallel)' ' + ( + cd sub && + git reset --hard && + echo >>world + ) && + git -c submodule.diffJobs=8 diff HEAD >actual && + sed -e "1,/^@@/d" actual >actual.body && + expect_from_to >expect.body $subtip $subprev-dirty && + test_cmp expect.body actual.body +' + test_expect_success 'git diff HEAD with dirty submodule (index)' ' ( cd sub && diff --git a/t/t7506-status-submodule.sh b/t/t7506-status-submodule.sh index d050091345..7da64e4c4c 100755 --- a/t/t7506-status-submodule.sh +++ b/t/t7506-status-submodule.sh @@ -412,4 +412,29 @@ test_expect_success 'status with added file in nested submodule (short)' ' EOF ' +test_expect_success 'status in superproject with submodules respects parallel settings' ' + test_when_finished "rm -f trace.out" && + ( + GIT_TRACE=$(pwd)/trace.out git status && + grep "1 tasks" trace.out && + >trace.out && + + git config submodule.diffJobs 8 && + GIT_TRACE=$(pwd)/trace.out git status && + grep "8 tasks" trace.out && + >trace.out && + + GIT_TRACE=$(pwd)/trace.out git -c submodule.diffJobs=0 status && + grep "preparing to run up to [0-9]* tasks" trace.out && + ! grep "up to 0 tasks" trace.out && + >trace.out + ) +' + +test_expect_success 'status in superproject with submodules (parallel)' ' + git -C super status --porcelain >output && + git -C super -c submodule.diffJobs=8 status --porcelain >output_parallel && + diff output output_parallel +' + test_done
During the iteration of the index entries in run_diff_files, whenever a submodule is found and needs its status checked, a subprocess is spawned for it. Instead of spawning the subprocess immediately and waiting for its completion to continue, hold onto all submodules and relevant information in a list. Then use that list to create tasks for run_processes_parallel. Subprocess output is passed to status_on_stderr_output which stores it to be parsed on completion of the subprocess. Add config option submodule.diffJobs to set the maximum number of parallel jobs. The option defaults to 1 if unset. If set to 0, the number of jobs is set to online_cpus(). Since run_diff_files is called from many different commands, I chose to grab the config option in the function rather than adding variables to every git command and then figuring out how to pass them all in. Signed-off-by: Calvin Wan <calvinwan@google.com> --- Documentation/config/submodule.txt | 12 +++ diff-lib.c | 81 +++++++++++++++--- submodule.c | 128 +++++++++++++++++++++++++++++ submodule.h | 9 ++ t/t4027-diff-submodule.sh | 31 +++++++ t/t7506-status-submodule.sh | 25 ++++++ 6 files changed, 274 insertions(+), 12 deletions(-)