Message ID | 20200909004939.1942347-2-emilyshaffer@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | propose config-based hooks | expand |
For this review, I'll just concern myself with overall design and structure. For this patch, overall I think it's better if there's a clear distinction between what we are implementing now and what we are implementing later. > +[[motivation]] > +== Motivation > + > +Treat hooks as a first-class citizen by replacing the .git/hook/hookname path as > +the only source of hooks to execute, in a way which is friendly to users with > +multiple repos which have similar needs. I don't understand what "first-class citizen" here means - probably better to just omit that phrase and describe the new way of doing hooks. > +[[config-schema-hook]] > +==== `hook` > + > +Primarily contains subsections for each hook event. These order of these > +subsections defines the hook command execution order The execution order is defined by the order of a multivalue config variable, I think, not the order of subsections? Besides, I believe that there's one subsection per hook event (e.g. hook."pre-commit"), not one subsection per command. > ; hook commands can be > +specified by setting the value directly to the command if no additional > +configuration is needed, or by setting the value as the name of a `hookcmd`. If > +Git does not find a `hookcmd` whose subsection matches the value of the given > +command string, Git will try to execute the string directly. Hooks are executed > +by passing the resolved command string to the shell. [snip] > Hook event subsections can > +also contain per-hook-event settings. If this is not yet implemented, maybe list under "future work". > + > +Also contains top-level hook execution settings, for example, > +`hook.warnHookDir`, `hook.runHookDir`, or `hook.disableAll`. (These settings are > +described more in <<library,Library>>.) I think it's clearer if you list this under "future work" - I didn't see any implementation of this. > +[hook "pre-commit"] > + command = perl-linter > + command = /usr/bin/git-secrets --pre-commit > + > +[hook "pre-applypatch"] > + command = perl-linter > + error = ignore Is "error" implemented? > + > +[hook] > + runHookDir = interactive Same question for "runHookDir". > +[[config-schema-hookcmd]] > +==== `hookcmd` > + > +Defines a hook command and its attributes, which will be used when a hook event > +occurs. Unqualified attributes are assumed to apply to this hook during all hook > +events, but event-specific attributes can also be supplied. The example runs > +`/usr/bin/lint-it --language=perl <args passed by Git>`, but for repos which > +include this config, the hook command will be skipped for all events to which > +it's normally subscribed _except_ `pre-commit`. > + > +---- > +[hookcmd "perl-linter"] > + command = /usr/bin/lint-it --language=perl > + skip = true > + pre-commit-skip = false > +---- And the skips. (And several more below which I will skip.) > +If the caller wants to do something more complicated, the hook library can also > +provide a callback API: > + > +*`int for_each_hookcmd(const char *hookname, hookcmd_function *cb)`* Is there a use case that would need such a function? > +[[migration]] > +=== Migration path > + > +[[stage-0]] > +==== Stage 0 > + > +Hooks are called by running `run-command.h:find_hook()` with the hookname and > +executing the result. The hook library and builtin do not exist. Hooks only > +exist as specially named scripts within `.git/hooks/`. > + > +[[stage-1]] > +==== Stage 1 > + > +`git hook list --porcelain <hook-event>` is implemented. Users can replace their > +`.git/hooks/<hook-event>` scripts with a trampoline based on `git hook list`'s > +output. Modifier commands like `git hook add` and `git hook edit` can be > +implemented around this time as well. This seems to contradict patch 8, which teaches Git to use the configs directly without any change to .git/hooks/<hook-event> (at least for certain commit-related hooks). > +[[future-work]] > +== Future work > + > +[[execution-ordering]] > +=== Execution ordering > + > +We may find that config order is insufficient for some users; for example, > +config order makes it difficult to add a new hook to the system or global config > +which runs at the end of the hook list. A new ordering schema should be: > + > +1) Specified by a `hook.order` config, so that users will not unexpectedly see > +their order change; > + > +2) Either dependency or numerically based. > + > +Dependency-based ordering is prone to classic linked-list problems, like a > +cycles and handling of missing dependencies. But, it paves the way for enabling > +parallelization if some tasks truly depend on others. > + > +Numerical ordering makes it tricky for Git to generate suggested ordering > +numbers for each command, but is easy to determine a definitive order. With this schema, and with the "skip" behavior described above (but not implemented in this patch set), rudimentary ordering can already be done; because a hook is removed and reinserted whenever it appears in the config, even a hook X in the system config can be made to run after a hook Y in the worktree config by adding Y then X in the worktree config, and if we want to disable X instead, we can just add "skip" to X.
On Wed, Sep 23, 2020 at 03:59:10PM -0700, Jonathan Tan wrote: > > For this review, I'll just concern myself with overall design and > structure. Thanks - the design doc is now slightly old, so it's nice to have some fresh eyes on it. > > For this patch, overall I think it's better if there's a clear > distinction between what we are implementing now and what we are > implementing later. I took a light hand when I checked for this - the topic isn't complete yet, and there's some work in the design doc which I want to include in this topic, but which hasn't been sent around (or written) yet. > > > +[[motivation]] > > +== Motivation > > + > > +Treat hooks as a first-class citizen by replacing the .git/hook/hookname path as > > +the only source of hooks to execute, in a way which is friendly to users with > > +multiple repos which have similar needs. > > I don't understand what "first-class citizen" here means - probably > better to just omit that phrase and describe the new way of doing hooks. Sure. > > > +[[config-schema-hook]] > > +==== `hook` > > + > > +Primarily contains subsections for each hook event. These order of these > > +subsections defines the hook command execution order > > The execution order is defined by the order of a multivalue config > variable, I think, not the order of subsections? Besides, I believe that > there's one subsection per hook event (e.g. hook."pre-commit"), not one > subsection per command. Ok. Have changed to "The order of variables in these subsections defines..." > > > ; hook commands can be > > +specified by setting the value directly to the command if no additional > > +configuration is needed, or by setting the value as the name of a `hookcmd`. If > > +Git does not find a `hookcmd` whose subsection matches the value of the given > > +command string, Git will try to execute the string directly. Hooks are executed > > +by passing the resolved command string to the shell. > > [snip] > > > Hook event subsections can > > +also contain per-hook-event settings. > > If this is not yet implemented, maybe list under "future work". Good idea. Done. > > > + > > +Also contains top-level hook execution settings, for example, > > +`hook.warnHookDir`, `hook.runHookDir`, or `hook.disableAll`. (These settings are > > +described more in <<library,Library>>.) > > I think it's clearer if you list this under "future work" - I didn't see > any implementation of this. Yeah, this is out of sync with what the implementation ended up looking like; disableAll might still be a useful thing to include in the initial feature topic, so I won't remove it, but warnHookDir is not necessary. > > > +[hook "pre-commit"] > > + command = perl-linter > > + command = /usr/bin/git-secrets --pre-commit > > + > > +[hook "pre-applypatch"] > > + command = perl-linter > > + error = ignore > > Is "error" implemented? No, have marked it with a comment. > > > + > > +[hook] > > + runHookDir = interactive > > Same question for "runHookDir". It is in the reroll I'm trying to get out this week :) > > > +[[config-schema-hookcmd]] > > +==== `hookcmd` > > + > > +Defines a hook command and its attributes, which will be used when a hook event > > +occurs. Unqualified attributes are assumed to apply to this hook during all hook > > +events, but event-specific attributes can also be supplied. The example runs > > +`/usr/bin/lint-it --language=perl <args passed by Git>`, but for repos which > > +include this config, the hook command will be skipped for all events to which > > +it's normally subscribed _except_ `pre-commit`. > > + > > +---- > > +[hookcmd "perl-linter"] > > + command = /usr/bin/lint-it --language=perl > > + skip = true > > + pre-commit-skip = false > > +---- > > And the skips. (And several more below which I will skip.) Again, this is in the reroll I'm working on. > > > +If the caller wants to do something more complicated, the hook library can also > > +provide a callback API: > > + > > +*`int for_each_hookcmd(const char *hookname, hookcmd_function *cb)`* > > Is there a use case that would need such a function? I'm not sure yet - but I'm not quite ready to cut it from the design doc, until I finish migrating the existing hooks and know that it's not needed. At that point it'll make sense to move it into the future work section. > > > +[[migration]] > > +=== Migration path > > + > > +[[stage-0]] > > +==== Stage 0 > > + > > +Hooks are called by running `run-command.h:find_hook()` with the hookname and > > +executing the result. The hook library and builtin do not exist. Hooks only > > +exist as specially named scripts within `.git/hooks/`. > > + > > +[[stage-1]] > > +==== Stage 1 > > + > > +`git hook list --porcelain <hook-event>` is implemented. Users can replace their > > +`.git/hooks/<hook-event>` scripts with a trampoline based on `git hook list`'s > > +output. Modifier commands like `git hook add` and `git hook edit` can be > > +implemented around this time as well. > > This seems to contradict patch 8, which teaches Git to use the configs > directly without any change to .git/hooks/<hook-event> (at least for > certain commit-related hooks). Yeah, I think this needs to be rephrased; at this point locally I've completely removed the --porcelain patch - I'm pretty sure it needs to be a format string instead. > > > +[[future-work]] > > +== Future work > > + > > +[[execution-ordering]] > > +=== Execution ordering > > + > > +We may find that config order is insufficient for some users; for example, > > +config order makes it difficult to add a new hook to the system or global config > > +which runs at the end of the hook list. A new ordering schema should be: > > + > > +1) Specified by a `hook.order` config, so that users will not unexpectedly see > > +their order change; > > + > > +2) Either dependency or numerically based. > > + > > +Dependency-based ordering is prone to classic linked-list problems, like a > > +cycles and handling of missing dependencies. But, it paves the way for enabling > > +parallelization if some tasks truly depend on others. > > + > > +Numerical ordering makes it tricky for Git to generate suggested ordering > > +numbers for each command, but is easy to determine a definitive order. > > With this schema, and with the "skip" behavior described above (but not > implemented in this patch set), rudimentary ordering can already be > done; because a hook is removed and reinserted whenever it appears in > the config, even a hook X in the system config can be made to run after a > hook Y in the worktree config by adding Y then X in the worktree config, > and if we want to disable X instead, we can just add "skip" to X. Yep, that's why reordering is in the future work section :) The problem with config ordering is like such: if I want everyone in my enterprise to run 'git-secrets --prepare-commit-msg' as the very last prepare-commit-msg hook, but I can only ship them an /etc/gitconfig, then the best I can do is email my users and ask them to run 'git config hook.prepare-commit-msg.command git-secrets-prepare-commit-msg' in every new repo and include a 'hookcmd.git-secrets-prepare-commit-msg.command' config in the /etc/gitconfig I ship. (I mention git-secrets here because it's possible other hooks could have introduced credential secrets into my user's commit message after git-secrets already ran.) - Emily
On Wed, Sep 09 2020, Emily Shaffer wrote: First, thanks a lot for working on this. As you may have found I've done some small amount of actual work in this area before, but mostly just blathered about it on the ML. > Begin a design document for config-based hooks, managed via git-hook. > Focus on an overview of the implementation and motivation for design > decisions. Briefly discuss the alternatives considered before this > point. Also, attempt to redefine terms to fit into a multihook world. > [...] > +[[status-quo]] > +=== Status quo > + > +Today users can implement multihooks themselves by using a "trampoline script" > +as their hook, and pointing that script to a directory or list of other scripts > +they wish to run. ...or by setting core.hooksPath in their local/global/system config. Granted it doesn't cover the malicious hook injection case you're also trying to solve, but does address e.g. having a git server with a lot of centralized hooks. The "trampoline script" also isn't needed for the common case you mention, you just symlink the .git/hooks directory (as e.g. GitLab does). People usually use a trampoline script for e.g. using GNU parallel or something to execute N hooks. > +[[hook-directories]] > +=== Hook directories > + > +Other contributors have suggested Git learn about the existence of a directory > +such as `.git/hooks/<hookname>.d` and execute those hooks in alphabetical order. ...which seems like an easy thing to add later by having a "hookdir" in addition to "hookcmd", i.e. just specify a glob there instead of a cmd/path. You already use "hookdir" for something else though, so that's a bit confusing, perhaps s/hookcmd/definehookcmd/ would be less confusing, or perhaps more confusing... > [...] > +[[execution-ordering]] > +=== Execution ordering > + > +We may find that config order is insufficient for some users; for example, > +config order makes it difficult to add a new hook to the system or global config > +which runs at the end of the hook list. A new ordering schema should be: > + > +1) Specified by a `hook.order` config, so that users will not unexpectedly see > +their order change; > + > +2) Either dependency or numerically based. > + > +Dependency-based ordering is prone to classic linked-list problems, like a > +cycles and handling of missing dependencies. But, it paves the way for enabling > +parallelization if some tasks truly depend on others. > > +Numerical ordering makes it tricky for Git to generate suggested ordering > +numbers for each command, but is easy to determine a definitive order. > + > +[[parallelization]] > +=== Parallelization > + > +Users with many hooks might want to run them simultaneously, if the hooks don't > +modify state; if one hook depends on another's output, then users will want to > +specify those dependencies. If we decide to solve this problem, we may want to > +look to modern build systems for inspiration on how to manage dependencies and > +parallel tasks. If you're taking requests it would make me very happy if we had parallelism in this from day one. It's the kind of thing that's hard to do by default once a feature is shipped since people will implicitly depend on it not being there, i.e. we won't know what we're breaking. I think doing it this way is simple, covers most use cases, and solves a lot of the problems you note: 1. Don't use config order to execute hooks, use glob'd name order regardless of origin. I.e. a system-level hook is called "001-first" is executed before a local hook called "999-at-the-end" (or the other way around, i.e. hook origin doesn't matter). 2. We execute hooks parallel in that glob order, i.e. a pthread for-loop that starts the 001-first task first, eventually getting to 999-at-the-end N at a time. I.e. the same as: parallel --jobs N --halt-on-error soon,fail=1" ::: <hooks-in-glob-order> This allows for parallelism but guarantees the very useful case of having a global log hook being guaranteed to execute. 3. A hook can define "parallel=no" in its config. We'll then run it while no other hook is running. 4. We don't attempt to do dependencies etc, if you need that sort of complexity you can just make one of the hooks be a hook runner as users do now for the common "make it parallel" case. It's a relatively small change to the code you have already. I.e. the for_each() in run_hooks() would be called N times for each continuous glob'd parallel/non-parallel segment, and hook_list()'s config parsing would learn to spew those out as a list-of-lists. This also gives you a rudimentary implementation of the dependency schema you proposed for free. I.e. a definition of (pseudocode): hookcmd=000-first parallel=no hookcmd=250-middle-abc hookcmd=250-middle-xyz hookcmd=300-gather parallel=no hookcmd=999-the-end Would result in the pseudocode execution of; segments=[[000-first], [250-middle-abc, 250-middle-xyz], [300-gather], [999-the-end]] for each s in segments: ok = run_in_parallel(s) last if !ok # or, depending on "early exit?" config I.e.: * The common case of people adding N hooks won't take sum(N) time. * parallel=no hooks aren't run in parallel with other non-parallel hooks * We support a rudimentary dependency schema as a side-effect, i.e. defining 300-gather as non-parallel allows it to act as the sole "reduce" step in a map/reduce in a "map" step started with the 250-* hooks. > +[[securing-hookdir-hooks]] > +=== Securing hookdir hooks > + > +With the design as written in this doc, it's still possible for a malicious user > +to modify `.git/config` to include `hook.pre-receive.command = rm -rf /`, then > +zip their repo and send it to another user. It may be necessary to teach Git to > +only allow inlined hooks like this if they were configured outside of the local > +scope (in other words, only run hookcmds, and only allow hookcmds to be > +configured in global or system scope); or another approach, like a list of safe > +projects, might be useful. It may also be sufficient (or at least useful) to > +teach a `hook.disableAll` config or similar flag to the Git executable. I think this part of the doc should note a bit of the context in https://lore.kernel.org/git/20171002234517.GV19555@aiede.mtv.corp.google.com/ I.e. even if we get a 100% secure hook implementation we've done practically nothing for overall security, since we'll still run the pager, aliases etc. from that local repo. This is a great step in the right direction, but it behooves us to note that, so some user reading this documentation without context doesn't think inspecting untrusted repositories like that is safe just because they set the right hook settings in their config (once what's being proposed here is implemented).
On Wed, Oct 07, 2020 at 11:23:10AM +0200, Ævar Arnfjörð Bjarmason wrote: > > > On Wed, Sep 09 2020, Emily Shaffer wrote: > > First, thanks a lot for working on this. As you may have found I've done > some small amount of actual work in this area before, but mostly just > blathered about it on the ML. > > > Begin a design document for config-based hooks, managed via git-hook. > > Focus on an overview of the implementation and motivation for design > > decisions. Briefly discuss the alternatives considered before this > > point. Also, attempt to redefine terms to fit into a multihook world. > > [...] > > +[[status-quo]] > > +=== Status quo > > + > > +Today users can implement multihooks themselves by using a "trampoline script" > > +as their hook, and pointing that script to a directory or list of other scripts > > +they wish to run. > > ...or by setting core.hooksPath in their local/global/system > config. Granted it doesn't cover the malicious hook injection case > you're also trying to solve, but does address e.g. having a git server > with a lot of centralized hooks. Aha, setting core.hooksPath in the global/system config had not occurred to me. > > The "trampoline script" also isn't needed for the common case you > mention, you just symlink the .git/hooks directory (as e.g. GitLab > does). People usually use a trampoline script for e.g. using GNU > parallel or something to execute N hooks. Hm, I don't think that's quite true. Symlinking out .git/hooks doesn't give me more than one $HOOKDIR/pre-commit - it just gives me a different one. So if I wanted to run three different hooks, $HOOKDIR/pre-commit would need to do the work of all three, regardless of where $HOOKDIR points. That's what I meant when I said "multihooks" in this section. But I think what you're trying to say is this: the "status quo" section doesn't fully cover the status quo. There are more tricks than I mentioned, e.g. 'git config --global core.hooksPath /home/emily/githook/' to get the same set of hooks to run everywhere. This approach still has some drawbacks - for example, it doesn't allow me to use language-specific linters if I have repos in various languages, without exempting an individual repo from the ~/githook/ by 'git config --local core.hooksPath /home/emily/my-python-thing/.git/hook'. It looks like, then, the "status quo" section needs some rework for the next iteration. > > > > +[[hook-directories]] > > +=== Hook directories > > + > > +Other contributors have suggested Git learn about the existence of a directory > > +such as `.git/hooks/<hookname>.d` and execute those hooks in alphabetical order. > > ...which seems like an easy thing to add later by having a "hookdir" in > addition to "hookcmd", i.e. just specify a glob there instead of a > cmd/path. Hum, interesting! Something like so: [hook.pre-commit] command = last-minute-checks [hookdir.last-minute-checks] dir = /home/emily/last-minute-checks/* And then the hooks library knows to go and run everything in ~/last-minute-checks/. This is easier to keep fresh than: [hook.pre-commit] command = /home/emily/last-minute-checks/c-linter command = /home/emily/last-minute-checks/check-for-debug-prints command = /home/emily/last-minute-checks/check-for-notes ... I actually like the idea of this for folks who might have a small number of hooks they wrote for themselves. I wonder if it's applicable for something like git-secrets, which presumably users would grab with a 'git clone' later. It doesn't seem at odds with the rest of the design - how would you feel about me adding it to the "future work" section at the end? Future work, rather than "Emily will do this in the next couple of rounds", because: - I think nobody already has their hooks in $HOOKDIR/hook/pre-commit.d without a corresponding trampoline in $HOOKDIR/hook/pre-commit; so they could still call that trampoline, for now - I think it might be prone to some bikeshedding - e.g. should we recurse into ~/last-minute-checks/linters/c/? how far? what if some script requires magic options? etc? But as I'm typing those questions out they sound mostly trivial or ridiculous, so maybe my assessment is wrong here. - It sounds like you might be keen to write it, or at the very least, more keen than me - Practically speaking, I am not sure I have time to do it alongside the rest of the series. Again, my bikeshedding assessment could be wrong, and this extra feature could be totally trivial. > You already use "hookdir" for something else though, so that's a bit > confusing, perhaps s/hookcmd/definehookcmd/ would be less confusing, or > perhaps more confusing... "Hookdir" might be the wrong word to use, too - maybe it's better to mirror "hookspath" there. Eitherway, "hookdir" and "hookspath" are similar enough that I think it would be confusing, and "hookcmd" is already getting some side-eye from me for not being a great choice. Some thoughts for "a path to a directory in which multiple scripts for a single hook live": - hookset - hookbatch (ugh, redundant with MS scripting) - hook.pre-commit.all-of = ~/last-minute-checks/ - " " .everything-in = " " ...? I think I named a couple silly ideas for "hookcmd" in another mail. > > > [...] > > +[[execution-ordering]] > > +=== Execution ordering > > + > > +We may find that config order is insufficient for some users; for example, > > +config order makes it difficult to add a new hook to the system or global config > > +which runs at the end of the hook list. A new ordering schema should be: > > + > > +1) Specified by a `hook.order` config, so that users will not unexpectedly see > > +their order change; > > + > > +2) Either dependency or numerically based. > > + > > +Dependency-based ordering is prone to classic linked-list problems, like a > > +cycles and handling of missing dependencies. But, it paves the way for enabling > > +parallelization if some tasks truly depend on others. > > > > +Numerical ordering makes it tricky for Git to generate suggested ordering > > +numbers for each command, but is easy to determine a definitive order. > > + > > +[[parallelization]] > > +=== Parallelization > > + > > +Users with many hooks might want to run them simultaneously, if the hooks don't > > +modify state; if one hook depends on another's output, then users will want to > > +specify those dependencies. If we decide to solve this problem, we may want to > > +look to modern build systems for inspiration on how to manage dependencies and > > +parallel tasks. > > If you're taking requests it would make me very happy if we had > parallelism in this from day one. It's the kind of thing that's hard to > do by default once a feature is shipped since people will implicitly > depend on it not being there, i.e. we won't know what we're breaking. Hm. This might be tricky. Some hooks are inherently not able to be parallelized - for example, hooks which modify a given file, like the commit message draft. In general, based on the handful of hooks I've converted locally, it's hard to check whether a callsite assumes a hook could have modified state. Usually this seems to be done with a call to find_hook() ("was there a hook that might have run?") and then reopening the file. Sometimes a file is reopened unconditionally. Sometimes the find_hook() call is very far away from the run_hook_le() call. The rest, then, which only read a file and say yes or no, probably don't need to have a strict ordering - at least as far as Git is concerned. And I think that's what you're worried about: [hook.theoretical-parallelizable-event] command = check-and-mark-a-file-foo command = check-file-foo-and-do-something-else command = do-something-totally-unrelated On day 1 of this feature, as written, this is safe. But if we aren't careful and we start to parallelize *without* setting up dependency ordering, e.g. 'git config --global hook.parallelize', and turn that on by default without warning anyone, then the author of this config will be unhappy. But as I read further, you're talking about specifically *not* allowing dependency ordering... > > I think doing it this way is simple, covers most use cases, and solves a > lot of the problems you note: > > 1. Don't use config order to execute hooks, use glob'd name order > regardless of origin. I.e. a system-level hook is called "001-first" > is executed before a local hook called "999-at-the-end" (or the other > way around, i.e. hook origin doesn't matter). Can you say a little more about why different ordering schema would matter, if we effectively don't care which jobs are in parallel with which, as you describe? I'm not quite following. > > 2. We execute hooks parallel in that glob order, i.e. a pthread for-loop > that starts the 001-first task first, eventually getting to > 999-at-the-end N at a time. I.e. the same as: > > parallel --jobs N --halt-on-error soon,fail=1" ::: <hooks-in-glob-order> > > This allows for parallelism but guarantees the very useful case of > having a global log hook being guaranteed to execute. Ah, I think you're suggesting the glob order specifically to make up for --halt-on-error in this case. > > 3. A hook can define "parallel=no" in its config. We'll then run it > while no other hook is running. > > 4. We don't attempt to do dependencies etc, if you need that sort of > complexity you can just make one of the hooks be a hook runner as > users do now for the common "make it parallel" case. If we aren't attempting any magical ordering, then I don't really see a big difference between glob vs. config order - presumably for most users the effect would be same, e.g. N = $(nproc * hyperthreading), M = (number of scripts I care to run) probably will often result in M < N, so all jobs would run simultaneously anyways. > > It's a relatively small change to the code you have already. I.e. the > for_each() in run_hooks() would be called N times for each continuous > glob'd parallel/non-parallel segment, and hook_list()'s config parsing > would learn to spew those out as a list-of-lists. > > This also gives you a rudimentary implementation of the dependency > schema you proposed for free. I.e. a definition of (pseudocode): > > hookcmd=000-first > parallel=no > > hookcmd=250-middle-abc > hookcmd=250-middle-xyz > > hookcmd=300-gather > parallel=no > > hookcmd=999-the-end > > Would result in the pseudocode execution of; > > segments=[[000-first], > [250-middle-abc, 250-middle-xyz], Hum. This seems to say "folks who started their hooks with the same number agree that their hooks should also run simultaneously" - which sounds like an even harder problem than "how do I know my ordering number isn't the same as someone else's in another config file". Or else I'm misunderstanding your pseudo :) Ah, I see later you mention it directly as a dependency schema. I think this offers the same set of problems I saw trying to use this as an ordering schema, but worse in all the usual ways parallelism provides. It is still impossible for someone writing a global or system config to know where in the dependency chain more local hooks reside. > [300-gather], > [999-the-end]] > for each s in segments: > ok = run_in_parallel(s) > last if !ok # or, depending on "early exit?" config > > I.e.: > > * The common case of people adding N hooks won't take sum(N) time. > > * parallel=no hooks aren't run in parallel with other non-parallel > hooks > > * We support a rudimentary dependency schema as a side-effect, > i.e. defining 300-gather as non-parallel allows it to act as the sole > "reduce" step in a map/reduce in a "map" step started with the 250-* > hooks. As I understand it, the main concerns you have about getting parallelization to happen on day 1 are like so: - keep users from assuming serial execution - avoid a messy schema change to deal with dependencies I see the benefit of the former; I don't like the new schema proposed by the latter. I do see that not turning it on day 1 would prevent us from turning it on by default later, in case users did something silly like assume dependencies. Hrm. I think we could turn on parallelization day 1 by providing an explicitly-parallel API in hook.h (and a similar 'git hook run foo --parallel' flag), and being more careful when converting hooks to call run_hooks_parallel() instead of run_hooks(). That way hooks which will never be parallelizable (e.g. commit-msg) won't get burned later by us trying to be clever. Everyone else who can be parallelized is, in config order, with no dependency management whatsoever. That leaves the door open for us to add dependency management however we want later on, but users can still roll their own with a launcher script today. I know I rambled a lot - I was trying to convince myself :) For now, I'd prefer to add more detail to the "future work" section of the doc and then not touch this problem with a very long pole... ;) Thoughts welcome. > > > +[[securing-hookdir-hooks]] > > +=== Securing hookdir hooks > > + > > +With the design as written in this doc, it's still possible for a malicious user > > +to modify `.git/config` to include `hook.pre-receive.command = rm -rf /`, then > > +zip their repo and send it to another user. It may be necessary to teach Git to > > +only allow inlined hooks like this if they were configured outside of the local > > +scope (in other words, only run hookcmds, and only allow hookcmds to be > > +configured in global or system scope); or another approach, like a list of safe > > +projects, might be useful. It may also be sufficient (or at least useful) to > > +teach a `hook.disableAll` config or similar flag to the Git executable. > > I think this part of the doc should note a bit of the context in > https://lore.kernel.org/git/20171002234517.GV19555@aiede.mtv.corp.google.com/ > > I.e. even if we get a 100% secure hook implementation we've done > practically nothing for overall security, since we'll still run the > pager, aliases etc. from that local repo. > > This is a great step in the right direction, but it behooves us to note > that, so some user reading this documentation without context doesn't > think inspecting untrusted repositories like that is safe just because > they set the right hook settings in their config (once what's being > proposed here is implemented). Yeah, I agree. I'll try to make that clearer in the doc in the next reroll. Very sorry again for having missed this - I think the first weeks of October I was working from my local todo list instead of from the list of replies in mutt. Urk. - Emily
On Thu, Oct 22 2020, Emily Shaffer wrote: > On Wed, Oct 07, 2020 at 11:23:10AM +0200, Ævar Arnfjörð Bjarmason wrote: >> >> >> On Wed, Sep 09 2020, Emily Shaffer wrote: >> >> First, thanks a lot for working on this. As you may have found I've done >> some small amount of actual work in this area before, but mostly just >> blathered about it on the ML. >> >> > Begin a design document for config-based hooks, managed via git-hook. >> > Focus on an overview of the implementation and motivation for design >> > decisions. Briefly discuss the alternatives considered before this >> > point. Also, attempt to redefine terms to fit into a multihook world. >> > [...] >> > +[[status-quo]] >> > +=== Status quo >> > + >> > +Today users can implement multihooks themselves by using a "trampoline script" >> > +as their hook, and pointing that script to a directory or list of other scripts >> > +they wish to run. >> >> ...or by setting core.hooksPath in their local/global/system >> config. Granted it doesn't cover the malicious hook injection case >> you're also trying to solve, but does address e.g. having a git server >> with a lot of centralized hooks. > > Aha, setting core.hooksPath in the global/system config had not occurred > to me. It's a useful hack. >> >> The "trampoline script" also isn't needed for the common case you >> mention, you just symlink the .git/hooks directory (as e.g. GitLab >> does). People usually use a trampoline script for e.g. using GNU >> parallel or something to execute N hooks. > > Hm, I don't think that's quite true. Symlinking out .git/hooks doesn't > give me more than one $HOOKDIR/pre-commit - it just gives me a different > one. So if I wanted to run three different hooks, $HOOKDIR/pre-commit > would need to do the work of all three, regardless of where $HOOKDIR > points. That's what I meant when I said "multihooks" in this section. > > But I think what you're trying to say is this: the "status quo" section > doesn't fully cover the status quo. There are more tricks than I > mentioned, e.g. 'git config --global core.hooksPath > /home/emily/githook/' to get the same set of hooks to run everywhere. > This approach still has some drawbacks - for example, it doesn't allow > me to use language-specific linters if I have repos in various > languages, without exempting an individual repo from the ~/githook/ by > 'git config --local core.hooksPath > /home/emily/my-python-thing/.git/hook'. > > It looks like, then, the "status quo" section needs some rework for the > next iteration. Re-reading your original patch I think I just misread that. I thought you were saying a stub script was needed in the .git to point to a multi-hook script, but I was pointing out that you can just symlink to the multi-hook script (as e.g. GitLab does), but reading it again & this I don't thin you meant that at all. Nevermind. >> >> >> > +[[hook-directories]] >> > +=== Hook directories >> > + >> > +Other contributors have suggested Git learn about the existence of a directory >> > +such as `.git/hooks/<hookname>.d` and execute those hooks in alphabetical order. >> >> ...which seems like an easy thing to add later by having a "hookdir" in >> addition to "hookcmd", i.e. just specify a glob there instead of a >> cmd/path. > > Hum, interesting! Something like so: > > [hook.pre-commit] > command = last-minute-checks > > [hookdir.last-minute-checks] > dir = /home/emily/last-minute-checks/* > > And then the hooks library knows to go and run everything in > ~/last-minute-checks/. This is easier to keep fresh than: > > [hook.pre-commit] > command = /home/emily/last-minute-checks/c-linter > command = /home/emily/last-minute-checks/check-for-debug-prints > command = /home/emily/last-minute-checks/check-for-notes > ... > > I actually like the idea of this for folks who might have a small number > of hooks they wrote for themselves. I wonder if it's applicable for > something like git-secrets, which presumably users would grab with a > 'git clone' later. > > It doesn't seem at odds with the rest of the design - how would you feel > about me adding it to the "future work" section at the end? Future work, > rather than "Emily will do this in the next couple of rounds", because: > - I think nobody already has their hooks in $HOOKDIR/hook/pre-commit.d > without a corresponding trampoline in $HOOKDIR/hook/pre-commit; so > they could still call that trampoline, for now > - I think it might be prone to some bikeshedding - e.g. should we > recurse into ~/last-minute-checks/linters/c/? how far? what if some > script requires magic options? etc? But as I'm typing those questions > out they sound mostly trivial or ridiculous, so maybe my assessment > is wrong here. > - It sounds like you might be keen to write it, or at the very least, > more keen than me > - Practically speaking, I am not sure I have time to do it alongside > the rest of the series. Again, my bikeshedding assessment could be > wrong, and this extra feature could be totally trivial. > >> You already use "hookdir" for something else though, so that's a bit >> confusing, perhaps s/hookcmd/definehookcmd/ would be less confusing, or >> perhaps more confusing... > > "Hookdir" might be the wrong word to use, too - maybe it's better to > mirror "hookspath" there. Eitherway, "hookdir" and "hookspath" are > similar enough that I think it would be confusing, and "hookcmd" is > already getting some side-eye from me for not being a great choice. > > Some thoughts for "a path to a directory in which multiple scripts for a > single hook live": > - hookset > - hookbatch (ugh, redundant with MS scripting) > - hook.pre-commit.all-of = ~/last-minute-checks/ > - " " .everything-in = " " > ...? > > I think I named a couple silly ideas for "hookcmd" in another mail. To both of the above: Yeah I'm not saying you need to do the work, just that I think it would be a useful case to bikeshed now since it seems inevitable that we'll get a "find hooks in this dir by glob" once we have this facility. So having a config syntax for that which isn't overly confusing / extensible to that case would be useful, i.e. as the current syntax uses "dir" already. >> >> > [...] >> > +[[execution-ordering]] >> > +=== Execution ordering >> > + >> > +We may find that config order is insufficient for some users; for example, >> > +config order makes it difficult to add a new hook to the system or global config >> > +which runs at the end of the hook list. A new ordering schema should be: >> > + >> > +1) Specified by a `hook.order` config, so that users will not unexpectedly see >> > +their order change; >> > + >> > +2) Either dependency or numerically based. >> > + >> > +Dependency-based ordering is prone to classic linked-list problems, like a >> > +cycles and handling of missing dependencies. But, it paves the way for enabling >> > +parallelization if some tasks truly depend on others. >> > >> > +Numerical ordering makes it tricky for Git to generate suggested ordering >> > +numbers for each command, but is easy to determine a definitive order. >> > + >> > +[[parallelization]] >> > +=== Parallelization >> > + >> > +Users with many hooks might want to run them simultaneously, if the hooks don't >> > +modify state; if one hook depends on another's output, then users will want to >> > +specify those dependencies. If we decide to solve this problem, we may want to >> > +look to modern build systems for inspiration on how to manage dependencies and >> > +parallel tasks. >> >> If you're taking requests it would make me very happy if we had >> parallelism in this from day one. It's the kind of thing that's hard to >> do by default once a feature is shipped since people will implicitly >> depend on it not being there, i.e. we won't know what we're breaking. > > Hm. This might be tricky. > > Some hooks are inherently not able to be parallelized - for example, > hooks which modify a given file, like the commit message draft. In > general, based on the handful of hooks I've converted locally, it's hard > to check whether a callsite assumes a hook could have modified state. > Usually this seems to be done with a call to find_hook() ("was there a > hook that might have run?") and then reopening the file. Sometimes a > file is reopened unconditionally. Sometimes the find_hook() call is > very far away from the run_hook_le() call. > > The rest, then, which only read a file and say yes or no, probably don't > need to have a strict ordering - at least as far as Git is concerned. > And I think that's what you're worried about: > > [hook.theoretical-parallelizable-event] > command = check-and-mark-a-file-foo > command = check-file-foo-and-do-something-else > command = do-something-totally-unrelated > > On day 1 of this feature, as written, this is safe. But if we aren't > careful and we start to parallelize *without* setting up dependency > ordering, e.g. 'git config --global hook.parallelize', and turn that on > by default without warning anyone, then the author of this config will > be unhappy. > > But as I read further, you're talking about specifically *not* allowing > dependency ordering... > >> >> I think doing it this way is simple, covers most use cases, and solves a >> lot of the problems you note: >> >> 1. Don't use config order to execute hooks, use glob'd name order >> regardless of origin. I.e. a system-level hook is called "001-first" >> is executed before a local hook called "999-at-the-end" (or the other >> way around, i.e. hook origin doesn't matter). > > Can you say a little more about why different ordering schema would > matter, if we effectively don't care which jobs are in parallel with > which, as you describe? I'm not quite following. > >> >> 2. We execute hooks parallel in that glob order, i.e. a pthread for-loop >> that starts the 001-first task first, eventually getting to >> 999-at-the-end N at a time. I.e. the same as: >> >> parallel --jobs N --halt-on-error soon,fail=1" ::: <hooks-in-glob-order> >> >> This allows for parallelism but guarantees the very useful case of >> having a global log hook being guaranteed to execute. > > Ah, I think you're suggesting the glob order specifically to make up for > --halt-on-error in this case. > >> >> 3. A hook can define "parallel=no" in its config. We'll then run it >> while no other hook is running. >> >> 4. We don't attempt to do dependencies etc, if you need that sort of >> complexity you can just make one of the hooks be a hook runner as >> users do now for the common "make it parallel" case. > > If we aren't attempting any magical ordering, then I don't really see a > big difference between glob vs. config order - presumably for most users > the effect would be same, e.g. N = $(nproc * hyperthreading), M = (number of scripts I > care to run) probably will often result in M < N, so all jobs would run > simultaneously anyways. > >> >> It's a relatively small change to the code you have already. I.e. the >> for_each() in run_hooks() would be called N times for each continuous >> glob'd parallel/non-parallel segment, and hook_list()'s config parsing >> would learn to spew those out as a list-of-lists. >> >> This also gives you a rudimentary implementation of the dependency >> schema you proposed for free. I.e. a definition of (pseudocode): >> >> hookcmd=000-first >> parallel=no >> >> hookcmd=250-middle-abc >> hookcmd=250-middle-xyz >> >> hookcmd=300-gather >> parallel=no >> >> hookcmd=999-the-end >> >> Would result in the pseudocode execution of; >> >> segments=[[000-first], >> [250-middle-abc, 250-middle-xyz], > > Hum. This seems to say "folks who started their hooks with the same > number agree that their hooks should also run simultaneously" - which > sounds like an even harder problem than "how do I know my ordering > number isn't the same as someone else's in another config file". Or else > I'm misunderstanding your pseudo :) The prefix number isn't meaningful in that way, i.e. if you have 10 threads and 5 hooks starting with 250-* they won't all be invoked at the same time. > Ah, I see later you mention it directly as a dependency schema. I think > this offers the same set of problems I saw trying to use this as an > ordering schema, but worse in all the usual ways parallelism provides. > It is still impossible for someone writing a global or system config to > know where in the dependency chain more local hooks reside. > >> [300-gather], >> [999-the-end]] >> for each s in segments: >> ok = run_in_parallel(s) >> last if !ok # or, depending on "early exit?" config >> >> I.e.: >> >> * The common case of people adding N hooks won't take sum(N) time. >> >> * parallel=no hooks aren't run in parallel with other non-parallel >> hooks >> >> * We support a rudimentary dependency schema as a side-effect, >> i.e. defining 300-gather as non-parallel allows it to act as the sole >> "reduce" step in a map/reduce in a "map" step started with the 250-* >> hooks. > > As I understand it, the main concerns you have about getting > parallelization to happen on day 1 are like so: > > - keep users from assuming serial execution > - avoid a messy schema change to deal with dependencies > > I see the benefit of the former; I don't like the new schema proposed by > the latter. I do see that not turning it on day 1 would prevent us from > turning it on by default later, in case users did something silly like > assume dependencies. > > Hrm. > > I think we could turn on parallelization day 1 by providing an > explicitly-parallel API in hook.h (and a similar 'git hook run foo > --parallel' flag), and being more careful when converting hooks to call > run_hooks_parallel() instead of run_hooks(). That way hooks which will > never be parallelizable (e.g. commit-msg) won't get burned later by us > trying to be clever. Everyone else who can be parallelized is, in config > order, with no dependency management whatsoever. That leaves the door > open for us to add dependency management however we want later on, but > users can still roll their own with a launcher script today. > > I know I rambled a lot - I was trying to convince myself :) For now, I'd > prefer to add more detail to the "future work" section of the doc and > then not touch this problem with a very long pole... ;) Thoughts > welcome. I'm replying to much of the above in general here, particularly since much of it was in the form of a question you answered yourself later :) Yes as you point out the reason I'm raising the parallel thing now is "keep users from assuming serial execution", i.e. any implementation that isn't like that from day 1 will need more verbose syntax to opt-in to that. I think parallel is the sane default, although there's a really strong case as you point out with the "commit-msg" hook for treating that on a hook-type basis. E.g. commit-msg (in-place editing of as single file) being non-parallel by default, but e.g. post-commit, pre-applypatch, pre-receive and other "should we proceed?" hooks being parallel. But I'm also raising a general concern with the design of the API / command around this. I don't see the need for having a git hook list/edit/add command at all. We should just keep this simpler and be able to point to "git config --add/--get-regexp" etc. It seems the reason to introduce this command API around it is because you're imagining that git needs to manage hooks whose relative execution order is important, and to later on once this lands aim to implement a much more complex dependency management schema. I just can't imagine a case that needs that where say those 10 hooks need to execute in exact order 1/2/3/4 where the author of that tight coupling wouldn't also desire to roll that all into one script, or at least that it's an obscure enough case that we can just say "do that". Whereas I do think "run a bunch of independent checks, if all pass proceed" is *the* common case, e.g. adding a bunch of pre-receive hooks. If we tell the user we'll treat those as independent programs we can run them in parallel. The vast majority of users will benefit from the default faster execution. The "glob order" case I mentioned is extra complexity on top of that, yes, but I think that concession is sane for the common case of "yes parallel, but I want to always run the always-exit-0 log hook". E.g. I've used this to setup a hook to run push attempts/successes in a hook framework that runs N pre-receive hooks. All that being said I'm open to being convinced, I just don't see what the target user is, and the submitted docs don't really make a case for it. I.e. there's plenty of "what" not "why would someone want this...". >> >> > +[[securing-hookdir-hooks]] >> > +=== Securing hookdir hooks >> > + >> > +With the design as written in this doc, it's still possible for a malicious user >> > +to modify `.git/config` to include `hook.pre-receive.command = rm -rf /`, then >> > +zip their repo and send it to another user. It may be necessary to teach Git to >> > +only allow inlined hooks like this if they were configured outside of the local >> > +scope (in other words, only run hookcmds, and only allow hookcmds to be >> > +configured in global or system scope); or another approach, like a list of safe >> > +projects, might be useful. It may also be sufficient (or at least useful) to >> > +teach a `hook.disableAll` config or similar flag to the Git executable. >> >> I think this part of the doc should note a bit of the context in >> https://lore.kernel.org/git/20171002234517.GV19555@aiede.mtv.corp.google.com/ >> >> I.e. even if we get a 100% secure hook implementation we've done >> practically nothing for overall security, since we'll still run the >> pager, aliases etc. from that local repo. >> >> This is a great step in the right direction, but it behooves us to note >> that, so some user reading this documentation without context doesn't >> think inspecting untrusted repositories like that is safe just because >> they set the right hook settings in their config (once what's being >> proposed here is implemented). > > Yeah, I agree. I'll try to make that clearer in the doc in the next > reroll. > > Very sorry again for having missed this - I think the first weeks of > October I was working from my local todo list instead of from the list > of replies in mutt. Urk. *nod*
On Fri, Oct 23, 2020 at 09:10:24PM +0200, Ævar Arnfjörð Bjarmason wrote: > >> You already use "hookdir" for something else though, so that's a bit > >> confusing, perhaps s/hookcmd/definehookcmd/ would be less confusing, or > >> perhaps more confusing... > > > > "Hookdir" might be the wrong word to use, too - maybe it's better to > > mirror "hookspath" there. Eitherway, "hookdir" and "hookspath" are > > similar enough that I think it would be confusing, and "hookcmd" is > > already getting some side-eye from me for not being a great choice. > > > > Some thoughts for "a path to a directory in which multiple scripts for a > > single hook live": > > - hookset > > - hookbatch (ugh, redundant with MS scripting) > > - hook.pre-commit.all-of = ~/last-minute-checks/ > > - " " .everything-in = " " > > ...? > > > > I think I named a couple silly ideas for "hookcmd" in another mail. > > To both of the above: Yeah I'm not saying you need to do the work, just > that I think it would be a useful case to bikeshed now since it seems > inevitable that we'll get a "find hooks in this dir by glob" once we > have this facility. So having a config syntax for that which isn't > overly confusing / extensible to that case would be useful, i.e. as the > current syntax uses "dir" already. Yeah. I'm not sure that it needs to happen right away. Because hook.*.command // hookcommand.*.command gets passed right into run_command()-with-shell, it's possible for a user who's keen to also set `hook.*.command = find -type f /some/path | xargs` in the meantime. And also because it's passed right into run_command()-with-shell, it's hard to do some smart wildcarding on the .command config and try to figure out the right syntax. I'd just as soon see something explicit like the configs I mentioned above, which can be added pretty easily after the fact. I think what you're mostly saying, though, is "Leave some words for glob execution!" and that I can appreciate. > > Hum. This seems to say "folks who started their hooks with the same > > number agree that their hooks should also run simultaneously" - which > > sounds like an even harder problem than "how do I know my ordering > > number isn't the same as someone else's in another config file". Or else > > I'm misunderstanding your pseudo :) > > The prefix number isn't meaningful in that way, i.e. if you have 10 > threads and 5 hooks starting with 250-* they won't all be invoked at the > same time. Ok. I misunderstood, then. > > I know I rambled a lot - I was trying to convince myself :) For now, I'd > > prefer to add more detail to the "future work" section of the doc and > > then not touch this problem with a very long pole... ;) Thoughts > > welcome. > > I'm replying to much of the above in general here, particularly since > much of it was in the form of a question you answered yourself later :) > > Yes as you point out the reason I'm raising the parallel thing now is > "keep users from assuming serial execution", i.e. any implementation > that isn't like that from day 1 will need more verbose syntax to opt-in > to that. > > I think parallel is the sane default, although there's a really strong > case as you point out with the "commit-msg" hook for treating that on a > hook-type basis. E.g. commit-msg (in-place editing of as single file) > being non-parallel by default, but e.g. post-commit, pre-applypatch, > pre-receive and other "should we proceed?" hooks being parallel. Yeah. I think you've sold me. So what I will do is thus: before I send the next reroll (as I'm pretty much done, locally, and hope to be ready for nits next time) I'll take a look in 'git help githooks' and see which ones expect writes to occur. I think there are more than just "commit-msg". I'll add a bit to run_hooks() and a corresponding flag to 'git hook run', plus relevant documentation. I'll also plan to add explicit documentation to 'git help githooks' mentioning parallel vs. serial execution. But I will plan on writing it stupidly - user configurable job number but no dependency checking; and let the user turn off parallel execution for everyone (hook.jobs=1) or for just one hook (hook.pre-commit.parallel = false (?)). Like you and Jonathan N say, we can add more sugar like hookcmd.*.depends later on when we need it. > > But I'm also raising a general concern with the design of the API / > command around this. > > I don't see the need for having a git hook list/edit/add command at > all. We should just keep this simpler and be able to point to "git > config --add/--get-regexp" etc. > > It seems the reason to introduce this command API around it is because > you're imagining that git needs to manage hooks whose relative execution > order is important, and to later on once this lands aim to implement a > much more complex dependency management schema. No, I don't think that's the reason to have list/edit/add. The reason is more for discoverability (if I 'git help git' or 'git^TAB', do I see something handy in the command list that I didn't know about before?) and user friendliness ("I can't remember the right config options to set this up every dang time"). And 'list', I think, is handy for giving users a dry run of what they can expect to see happen (and where to fix them, since it lists the origin). Yes, a user could put it all together from invocations of 'git config', but I personally think it's more useful for Git to tell me what Git is going to do/what Git wants than for my meat brain to try and guess :) > > I just can't imagine a case that needs that where say those 10 hooks > need to execute in exact order 1/2/3/4 where the author of that tight > coupling wouldn't also desire to roll that all into one script, or at > least that it's an obscure enough case that we can just say "do that". > > Whereas I do think "run a bunch of independent checks, if all pass > proceed" is *the* common case, e.g. adding a bunch of pre-receive > hooks. If we tell the user we'll treat those as independent programs we > can run them in parallel. The vast majority of users will benefit from > the default faster execution. > > The "glob order" case I mentioned is extra complexity on top of that, > yes, but I think that concession is sane for the common case of "yes > parallel, but I want to always run the always-exit-0 log > hook". E.g. I've used this to setup a hook to run push > attempts/successes in a hook framework that runs N pre-receive hooks. Reading this, I think I'm still missing something key about what you think glob ordering provides. I'm not following why having the log hook set early requires glob ordering over config ordering (since the config ordering schema allows reordering via replacement), and I'm not following why it's required to halt on failure. > > All that being said I'm open to being convinced, I just don't see what > the target user is, and the submitted docs don't really make a case for > it. I.e. there's plenty of "what" not "why would someone want this...". ACK. I'll try and go over the doc again before I reroll. - Emily
On Thu, Oct 29 2020, Emily Shaffer wrote: > On Fri, Oct 23, 2020 at 09:10:24PM +0200, Ævar Arnfjörð Bjarmason wrote: > >> >> You already use "hookdir" for something else though, so that's a bit >> >> confusing, perhaps s/hookcmd/definehookcmd/ would be less confusing, or >> >> perhaps more confusing... >> > >> > "Hookdir" might be the wrong word to use, too - maybe it's better to >> > mirror "hookspath" there. Eitherway, "hookdir" and "hookspath" are >> > similar enough that I think it would be confusing, and "hookcmd" is >> > already getting some side-eye from me for not being a great choice. >> > >> > Some thoughts for "a path to a directory in which multiple scripts for a >> > single hook live": >> > - hookset >> > - hookbatch (ugh, redundant with MS scripting) >> > - hook.pre-commit.all-of = ~/last-minute-checks/ >> > - " " .everything-in = " " >> > ...? >> > >> > I think I named a couple silly ideas for "hookcmd" in another mail. >> >> To both of the above: Yeah I'm not saying you need to do the work, just >> that I think it would be a useful case to bikeshed now since it seems >> inevitable that we'll get a "find hooks in this dir by glob" once we >> have this facility. So having a config syntax for that which isn't >> overly confusing / extensible to that case would be useful, i.e. as the >> current syntax uses "dir" already. > > Yeah. I'm not sure that it needs to happen right away. Because > hook.*.command // hookcommand.*.command gets passed right into > run_command()-with-shell, it's possible for a user who's keen to also > set `hook.*.command = find -type f /some/path | xargs` in the meantime. > And also because it's passed right into run_command()-with-shell, it's > hard to do some smart wildcarding on the .command config and try to > figure out the right syntax. I'd just as soon see something explicit > like the configs I mentioned above, which can be added pretty easily > after the fact. I think what you're mostly saying, though, is "Leave > some words for glob execution!" and that I can appreciate. Yeah, or rather, just now in config key naming think about if the key naming makes sense if it's expanded to support such glob inclusion, which seems like a desired addition. But I won't belabor that point. Just one thing to add: We don't really need to come up with a syntax & semantics for glob inclusion special to this, we'd use the sort of glob patterns "Conditional includes" use, as documented in git-config(1). >> > Hum. This seems to say "folks who started their hooks with the same >> > number agree that their hooks should also run simultaneously" - which >> > sounds like an even harder problem than "how do I know my ordering >> > number isn't the same as someone else's in another config file". Or else >> > I'm misunderstanding your pseudo :) >> >> The prefix number isn't meaningful in that way, i.e. if you have 10 >> threads and 5 hooks starting with 250-* they won't all be invoked at the >> same time. > > Ok. I misunderstood, then. > >> > I know I rambled a lot - I was trying to convince myself :) For now, I'd >> > prefer to add more detail to the "future work" section of the doc and >> > then not touch this problem with a very long pole... ;) Thoughts >> > welcome. >> >> I'm replying to much of the above in general here, particularly since >> much of it was in the form of a question you answered yourself later :) >> >> Yes as you point out the reason I'm raising the parallel thing now is >> "keep users from assuming serial execution", i.e. any implementation >> that isn't like that from day 1 will need more verbose syntax to opt-in >> to that. >> >> I think parallel is the sane default, although there's a really strong >> case as you point out with the "commit-msg" hook for treating that on a >> hook-type basis. E.g. commit-msg (in-place editing of as single file) >> being non-parallel by default, but e.g. post-commit, pre-applypatch, >> pre-receive and other "should we proceed?" hooks being parallel. > > Yeah. I think you've sold me. So what I will do is thus: before I send > the next reroll (as I'm pretty much done, locally, and hope to be ready > for nits next time) I'll take a look in 'git help githooks' and see > which ones expect writes to occur. I think there are more than just > "commit-msg". I'll add a bit to run_hooks() and a corresponding flag to > 'git hook run', plus relevant documentation. I'll also plan to add > explicit documentation to 'git help githooks' mentioning parallel vs. > serial execution. Sounds good. > But I will plan on writing it stupidly - user configurable job number > but no dependency checking; and let the user turn off parallel execution > for everyone (hook.jobs=1) or for just one hook > (hook.pre-commit.parallel = false (?)). Like you and Jonathan N say, we > can add more sugar like hookcmd.*.depends later on when we need it. Yeah, that sounds great. As long as there's parallelism that stuff can always be tweaked later. >> >> But I'm also raising a general concern with the design of the API / >> command around this. >> >> I don't see the need for having a git hook list/edit/add command at >> all. We should just keep this simpler and be able to point to "git >> config --add/--get-regexp" etc. >> >> It seems the reason to introduce this command API around it is because >> you're imagining that git needs to manage hooks whose relative execution >> order is important, and to later on once this lands aim to implement a >> much more complex dependency management schema. > > No, I don't think that's the reason to have list/edit/add. The reason is > more for discoverability (if I 'git help git' or 'git^TAB', do I see > something handy in the command list that I didn't know about before?) > and user friendliness ("I can't remember the right config options to set > this up every dang time"). And 'list', I think, is handy for giving > users a dry run of what they can expect to see happen (and where to fix > them, since it lists the origin). Yes, a user could put it all together > from invocations of 'git config', but I personally think it's more > useful for Git to tell me what Git is going to do/what Git wants than > for my meat brain to try and guess :) Okey, that makes sense & I've got nothing against that, just clarifying since it *looked* like it was the first step in some future addition of complexity around this. It would be nice if the docs for the new command were modified to state that clearly, even to the point of saying "this is really just sugar for this similar git-config invocation". >> >> I just can't imagine a case that needs that where say those 10 hooks >> need to execute in exact order 1/2/3/4 where the author of that tight >> coupling wouldn't also desire to roll that all into one script, or at >> least that it's an obscure enough case that we can just say "do that". >> >> Whereas I do think "run a bunch of independent checks, if all pass >> proceed" is *the* common case, e.g. adding a bunch of pre-receive >> hooks. If we tell the user we'll treat those as independent programs we >> can run them in parallel. The vast majority of users will benefit from >> the default faster execution. >> >> The "glob order" case I mentioned is extra complexity on top of that, >> yes, but I think that concession is sane for the common case of "yes >> parallel, but I want to always run the always-exit-0 log >> hook". E.g. I've used this to setup a hook to run push >> attempts/successes in a hook framework that runs N pre-receive hooks. > > Reading this, I think I'm still missing something key about what you > think glob ordering provides. For context, I feel strongly that we should do parallel by default for implementing something like this, it's great that per the above discussion you're open to that. This "glob ordering" is an entirely separate idea I'm not strongly advocating, there's pros & cons of doing that v.s. config ordering. * Con: less obvious than config order, you write hooks "a c b" in the config and we execute in "a b c" order. * Pro: Sidesteps the issues you noted in "Execution ordering" in the docs you're adding, i.e. now it'll be impossible to execute a repo-local hook before a system-wide one, you can override that with having a local one called "000-something". I.e. now we'd read the config in the normal config order, and thus if there's a system hook there's no way to define a local hook to run first, until we get some sort of override for that. > I'm not following why having the log hook set early requires glob > ordering over config ordering (since the config ordering schema allows > reordering via replacement) > [...] > and I'm not following why it's required to halt on failure. I realize I didn't elaborate on this, there's some past discussion[1][2] about this. I.e. when running N hooks sometimes you'd want to run them all (e.g. to send notifications), but for others such as pre-receive.d guard checks you don't have to run all N, if one check (say one checks commit format validity, another code syntax) fails you'd like to abort early. So halting on failure is just saving CPU, you might have 10 hooks that each take 1 second, no point in making the user wait on all 10 checks for 10 seconds if a failure of any fails the push. But OTOH you have other use-cases where users want to run them all (talked about in the [1][2] discussion above), so it's been anticipated as something we'd grow config for with multi-hook support. The glob ordering allows common cases for things that aren't possible with config-order with such early abort. E.g. consider a server with some common system-wide pre-receive.d hook (e.g. author e-mail envelope check), and a SOX/PCI controlled repository where some compliance thing says all push attempts must be logged. You could then do: /etc/git/hooks/pre-receive.d/email-check /path/to/repo/hooks/pre-receive.d/000-log-push-attempt-to-db /path/to/repo/hooks/pre-receive.d/some-other-check And we'd always run the 000-* hook first, whereas in the current schema you can't do that without editing the system-wide config. >> >> All that being said I'm open to being convinced, I just don't see what >> the target user is, and the submitted docs don't really make a case for >> it. I.e. there's plenty of "what" not "why would someone want this...". > > ACK. I'll try and go over the doc again before I reroll. > > - Emily 1. https://lore.kernel.org/git/87wojjsv9p.fsf@evledraar.gmail.com/ 2. https://public-inbox.org/git/CACBZZX6j6q2DUN_Z-Pnent1u714dVNPFBrL_PiEQyLmCzLUVxg@mail.gmail.com/
diff --git a/Documentation/Makefile b/Documentation/Makefile index 80d1908a44..58d6b3acbe 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -81,6 +81,7 @@ SP_ARTICLES += $(API_DOCS) TECH_DOCS += MyFirstContribution TECH_DOCS += MyFirstObjectWalk TECH_DOCS += SubmittingPatches +TECH_DOCS += technical/config-based-hooks TECH_DOCS += technical/hash-function-transition TECH_DOCS += technical/http-protocol TECH_DOCS += technical/index-format diff --git a/Documentation/technical/config-based-hooks.txt b/Documentation/technical/config-based-hooks.txt new file mode 100644 index 0000000000..c6e762b192 --- /dev/null +++ b/Documentation/technical/config-based-hooks.txt @@ -0,0 +1,354 @@ +Configuration-based hook management +=================================== +:sectanchors: + +[[motivation]] +== Motivation + +Treat hooks as a first-class citizen by replacing the .git/hook/hookname path as +the only source of hooks to execute, in a way which is friendly to users with +multiple repos which have similar needs. + +Redefine "hook" as an event rather than a single script, allowing users to +perform unrelated actions on a single event. + +Take a step closer to safety when copying zipped Git repositories from untrusted +users by making it more apparent to users which scripts will be run during +normal Git operations. + +Make it easier for users to discover Git's hook feature and automate their +workflows. + +[[user-interfaces]] +== User interfaces + +[[config-schema]] +=== Config schema + +Hooks can be introduced by editing the configuration manually. There are two new +sections added, `hook` and `hookcmd`. + +[[config-schema-hook]] +==== `hook` + +Primarily contains subsections for each hook event. These order of these +subsections defines the hook command execution order; hook commands can be +specified by setting the value directly to the command if no additional +configuration is needed, or by setting the value as the name of a `hookcmd`. If +Git does not find a `hookcmd` whose subsection matches the value of the given +command string, Git will try to execute the string directly. Hooks are executed +by passing the resolved command string to the shell. Hook event subsections can +also contain per-hook-event settings. + +Also contains top-level hook execution settings, for example, +`hook.warnHookDir`, `hook.runHookDir`, or `hook.disableAll`. (These settings are +described more in <<library,Library>>.) + +---- +[hook "pre-commit"] + command = perl-linter + command = /usr/bin/git-secrets --pre-commit + +[hook "pre-applypatch"] + command = perl-linter + error = ignore + +[hook] + runHookDir = interactive +---- + +[[config-schema-hookcmd]] +==== `hookcmd` + +Defines a hook command and its attributes, which will be used when a hook event +occurs. Unqualified attributes are assumed to apply to this hook during all hook +events, but event-specific attributes can also be supplied. The example runs +`/usr/bin/lint-it --language=perl <args passed by Git>`, but for repos which +include this config, the hook command will be skipped for all events to which +it's normally subscribed _except_ `pre-commit`. + +---- +[hookcmd "perl-linter"] + command = /usr/bin/lint-it --language=perl + skip = true + pre-commit-skip = false +---- + +[[command-line-api]] +=== Command-line API + +Users should be able to view, reorder, and create hook commands via the command +line. External tools should be able to view a list of hooks in the correct order +to run. + +*`git hook list <hook-event>`* + +*`git hook list (--system|--global|--local|--worktree)`* + +*`git hook edit <hook-event>`* + +*`git hook add <hook-command> <hook-event> <options...>`* + +[[hook-editor]] +=== Hook editor + +The tool which is presented by `git hook edit <hook-command>`. Ideally, this +tool should be easier to use than manually editing the config, and then produce +a concise config afterwards. It may take a form similar to `git rebase +--interactive`. + +[[implementation]] +== Implementation + +[[library]] +=== Library + +`hook.c` and `hook.h` are responsible for interacting with the config files. In +the case when the code generating a hook event doesn't have special concerns +about how to run the hooks, the hook library will provide a basic API to call +all hooks in config order with an `argv_array` provided by the code which +generates the hook event: + +*`int run_hooks(const char *hookname, struct argv_array *args)`* + +This call includes the hook command provided by `run-command.h:find_hook()`; +eventually, this legacy hook will be gated by a config `hook.runHookDir`. The +config is checked against a number of cases: + +- "no": the legacy hook will not be run +- "interactive": Git will prompt the user before running the legacy hook +- "warn": Git will print a warning to stderr before running the legacy hook +- "yes" (default): Git will silently run the legacy hook + +In case this list is expanded in the future, if a value for `hook.runHookDir` is +given which Git does not recognize, Git should discard that config entry. For +example, if "warn" was specified at system level and "junk" was specified at +global level, Git would resolve the value to "warn"; if the only time the config +was set was to "junk", Git would use the default value of "yes". + +If the caller wants to do something more complicated, the hook library can also +provide a callback API: + +*`int for_each_hookcmd(const char *hookname, hookcmd_function *cb)`* + +Finally, to facilitate the builtin, the library will also provide the following +APIs to interact with the config: + +---- +int set_hook_commands(const char *hookname, struct string_list *commands, + enum config_scope scope); +int set_hookcmd(const char *hookcmd, struct hookcmd options); + +int list_hook_commands(const char *hookname, struct string_list *commands); +int list_hooks_in_scope(enum config_scope scope, struct string_list *commands); +---- + +`struct hookcmd` is expected to grow in size over time as more functionality is +added to hooks; so that other parts of the code don't need to understand the +config schema, `struct hookcmd` should contain logical values instead of string +pairs. + +---- +struct hookcmd { + const char *name; + const char *command; + + /* for illustration only; not planned at present */ + int parallelizable; + const char *hookcmd_before; + const char *hookcmd_after; + enum recovery_action on_fail; +} +---- + +[[builtin]] +=== Builtin + +`builtin/hook.c` is responsible for providing the frontend. It's responsible for +formatting user-provided data and then calling the library API to set the +configs as appropriate. The builtin frontend is not responsible for calling the +config directly, so that other areas of Git can rely on the hook library to +understand the most recent config schema for hooks. + +[[migration]] +=== Migration path + +[[stage-0]] +==== Stage 0 + +Hooks are called by running `run-command.h:find_hook()` with the hookname and +executing the result. The hook library and builtin do not exist. Hooks only +exist as specially named scripts within `.git/hooks/`. + +[[stage-1]] +==== Stage 1 + +`git hook list --porcelain <hook-event>` is implemented. Users can replace their +`.git/hooks/<hook-event>` scripts with a trampoline based on `git hook list`'s +output. Modifier commands like `git hook add` and `git hook edit` can be +implemented around this time as well. + +[[stage-2]] +==== Stage 2 + +`hook.h:run_hooks()` is taught to include `run-command.h:find_hook()` at the +end; calls to `find_hook()` are replaced with calls to `run_hooks()`. Users can +opt-in to config-based hooks simply by creating some in their config; otherwise +users should remain unaffected by the change. + +[[stage-3]] +==== Stage 3 + +The call to `find_hook()` inside of `run_hooks()` learns to check for a config, +`hook.runHookDir`. Users can opt into managing their hooks completely via the +config this way. + +[[stage-4]] +==== Stage 4 + +`.git/hooks` is removed from the template and the hook directory is considered +deprecated. To avoid breaking older repos, the default of `hook.runHookDir` is +not changed, and `find_hook()` is not removed. + +[[caveats]] +== Caveats + +[[security]] +=== Security and repo config + +Part of the motivation behind this refactor is to mitigate hooks as an attack +vector;footnote:[https://lore.kernel.org/git/20171002234517.GV19555@aiede.mtv.corp.google.com/] +however, as the design stands, users can still provide hooks in the repo-level +config, which is included when a repo is zipped and sent elsewhere. The +security of the repo-level config is still under discussion; this design +generally assumes the repo-level config is secure, which is not true yet. The +goal is to avoid an overcomplicated design to work around a problem which has +ceased to exist. + +[[ease-of-use]] +=== Ease of use + +The config schema is nontrivial; that's why it's important for the `git hook` +modifier commands to be usable. Contributors with UX expertise are encouraged to +share their suggestions. + +[[alternatives]] +== Alternative approaches + +A previous summary of alternatives exists in the +archives.footnote:[https://lore.kernel.org/git/20191116011125.GG22855@google.com] + +[[status-quo]] +=== Status quo + +Today users can implement multihooks themselves by using a "trampoline script" +as their hook, and pointing that script to a directory or list of other scripts +they wish to run. + +[[hook-directories]] +=== Hook directories + +Other contributors have suggested Git learn about the existence of a directory +such as `.git/hooks/<hookname>.d` and execute those hooks in alphabetical order. + +[[comparison]] +=== Comparison table + +.Comparison of alternatives +|=== +|Feature |Config-based hooks |Hook directories |Status quo + +|Supports multiple hooks +|Natively +|Natively +|With user effort + +|Safer for zipped repos +|A little +|No +|No + +|Previous hooks just work +|If configured +|Yes +|Yes + +|Can install one hook to many repos +|Yes +|No +|No + +|Discoverability +|Better (in `git help git`) +|Same as before +|Same as before + +|Hard to run unexpected hook +|If configured +|No +|No +|=== + +[[future-work]] +== Future work + +[[execution-ordering]] +=== Execution ordering + +We may find that config order is insufficient for some users; for example, +config order makes it difficult to add a new hook to the system or global config +which runs at the end of the hook list. A new ordering schema should be: + +1) Specified by a `hook.order` config, so that users will not unexpectedly see +their order change; + +2) Either dependency or numerically based. + +Dependency-based ordering is prone to classic linked-list problems, like a +cycles and handling of missing dependencies. But, it paves the way for enabling +parallelization if some tasks truly depend on others. + +Numerical ordering makes it tricky for Git to generate suggested ordering +numbers for each command, but is easy to determine a definitive order. + +[[parallelization]] +=== Parallelization + +Users with many hooks might want to run them simultaneously, if the hooks don't +modify state; if one hook depends on another's output, then users will want to +specify those dependencies. If we decide to solve this problem, we may want to +look to modern build systems for inspiration on how to manage dependencies and +parallel tasks. + +[[securing-hookdir-hooks]] +=== Securing hookdir hooks + +With the design as written in this doc, it's still possible for a malicious user +to modify `.git/config` to include `hook.pre-receive.command = rm -rf /`, then +zip their repo and send it to another user. It may be necessary to teach Git to +only allow inlined hooks like this if they were configured outside of the local +scope (in other words, only run hookcmds, and only allow hookcmds to be +configured in global or system scope); or another approach, like a list of safe +projects, might be useful. It may also be sufficient (or at least useful) to +teach a `hook.disableAll` config or similar flag to the Git executable. + +[[submodule-inheritance]] +=== Submodule inheritance + +It's possible some submodules may want to run the identical set of hooks that +their superrepo runs. While a globally-configured hook set is helpful, it's not +a great solution for users who have multiple repos-with-submodules under the +same user. It would be useful for submodules to learn how to run hooks from +their superrepo's config, or inherit that hook setting. + +[[glossary]] +== Glossary + +*hook event* + +A point during Git's execution where user scripts may be run, for example, +_prepare-commit-msg_ or _pre-push_. + +*hook command* + +A user script or executable which will be run on one or more hook events.
Begin a design document for config-based hooks, managed via git-hook. Focus on an overview of the implementation and motivation for design decisions. Briefly discuss the alternatives considered before this point. Also, attempt to redefine terms to fit into a multihook world. Signed-off-by: Emily Shaffer <emilyshaffer@google.com> --- Documentation/Makefile | 1 + .../technical/config-based-hooks.txt | 354 ++++++++++++++++++ 2 files changed, 355 insertions(+) create mode 100644 Documentation/technical/config-based-hooks.txt