[v4,1/9] doc: propose hooks managed by the config

Message ID	20200909004939.1942347-2-emilyshaffer@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=gV3S=CS=vger.kernel.org=git-owner@kernel.org> Date: Tue, 8 Sep 2020 17:49:31 -0700 In-Reply-To: <20200909004939.1942347-1-emilyshaffer@google.com> Message-Id: <20200909004939.1942347-2-emilyshaffer@google.com> Mime-Version: 1.0 References: <20200909004939.1942347-1-emilyshaffer@google.com> Subject: [PATCH v4 1/9] doc: propose hooks managed by the config From: Emily Shaffer <emilyshaffer@google.com> To: git@vger.kernel.org Cc: Emily Shaffer <emilyshaffer@google.com> Content-Type: text/plain; charset="UTF-8" Sender: git-owner@vger.kernel.org Precedence: bulk
Series	propose config-based hooks \| expand [v4,0/9] propose config-based hooks [v4,1/9] doc: propose hooks managed by the config [v4,2/9] hook: scaffolding for git-hook subcommand [v4,3/9] hook: add list command [v4,4/9] hook: add --porcelain to list command [v4,5/9] parse-options: parse into strvec [v4,6/9] hook: add 'run' subcommand [v4,7/9] hook: replace run-command.h:find_hook [v4,8/9] commit: use config-based hooks [v4,9/9] run_commit_hook: take strvec instead of varargs

Emily Shaffer Sept. 9, 2020, 12:49 a.m. UTC

Begin a design document for config-based hooks, managed via git-hook.
Focus on an overview of the implementation and motivation for design
decisions. Briefly discuss the alternatives considered before this
point. Also, attempt to redefine terms to fit into a multihook world.

Signed-off-by: Emily Shaffer <emilyshaffer@google.com>
---
 Documentation/Makefile                        |   1 +
 .../technical/config-based-hooks.txt          | 354 ++++++++++++++++++
 2 files changed, 355 insertions(+)
 create mode 100644 Documentation/technical/config-based-hooks.txt

Jonathan Tan Sept. 23, 2020, 10:59 p.m. UTC | #1

For this review, I'll just concern myself with overall design and
structure.

For this patch, overall I think it's better if there's a clear
distinction between what we are implementing now and what we are
implementing later.

> +[[motivation]]
> +== Motivation
> +
> +Treat hooks as a first-class citizen by replacing the .git/hook/hookname path as
> +the only source of hooks to execute, in a way which is friendly to users with
> +multiple repos which have similar needs.

I don't understand what "first-class citizen" here means - probably
better to just omit that phrase and describe the new way of doing hooks.

> +[[config-schema-hook]]
> +==== `hook`
> +
> +Primarily contains subsections for each hook event. These order of these
> +subsections defines the hook command execution order

The execution order is defined by the order of a multivalue config
variable, I think, not the order of subsections? Besides, I believe that
there's one subsection per hook event (e.g. hook."pre-commit"), not one
subsection per command.

> ; hook commands can be
> +specified by setting the value directly to the command if no additional
> +configuration is needed, or by setting the value as the name of a `hookcmd`. If
> +Git does not find a `hookcmd` whose subsection matches the value of the given
> +command string, Git will try to execute the string directly. Hooks are executed
> +by passing the resolved command string to the shell.

[snip]

> Hook event subsections can
> +also contain per-hook-event settings.

If this is not yet implemented, maybe list under "future work".

> +
> +Also contains top-level hook execution settings, for example,
> +`hook.warnHookDir`, `hook.runHookDir`, or `hook.disableAll`. (These settings are
> +described more in <<library,Library>>.)

I think it's clearer if you list this under "future work" - I didn't see
any implementation of this.

> +[hook "pre-commit"]
> +  command = perl-linter
> +  command = /usr/bin/git-secrets --pre-commit
> +
> +[hook "pre-applypatch"]
> +  command = perl-linter
> +  error = ignore

Is "error" implemented?

> +
> +[hook]
> +  runHookDir = interactive

Same question for "runHookDir".

> +[[config-schema-hookcmd]]
> +==== `hookcmd`
> +
> +Defines a hook command and its attributes, which will be used when a hook event
> +occurs. Unqualified attributes are assumed to apply to this hook during all hook
> +events, but event-specific attributes can also be supplied. The example runs
> +`/usr/bin/lint-it --language=perl <args passed by Git>`, but for repos which
> +include this config, the hook command will be skipped for all events to which
> +it's normally subscribed _except_ `pre-commit`.
> +
> +----
> +[hookcmd "perl-linter"]
> +  command = /usr/bin/lint-it --language=perl
> +  skip = true
> +  pre-commit-skip = false
> +----

And the skips. (And several more below which I will skip.)

> +If the caller wants to do something more complicated, the hook library can also
> +provide a callback API:
> +
> +*`int for_each_hookcmd(const char *hookname, hookcmd_function *cb)`*

Is there a use case that would need such a function?

> +[[migration]]
> +=== Migration path
> +
> +[[stage-0]]
> +==== Stage 0
> +
> +Hooks are called by running `run-command.h:find_hook()` with the hookname and
> +executing the result. The hook library and builtin do not exist. Hooks only
> +exist as specially named scripts within `.git/hooks/`.
> +
> +[[stage-1]]
> +==== Stage 1
> +
> +`git hook list --porcelain <hook-event>` is implemented. Users can replace their
> +`.git/hooks/<hook-event>` scripts with a trampoline based on `git hook list`'s
> +output. Modifier commands like `git hook add` and `git hook edit` can be
> +implemented around this time as well.

This seems to contradict patch 8, which teaches Git to use the configs
directly without any change to .git/hooks/<hook-event> (at least for
certain commit-related hooks).

> +[[future-work]]
> +== Future work
> +
> +[[execution-ordering]]
> +=== Execution ordering
> +
> +We may find that config order is insufficient for some users; for example,
> +config order makes it difficult to add a new hook to the system or global config
> +which runs at the end of the hook list. A new ordering schema should be:
> +
> +1) Specified by a `hook.order` config, so that users will not unexpectedly see
> +their order change;
> +
> +2) Either dependency or numerically based.
> +
> +Dependency-based ordering is prone to classic linked-list problems, like a
> +cycles and handling of missing dependencies. But, it paves the way for enabling
> +parallelization if some tasks truly depend on others.
> +
> +Numerical ordering makes it tricky for Git to generate suggested ordering
> +numbers for each command, but is easy to determine a definitive order.

With this schema, and with the "skip" behavior described above (but not
implemented in this patch set), rudimentary ordering can already be
done; because a hook is removed and reinserted whenever it appears in
the config, even a hook X in the system config can be made to run after a
hook Y in the worktree config by adding Y then X in the worktree config,
and if we want to disable X instead, we can just add "skip" to X.

Emily Shaffer Sept. 24, 2020, 9:54 p.m. UTC | #2

On Wed, Sep 23, 2020 at 03:59:10PM -0700, Jonathan Tan wrote:
> 
> For this review, I'll just concern myself with overall design and
> structure.

Thanks - the design doc is now slightly old, so it's nice to have some
fresh eyes on it.

> 
> For this patch, overall I think it's better if there's a clear
> distinction between what we are implementing now and what we are
> implementing later.

I took a light hand when I checked for this - the topic isn't complete
yet, and there's some work in the design doc which I want to include in
this topic, but which hasn't been sent around (or written) yet.

> 
> > +[[motivation]]
> > +== Motivation
> > +
> > +Treat hooks as a first-class citizen by replacing the .git/hook/hookname path as
> > +the only source of hooks to execute, in a way which is friendly to users with
> > +multiple repos which have similar needs.
> 
> I don't understand what "first-class citizen" here means - probably
> better to just omit that phrase and describe the new way of doing hooks.

Sure.

> 
> > +[[config-schema-hook]]
> > +==== `hook`
> > +
> > +Primarily contains subsections for each hook event. These order of these
> > +subsections defines the hook command execution order
> 
> The execution order is defined by the order of a multivalue config
> variable, I think, not the order of subsections? Besides, I believe that
> there's one subsection per hook event (e.g. hook."pre-commit"), not one
> subsection per command.

Ok. Have changed to "The order of variables in these subsections
defines..."

> 
> > ; hook commands can be
> > +specified by setting the value directly to the command if no additional
> > +configuration is needed, or by setting the value as the name of a `hookcmd`. If
> > +Git does not find a `hookcmd` whose subsection matches the value of the given
> > +command string, Git will try to execute the string directly. Hooks are executed
> > +by passing the resolved command string to the shell.
> 
> [snip]
> 
> > Hook event subsections can
> > +also contain per-hook-event settings.
> 
> If this is not yet implemented, maybe list under "future work".

Good idea. Done.

> 
> > +
> > +Also contains top-level hook execution settings, for example,
> > +`hook.warnHookDir`, `hook.runHookDir`, or `hook.disableAll`. (These settings are
> > +described more in <<library,Library>>.)
> 
> I think it's clearer if you list this under "future work" - I didn't see
> any implementation of this.

Yeah, this is out of sync with what the implementation ended up looking
like; disableAll might still be a useful thing to include in the initial
feature topic, so I won't remove it, but warnHookDir is not necessary.

> 
> > +[hook "pre-commit"]
> > +  command = perl-linter
> > +  command = /usr/bin/git-secrets --pre-commit
> > +
> > +[hook "pre-applypatch"]
> > +  command = perl-linter
> > +  error = ignore
> 
> Is "error" implemented?

No, have marked it with a comment.

> 
> > +
> > +[hook]
> > +  runHookDir = interactive
> 
> Same question for "runHookDir".

It is in the reroll I'm trying to get out this week :)

> 
> > +[[config-schema-hookcmd]]
> > +==== `hookcmd`
> > +
> > +Defines a hook command and its attributes, which will be used when a hook event
> > +occurs. Unqualified attributes are assumed to apply to this hook during all hook
> > +events, but event-specific attributes can also be supplied. The example runs
> > +`/usr/bin/lint-it --language=perl <args passed by Git>`, but for repos which
> > +include this config, the hook command will be skipped for all events to which
> > +it's normally subscribed _except_ `pre-commit`.
> > +
> > +----
> > +[hookcmd "perl-linter"]
> > +  command = /usr/bin/lint-it --language=perl
> > +  skip = true
> > +  pre-commit-skip = false
> > +----
> 
> And the skips. (And several more below which I will skip.)

Again, this is in the reroll I'm working on.

> 
> > +If the caller wants to do something more complicated, the hook library can also
> > +provide a callback API:
> > +
> > +*`int for_each_hookcmd(const char *hookname, hookcmd_function *cb)`*
> 
> Is there a use case that would need such a function?

I'm not sure yet - but I'm not quite ready to cut it from the design
doc, until I finish migrating the existing hooks and know that it's not
needed. At that point it'll make sense to move it into the future work
section.

> 
> > +[[migration]]
> > +=== Migration path
> > +
> > +[[stage-0]]
> > +==== Stage 0
> > +
> > +Hooks are called by running `run-command.h:find_hook()` with the hookname and
> > +executing the result. The hook library and builtin do not exist. Hooks only
> > +exist as specially named scripts within `.git/hooks/`.
> > +
> > +[[stage-1]]
> > +==== Stage 1
> > +
> > +`git hook list --porcelain <hook-event>` is implemented. Users can replace their
> > +`.git/hooks/<hook-event>` scripts with a trampoline based on `git hook list`'s
> > +output. Modifier commands like `git hook add` and `git hook edit` can be
> > +implemented around this time as well.
> 
> This seems to contradict patch 8, which teaches Git to use the configs
> directly without any change to .git/hooks/<hook-event> (at least for
> certain commit-related hooks).

Yeah, I think this needs to be rephrased; at this point locally I've
completely removed the --porcelain patch - I'm pretty sure it needs to
be a format string instead.

> 
> > +[[future-work]]
> > +== Future work
> > +
> > +[[execution-ordering]]
> > +=== Execution ordering
> > +
> > +We may find that config order is insufficient for some users; for example,
> > +config order makes it difficult to add a new hook to the system or global config
> > +which runs at the end of the hook list. A new ordering schema should be:
> > +
> > +1) Specified by a `hook.order` config, so that users will not unexpectedly see
> > +their order change;
> > +
> > +2) Either dependency or numerically based.
> > +
> > +Dependency-based ordering is prone to classic linked-list problems, like a
> > +cycles and handling of missing dependencies. But, it paves the way for enabling
> > +parallelization if some tasks truly depend on others.
> > +
> > +Numerical ordering makes it tricky for Git to generate suggested ordering
> > +numbers for each command, but is easy to determine a definitive order.
> 
> With this schema, and with the "skip" behavior described above (but not
> implemented in this patch set), rudimentary ordering can already be
> done; because a hook is removed and reinserted whenever it appears in
> the config, even a hook X in the system config can be made to run after a
> hook Y in the worktree config by adding Y then X in the worktree config,
> and if we want to disable X instead, we can just add "skip" to X.

Yep, that's why reordering is in the future work section :)

The problem with config ordering is like such: if I want everyone in
my enterprise to run 'git-secrets --prepare-commit-msg' as the very last
prepare-commit-msg hook, but I can only ship them an /etc/gitconfig,
then the best I can do is email my users and ask them to run 'git config
hook.prepare-commit-msg.command git-secrets-prepare-commit-msg' in every
new repo and include a 'hookcmd.git-secrets-prepare-commit-msg.command'
config in the /etc/gitconfig I ship. (I mention git-secrets here because
it's possible other hooks could have introduced credential secrets into
my user's commit message after git-secrets already ran.)

 - Emily

Ævar Arnfjörð Bjarmason Oct. 7, 2020, 9:23 a.m. UTC | #3

On Wed, Sep 09 2020, Emily Shaffer wrote:

First, thanks a lot for working on this. As you may have found I've done
some small amount of actual work in this area before, but mostly just
blathered about it on the ML.

> Begin a design document for config-based hooks, managed via git-hook.
> Focus on an overview of the implementation and motivation for design
> decisions. Briefly discuss the alternatives considered before this
> point. Also, attempt to redefine terms to fit into a multihook world.
> [...]
> +[[status-quo]]
> +=== Status quo
> +
> +Today users can implement multihooks themselves by using a "trampoline script"
> +as their hook, and pointing that script to a directory or list of other scripts
> +they wish to run.

...or by setting core.hooksPath in their local/global/system
config. Granted it doesn't cover the malicious hook injection case
you're also trying to solve, but does address e.g. having a git server
with a lot of centralized hooks.

The "trampoline script" also isn't needed for the common case you
mention, you just symlink the .git/hooks directory (as e.g. GitLab
does). People usually use a trampoline script for e.g. using GNU
parallel or something to execute N hooks.

> +[[hook-directories]]
> +=== Hook directories
> +
> +Other contributors have suggested Git learn about the existence of a directory
> +such as `.git/hooks/<hookname>.d` and execute those hooks in alphabetical order.

...which seems like an easy thing to add later by having a "hookdir" in
addition to "hookcmd", i.e. just specify a glob there instead of a
cmd/path.

You already use "hookdir" for something else though, so that's a bit
confusing, perhaps s/hookcmd/definehookcmd/ would be less confusing, or
perhaps more confusing...

> [...]
> +[[execution-ordering]]
> +=== Execution ordering
> +
> +We may find that config order is insufficient for some users; for example,
> +config order makes it difficult to add a new hook to the system or global config
> +which runs at the end of the hook list. A new ordering schema should be:
> +
> +1) Specified by a `hook.order` config, so that users will not unexpectedly see
> +their order change;
> +
> +2) Either dependency or numerically based.
> +
> +Dependency-based ordering is prone to classic linked-list problems, like a
> +cycles and handling of missing dependencies. But, it paves the way for enabling
> +parallelization if some tasks truly depend on others.
>
> +Numerical ordering makes it tricky for Git to generate suggested ordering
> +numbers for each command, but is easy to determine a definitive order.
> +
> +[[parallelization]]
> +=== Parallelization
> +
> +Users with many hooks might want to run them simultaneously, if the hooks don't
> +modify state; if one hook depends on another's output, then users will want to
> +specify those dependencies. If we decide to solve this problem, we may want to
> +look to modern build systems for inspiration on how to manage dependencies and
> +parallel tasks.

If you're taking requests it would make me very happy if we had
parallelism in this from day one. It's the kind of thing that's hard to
do by default once a feature is shipped since people will implicitly
depend on it not being there, i.e. we won't know what we're breaking.

I think doing it this way is simple, covers most use cases, and solves a
lot of the problems you note:

1. Don't use config order to execute hooks, use glob'd name order
   regardless of origin. I.e. a system-level hook is called "001-first"
   is executed before a local hook called "999-at-the-end" (or the other
   way around, i.e. hook origin doesn't matter).

2. We execute hooks parallel in that glob order, i.e. a pthread for-loop
   that starts the 001-first task first, eventually getting to
   999-at-the-end N at a time. I.e. the same as:

       parallel --jobs N --halt-on-error soon,fail=1" ::: <hooks-in-glob-order>

   This allows for parallelism but guarantees the very useful case of
   having a global log hook being guaranteed to execute.

3. A hook can define "parallel=no" in its config. We'll then run it
   while no other hook is running.

4. We don't attempt to do dependencies etc, if you need that sort of
   complexity you can just make one of the hooks be a hook runner as
   users do now for the common "make it parallel" case.

It's a relatively small change to the code you have already. I.e. the
for_each() in run_hooks() would be called N times for each continuous
glob'd parallel/non-parallel segment, and hook_list()'s config parsing
would learn to spew those out as a list-of-lists.

This also gives you a rudimentary implementation of the dependency
schema you proposed for free. I.e. a definition of (pseudocode):

    hookcmd=000-first
    parallel=no

    hookcmd=250-middle-abc
    hookcmd=250-middle-xyz

    hookcmd=300-gather
    parallel=no

    hookcmd=999-the-end

Would result in the pseudocode execution of;

    segments=[[000-first],
              [250-middle-abc, 250-middle-xyz],
              [300-gather],
              [999-the-end]]
    for each s in segments:
        ok = run_in_parallel(s)
        last if !ok # or, depending on "early exit?" config

I.e.:

 * The common case of people adding N hooks won't take sum(N) time.

 * parallel=no hooks aren't run in parallel with other non-parallel
   hooks

 * We support a rudimentary dependency schema as a side-effect,
   i.e. defining 300-gather as non-parallel allows it to act as the sole
   "reduce" step in a map/reduce in a "map" step started with the 250-*
   hooks.

> +[[securing-hookdir-hooks]]
> +=== Securing hookdir hooks
> +
> +With the design as written in this doc, it's still possible for a malicious user
> +to modify `.git/config` to include `hook.pre-receive.command = rm -rf /`, then
> +zip their repo and send it to another user. It may be necessary to teach Git to
> +only allow inlined hooks like this if they were configured outside of the local
> +scope (in other words, only run hookcmds, and only allow hookcmds to be
> +configured in global or system scope); or another approach, like a list of safe
> +projects, might be useful. It may also be sufficient (or at least useful) to
> +teach a `hook.disableAll` config or similar flag to the Git executable.

I think this part of the doc should note a bit of the context in
https://lore.kernel.org/git/20171002234517.GV19555@aiede.mtv.corp.google.com/

I.e. even if we get a 100% secure hook implementation we've done
practically nothing for overall security, since we'll still run the
pager, aliases etc. from that local repo.

This is a great step in the right direction, but it behooves us to note
that, so some user reading this documentation without context doesn't
think inspecting untrusted repositories like that is safe just because
they set the right hook settings in their config (once what's being
proposed here is implemented).

Emily Shaffer Oct. 22, 2020, 12:58 a.m. UTC | #4

On Wed, Oct 07, 2020 at 11:23:10AM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
> 
> On Wed, Sep 09 2020, Emily Shaffer wrote:
> 
> First, thanks a lot for working on this. As you may have found I've done
> some small amount of actual work in this area before, but mostly just
> blathered about it on the ML.
> 
> > Begin a design document for config-based hooks, managed via git-hook.
> > Focus on an overview of the implementation and motivation for design
> > decisions. Briefly discuss the alternatives considered before this
> > point. Also, attempt to redefine terms to fit into a multihook world.
> > [...]
> > +[[status-quo]]
> > +=== Status quo
> > +
> > +Today users can implement multihooks themselves by using a "trampoline script"
> > +as their hook, and pointing that script to a directory or list of other scripts
> > +they wish to run.
> 
> ...or by setting core.hooksPath in their local/global/system
> config. Granted it doesn't cover the malicious hook injection case
> you're also trying to solve, but does address e.g. having a git server
> with a lot of centralized hooks.

Aha, setting core.hooksPath in the global/system config had not occurred
to me.

> 
> The "trampoline script" also isn't needed for the common case you
> mention, you just symlink the .git/hooks directory (as e.g. GitLab
> does). People usually use a trampoline script for e.g. using GNU
> parallel or something to execute N hooks.

Hm, I don't think that's quite true. Symlinking out .git/hooks doesn't
give me more than one $HOOKDIR/pre-commit - it just gives me a different
one. So if I wanted to run three different hooks, $HOOKDIR/pre-commit
would need to do the work of all three, regardless of where $HOOKDIR
points. That's what I meant when I said "multihooks" in this section.

But I think what you're trying to say is this: the "status quo" section
doesn't fully cover the status quo. There are more tricks than I
mentioned, e.g. 'git config --global core.hooksPath
/home/emily/githook/' to get the same set of hooks to run everywhere.
This approach still has some drawbacks - for example, it doesn't allow
me to use language-specific linters if I have repos in various
languages, without exempting an individual repo from the ~/githook/ by
'git config --local core.hooksPath
/home/emily/my-python-thing/.git/hook'.

It looks like, then, the "status quo" section needs some rework for the
next iteration.

> 
> 
> > +[[hook-directories]]
> > +=== Hook directories
> > +
> > +Other contributors have suggested Git learn about the existence of a directory
> > +such as `.git/hooks/<hookname>.d` and execute those hooks in alphabetical order.
> 
> ...which seems like an easy thing to add later by having a "hookdir" in
> addition to "hookcmd", i.e. just specify a glob there instead of a
> cmd/path.

Hum, interesting! Something like so:

[hook.pre-commit]
  command = last-minute-checks

[hookdir.last-minute-checks]
  dir = /home/emily/last-minute-checks/*

And then the hooks library knows to go and run everything in
~/last-minute-checks/. This is easier to keep fresh than:

[hook.pre-commit]
  command = /home/emily/last-minute-checks/c-linter
  command = /home/emily/last-minute-checks/check-for-debug-prints
  command = /home/emily/last-minute-checks/check-for-notes
  ...

I actually like the idea of this for folks who might have a small number
of hooks they wrote for themselves. I wonder if it's applicable for
something like git-secrets, which presumably users would grab with a
'git clone' later.

It doesn't seem at odds with the rest of the design - how would you feel
about me adding it to the "future work" section at the end? Future work,
rather than "Emily will do this in the next couple of rounds", because:
 - I think nobody already has their hooks in $HOOKDIR/hook/pre-commit.d
   without a corresponding trampoline in $HOOKDIR/hook/pre-commit; so
   they could still call that trampoline, for now
 - I think it might be prone to some bikeshedding - e.g. should we
   recurse into ~/last-minute-checks/linters/c/? how far? what if some
   script requires magic options? etc? But as I'm typing those questions
   out they sound mostly trivial or ridiculous, so maybe my assessment
   is wrong here.
 - It sounds like you might be keen to write it, or at the very least,
   more keen than me
 - Practically speaking, I am not sure I have time to do it alongside
   the rest of the series. Again, my bikeshedding assessment could be
   wrong, and this extra feature could be totally trivial.

> You already use "hookdir" for something else though, so that's a bit
> confusing, perhaps s/hookcmd/definehookcmd/ would be less confusing, or
> perhaps more confusing...

"Hookdir" might be the wrong word to use, too - maybe it's better to
mirror "hookspath" there. Eitherway, "hookdir" and "hookspath" are
similar enough that I think it would be confusing, and "hookcmd" is
already getting some side-eye from me for not being a great choice.

Some thoughts for "a path to a directory in which multiple scripts for a
single hook live":
 - hookset
 - hookbatch (ugh, redundant with MS scripting)
 - hook.pre-commit.all-of = ~/last-minute-checks/
 -  "   "  .everything-in = "   "
...?

I think I named a couple silly ideas for "hookcmd" in another mail.

> 
> > [...]
> > +[[execution-ordering]]
> > +=== Execution ordering
> > +
> > +We may find that config order is insufficient for some users; for example,
> > +config order makes it difficult to add a new hook to the system or global config
> > +which runs at the end of the hook list. A new ordering schema should be:
> > +
> > +1) Specified by a `hook.order` config, so that users will not unexpectedly see
> > +their order change;
> > +
> > +2) Either dependency or numerically based.
> > +
> > +Dependency-based ordering is prone to classic linked-list problems, like a
> > +cycles and handling of missing dependencies. But, it paves the way for enabling
> > +parallelization if some tasks truly depend on others.
> >
> > +Numerical ordering makes it tricky for Git to generate suggested ordering
> > +numbers for each command, but is easy to determine a definitive order.
> > +
> > +[[parallelization]]
> > +=== Parallelization
> > +
> > +Users with many hooks might want to run them simultaneously, if the hooks don't
> > +modify state; if one hook depends on another's output, then users will want to
> > +specify those dependencies. If we decide to solve this problem, we may want to
> > +look to modern build systems for inspiration on how to manage dependencies and
> > +parallel tasks.
> 
> If you're taking requests it would make me very happy if we had
> parallelism in this from day one. It's the kind of thing that's hard to
> do by default once a feature is shipped since people will implicitly
> depend on it not being there, i.e. we won't know what we're breaking.

Hm. This might be tricky.

Some hooks are inherently not able to be parallelized - for example,
hooks which modify a given file, like the commit message draft. In
general, based on the handful of hooks I've converted locally, it's hard
to check whether a callsite assumes a hook could have modified state.
Usually this seems to be done with a call to find_hook() ("was there a
hook that might have run?") and then reopening the file. Sometimes a
file is reopened unconditionally. Sometimes the find_hook() call is
very far away from the run_hook_le() call.

The rest, then, which only read a file and say yes or no, probably don't
need to have a strict ordering - at least as far as Git is concerned.
And I think that's what you're worried about:

[hook.theoretical-parallelizable-event]
  command = check-and-mark-a-file-foo
  command = check-file-foo-and-do-something-else
  command = do-something-totally-unrelated

On day 1 of this feature, as written, this is safe. But if we aren't
careful and we start to parallelize *without* setting up dependency
ordering, e.g. 'git config --global hook.parallelize', and turn that on
by default without warning anyone, then the author of this config will
be unhappy.

But as I read further, you're talking about specifically *not* allowing
dependency ordering...

> 
> I think doing it this way is simple, covers most use cases, and solves a
> lot of the problems you note:
> 
> 1. Don't use config order to execute hooks, use glob'd name order
>    regardless of origin. I.e. a system-level hook is called "001-first"
>    is executed before a local hook called "999-at-the-end" (or the other
>    way around, i.e. hook origin doesn't matter).

Can you say a little more about why different ordering schema would
matter, if we effectively don't care which jobs are in parallel with
which, as you describe? I'm not quite following.

> 
> 2. We execute hooks parallel in that glob order, i.e. a pthread for-loop
>    that starts the 001-first task first, eventually getting to
>    999-at-the-end N at a time. I.e. the same as:
> 
>        parallel --jobs N --halt-on-error soon,fail=1" ::: <hooks-in-glob-order>
> 
>    This allows for parallelism but guarantees the very useful case of
>    having a global log hook being guaranteed to execute.

Ah, I think you're suggesting the glob order specifically to make up for
--halt-on-error in this case.

> 
> 3. A hook can define "parallel=no" in its config. We'll then run it
>    while no other hook is running.
> 
> 4. We don't attempt to do dependencies etc, if you need that sort of
>    complexity you can just make one of the hooks be a hook runner as
>    users do now for the common "make it parallel" case.

If we aren't attempting any magical ordering, then I don't really see a
big difference between glob vs. config order - presumably for most users
the effect would be same, e.g. N = $(nproc * hyperthreading), M = (number of scripts I
care to run) probably will often result in M < N, so all jobs would run
simultaneously anyways.

> 
> It's a relatively small change to the code you have already. I.e. the
> for_each() in run_hooks() would be called N times for each continuous
> glob'd parallel/non-parallel segment, and hook_list()'s config parsing
> would learn to spew those out as a list-of-lists.
> 
> This also gives you a rudimentary implementation of the dependency
> schema you proposed for free. I.e. a definition of (pseudocode):
> 
>     hookcmd=000-first
>     parallel=no
> 
>     hookcmd=250-middle-abc
>     hookcmd=250-middle-xyz
> 
>     hookcmd=300-gather
>     parallel=no
> 
>     hookcmd=999-the-end
> 
> Would result in the pseudocode execution of;
> 
>     segments=[[000-first],
>               [250-middle-abc, 250-middle-xyz],

Hum. This seems to say "folks who started their hooks with the same
number agree that their hooks should also run simultaneously" - which
sounds like an even harder problem than "how do I know my ordering
number isn't the same as someone else's in another config file". Or else
I'm misunderstanding your pseudo :)

Ah, I see later you mention it directly as a dependency schema. I think
this offers the same set of problems I saw trying to use this as an
ordering schema, but worse in all the usual ways parallelism provides.
It is still impossible for someone writing a global or system config to
know where in the dependency chain more local hooks reside.

>               [300-gather],
>               [999-the-end]]
>     for each s in segments:
>         ok = run_in_parallel(s)
>         last if !ok # or, depending on "early exit?" config
> 
> I.e.:
> 
>  * The common case of people adding N hooks won't take sum(N) time.
> 
>  * parallel=no hooks aren't run in parallel with other non-parallel
>    hooks
> 
>  * We support a rudimentary dependency schema as a side-effect,
>    i.e. defining 300-gather as non-parallel allows it to act as the sole
>    "reduce" step in a map/reduce in a "map" step started with the 250-*
>    hooks.

As I understand it, the main concerns you have about getting
parallelization to happen on day 1 are like so:

 - keep users from assuming serial execution
 - avoid a messy schema change to deal with dependencies

I see the benefit of the former; I don't like the new schema proposed by
the latter. I do see that not turning it on day 1 would prevent us from
turning it on by default later, in case users did something silly like
assume dependencies.

Hrm.

I think we could turn on parallelization day 1 by providing an
explicitly-parallel API in hook.h (and a similar 'git hook run foo
--parallel' flag), and being more careful when converting hooks to call
run_hooks_parallel() instead of run_hooks(). That way hooks which will
never be parallelizable (e.g. commit-msg) won't get burned later by us
trying to be clever. Everyone else who can be parallelized is, in config
order, with no dependency management whatsoever. That leaves the door
open for us to add dependency management however we want later on, but
users can still roll their own with a launcher script today.

I know I rambled a lot - I was trying to convince myself :) For now, I'd
prefer to add more detail to the "future work" section of the doc and
then not touch this problem with a very long pole... ;) Thoughts
welcome.

> 
> > +[[securing-hookdir-hooks]]
> > +=== Securing hookdir hooks
> > +
> > +With the design as written in this doc, it's still possible for a malicious user
> > +to modify `.git/config` to include `hook.pre-receive.command = rm -rf /`, then
> > +zip their repo and send it to another user. It may be necessary to teach Git to
> > +only allow inlined hooks like this if they were configured outside of the local
> > +scope (in other words, only run hookcmds, and only allow hookcmds to be
> > +configured in global or system scope); or another approach, like a list of safe
> > +projects, might be useful. It may also be sufficient (or at least useful) to
> > +teach a `hook.disableAll` config or similar flag to the Git executable.
> 
> I think this part of the doc should note a bit of the context in
> https://lore.kernel.org/git/20171002234517.GV19555@aiede.mtv.corp.google.com/
> 
> I.e. even if we get a 100% secure hook implementation we've done
> practically nothing for overall security, since we'll still run the
> pager, aliases etc. from that local repo.
> 
> This is a great step in the right direction, but it behooves us to note
> that, so some user reading this documentation without context doesn't
> think inspecting untrusted repositories like that is safe just because
> they set the right hook settings in their config (once what's being
> proposed here is implemented).

Yeah, I agree. I'll try to make that clearer in the doc in the next
reroll.

Very sorry again for having missed this - I think the first weeks of
October I was working from my local todo list instead of from the list
of replies in mutt. Urk.

 - Emily

Ævar Arnfjörð Bjarmason Oct. 23, 2020, 7:10 p.m. UTC | #5

On Thu, Oct 22 2020, Emily Shaffer wrote:

> On Wed, Oct 07, 2020 at 11:23:10AM +0200, Ævar Arnfjörð Bjarmason wrote:
>> 
>> 
>> On Wed, Sep 09 2020, Emily Shaffer wrote:
>> 
>> First, thanks a lot for working on this. As you may have found I've done
>> some small amount of actual work in this area before, but mostly just
>> blathered about it on the ML.
>> 
>> > Begin a design document for config-based hooks, managed via git-hook.
>> > Focus on an overview of the implementation and motivation for design
>> > decisions. Briefly discuss the alternatives considered before this
>> > point. Also, attempt to redefine terms to fit into a multihook world.
>> > [...]
>> > +[[status-quo]]
>> > +=== Status quo
>> > +
>> > +Today users can implement multihooks themselves by using a "trampoline script"
>> > +as their hook, and pointing that script to a directory or list of other scripts
>> > +they wish to run.
>> 
>> ...or by setting core.hooksPath in their local/global/system
>> config. Granted it doesn't cover the malicious hook injection case
>> you're also trying to solve, but does address e.g. having a git server
>> with a lot of centralized hooks.
>
> Aha, setting core.hooksPath in the global/system config had not occurred
> to me.

It's a useful hack.

>> 
>> The "trampoline script" also isn't needed for the common case you
>> mention, you just symlink the .git/hooks directory (as e.g. GitLab
>> does). People usually use a trampoline script for e.g. using GNU
>> parallel or something to execute N hooks.
>
> Hm, I don't think that's quite true. Symlinking out .git/hooks doesn't
> give me more than one $HOOKDIR/pre-commit - it just gives me a different
> one. So if I wanted to run three different hooks, $HOOKDIR/pre-commit
> would need to do the work of all three, regardless of where $HOOKDIR
> points. That's what I meant when I said "multihooks" in this section.
>
> But I think what you're trying to say is this: the "status quo" section
> doesn't fully cover the status quo. There are more tricks than I
> mentioned, e.g. 'git config --global core.hooksPath
> /home/emily/githook/' to get the same set of hooks to run everywhere.
> This approach still has some drawbacks - for example, it doesn't allow
> me to use language-specific linters if I have repos in various
> languages, without exempting an individual repo from the ~/githook/ by
> 'git config --local core.hooksPath
> /home/emily/my-python-thing/.git/hook'.
>
> It looks like, then, the "status quo" section needs some rework for the
> next iteration.

Re-reading your original patch I think I just misread that. I thought
you were saying a stub script was needed in the .git to point to a
multi-hook script, but I was pointing out that you can just symlink to
the multi-hook script (as e.g. GitLab does), but reading it again & this
I don't thin you meant that at all. Nevermind.

>> 
>> 
>> > +[[hook-directories]]
>> > +=== Hook directories
>> > +
>> > +Other contributors have suggested Git learn about the existence of a directory
>> > +such as `.git/hooks/<hookname>.d` and execute those hooks in alphabetical order.
>> 
>> ...which seems like an easy thing to add later by having a "hookdir" in
>> addition to "hookcmd", i.e. just specify a glob there instead of a
>> cmd/path.
>
> Hum, interesting! Something like so:
>
> [hook.pre-commit]
>   command = last-minute-checks
>
> [hookdir.last-minute-checks]
>   dir = /home/emily/last-minute-checks/*
>
> And then the hooks library knows to go and run everything in
> ~/last-minute-checks/. This is easier to keep fresh than:
>
> [hook.pre-commit]
>   command = /home/emily/last-minute-checks/c-linter
>   command = /home/emily/last-minute-checks/check-for-debug-prints
>   command = /home/emily/last-minute-checks/check-for-notes
>   ...
>
> I actually like the idea of this for folks who might have a small number
> of hooks they wrote for themselves. I wonder if it's applicable for
> something like git-secrets, which presumably users would grab with a
> 'git clone' later.
>
> It doesn't seem at odds with the rest of the design - how would you feel
> about me adding it to the "future work" section at the end? Future work,
> rather than "Emily will do this in the next couple of rounds", because:
>  - I think nobody already has their hooks in $HOOKDIR/hook/pre-commit.d
>    without a corresponding trampoline in $HOOKDIR/hook/pre-commit; so
>    they could still call that trampoline, for now
>  - I think it might be prone to some bikeshedding - e.g. should we
>    recurse into ~/last-minute-checks/linters/c/? how far? what if some
>    script requires magic options? etc? But as I'm typing those questions
>    out they sound mostly trivial or ridiculous, so maybe my assessment
>    is wrong here.
>  - It sounds like you might be keen to write it, or at the very least,
>    more keen than me
>  - Practically speaking, I am not sure I have time to do it alongside
>    the rest of the series. Again, my bikeshedding assessment could be
>    wrong, and this extra feature could be totally trivial.
>
>> You already use "hookdir" for something else though, so that's a bit
>> confusing, perhaps s/hookcmd/definehookcmd/ would be less confusing, or
>> perhaps more confusing...
>
> "Hookdir" might be the wrong word to use, too - maybe it's better to
> mirror "hookspath" there. Eitherway, "hookdir" and "hookspath" are
> similar enough that I think it would be confusing, and "hookcmd" is
> already getting some side-eye from me for not being a great choice.
>
> Some thoughts for "a path to a directory in which multiple scripts for a
> single hook live":
>  - hookset
>  - hookbatch (ugh, redundant with MS scripting)
>  - hook.pre-commit.all-of = ~/last-minute-checks/
>  -  "   "  .everything-in = "   "
> ...?
>
> I think I named a couple silly ideas for "hookcmd" in another mail.

To both of the above: Yeah I'm not saying you need to do the work, just
that I think it would be a useful case to bikeshed now since it seems
inevitable that we'll get a "find hooks in this dir by glob" once we
have this facility. So having a config syntax for that which isn't
overly confusing / extensible to that case would be useful, i.e. as the
current syntax uses "dir" already.

>> 
>> > [...]
>> > +[[execution-ordering]]
>> > +=== Execution ordering
>> > +
>> > +We may find that config order is insufficient for some users; for example,
>> > +config order makes it difficult to add a new hook to the system or global config
>> > +which runs at the end of the hook list. A new ordering schema should be:
>> > +
>> > +1) Specified by a `hook.order` config, so that users will not unexpectedly see
>> > +their order change;
>> > +
>> > +2) Either dependency or numerically based.
>> > +
>> > +Dependency-based ordering is prone to classic linked-list problems, like a
>> > +cycles and handling of missing dependencies. But, it paves the way for enabling
>> > +parallelization if some tasks truly depend on others.
>> >
>> > +Numerical ordering makes it tricky for Git to generate suggested ordering
>> > +numbers for each command, but is easy to determine a definitive order.
>> > +
>> > +[[parallelization]]
>> > +=== Parallelization
>> > +
>> > +Users with many hooks might want to run them simultaneously, if the hooks don't
>> > +modify state; if one hook depends on another's output, then users will want to
>> > +specify those dependencies. If we decide to solve this problem, we may want to
>> > +look to modern build systems for inspiration on how to manage dependencies and
>> > +parallel tasks.
>> 
>> If you're taking requests it would make me very happy if we had
>> parallelism in this from day one. It's the kind of thing that's hard to
>> do by default once a feature is shipped since people will implicitly
>> depend on it not being there, i.e. we won't know what we're breaking.
>
> Hm. This might be tricky.
>
> Some hooks are inherently not able to be parallelized - for example,
> hooks which modify a given file, like the commit message draft. In
> general, based on the handful of hooks I've converted locally, it's hard
> to check whether a callsite assumes a hook could have modified state.
> Usually this seems to be done with a call to find_hook() ("was there a
> hook that might have run?") and then reopening the file. Sometimes a
> file is reopened unconditionally. Sometimes the find_hook() call is
> very far away from the run_hook_le() call.
>
> The rest, then, which only read a file and say yes or no, probably don't
> need to have a strict ordering - at least as far as Git is concerned.
> And I think that's what you're worried about:
>
> [hook.theoretical-parallelizable-event]
>   command = check-and-mark-a-file-foo
>   command = check-file-foo-and-do-something-else
>   command = do-something-totally-unrelated
>
> On day 1 of this feature, as written, this is safe. But if we aren't
> careful and we start to parallelize *without* setting up dependency
> ordering, e.g. 'git config --global hook.parallelize', and turn that on
> by default without warning anyone, then the author of this config will
> be unhappy.
>
> But as I read further, you're talking about specifically *not* allowing
> dependency ordering...
>
>> 
>> I think doing it this way is simple, covers most use cases, and solves a
>> lot of the problems you note:
>> 
>> 1. Don't use config order to execute hooks, use glob'd name order
>>    regardless of origin. I.e. a system-level hook is called "001-first"
>>    is executed before a local hook called "999-at-the-end" (or the other
>>    way around, i.e. hook origin doesn't matter).
>
> Can you say a little more about why different ordering schema would
> matter, if we effectively don't care which jobs are in parallel with
> which, as you describe? I'm not quite following.
>
>> 
>> 2. We execute hooks parallel in that glob order, i.e. a pthread for-loop
>>    that starts the 001-first task first, eventually getting to
>>    999-at-the-end N at a time. I.e. the same as:
>> 
>>        parallel --jobs N --halt-on-error soon,fail=1" ::: <hooks-in-glob-order>
>> 
>>    This allows for parallelism but guarantees the very useful case of
>>    having a global log hook being guaranteed to execute.
>
> Ah, I think you're suggesting the glob order specifically to make up for
> --halt-on-error in this case.
>
>> 
>> 3. A hook can define "parallel=no" in its config. We'll then run it
>>    while no other hook is running.
>> 
>> 4. We don't attempt to do dependencies etc, if you need that sort of
>>    complexity you can just make one of the hooks be a hook runner as
>>    users do now for the common "make it parallel" case.
>
> If we aren't attempting any magical ordering, then I don't really see a
> big difference between glob vs. config order - presumably for most users
> the effect would be same, e.g. N = $(nproc * hyperthreading), M = (number of scripts I
> care to run) probably will often result in M < N, so all jobs would run
> simultaneously anyways.
>
>> 
>> It's a relatively small change to the code you have already. I.e. the
>> for_each() in run_hooks() would be called N times for each continuous
>> glob'd parallel/non-parallel segment, and hook_list()'s config parsing
>> would learn to spew those out as a list-of-lists.
>> 
>> This also gives you a rudimentary implementation of the dependency
>> schema you proposed for free. I.e. a definition of (pseudocode):
>> 
>>     hookcmd=000-first
>>     parallel=no
>> 
>>     hookcmd=250-middle-abc
>>     hookcmd=250-middle-xyz
>> 
>>     hookcmd=300-gather
>>     parallel=no
>> 
>>     hookcmd=999-the-end
>> 
>> Would result in the pseudocode execution of;
>> 
>>     segments=[[000-first],
>>               [250-middle-abc, 250-middle-xyz],
>
> Hum. This seems to say "folks who started their hooks with the same
> number agree that their hooks should also run simultaneously" - which
> sounds like an even harder problem than "how do I know my ordering
> number isn't the same as someone else's in another config file". Or else
> I'm misunderstanding your pseudo :)

The prefix number isn't meaningful in that way, i.e. if you have 10
threads and 5 hooks starting with 250-* they won't all be invoked at the
same time.

> Ah, I see later you mention it directly as a dependency schema. I think
> this offers the same set of problems I saw trying to use this as an
> ordering schema, but worse in all the usual ways parallelism provides.
> It is still impossible for someone writing a global or system config to
> know where in the dependency chain more local hooks reside.
>
>>               [300-gather],
>>               [999-the-end]]
>>     for each s in segments:
>>         ok = run_in_parallel(s)
>>         last if !ok # or, depending on "early exit?" config
>> 
>> I.e.:
>> 
>>  * The common case of people adding N hooks won't take sum(N) time.
>> 
>>  * parallel=no hooks aren't run in parallel with other non-parallel
>>    hooks
>> 
>>  * We support a rudimentary dependency schema as a side-effect,
>>    i.e. defining 300-gather as non-parallel allows it to act as the sole
>>    "reduce" step in a map/reduce in a "map" step started with the 250-*
>>    hooks.
>
> As I understand it, the main concerns you have about getting
> parallelization to happen on day 1 are like so:
>
>  - keep users from assuming serial execution
>  - avoid a messy schema change to deal with dependencies
>
> I see the benefit of the former; I don't like the new schema proposed by
> the latter. I do see that not turning it on day 1 would prevent us from
> turning it on by default later, in case users did something silly like
> assume dependencies.
>
> Hrm.
>
> I think we could turn on parallelization day 1 by providing an
> explicitly-parallel API in hook.h (and a similar 'git hook run foo
> --parallel' flag), and being more careful when converting hooks to call
> run_hooks_parallel() instead of run_hooks(). That way hooks which will
> never be parallelizable (e.g. commit-msg) won't get burned later by us
> trying to be clever. Everyone else who can be parallelized is, in config
> order, with no dependency management whatsoever. That leaves the door
> open for us to add dependency management however we want later on, but
> users can still roll their own with a launcher script today.
>
> I know I rambled a lot - I was trying to convince myself :) For now, I'd
> prefer to add more detail to the "future work" section of the doc and
> then not touch this problem with a very long pole... ;) Thoughts
> welcome.

I'm replying to much of the above in general here, particularly since
much of it was in the form of a question you answered yourself later :)

Yes as you point out the reason I'm raising the parallel thing now is
"keep users from assuming serial execution", i.e. any implementation
that isn't like that from day 1 will need more verbose syntax to opt-in
to that.

I think parallel is the sane default, although there's a really strong
case as you point out with the "commit-msg" hook for treating that on a
hook-type basis. E.g. commit-msg (in-place editing of as single file)
being non-parallel by default, but e.g. post-commit, pre-applypatch,
pre-receive and other "should we proceed?" hooks being parallel.

But I'm also raising a general concern with the design of the API /
command around this.

I don't see the need for having a git hook list/edit/add command at
all. We should just keep this simpler and be able to point to "git
config --add/--get-regexp" etc.

It seems the reason to introduce this command API around it is because
you're imagining that git needs to manage hooks whose relative execution
order is important, and to later on once this lands aim to implement a
much more complex dependency management schema.

I just can't imagine a case that needs that where say those 10 hooks
need to execute in exact order 1/2/3/4 where the author of that tight
coupling wouldn't also desire to roll that all into one script, or at
least that it's an obscure enough case that we can just say "do that".

Whereas I do think "run a bunch of independent checks, if all pass
proceed" is *the* common case, e.g. adding a bunch of pre-receive
hooks. If we tell the user we'll treat those as independent programs we
can run them in parallel. The vast majority of users will benefit from
the default faster execution.

The "glob order" case I mentioned is extra complexity on top of that,
yes, but I think that concession is sane for the common case of "yes
parallel, but I want to always run the always-exit-0 log
hook". E.g. I've used this to setup a hook to run push
attempts/successes in a hook framework that runs N pre-receive hooks.

All that being said I'm open to being convinced, I just don't see what
the target user is, and the submitted docs don't really make a case for
it. I.e. there's plenty of "what" not "why would someone want this...".

>> 
>> > +[[securing-hookdir-hooks]]
>> > +=== Securing hookdir hooks
>> > +
>> > +With the design as written in this doc, it's still possible for a malicious user
>> > +to modify `.git/config` to include `hook.pre-receive.command = rm -rf /`, then
>> > +zip their repo and send it to another user. It may be necessary to teach Git to
>> > +only allow inlined hooks like this if they were configured outside of the local
>> > +scope (in other words, only run hookcmds, and only allow hookcmds to be
>> > +configured in global or system scope); or another approach, like a list of safe
>> > +projects, might be useful. It may also be sufficient (or at least useful) to
>> > +teach a `hook.disableAll` config or similar flag to the Git executable.
>> 
>> I think this part of the doc should note a bit of the context in
>> https://lore.kernel.org/git/20171002234517.GV19555@aiede.mtv.corp.google.com/
>> 
>> I.e. even if we get a 100% secure hook implementation we've done
>> practically nothing for overall security, since we'll still run the
>> pager, aliases etc. from that local repo.
>> 
>> This is a great step in the right direction, but it behooves us to note
>> that, so some user reading this documentation without context doesn't
>> think inspecting untrusted repositories like that is safe just because
>> they set the right hook settings in their config (once what's being
>> proposed here is implemented).
>
> Yeah, I agree. I'll try to make that clearer in the doc in the next
> reroll.
>
> Very sorry again for having missed this - I think the first weeks of
> October I was working from my local todo list instead of from the list
> of replies in mutt. Urk.

*nod*

Emily Shaffer Oct. 29, 2020, 3:38 p.m. UTC | #6

On Fri, Oct 23, 2020 at 09:10:24PM +0200, Ævar Arnfjörð Bjarmason wrote:

> >> You already use "hookdir" for something else though, so that's a bit
> >> confusing, perhaps s/hookcmd/definehookcmd/ would be less confusing, or
> >> perhaps more confusing...
> >
> > "Hookdir" might be the wrong word to use, too - maybe it's better to
> > mirror "hookspath" there. Eitherway, "hookdir" and "hookspath" are
> > similar enough that I think it would be confusing, and "hookcmd" is
> > already getting some side-eye from me for not being a great choice.
> >
> > Some thoughts for "a path to a directory in which multiple scripts for a
> > single hook live":
> >  - hookset
> >  - hookbatch (ugh, redundant with MS scripting)
> >  - hook.pre-commit.all-of = ~/last-minute-checks/
> >  -  "   "  .everything-in = "   "
> > ...?
> >
> > I think I named a couple silly ideas for "hookcmd" in another mail.
> 
> To both of the above: Yeah I'm not saying you need to do the work, just
> that I think it would be a useful case to bikeshed now since it seems
> inevitable that we'll get a "find hooks in this dir by glob" once we
> have this facility. So having a config syntax for that which isn't
> overly confusing / extensible to that case would be useful, i.e. as the
> current syntax uses "dir" already.

Yeah. I'm not sure that it needs to happen right away. Because
hook.*.command // hookcommand.*.command gets passed right into
run_command()-with-shell, it's possible for a user who's keen to also
set `hook.*.command = find -type f /some/path | xargs` in the meantime.
And also because it's passed right into run_command()-with-shell, it's
hard to do some smart wildcarding on the .command config and try to
figure out the right syntax. I'd just as soon see something explicit
like the configs I mentioned above, which can be added pretty easily
after the fact. I think what you're mostly saying, though, is "Leave
some words for glob execution!" and that I can appreciate.

> > Hum. This seems to say "folks who started their hooks with the same
> > number agree that their hooks should also run simultaneously" - which
> > sounds like an even harder problem than "how do I know my ordering
> > number isn't the same as someone else's in another config file". Or else
> > I'm misunderstanding your pseudo :)
> 
> The prefix number isn't meaningful in that way, i.e. if you have 10
> threads and 5 hooks starting with 250-* they won't all be invoked at the
> same time.

Ok. I misunderstood, then.

> > I know I rambled a lot - I was trying to convince myself :) For now, I'd
> > prefer to add more detail to the "future work" section of the doc and
> > then not touch this problem with a very long pole... ;) Thoughts
> > welcome.
> 
> I'm replying to much of the above in general here, particularly since
> much of it was in the form of a question you answered yourself later :)
> 
> Yes as you point out the reason I'm raising the parallel thing now is
> "keep users from assuming serial execution", i.e. any implementation
> that isn't like that from day 1 will need more verbose syntax to opt-in
> to that.
> 
> I think parallel is the sane default, although there's a really strong
> case as you point out with the "commit-msg" hook for treating that on a
> hook-type basis. E.g. commit-msg (in-place editing of as single file)
> being non-parallel by default, but e.g. post-commit, pre-applypatch,
> pre-receive and other "should we proceed?" hooks being parallel.

Yeah. I think you've sold me. So what I will do is thus: before I send
the next reroll (as I'm pretty much done, locally, and hope to be ready
for nits next time) I'll take a look in 'git help githooks' and see
which ones expect writes to occur. I think there are more than just
"commit-msg". I'll add a bit to run_hooks() and a corresponding flag to
'git hook run', plus relevant documentation. I'll also plan to add
explicit documentation to 'git help githooks' mentioning parallel vs.
serial execution.

But I will plan on writing it stupidly - user configurable job number
but no dependency checking; and let the user turn off parallel execution
for everyone (hook.jobs=1) or for just one hook
(hook.pre-commit.parallel = false (?)). Like you and Jonathan N say, we
can add more sugar like hookcmd.*.depends later on when we need it.

> 
> But I'm also raising a general concern with the design of the API /
> command around this.
> 
> I don't see the need for having a git hook list/edit/add command at
> all. We should just keep this simpler and be able to point to "git
> config --add/--get-regexp" etc.
> 
> It seems the reason to introduce this command API around it is because
> you're imagining that git needs to manage hooks whose relative execution
> order is important, and to later on once this lands aim to implement a
> much more complex dependency management schema.

No, I don't think that's the reason to have list/edit/add. The reason is
more for discoverability (if I 'git help git' or 'git^TAB', do I see
something handy in the command list that I didn't know about before?)
and user friendliness ("I can't remember the right config options to set
this up every dang time"). And 'list', I think, is handy for giving
users a dry run of what they can expect to see happen (and where to fix
them, since it lists the origin). Yes, a user could put it all together
from invocations of 'git config', but I personally think it's more
useful for Git to tell me what Git is going to do/what Git wants than
for my meat brain to try and guess :)

> 
> I just can't imagine a case that needs that where say those 10 hooks
> need to execute in exact order 1/2/3/4 where the author of that tight
> coupling wouldn't also desire to roll that all into one script, or at
> least that it's an obscure enough case that we can just say "do that".
> 
> Whereas I do think "run a bunch of independent checks, if all pass
> proceed" is *the* common case, e.g. adding a bunch of pre-receive
> hooks. If we tell the user we'll treat those as independent programs we
> can run them in parallel. The vast majority of users will benefit from
> the default faster execution.
> 
> The "glob order" case I mentioned is extra complexity on top of that,
> yes, but I think that concession is sane for the common case of "yes
> parallel, but I want to always run the always-exit-0 log
> hook". E.g. I've used this to setup a hook to run push
> attempts/successes in a hook framework that runs N pre-receive hooks.

Reading this, I think I'm still missing something key about what you
think glob ordering provides. I'm not following why having the log hook
set early requires glob ordering over config ordering (since the config
ordering schema allows reordering via replacement), and I'm not
following why it's required to halt on failure.

> 
> All that being said I'm open to being convinced, I just don't see what
> the target user is, and the submitted docs don't really make a case for
> it. I.e. there's plenty of "what" not "why would someone want this...".

ACK. I'll try and go over the doc again before I reroll.

 - Emily

Ævar Arnfjörð Bjarmason Oct. 29, 2020, 8:04 p.m. UTC | #7

On Thu, Oct 29 2020, Emily Shaffer wrote:

> On Fri, Oct 23, 2020 at 09:10:24PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> >> You already use "hookdir" for something else though, so that's a bit
>> >> confusing, perhaps s/hookcmd/definehookcmd/ would be less confusing, or
>> >> perhaps more confusing...
>> >
>> > "Hookdir" might be the wrong word to use, too - maybe it's better to
>> > mirror "hookspath" there. Eitherway, "hookdir" and "hookspath" are
>> > similar enough that I think it would be confusing, and "hookcmd" is
>> > already getting some side-eye from me for not being a great choice.
>> >
>> > Some thoughts for "a path to a directory in which multiple scripts for a
>> > single hook live":
>> >  - hookset
>> >  - hookbatch (ugh, redundant with MS scripting)
>> >  - hook.pre-commit.all-of = ~/last-minute-checks/
>> >  -  "   "  .everything-in = "   "
>> > ...?
>> >
>> > I think I named a couple silly ideas for "hookcmd" in another mail.
>> 
>> To both of the above: Yeah I'm not saying you need to do the work, just
>> that I think it would be a useful case to bikeshed now since it seems
>> inevitable that we'll get a "find hooks in this dir by glob" once we
>> have this facility. So having a config syntax for that which isn't
>> overly confusing / extensible to that case would be useful, i.e. as the
>> current syntax uses "dir" already.
>
> Yeah. I'm not sure that it needs to happen right away. Because
> hook.*.command // hookcommand.*.command gets passed right into
> run_command()-with-shell, it's possible for a user who's keen to also
> set `hook.*.command = find -type f /some/path | xargs` in the meantime.
> And also because it's passed right into run_command()-with-shell, it's
> hard to do some smart wildcarding on the .command config and try to
> figure out the right syntax. I'd just as soon see something explicit
> like the configs I mentioned above, which can be added pretty easily
> after the fact. I think what you're mostly saying, though, is "Leave
> some words for glob execution!" and that I can appreciate.

Yeah, or rather, just now in config key naming think about if the key
naming makes sense if it's expanded to support such glob inclusion,
which seems like a desired addition. But I won't belabor that point.

Just one thing to add: We don't really need to come up with a syntax &
semantics for glob inclusion special to this, we'd use the sort of glob
patterns "Conditional includes" use, as documented in  git-config(1).

>> > Hum. This seems to say "folks who started their hooks with the same
>> > number agree that their hooks should also run simultaneously" - which
>> > sounds like an even harder problem than "how do I know my ordering
>> > number isn't the same as someone else's in another config file". Or else
>> > I'm misunderstanding your pseudo :)
>> 
>> The prefix number isn't meaningful in that way, i.e. if you have 10
>> threads and 5 hooks starting with 250-* they won't all be invoked at the
>> same time.
>
> Ok. I misunderstood, then.
>
>> > I know I rambled a lot - I was trying to convince myself :) For now, I'd
>> > prefer to add more detail to the "future work" section of the doc and
>> > then not touch this problem with a very long pole... ;) Thoughts
>> > welcome.
>> 
>> I'm replying to much of the above in general here, particularly since
>> much of it was in the form of a question you answered yourself later :)
>> 
>> Yes as you point out the reason I'm raising the parallel thing now is
>> "keep users from assuming serial execution", i.e. any implementation
>> that isn't like that from day 1 will need more verbose syntax to opt-in
>> to that.
>> 
>> I think parallel is the sane default, although there's a really strong
>> case as you point out with the "commit-msg" hook for treating that on a
>> hook-type basis. E.g. commit-msg (in-place editing of as single file)
>> being non-parallel by default, but e.g. post-commit, pre-applypatch,
>> pre-receive and other "should we proceed?" hooks being parallel.
>
> Yeah. I think you've sold me. So what I will do is thus: before I send
> the next reroll (as I'm pretty much done, locally, and hope to be ready
> for nits next time) I'll take a look in 'git help githooks' and see
> which ones expect writes to occur. I think there are more than just
> "commit-msg". I'll add a bit to run_hooks() and a corresponding flag to
> 'git hook run', plus relevant documentation. I'll also plan to add
> explicit documentation to 'git help githooks' mentioning parallel vs.
> serial execution.

Sounds good.

> But I will plan on writing it stupidly - user configurable job number
> but no dependency checking; and let the user turn off parallel execution
> for everyone (hook.jobs=1) or for just one hook
> (hook.pre-commit.parallel = false (?)). Like you and Jonathan N say, we
> can add more sugar like hookcmd.*.depends later on when we need it.

Yeah, that sounds great. As long as there's parallelism that stuff can
always be tweaked later.

>> 
>> But I'm also raising a general concern with the design of the API /
>> command around this.
>> 
>> I don't see the need for having a git hook list/edit/add command at
>> all. We should just keep this simpler and be able to point to "git
>> config --add/--get-regexp" etc.
>> 
>> It seems the reason to introduce this command API around it is because
>> you're imagining that git needs to manage hooks whose relative execution
>> order is important, and to later on once this lands aim to implement a
>> much more complex dependency management schema.
>
> No, I don't think that's the reason to have list/edit/add. The reason is
> more for discoverability (if I 'git help git' or 'git^TAB', do I see
> something handy in the command list that I didn't know about before?)
> and user friendliness ("I can't remember the right config options to set
> this up every dang time"). And 'list', I think, is handy for giving
> users a dry run of what they can expect to see happen (and where to fix
> them, since it lists the origin). Yes, a user could put it all together
> from invocations of 'git config', but I personally think it's more
> useful for Git to tell me what Git is going to do/what Git wants than
> for my meat brain to try and guess :)

Okey, that makes sense & I've got nothing against that, just clarifying
since it *looked* like it was the first step in some future addition of
complexity around this.

It would be nice if the docs for the new command were modified to state
that clearly, even to the point of saying "this is really just sugar for
this similar git-config invocation".

>> 
>> I just can't imagine a case that needs that where say those 10 hooks
>> need to execute in exact order 1/2/3/4 where the author of that tight
>> coupling wouldn't also desire to roll that all into one script, or at
>> least that it's an obscure enough case that we can just say "do that".
>> 
>> Whereas I do think "run a bunch of independent checks, if all pass
>> proceed" is *the* common case, e.g. adding a bunch of pre-receive
>> hooks. If we tell the user we'll treat those as independent programs we
>> can run them in parallel. The vast majority of users will benefit from
>> the default faster execution.
>> 
>> The "glob order" case I mentioned is extra complexity on top of that,
>> yes, but I think that concession is sane for the common case of "yes
>> parallel, but I want to always run the always-exit-0 log
>> hook". E.g. I've used this to setup a hook to run push
>> attempts/successes in a hook framework that runs N pre-receive hooks.
>
> Reading this, I think I'm still missing something key about what you
> think glob ordering provides. 

For context, I feel strongly that we should do parallel by default for
implementing something like this, it's great that per the above
discussion you're open to that.

This "glob ordering" is an entirely separate idea I'm not strongly
advocating, there's pros & cons of doing that v.s. config ordering.

 * Con: less obvious than config order, you write hooks "a c b" in the
   config and we execute in "a b c" order.

 * Pro: Sidesteps the issues you noted in "Execution ordering" in the
   docs you're adding, i.e. now it'll be impossible to execute a
   repo-local hook before a system-wide one, you can override that with
   having a local one called "000-something".

   I.e. now we'd read the config in the normal config order, and thus if
   there's a system hook there's no way to define a local hook to run
   first, until we get some sort of override for that.

> I'm not following why having the log hook set early requires glob
> ordering over config ordering (since the config ordering schema allows
> reordering via replacement)
> [...]
>  and I'm not following why it's required to halt on failure.

I realize I didn't elaborate on this, there's some past discussion[1][2]
about this. 

I.e. when running N hooks sometimes you'd want to run them all (e.g. to
send notifications), but for others such as pre-receive.d guard checks
you don't have to run all N, if one check (say one checks commit format
validity, another code syntax) fails you'd like to abort early.

So halting on failure is just saving CPU, you might have 10 hooks that
each take 1 second, no point in making the user wait on all 10 checks
for 10 seconds if a failure of any fails the push.

But OTOH you have other use-cases where users want to run them all
(talked about in the [1][2] discussion above), so it's been anticipated
as something we'd grow config for with multi-hook support.

The glob ordering allows common cases for things that aren't possible
with config-order with such early abort.

E.g. consider a server with some common system-wide pre-receive.d hook
(e.g. author e-mail envelope check), and a SOX/PCI controlled repository
where some compliance thing says all push attempts must be logged.

You could then do:

    /etc/git/hooks/pre-receive.d/email-check
    /path/to/repo/hooks/pre-receive.d/000-log-push-attempt-to-db
    /path/to/repo/hooks/pre-receive.d/some-other-check

And we'd always run the 000-* hook first, whereas in the current schema
you can't do that without editing the system-wide config.

>> 
>> All that being said I'm open to being convinced, I just don't see what
>> the target user is, and the submitted docs don't really make a case for
>> it. I.e. there's plenty of "what" not "why would someone want this...".
>
> ACK. I'll try and go over the doc again before I reroll.
>
>  - Emily

1. https://lore.kernel.org/git/87wojjsv9p.fsf@evledraar.gmail.com/
2. https://public-inbox.org/git/CACBZZX6j6q2DUN_Z-Pnent1u714dVNPFBrL_PiEQyLmCzLUVxg@mail.gmail.com/

[v4,1/9] doc: propose hooks managed by the config

Commit Message

Comments

Patch