diff mbox series

doc: propose hooks managed by the config

Message ID 20200420235310.94493-1-emilyshaffer@google.com (mailing list archive)
State New, archived
Headers show
Series doc: propose hooks managed by the config | expand

Commit Message

Emily Shaffer April 20, 2020, 11:53 p.m. UTC
Begin a design document for config-based hooks, managed via git-hook.
Focus on an overview of the implementation and motivation for design
decisions. Briefly discuss the alternatives considered before this
point. Also, attempt to redefine terms to fit into a multihook world.

Signed-off-by: Emily Shaffer <emilyshaffer@google.com>
---
Hi all,

I wasn't sure whether it made more sense to leave the design doc in the
conversation or not, but I figured it fit well into the context. I tried
to also add relevant IDs to the "References" headers to this mail.

Hopefully this is complete enough that we can discuss it directly until
we feel comfortable getting ready for implementation. I'm planning to
send a reply today with some comments, too.

 - Emily

 Documentation/Makefile                        |   1 +
 .../technical/config-based-hooks.txt          | 317 ++++++++++++++++++
 2 files changed, 318 insertions(+)
 create mode 100644 Documentation/technical/config-based-hooks.txt

Comments

Emily Shaffer April 21, 2020, 12:22 a.m. UTC | #1
On Mon, Apr 20, 2020 at 04:53:10PM -0700, Emily Shaffer wrote:
> 
> Begin a design document for config-based hooks, managed via git-hook.
> Focus on an overview of the implementation and motivation for design
> decisions. Briefly discuss the alternatives considered before this
> point. Also, attempt to redefine terms to fit into a multihook world.
> 
> Signed-off-by: Emily Shaffer <emilyshaffer@google.com>
> ---
> Hi all,
> 
> I wasn't sure whether it made more sense to leave the design doc in the
> conversation or not, but I figured it fit well into the context. I tried
> to also add relevant IDs to the "References" headers to this mail.
> 
> Hopefully this is complete enough that we can discuss it directly until
> we feel comfortable getting ready for implementation. I'm planning to
> send a reply today with some comments, too.
> 
>  - Emily
> 
>  Documentation/Makefile                        |   1 +
>  .../technical/config-based-hooks.txt          | 317 ++++++++++++++++++
>  2 files changed, 318 insertions(+)
>  create mode 100644 Documentation/technical/config-based-hooks.txt
> 
> diff --git a/Documentation/Makefile b/Documentation/Makefile
> index 8fe829cc1b..301111f236 100644
> --- a/Documentation/Makefile
> +++ b/Documentation/Makefile
> @@ -79,6 +79,7 @@ SP_ARTICLES += $(API_DOCS)
>  TECH_DOCS += MyFirstContribution
>  TECH_DOCS += MyFirstObjectWalk
>  TECH_DOCS += SubmittingPatches
> +TECH_DOCS += technical/config-based-hooks
>  TECH_DOCS += technical/hash-function-transition
>  TECH_DOCS += technical/http-protocol
>  TECH_DOCS += technical/index-format
> diff --git a/Documentation/technical/config-based-hooks.txt b/Documentation/technical/config-based-hooks.txt
> new file mode 100644
> index 0000000000..38893423be
> --- /dev/null
> +++ b/Documentation/technical/config-based-hooks.txt
> @@ -0,0 +1,317 @@
> +Configuration-based hook management
> +===================================
> +
> +== Motivation
> +
> +Treat hooks as a first-class citizen by replacing the .git/hook/hookname path as
> +the only source of hooks to execute, in a way which is friendly to users with
> +multiple repos which have similar needs.
> +
> +Redefine "hook" as an event rather than a single script, allowing users to
> +perform unrelated actions on a single event.
> +
> +Take a step closer to safety when copying zipped Git repositories from untrusted
> +users.
> +
> +Make it easier for users to discover Git's hook feature and automate their
> +workflows.
> +
> +== User interfaces
> +
> +=== Config schema
> +
> +Hooks can be introduced by editing the configuration manually. There are two new
> +sections added, `hook` and `hookcmd`.
> +
> +==== `hook`
> +
> +Primarily contains subsections for each hook event. These subsections define
> +hook command execution order; hook commands can be specified by passing the
> +command directly if no additional configuration is needed, or by passing the
> +name of a `hookcmd`. If Git does not find a `hookcmd` whose subsection matches
> +the value of the given command string, Git will try to execute the string
> +directly. Hook event subsections can also contain per-hook-event settings.
> +
> +Also contains top-level hook execution settings, for example,
> +`hook.warnHookDir`, `hook.runHookDir`, or `hook.disableAll`.
> +
> +----
> +[hook "pre-commit"]
> +  command = perl-linter
> +  command = /usr/bin/git-secrets --pre-commit
> +
> +[hook "pre-applypatch"]
> +  command = perl-linter
> +  error = ignore
> +
> +[hook]
> +  warnHookDir = true
> +  runHookDir = prompt

whoops, just realized this doesn't match the proposal below. Wrote these
on different days :)

> +----
> +
> +==== `hookcmd`
> +
> +Defines a hook command and its attributes, which will be used when a hook event
> +occurs. Unqualified attributes are assumed to apply to this hook during all hook
> +events, but event-specific attributes can also be supplied. The example runs
> +`/usr/bin/lint-it --language=perl <args passed by Git>`, but for repos which
> +include this config, the hook command will be skipped for all events to which
> +it's normally subscribed _except_ `pre-commit`.
> +
> +----
> +[hookcmd "perl-linter"]
> +  command = /usr/bin/lint-it --language=perl
> +  skip = true
> +  pre-commit-skip = false
> +----
> +
> +=== Command-line API
> +
> +Users should be able to view, reorder, and create hook commands via the command
> +line. External tools should be able to view a list of hooks in the correct order
> +to run.
> +
> +*`git hook list <hook-event>`*
> +
> +*`git hook list (--system|--global|--local|--worktree)`*
> +
> +*`git hook edit <hook-event>`*
> +
> +*`git hook add <hook-command> <hook-event> <options...>`*
> +
> +=== Hook editor
> +
> +The tool which is presented by `git hook edit <hook-command>`. Ideally, this
> +tool should be easier to use than manually editing the config, and then produce
> +a concise config afterwards. It may take a form similar to `git rebase
> +--interactive`.

This section is a little thin because I'm hoping to meet with some UX
folks on our end and get a better suggestion. Suggestions welcome from
upstream too - I'm having trouble coming up with anything that's better
than modifying the config directly.

> +
> +== Implementation
> +
> +=== Library
> +
> +`hook.c` and `hook.h` are responsible for interacting with the config files. In
> +the case when the code generating a hook event doesn't have special concerns
> +about how to run the hooks, the hook library will provide a basic API to call
> +all hooks in config order with an `argv_array` provided by the code which
> +generates the hook event:
> +
> +*`int run_hooks(const char *hookname, struct argv_array *args)`*
> +
> +This call includes the hook command provided by `run-command.h:find_hook()`;
> +eventually, this legacy hook will be gated by a config `hook.runHookDir`. The
> +config is checked against a number of cases:
> +
> +- "no": the legacy hook will not be run
> +- "interactive": Git will prompt the user before running the legacy hook
> +- "warn": Git will print a warning to stderr before running the legacy hook
> +- "yes" (default): Git will silently run the legacy hook
> +
> +If `hook.runHookDir` is provided more than once, Git will use the most
> +restrictive setting provided, for security reasons.
> +
> +If the caller wants to do something more complicated, the hook library can also
> +provide a callback API:
> +
> +*`int for_each_hookcmd(const char *hookname, hookcmd_function *cb)`*

Another alternative is to do this by providing a linked-list of
structs or even just an ordered string_list; that means the caller
becomes responsible for config syntax and parallelization, which I
didn't want. I'm open to hearing more argument. (on the rest of the doc
too... but also here. :) )

> +
> +Finally, to facilitate the builtin, the library will also provide the following
> +APIs to interact with the config:
> +
> +----
> +int set_hook_commands(const char *hookname, struct string_list *commands,
> +	enum config_scope scope);
> +int set_hookcmd(const char *hookcmd, struct hookcmd options);
> +
> +int list_hook_commands(const char *hookname, struct string_list *commands);
> +int list_hooks_in_scope(enum config_scope scope, struct string_list *commands);
> +----
> +
> +`struct hookcmd` is expected to grow in size over time as more functionality is
> +added to hooks; so that other parts of the code don't need to understand the
> +config schema, `struct hookcmd` should contain logical values instead of string
> +pairs.
> +
> +----
> +struct hookcmd {
> +  const char *name;
> +  const char *command;
> +
> +  /* for illustration only; not planned at present */
> +  int parallelizable;
> +  const char *hookcmd_before;
> +  const char *hookcmd_after;
> +  enum recovery_action on_fail;
> +}
> +----
> +
> +=== Builtin
> +
> +`builtin/hook.c` is responsible for providing the frontend. It's responsible for
> +formatting user-provided data and then calling the library API to set the
> +configs as appropriate. The builtin frontend is not responsible for calling the
> +config directly, so that other areas of Git can rely on the hook library to
> +understand the most recent config schema for hooks.
> +
> +=== Migration path
> +
> +==== Stage 0
> +
> +Hooks are called by running `run-command.h:find_hook()` with the hookname and
> +executing the result. The hook library and builtin do not exist. Hooks only
> +exist as specially named scripts within `.git/hooks/`.
> +
> +==== Stage 1
> +
> +`git hook list --porcelain <hook-event>` is implemented. Users can replace their
> +`.git/hooks/<hook-event>` scripts with a trampoline based on `git hook list`'s
> +output. Modifier commands like `git hook add` and `git hook edit` can be
> +implemented around this time as well.
> +
> +==== Stage 2
> +
> +`hook.h:run_hooks()` is taught to include `run-command.h:find_hook()` at the
> +end; calls to `find_hook()` are replaced with calls to `run_hooks()`. Users can
> +opt-in to config-based hooks simply by creating some in their config; otherwise
> +users should remain unaffected by the change.
> +
> +==== Stage 3
> +
> +The call to `find_hook()` inside of `run_hooks()` learns to check for a config,
> +`hook.runHookDir`. Users can opt into managing their hooks completely via the
> +config this way.
> +
> +==== Stage 4
> +
> +`.git/hooks` is removed from the template and the hook directory is considered
> +deprecated. To avoid breaking older repos, the default of `hook.runHookDir` is
> +not changed, and `find_hook()` is not removed.
> +
> +== Caveats
> +
> +=== Security and repo config
> +
> +Part of the motivation behind this refactor is to mitigate hooks as an attack
> +vector;footnote:[https://lore.kernel.org/git/20171002234517.GV19555@aiede.mtv.corp.google.com/]
> +however, as the design stands, users can still provide hooks in the repo-level
> +config, which is included when a repo is zipped and sent elsewhere.  The
> +security of the repo-level config is still under discussion; this design
> +generally assumes the repo-level config is secure, which is not true yet. The
> +goal is to avoid an overcomplicated design to work around a problem which has
> +ceased to exist.
> +
> +=== Ease of use
> +
> +The config schema is nontrivial; that's why it's important for the `git hook`
> +modifier commands to be usable. Contributors with UX expertise are encouraged to
> +share their suggestions.
> +
> +== Alternative approaches
> +
> +A previous summary of alternatives exists in the
> +archives.footnote:[https://lore.kernel.org/git/20191116011125.GG22855@google.com]
> +
> +=== Status quo
> +
> +Today users can implement multihooks themselves by using a "trampoline script"
> +as their hook, and pointing that script to a directory or list of other scripts
> +they wish to run.
> +
> +=== Hook directories
> +
> +Other contributors have suggested Git learn about the existence of a directory
> +such as `.git/hooks/<hookname>.d` and execute those hooks in alphabetical order.
> +
> +=== Comparison table
> +
> +.Comparison of alternatives
> +|===
> +|Feature |Config-based hooks |Hook directories |Status quo
> +
> +|Supports multiple hooks
> +|Natively
> +|Natively
> +|With user effort
> +
> +|Safer for zipped repos
> +|A little
> +|No
> +|No
> +
> +|Previous hooks just work
> +|If configured
> +|Yes
> +|Yes
> +
> +|Can install one hook to many repos
> +|Yes
> +|No
> +|No
> +
> +|Discoverability
> +|Better (in `git help git`)
> +|Same as before
> +|Same as before
> +
> +|Hard to run unexpected hook
> +|If configured
> +|No
> +|No
> +|===

Please share more features that come to your mind; I took most of this
list from the RFC I sent last fall:
https://lore.kernel.org/git/20191116011125.GG22855@google.com

> +
> +== Future work
> +
> +=== Execution ordering
> +
> +We may find that config order is insufficient for some users; for example,
> +config order makes it difficult to add a new hook to the system or global config
> +which runs at the end of the hook list. A new ordering schema should be:
> +
> +1) Specified by a `hook.order` config, so that users will not unexpectedly see
> +their order change;
> +
> +2) Either dependency or numerically based.
> +
> +Dependency-based ordering is prone to classic linked-list problems, like a
> +cycles and handling of missing dependencies. But, it paves the way for enabling
> +parallelization if some tasks truly depend on others.
> +
> +Numerical ordering makes it tricky for Git to generate suggested ordering
> +numbers for each command, but is easy to determine a definitive order.
> +
> +=== Parallelization
> +
> +Users with many hooks might want to run them simultaneously, if the hooks don't
> +modify state; if one hook depends on another's output, then users will want to
> +specify those dependencies. If we decide to solve this problem, we may want to
> +look to modern build systems for inspiration on how to manage dependencies and
> +parallel tasks.
> +
> +=== Securing hookdir hooks
> +
> +With the design as written in this doc, it's still possible for a malicious user
> +to modify `.git/config` to include `hook.pre-receive.command = rm -rf /`, then
> +zip their repo and send it to another user. It may be necessary to teach Git to
> +only allow one-line hooks like this if they were configured outside of the local
> +scope; or another approach, like a list of safe projects, might be useful. It
> +may also be sufficient (or at least useful) to teach a `hook.disableAll` config
> +or similar flag to the Git executable.
> +
> +=== Submodule inheritance
> +
> +It's possible some submodules may want to run the identical set of hooks that
> +their superrepo runs. While a globally-configured hook set is helpful, it's not
> +a great solution for users who have multiple repos-with-submodules under the
> +same user. It would be useful for submodules to learn how to run hooks from
> +their superrepo's config, or inherit that hook setting.
> +
> +== Glossary
> +
> +*hook event*
> +
> +A point during Git's execution where user scripts may be run, for example,
> +_prepare-commit-msg_ or _pre-push_.
> +
> +*hook command*
> +
> +A user script or executable which will be run on one or more hook events.

If other terms in the design doc are surprising to you, let me know and
I'll define them here too.

> -- 
> 2.26.1.301.g55bc3eb7cb9-goog
>
Junio C Hamano April 21, 2020, 1:20 a.m. UTC | #2
Emily Shaffer <emilyshaffer@google.com> writes:

> Whoops, just realized this doesn't match the proposal below. Wrote these
> on different days :)

It often is a good idea to attempt writing anything in one sitting
for coherency, and proofread the result on a separate day before
sending it out ;-)
Emily Shaffer April 24, 2020, 11:14 p.m. UTC | #3
On Mon, Apr 20, 2020 at 06:20:00PM -0700, Junio C Hamano wrote:
> 
> Emily Shaffer <emilyshaffer@google.com> writes:
> 
> > Whoops, just realized this doesn't match the proposal below. Wrote these
> > on different days :)
> 
> It often is a good idea to attempt writing anything in one sitting
> for coherency, and proofread the result on a separate day before
> sending it out ;-)

Agreed for next time :)

I didn't make it very clear in my initial comment that the only problem
here is the code snippets and the difference is very minor - I don't
think it's worth a reroll on its own without hearing feedback about the
rest. Or, to put it another way, if any interested reader said "I'll
wait to review" - don't ;)

 - Emily
brian m. carlson April 25, 2020, 8:57 p.m. UTC | #4
On 2020-04-20 at 23:53:10, Emily Shaffer wrote:
> +=== Config schema
> +
> +Hooks can be introduced by editing the configuration manually. There are two new
> +sections added, `hook` and `hookcmd`.
> +
> +==== `hook`
> +
> +Primarily contains subsections for each hook event. These subsections define
> +hook command execution order; hook commands can be specified by passing the
> +command directly if no additional configuration is needed, or by passing the
> +name of a `hookcmd`. If Git does not find a `hookcmd` whose subsection matches
> +the value of the given command string, Git will try to execute the string
> +directly. Hook event subsections can also contain per-hook-event settings.

Can we say explicitly that the commands are invoked by the shell?  Or is
the plan to try to parse them without passing to the shell?

> +Also contains top-level hook execution settings, for example,
> +`hook.warnHookDir`, `hook.runHookDir`, or `hook.disableAll`.
> +
> +----
> +[hook "pre-commit"]
> +  command = perl-linter
> +  command = /usr/bin/git-secrets --pre-commit
> +
> +[hook "pre-applypatch"]
> +  command = perl-linter
> +  error = ignore
> +
> +[hook]
> +  warnHookDir = true
> +  runHookDir = prompt
> +----
> +
> +==== `hookcmd`
> +
> +Defines a hook command and its attributes, which will be used when a hook event
> +occurs. Unqualified attributes are assumed to apply to this hook during all hook
> +events, but event-specific attributes can also be supplied. The example runs
> +`/usr/bin/lint-it --language=perl <args passed by Git>`, but for repos which
> +include this config, the hook command will be skipped for all events to which
> +it's normally subscribed _except_ `pre-commit`.
> +
> +----
> +[hookcmd "perl-linter"]
> +  command = /usr/bin/lint-it --language=perl
> +  skip = true
> +  pre-commit-skip = false
> +----

This seems fine to me.  I like this design and it seems sane.

> +== Implementation
> +
> +=== Library
> +
> +`hook.c` and `hook.h` are responsible for interacting with the config files. In
> +the case when the code generating a hook event doesn't have special concerns
> +about how to run the hooks, the hook library will provide a basic API to call
> +all hooks in config order with an `argv_array` provided by the code which
> +generates the hook event:
> +
> +*`int run_hooks(const char *hookname, struct argv_array *args)`*
> +
> +This call includes the hook command provided by `run-command.h:find_hook()`;
> +eventually, this legacy hook will be gated by a config `hook.runHookDir`. The
> +config is checked against a number of cases:
> +
> +- "no": the legacy hook will not be run
> +- "interactive": Git will prompt the user before running the legacy hook
> +- "warn": Git will print a warning to stderr before running the legacy hook
> +- "yes" (default): Git will silently run the legacy hook
> +
> +If `hook.runHookDir` is provided more than once, Git will use the most
> +restrictive setting provided, for security reasons.

I don't think this is consistent with the way the rest of our options
work.  What if someone generally wants to disable legacy hooks but then
works with a program in a repository that requires them?

> +== Caveats
> +
> +=== Security and repo config
> +
> +Part of the motivation behind this refactor is to mitigate hooks as an attack
> +vector;footnote:[https://lore.kernel.org/git/20171002234517.GV19555@aiede.mtv.corp.google.com/]
> +however, as the design stands, users can still provide hooks in the repo-level
> +config, which is included when a repo is zipped and sent elsewhere.  The
> +security of the repo-level config is still under discussion; this design
> +generally assumes the repo-level config is secure, which is not true yet. The
> +goal is to avoid an overcomplicated design to work around a problem which has
> +ceased to exist.

I want to be clear that I'm very much opposed to trying to "secure" the
config as a whole.  I believe that it's going to ultimately lead to a
variety of new and interesting attack vectors and will lead to Git
becoming a CVE factory.  Vim has this problem with modelines, for
example.

I think we should maintain the status quo that the only safe things you
can do with an untrusted repository are clone and fetch because it sets
a clear security boundary.

Having said that, I'm otherwise pretty happy with this design and I'm
looking forward to seeing it implemented.
Emily Shaffer May 6, 2020, 9:33 p.m. UTC | #5
On Sat, Apr 25, 2020 at 08:57:27PM +0000, brian m. carlson wrote:
> 
> On 2020-04-20 at 23:53:10, Emily Shaffer wrote:
> > +=== Config schema
> > +
> > +Hooks can be introduced by editing the configuration manually. There are two new
> > +sections added, `hook` and `hookcmd`.
> > +
> > +==== `hook`
> > +
> > +Primarily contains subsections for each hook event. These subsections define
> > +hook command execution order; hook commands can be specified by passing the
> > +command directly if no additional configuration is needed, or by passing the
> > +name of a `hookcmd`. If Git does not find a `hookcmd` whose subsection matches
> > +the value of the given command string, Git will try to execute the string
> > +directly. Hook event subsections can also contain per-hook-event settings.
> 
> Can we say explicitly that the commands are invoked by the shell?  Or is
> the plan to try to parse them without passing to the shell?

Sure. If I didn't make it clear it was by mistake, not by intent.

> 
> > +Also contains top-level hook execution settings, for example,
> > +`hook.warnHookDir`, `hook.runHookDir`, or `hook.disableAll`.
> > +
> > +----
> > +[hook "pre-commit"]
> > +  command = perl-linter
> > +  command = /usr/bin/git-secrets --pre-commit
> > +
> > +[hook "pre-applypatch"]
> > +  command = perl-linter
> > +  error = ignore
> > +
> > +[hook]
> > +  warnHookDir = true
> > +  runHookDir = prompt
> > +----
> > +
> > +==== `hookcmd`
> > +
> > +Defines a hook command and its attributes, which will be used when a hook event
> > +occurs. Unqualified attributes are assumed to apply to this hook during all hook
> > +events, but event-specific attributes can also be supplied. The example runs
> > +`/usr/bin/lint-it --language=perl <args passed by Git>`, but for repos which
> > +include this config, the hook command will be skipped for all events to which
> > +it's normally subscribed _except_ `pre-commit`.
> > +
> > +----
> > +[hookcmd "perl-linter"]
> > +  command = /usr/bin/lint-it --language=perl
> > +  skip = true
> > +  pre-commit-skip = false
> > +----
> 
> This seems fine to me.  I like this design and it seems sane.
> 
> > +== Implementation
> > +
> > +=== Library
> > +
> > +`hook.c` and `hook.h` are responsible for interacting with the config files. In
> > +the case when the code generating a hook event doesn't have special concerns
> > +about how to run the hooks, the hook library will provide a basic API to call
> > +all hooks in config order with an `argv_array` provided by the code which
> > +generates the hook event:
> > +
> > +*`int run_hooks(const char *hookname, struct argv_array *args)`*
> > +
> > +This call includes the hook command provided by `run-command.h:find_hook()`;
> > +eventually, this legacy hook will be gated by a config `hook.runHookDir`. The
> > +config is checked against a number of cases:
> > +
> > +- "no": the legacy hook will not be run
> > +- "interactive": Git will prompt the user before running the legacy hook
> > +- "warn": Git will print a warning to stderr before running the legacy hook
> > +- "yes" (default): Git will silently run the legacy hook
> > +
> > +If `hook.runHookDir` is provided more than once, Git will use the most
> > +restrictive setting provided, for security reasons.
> 
> I don't think this is consistent with the way the rest of our options
> work.  What if someone generally wants to disable legacy hooks but then
> works with a program in a repository that requires them?

Unfortunately this is something I think my end will want to hold firm
on. In general we disagree with your statement later about not wanting
to make the .git/config secure. I see your use case, and I anticipate
two possible workarounds I'd present:

1) If working in that repo for the short term, run `git -c
hook.runHookDir=yes <command> <arg...>` (and therefore allow the config
from command line scope, which I'm happy with in general). Maybe
someone would want to use an alias, hookgit or hg? Just kidding.. ;P

2) If you're stuck with that repo for the long term, add
`hook.<hookname>.command = /path/.git/hooks/<hookname>` lines to the local
config.

Yes, those are both somewhat user-unfriendly, and I think we can do
better... I'll have to think more and see what I can come up with.
Suggestions welcome.

> 
> > +== Caveats
> > +
> > +=== Security and repo config
> > +
> > +Part of the motivation behind this refactor is to mitigate hooks as an attack
> > +vector;footnote:[https://lore.kernel.org/git/20171002234517.GV19555@aiede.mtv.corp.google.com/]
> > +however, as the design stands, users can still provide hooks in the repo-level
> > +config, which is included when a repo is zipped and sent elsewhere.  The
> > +security of the repo-level config is still under discussion; this design
> > +generally assumes the repo-level config is secure, which is not true yet. The
> > +goal is to avoid an overcomplicated design to work around a problem which has
> > +ceased to exist.
> 
> I want to be clear that I'm very much opposed to trying to "secure" the
> config as a whole.  I believe that it's going to ultimately lead to a
> variety of new and interesting attack vectors and will lead to Git
> becoming a CVE factory.  Vim has this problem with modelines, for
> example.

I'm really interested to hear more - it seems like security and config
efforts will end up on my plate before the end of the year, so I'd like
to know what is on your mind.

> 
> I think we should maintain the status quo that the only safe things you
> can do with an untrusted repository are clone and fetch because it sets
> a clear security boundary.

I wish there was a way to make that more apparent. The trouble is that
while you and I and the sysadmin know the dangers, the high schooler
making a website might not. Talking about how to warn users is
definitely out-of-scope for this conversation, but it's on my mind.

> 
> Having said that, I'm otherwise pretty happy with this design and I'm
> looking forward to seeing it implemented.

Thanks very much for the feedback and for reading it through! :)

 - Emily
brian m. carlson May 6, 2020, 11:13 p.m. UTC | #6
On 2020-05-06 at 21:33:54, Emily Shaffer wrote:
> On Sat, Apr 25, 2020 at 08:57:27PM +0000, brian m. carlson wrote:
> > 
> > On 2020-04-20 at 23:53:10, Emily Shaffer wrote:
> > > +== Caveats
> > > +
> > > +=== Security and repo config
> > > +
> > > +Part of the motivation behind this refactor is to mitigate hooks as an attack
> > > +vector;footnote:[https://lore.kernel.org/git/20171002234517.GV19555@aiede.mtv.corp.google.com/]
> > > +however, as the design stands, users can still provide hooks in the repo-level
> > > +config, which is included when a repo is zipped and sent elsewhere.  The
> > > +security of the repo-level config is still under discussion; this design
> > > +generally assumes the repo-level config is secure, which is not true yet. The
> > > +goal is to avoid an overcomplicated design to work around a problem which has
> > > +ceased to exist.
> > 
> > I want to be clear that I'm very much opposed to trying to "secure" the
> > config as a whole.  I believe that it's going to ultimately lead to a
> > variety of new and interesting attack vectors and will lead to Git
> > becoming a CVE factory.  Vim has this problem with modelines, for
> > example.
> 
> I'm really interested to hear more - it seems like security and config
> efforts will end up on my plate before the end of the year, so I'd like
> to know what is on your mind.

In general, having untrusted configuration is enormously difficult and
is typically only possible as a designed-in feature with extremely
limited options.  We have not designed that feature in from the
beginning and our config parsing is far too ad-hoc to support any
reasonable security posture.  We've also written a program entirely in
C, which has all of the fun memory safety problems.

If we try to secure the config and allow people to use untrusted
repositories securely, we've changed the security posture of the project
very significantly.  The number of keys we can safely trust come down to
probably core.repositoryformatversion and extensions.objectformat, and
I'm not even sure that the latter can be trusted because there are all
sorts of fun behaviors one can produce by setting the wrong hash
algorithm.

That's just one example of a potential source of security problems, but
I anticipate people can use other options as well.  Setting the rename
limit can be a DoS.  Changing the colors of diff or log output could be
used to hide malicious code from inspection.  We obviously can't trust
anything containing a URL, since an attacker could try to make "git pull
origin" point to their server instead, which means having remotes is out
of the question.  Most of our recent security issues have involved the
.gitmodules file, which, despite being extremely limited, is indeed an
untrusted config file.

The scope of potential vulnerabilities explodes as you allow users to
have untrusted config.  I don't think there's any reasonable set of
useful configuration we can have on a per-repo basis that doesn't open
us up to a whole set of security vulnerabilities.  It seems to me that
we're setting ourselves up to either have a feature so limited nobody
uses it or a massive, never-ending set of CVEs as everybody finds new
ways to attack things.  I just don't think promising that feature to
users is honest because I don't think we can practically achieve it in
Git.  Most projects don't even try it as an option.

On the other hand, what we promise now, which is to restrict untrusted
repositories to cloning and fetching, while surprising to users,
dramatically reduces the scope because it's basically what we promise
over the network.  The interface is highly restricted, well known, and
reasonably secure.  We've also limited attack surface to a much smaller
number of binaries.

So while I think the intention is good and the idea, if implementable,
would be beneficial to users, I think it's practically going to be
unachievable.
Emily Shaffer May 19, 2020, 8:10 p.m. UTC | #7
On Wed, May 06, 2020 at 02:33:54PM -0700, Emily Shaffer wrote:
> 
> On Sat, Apr 25, 2020 at 08:57:27PM +0000, brian m. carlson wrote:
> > 
> > On 2020-04-20 at 23:53:10, Emily Shaffer wrote:
> > > +=== Config schema
> > > +
> > > +Hooks can be introduced by editing the configuration manually. There are two new
> > > +sections added, `hook` and `hookcmd`.
> > > +
> > > +==== `hook`
> > > +
> > > +Primarily contains subsections for each hook event. These subsections define
> > > +hook command execution order; hook commands can be specified by passing the
> > > +command directly if no additional configuration is needed, or by passing the
> > > +name of a `hookcmd`. If Git does not find a `hookcmd` whose subsection matches
> > > +the value of the given command string, Git will try to execute the string
> > > +directly. Hook event subsections can also contain per-hook-event settings.
> > 
> > Can we say explicitly that the commands are invoked by the shell?  Or is
> > the plan to try to parse them without passing to the shell?
> 
> Sure. If I didn't make it clear it was by mistake, not by intent.
> 
> > 
> > > +Also contains top-level hook execution settings, for example,
> > > +`hook.warnHookDir`, `hook.runHookDir`, or `hook.disableAll`.
> > > +
> > > +----
> > > +[hook "pre-commit"]
> > > +  command = perl-linter
> > > +  command = /usr/bin/git-secrets --pre-commit
> > > +
> > > +[hook "pre-applypatch"]
> > > +  command = perl-linter
> > > +  error = ignore
> > > +
> > > +[hook]
> > > +  warnHookDir = true
> > > +  runHookDir = prompt
> > > +----
> > > +
> > > +==== `hookcmd`
> > > +
> > > +Defines a hook command and its attributes, which will be used when a hook event
> > > +occurs. Unqualified attributes are assumed to apply to this hook during all hook
> > > +events, but event-specific attributes can also be supplied. The example runs
> > > +`/usr/bin/lint-it --language=perl <args passed by Git>`, but for repos which
> > > +include this config, the hook command will be skipped for all events to which
> > > +it's normally subscribed _except_ `pre-commit`.
> > > +
> > > +----
> > > +[hookcmd "perl-linter"]
> > > +  command = /usr/bin/lint-it --language=perl
> > > +  skip = true
> > > +  pre-commit-skip = false
> > > +----
> > 
> > This seems fine to me.  I like this design and it seems sane.
> > 
> > > +== Implementation
> > > +
> > > +=== Library
> > > +
> > > +`hook.c` and `hook.h` are responsible for interacting with the config files. In
> > > +the case when the code generating a hook event doesn't have special concerns
> > > +about how to run the hooks, the hook library will provide a basic API to call
> > > +all hooks in config order with an `argv_array` provided by the code which
> > > +generates the hook event:
> > > +
> > > +*`int run_hooks(const char *hookname, struct argv_array *args)`*
> > > +
> > > +This call includes the hook command provided by `run-command.h:find_hook()`;
> > > +eventually, this legacy hook will be gated by a config `hook.runHookDir`. The
> > > +config is checked against a number of cases:
> > > +
> > > +- "no": the legacy hook will not be run
> > > +- "interactive": Git will prompt the user before running the legacy hook
> > > +- "warn": Git will print a warning to stderr before running the legacy hook
> > > +- "yes" (default): Git will silently run the legacy hook
> > > +
> > > +If `hook.runHookDir` is provided more than once, Git will use the most
> > > +restrictive setting provided, for security reasons.
> > 
> > I don't think this is consistent with the way the rest of our options
> > work.  What if someone generally wants to disable legacy hooks but then
> > works with a program in a repository that requires them?
> 
> Unfortunately this is something I think my end will want to hold firm
> on. In general we disagree with your statement later about not wanting
> to make the .git/config secure. I see your use case, and I anticipate
> two possible workarounds I'd present:
> 
> 1) If working in that repo for the short term, run `git -c
> hook.runHookDir=yes <command> <arg...>` (and therefore allow the config
> from command line scope, which I'm happy with in general). Maybe
> someone would want to use an alias, hookgit or hg? Just kidding.. ;P
> 
> 2) If you're stuck with that repo for the long term, add
> `hook.<hookname>.command = /path/.git/hooks/<hookname>` lines to the local
> config.
> 
> Yes, those are both somewhat user-unfriendly, and I think we can do
> better... I'll have to think more and see what I can come up with.
> Suggestions welcome.

I thought more about this and today I'm revisiting this work (and
starting on patches!) so I figured I'd close the loop, since it'll be
buried in the next round of the design doc.

Refusing to trust the local config is actually contrary to one of the
tenets I was trying to use when designing this - that we should assume
the .git/config is safe, so that we don't end up with bloat later if
.git/config does become safe. The suggestion I made here to disallow
overrides doesn't fit, so I'll drop it. The implementation will allow a
more local config to turn hookdir hooks back on.

Thanks, all. By way of status update, I think I'll be able to start
working on this more actively starting this week.

 - Emily
diff mbox series

Patch

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 8fe829cc1b..301111f236 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -79,6 +79,7 @@  SP_ARTICLES += $(API_DOCS)
 TECH_DOCS += MyFirstContribution
 TECH_DOCS += MyFirstObjectWalk
 TECH_DOCS += SubmittingPatches
+TECH_DOCS += technical/config-based-hooks
 TECH_DOCS += technical/hash-function-transition
 TECH_DOCS += technical/http-protocol
 TECH_DOCS += technical/index-format
diff --git a/Documentation/technical/config-based-hooks.txt b/Documentation/technical/config-based-hooks.txt
new file mode 100644
index 0000000000..38893423be
--- /dev/null
+++ b/Documentation/technical/config-based-hooks.txt
@@ -0,0 +1,317 @@ 
+Configuration-based hook management
+===================================
+
+== Motivation
+
+Treat hooks as a first-class citizen by replacing the .git/hook/hookname path as
+the only source of hooks to execute, in a way which is friendly to users with
+multiple repos which have similar needs.
+
+Redefine "hook" as an event rather than a single script, allowing users to
+perform unrelated actions on a single event.
+
+Take a step closer to safety when copying zipped Git repositories from untrusted
+users.
+
+Make it easier for users to discover Git's hook feature and automate their
+workflows.
+
+== User interfaces
+
+=== Config schema
+
+Hooks can be introduced by editing the configuration manually. There are two new
+sections added, `hook` and `hookcmd`.
+
+==== `hook`
+
+Primarily contains subsections for each hook event. These subsections define
+hook command execution order; hook commands can be specified by passing the
+command directly if no additional configuration is needed, or by passing the
+name of a `hookcmd`. If Git does not find a `hookcmd` whose subsection matches
+the value of the given command string, Git will try to execute the string
+directly. Hook event subsections can also contain per-hook-event settings.
+
+Also contains top-level hook execution settings, for example,
+`hook.warnHookDir`, `hook.runHookDir`, or `hook.disableAll`.
+
+----
+[hook "pre-commit"]
+  command = perl-linter
+  command = /usr/bin/git-secrets --pre-commit
+
+[hook "pre-applypatch"]
+  command = perl-linter
+  error = ignore
+
+[hook]
+  warnHookDir = true
+  runHookDir = prompt
+----
+
+==== `hookcmd`
+
+Defines a hook command and its attributes, which will be used when a hook event
+occurs. Unqualified attributes are assumed to apply to this hook during all hook
+events, but event-specific attributes can also be supplied. The example runs
+`/usr/bin/lint-it --language=perl <args passed by Git>`, but for repos which
+include this config, the hook command will be skipped for all events to which
+it's normally subscribed _except_ `pre-commit`.
+
+----
+[hookcmd "perl-linter"]
+  command = /usr/bin/lint-it --language=perl
+  skip = true
+  pre-commit-skip = false
+----
+
+=== Command-line API
+
+Users should be able to view, reorder, and create hook commands via the command
+line. External tools should be able to view a list of hooks in the correct order
+to run.
+
+*`git hook list <hook-event>`*
+
+*`git hook list (--system|--global|--local|--worktree)`*
+
+*`git hook edit <hook-event>`*
+
+*`git hook add <hook-command> <hook-event> <options...>`*
+
+=== Hook editor
+
+The tool which is presented by `git hook edit <hook-command>`. Ideally, this
+tool should be easier to use than manually editing the config, and then produce
+a concise config afterwards. It may take a form similar to `git rebase
+--interactive`.
+
+== Implementation
+
+=== Library
+
+`hook.c` and `hook.h` are responsible for interacting with the config files. In
+the case when the code generating a hook event doesn't have special concerns
+about how to run the hooks, the hook library will provide a basic API to call
+all hooks in config order with an `argv_array` provided by the code which
+generates the hook event:
+
+*`int run_hooks(const char *hookname, struct argv_array *args)`*
+
+This call includes the hook command provided by `run-command.h:find_hook()`;
+eventually, this legacy hook will be gated by a config `hook.runHookDir`. The
+config is checked against a number of cases:
+
+- "no": the legacy hook will not be run
+- "interactive": Git will prompt the user before running the legacy hook
+- "warn": Git will print a warning to stderr before running the legacy hook
+- "yes" (default): Git will silently run the legacy hook
+
+If `hook.runHookDir` is provided more than once, Git will use the most
+restrictive setting provided, for security reasons.
+
+If the caller wants to do something more complicated, the hook library can also
+provide a callback API:
+
+*`int for_each_hookcmd(const char *hookname, hookcmd_function *cb)`*
+
+Finally, to facilitate the builtin, the library will also provide the following
+APIs to interact with the config:
+
+----
+int set_hook_commands(const char *hookname, struct string_list *commands,
+	enum config_scope scope);
+int set_hookcmd(const char *hookcmd, struct hookcmd options);
+
+int list_hook_commands(const char *hookname, struct string_list *commands);
+int list_hooks_in_scope(enum config_scope scope, struct string_list *commands);
+----
+
+`struct hookcmd` is expected to grow in size over time as more functionality is
+added to hooks; so that other parts of the code don't need to understand the
+config schema, `struct hookcmd` should contain logical values instead of string
+pairs.
+
+----
+struct hookcmd {
+  const char *name;
+  const char *command;
+
+  /* for illustration only; not planned at present */
+  int parallelizable;
+  const char *hookcmd_before;
+  const char *hookcmd_after;
+  enum recovery_action on_fail;
+}
+----
+
+=== Builtin
+
+`builtin/hook.c` is responsible for providing the frontend. It's responsible for
+formatting user-provided data and then calling the library API to set the
+configs as appropriate. The builtin frontend is not responsible for calling the
+config directly, so that other areas of Git can rely on the hook library to
+understand the most recent config schema for hooks.
+
+=== Migration path
+
+==== Stage 0
+
+Hooks are called by running `run-command.h:find_hook()` with the hookname and
+executing the result. The hook library and builtin do not exist. Hooks only
+exist as specially named scripts within `.git/hooks/`.
+
+==== Stage 1
+
+`git hook list --porcelain <hook-event>` is implemented. Users can replace their
+`.git/hooks/<hook-event>` scripts with a trampoline based on `git hook list`'s
+output. Modifier commands like `git hook add` and `git hook edit` can be
+implemented around this time as well.
+
+==== Stage 2
+
+`hook.h:run_hooks()` is taught to include `run-command.h:find_hook()` at the
+end; calls to `find_hook()` are replaced with calls to `run_hooks()`. Users can
+opt-in to config-based hooks simply by creating some in their config; otherwise
+users should remain unaffected by the change.
+
+==== Stage 3
+
+The call to `find_hook()` inside of `run_hooks()` learns to check for a config,
+`hook.runHookDir`. Users can opt into managing their hooks completely via the
+config this way.
+
+==== Stage 4
+
+`.git/hooks` is removed from the template and the hook directory is considered
+deprecated. To avoid breaking older repos, the default of `hook.runHookDir` is
+not changed, and `find_hook()` is not removed.
+
+== Caveats
+
+=== Security and repo config
+
+Part of the motivation behind this refactor is to mitigate hooks as an attack
+vector;footnote:[https://lore.kernel.org/git/20171002234517.GV19555@aiede.mtv.corp.google.com/]
+however, as the design stands, users can still provide hooks in the repo-level
+config, which is included when a repo is zipped and sent elsewhere.  The
+security of the repo-level config is still under discussion; this design
+generally assumes the repo-level config is secure, which is not true yet. The
+goal is to avoid an overcomplicated design to work around a problem which has
+ceased to exist.
+
+=== Ease of use
+
+The config schema is nontrivial; that's why it's important for the `git hook`
+modifier commands to be usable. Contributors with UX expertise are encouraged to
+share their suggestions.
+
+== Alternative approaches
+
+A previous summary of alternatives exists in the
+archives.footnote:[https://lore.kernel.org/git/20191116011125.GG22855@google.com]
+
+=== Status quo
+
+Today users can implement multihooks themselves by using a "trampoline script"
+as their hook, and pointing that script to a directory or list of other scripts
+they wish to run.
+
+=== Hook directories
+
+Other contributors have suggested Git learn about the existence of a directory
+such as `.git/hooks/<hookname>.d` and execute those hooks in alphabetical order.
+
+=== Comparison table
+
+.Comparison of alternatives
+|===
+|Feature |Config-based hooks |Hook directories |Status quo
+
+|Supports multiple hooks
+|Natively
+|Natively
+|With user effort
+
+|Safer for zipped repos
+|A little
+|No
+|No
+
+|Previous hooks just work
+|If configured
+|Yes
+|Yes
+
+|Can install one hook to many repos
+|Yes
+|No
+|No
+
+|Discoverability
+|Better (in `git help git`)
+|Same as before
+|Same as before
+
+|Hard to run unexpected hook
+|If configured
+|No
+|No
+|===
+
+== Future work
+
+=== Execution ordering
+
+We may find that config order is insufficient for some users; for example,
+config order makes it difficult to add a new hook to the system or global config
+which runs at the end of the hook list. A new ordering schema should be:
+
+1) Specified by a `hook.order` config, so that users will not unexpectedly see
+their order change;
+
+2) Either dependency or numerically based.
+
+Dependency-based ordering is prone to classic linked-list problems, like a
+cycles and handling of missing dependencies. But, it paves the way for enabling
+parallelization if some tasks truly depend on others.
+
+Numerical ordering makes it tricky for Git to generate suggested ordering
+numbers for each command, but is easy to determine a definitive order.
+
+=== Parallelization
+
+Users with many hooks might want to run them simultaneously, if the hooks don't
+modify state; if one hook depends on another's output, then users will want to
+specify those dependencies. If we decide to solve this problem, we may want to
+look to modern build systems for inspiration on how to manage dependencies and
+parallel tasks.
+
+=== Securing hookdir hooks
+
+With the design as written in this doc, it's still possible for a malicious user
+to modify `.git/config` to include `hook.pre-receive.command = rm -rf /`, then
+zip their repo and send it to another user. It may be necessary to teach Git to
+only allow one-line hooks like this if they were configured outside of the local
+scope; or another approach, like a list of safe projects, might be useful. It
+may also be sufficient (or at least useful) to teach a `hook.disableAll` config
+or similar flag to the Git executable.
+
+=== Submodule inheritance
+
+It's possible some submodules may want to run the identical set of hooks that
+their superrepo runs. While a globally-configured hook set is helpful, it's not
+a great solution for users who have multiple repos-with-submodules under the
+same user. It would be useful for submodules to learn how to run hooks from
+their superrepo's config, or inherit that hook setting.
+
+== Glossary
+
+*hook event*
+
+A point during Git's execution where user scripts may be run, for example,
+_prepare-commit-msg_ or _pre-push_.
+
+*hook command*
+
+A user script or executable which will be run on one or more hook events.