[v2] refs: implement reference transaction hook
diff mbox series

Message ID 04116cc57ab37eeb50bd51a065a7c06503493bf3.1591186875.git.ps@pks.im
State New
Headers show
Series
  • [v2] refs: implement reference transaction hook
Related show

Commit Message

Patrick Steinhardt June 3, 2020, 12:27 p.m. UTC
The low-level reference transactions used to update references are
currently completely opaque to the user. While certainly desirable in
most usecases, there are some which might want to hook into the
transaction to observe all queued reference updates as well as observing
the abortion or commit of a prepared transaction.

One such usecase would be to have a set of replicas of a given Git
repository, where we perform Git operations on all of the repositories
at once and expect the outcome to be the same in all of them. While
there exist hooks already for a certain subset of Git commands that
could be used to implement a voting mechanism for this, many others
currently don't have any mechanism for this.

The above scenario is the motivation for the new "reference-transaction"
hook that reaches directly into Git's reference transaction mechanism.
The hook receives as parameter the current state the transaction was
moved to ("prepared", "committed" or "aborted") and gets via its
standard input all queued reference updates. While the exit code gets
ignored in the "committed" and "aborted" states, a non-zero exit code in
the "prepared" state will cause the transaction to be aborted
prematurely.

Given the usecase described above, a voting mechanism can now be
implemented via this hook: as soon as it gets called, it will take all
of stdin and use it to cast a vote to a central service. When all
replicas of the repository agree, the hook will exit with zero,
otherwise it will abort the transaction by returning non-zero. The most
important upside is that this will catch _all_ commands writing
references at once, allowing to implement strong consistency for
reference updates via a single mechanism.

In order to test the impact on the case where we don't have any
"reference-transaction" hook installed in the repository, this commit
introduces a new performance test for git-update-refs(1). Run against an
empty repository, it produces the following results:

  Test                                   HEAD~             HEAD
  ------------------------------------------------------------------------------
  1400.2: update existing reference      2.05(1.58+0.54)   2.08(1.58+0.57) +1.5%
  1400.3: create and destroy reference   1.79(1.38+0.49)   1.82(1.39+0.51) +1.7%

So the overhead is around ~1.5%. Given that git-update-refs(1) is a
near-direct wrapper around reference transactions, there likely is no
other command that is impacted much worse than this.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---

The main adjustment in this second version is that I merged the
previously three hooks into a single one that gets the transaction state
as its first parameter, as proposed by Gábor. This is mainly to enable
atomic replacement of all three scripts, even though it could still be
that the hook gets replaced during a session. But I think it makes sense
regardless to merge these hooks into a single one.

I've also made changes based on Junio's feedback and added a benchmark
that exercises git-update-refs(1) as a proxy for this change. I guess
the ~1.5% penalty isn't too bad. It might be improved by caching hook
existence, but I don't think it necessary right now.

Thanks to both of you for your feedback!

Patrick

 Documentation/githooks.txt       |  29 ++++++++
 refs.c                           |  72 +++++++++++++++++++-
 t/perf/p1400-update-ref.sh       |  31 +++++++++
 t/t1416-ref-transaction-hooks.sh | 109 +++++++++++++++++++++++++++++++
 4 files changed, 239 insertions(+), 2 deletions(-)
 create mode 100755 t/perf/p1400-update-ref.sh
 create mode 100755 t/t1416-ref-transaction-hooks.sh

Comments

Taylor Blau June 3, 2020, 4:51 p.m. UTC | #1
Hi Patrick,

On Wed, Jun 03, 2020 at 02:27:50PM +0200, Patrick Steinhardt wrote:
> In order to test the impact on the case where we don't have any
> "reference-transaction" hook installed in the repository, this commit
> introduces a new performance test for git-update-refs(1). Run against an
> empty repository, it produces the following results:
>
>   Test                                   HEAD~             HEAD
>   ------------------------------------------------------------------------------
>   1400.2: update existing reference      2.05(1.58+0.54)   2.08(1.58+0.57) +1.5%
>   1400.3: create and destroy reference   1.79(1.38+0.49)   1.82(1.39+0.51) +1.7%
>
> So the overhead is around ~1.5%. Given that git-update-refs(1) is a
> near-direct wrapper around reference transactions, there likely is no
> other command that is impacted much worse than this.

This is a serious performance regression that is worth considering. For
example, a 1.5% slow-down on average in reference transactions would
cause be enough for me to bisect the regression down to see what
changed.

What are ways that this could be avoided? I bet that this would work
quite well with Emily's work on hooks, where we could check in the
configuration first whether a hook is even configured.

Could this be integrated with that? If not, could you cache the result
of whether or not the hook exists, and/or implement some mechanism to
say something like, for eg., "only run reference transaction hooks
core.blah = 1" is true?

I haven't myself looked seriously at your patch (although I do plan on
doing so, but haven't yet had time), but I think that this should be
considered thoroughly before moving forward.

Thanks,
Taylor
Han-Wen Nienhuys June 3, 2020, 5:44 p.m. UTC | #2
On Wed, Jun 3, 2020 at 2:27 PM Patrick Steinhardt <ps@pks.im> wrote:
> The low-level reference transactions used to update references are
> currently completely opaque to the user. While certainly desirable in
> most usecases, there are some which might want to hook into the
> transaction to observe all queued reference updates as well as observing
> the abortion or commit of a prepared transaction.

I have an alternate suggestion, but it would depend on having
reftable.  In reftable, the transaction is

  0) create tables.list.lock
  1) write out new reftable(s),
  2) renaming tables.list.lock to tables.list.

If you insert a hook between 1) and 2), the hook would not need to be
supplied with any data, so it would have minimal performance impact.
If the hook runs, it can compare tables.list and tables.list.lock,
read out the new reftables, and make decisions based on that. Or
alternatively, the transaction could be passed a list of reftables,
that the hook could then process at its leisure.

> +For each reference update that was added to the transaction, the hook
> +receives on standard input a line of the format:
> +
> +  <old-value> SP <new-value> SP <ref-name> LF

Does this work for symrefs as well? Should it?
Junio C Hamano June 3, 2020, 6:03 p.m. UTC | #3
Han-Wen Nienhuys <hanwen@google.com> writes:

>> +For each reference update that was added to the transaction, the hook
>> +receives on standard input a line of the format:
>> +
>> +  <old-value> SP <new-value> SP <ref-name> LF
>
> Does this work for symrefs as well? Should it?

That's a good question.

I am assuming that you are asking about updating the value of the
underlying ref the symbolic ref points at (i.e. <ref-name> would be
"HEAD" and old- and new-value are commits at the tip of the branch
that is currently checked out), and not about repointing the
symbolic ref to a different underlying ref.  I suspect we do not
want to.  In such a case, we'd have a separate "refs/heads/master
moved from old to new" record anyway, so we probably should not even
talk about symbolic refs when we report "a ref got updated to point
a different object" in the above format.

And obviously the above format cannot talk about repointing HEAD to
a different branch (well, it can in a sense that old and new branch
both would have some object name that can be filled in the old- and
new-value fields, but the record misses the most interesting part of
the "branch switching" without the actual old and new refnames), so
there must be another format that allows us to report such an event.

    "Sym" SP <old-refname> SP <new-refname> SP <symbolic-ref-name> LF

or something like that, so that

	$ git checkout next

may yield

    "Sym" SP "refs/heads/master" SP "refs/heads/next" SP "HEAD"

if we originally were on the 'master' branch.  Obviously we need a
way to represent different kinds of transition, like (1) a new
symbolic ref has been created, pointing at some underlying ref
(which may or may not exist), (2) an existing ref that pointed
directly at an object now has become a symbolic ref, (3) a symbolic
ref stopped pointing at a ref and now points directly at an object,
and (4) a symbolic ref has been removed.

Both (1) and (4) would be a trivial variation of the above format, I
guess, using some special token to denote void (say, we'd use an
empty string for that):

    "Sym" "" SP <new-refname> SP <symbolic-ref-name> LF
    "Sym" <old-refname> SP "" SP <symbolic-ref-name> LF

(2) and (3) are interesting.  We could represent them as two
separate operations inside the same transaction, e.g.

    "Sym" "refs/heads/master" SP "" SP "HEAD" LF
    "0{40}" SP "b3d7a52fac39193503a0b6728771d1bf6a161464" SP "HEADD" LF

or we'd need to invent different type of record to represent such
"detaching" and "reattaching" events.
Patrick Steinhardt June 4, 2020, 7:36 a.m. UTC | #4
On Wed, Jun 03, 2020 at 10:51:42AM -0600, Taylor Blau wrote:
> Hi Patrick,
> 
> On Wed, Jun 03, 2020 at 02:27:50PM +0200, Patrick Steinhardt wrote:
> > In order to test the impact on the case where we don't have any
> > "reference-transaction" hook installed in the repository, this commit
> > introduces a new performance test for git-update-refs(1). Run against an
> > empty repository, it produces the following results:
> >
> >   Test                                   HEAD~             HEAD
> >   ------------------------------------------------------------------------------
> >   1400.2: update existing reference      2.05(1.58+0.54)   2.08(1.58+0.57) +1.5%
> >   1400.3: create and destroy reference   1.79(1.38+0.49)   1.82(1.39+0.51) +1.7%
> >
> > So the overhead is around ~1.5%. Given that git-update-refs(1) is a
> > near-direct wrapper around reference transactions, there likely is no
> > other command that is impacted much worse than this.
> 
> This is a serious performance regression that is worth considering. For
> example, a 1.5% slow-down on average in reference transactions would
> cause be enough for me to bisect the regression down to see what
> changed.
> 
> What are ways that this could be avoided? I bet that this would work
> quite well with Emily's work on hooks, where we could check in the
> configuration first whether a hook is even configured.
> 
> Could this be integrated with that? If not, could you cache the result
> of whether or not the hook exists, and/or implement some mechanism to
> say something like, for eg., "only run reference transaction hooks
> core.blah = 1" is true?

I wasn't aware of her work until now, so I'll take a look.

Meanwhile, I toyed around with performance and tried out two different
caching mechanisms:

    - map-cache: `find_hook()` uses a map of hook names mapped to their
      resolved hook path (or `NULL` if none was found). This is a
      generic mechanism working across all hooks, but has some overhead
      in maintaining the map.

    - reftx-cache: `run_transaction_hook()` caches the path returned by
      `find_hook()`. It's got less overhead as it only caches the path,
      but obviously only applies to the reference-transaction hook.

In theory, we could go even further and cache the hook's file
descriptor, executing it via fexecve(3P) whenever it's required, but I
didn't go down that route yet. This would also solve the atomicity
issue, though, if the administrator replaces the reference-transactions
hook halfway through the transaction.

Benchmarking results are mixed, mostly in the sense that I can choose
which run of the benchmarks I take in order to have my own work look
better or worse, as timings vary quite a lot between runs. Which
probably hints at the fact that the benchmarks themselves aren't really
reliable. The issue is that a single git-update-ref(1) run finishes so
quick that it's hard to measure with our benchmarks, but spawning
thousands of them to get something different than 0.0s very much depends
on the operating system and thus fluctuates. On the other hand, if we
only spawn a single git-update-refs and have it perform a few thousand
ref updates, overhead of the hook will not show at all as it is only
executed once per prepare/commit of the transaction.

The following timings are for the case where we execute

    git update-ref refs/heads/update-branch PRE POST &&
    git update-ref refs/heads/update-branch POST PRE

respectively

    git update-ref refs/heads/new POST &&
    git update-ref -d refs/heads/new

a thousand times:

Test                                   master            ref-hook-no-cache       ref-hook-map-cache      ref-hook-reftx-cache
------------------------------------------------------------------------------------------------------------------------------
1400.2: update existing reference      1.96(1.50+0.53)   2.00(1.54+0.53) +2.0%   2.02(1.54+0.55) +3.1%   1.98(1.52+0.52) +1.0%
1400.3: create and destroy reference   1.74(1.33+0.49)   1.77(1.38+0.47) +1.7%   1.77(1.36+0.48) +1.7%   1.76(1.35+0.49) +1.1%

For such a short-lived program like git-update-refs(1), one can see that
the overhead of using a map negatively impacts performance compared to
the no-cache case. But using the reftx-cache roughly cuts the overhead
in half as expected, as we only need to look up the hook once instead of
twice.

And here's the results if we use a single `git update-ref --stdin` with a
thousand reference updates at once:

Test                             master            ref-hook-no-cache       ref-hook-map-cache      ref-hook-reftx-cache
------------------------------------------------------------------------------------------------------------------------
1400.2: git update-ref --stdin   0.21(0.09+0.12)   0.21(0.07+0.14) +0.0%   0.21(0.07+0.13) +0.0%   0.21(0.07+0.13) +0.0%

As expected, there's nothing much to be seen here because there's only a
single transaction for all ref updates and, as a result, at most two
calls to `access(refhook_path, X_OK)`.

Patrick
Taylor Blau June 15, 2020, 4:46 p.m. UTC | #5
Hi Patrick,

Sorry for the slow response. I was out of the office last week and am
only just now getting a chance to catch up on the backlog of emails that
I missed.

On Thu, Jun 04, 2020 at 09:36:32AM +0200, Patrick Steinhardt wrote:
> On Wed, Jun 03, 2020 at 10:51:42AM -0600, Taylor Blau wrote:
> > Hi Patrick,
> >
> > On Wed, Jun 03, 2020 at 02:27:50PM +0200, Patrick Steinhardt wrote:
> > > In order to test the impact on the case where we don't have any
> > > "reference-transaction" hook installed in the repository, this commit
> > > introduces a new performance test for git-update-refs(1). Run against an
> > > empty repository, it produces the following results:
> > >
> > >   Test                                   HEAD~             HEAD
> > >   ------------------------------------------------------------------------------
> > >   1400.2: update existing reference      2.05(1.58+0.54)   2.08(1.58+0.57) +1.5%
> > >   1400.3: create and destroy reference   1.79(1.38+0.49)   1.82(1.39+0.51) +1.7%
> > >
> > > So the overhead is around ~1.5%. Given that git-update-refs(1) is a
> > > near-direct wrapper around reference transactions, there likely is no
> > > other command that is impacted much worse than this.
> >
> > This is a serious performance regression that is worth considering. For
> > example, a 1.5% slow-down on average in reference transactions would
> > cause be enough for me to bisect the regression down to see what
> > changed.
> >
> > What are ways that this could be avoided? I bet that this would work
> > quite well with Emily's work on hooks, where we could check in the
> > configuration first whether a hook is even configured.
> >
> > Could this be integrated with that? If not, could you cache the result
> > of whether or not the hook exists, and/or implement some mechanism to
> > say something like, for eg., "only run reference transaction hooks
> > core.blah = 1" is true?
>
> I wasn't aware of her work until now, so I'll take a look.
>
> Meanwhile, I toyed around with performance and tried out two different
> caching mechanisms:
>
>     - map-cache: `find_hook()` uses a map of hook names mapped to their
>       resolved hook path (or `NULL` if none was found). This is a
>       generic mechanism working across all hooks, but has some overhead
>       in maintaining the map.
>
>     - reftx-cache: `run_transaction_hook()` caches the path returned by
>       `find_hook()`. It's got less overhead as it only caches the path,
>       but obviously only applies to the reference-transaction hook.
>
> In theory, we could go even further and cache the hook's file
> descriptor, executing it via fexecve(3P) whenever it's required, but I
> didn't go down that route yet. This would also solve the atomicity
> issue, though, if the administrator replaces the reference-transactions
> hook halfway through the transaction.
>
> Benchmarking results are mixed, mostly in the sense that I can choose
> which run of the benchmarks I take in order to have my own work look
> better or worse, as timings vary quite a lot between runs. Which
> probably hints at the fact that the benchmarks themselves aren't really
> reliable. The issue is that a single git-update-ref(1) run finishes so
> quick that it's hard to measure with our benchmarks, but spawning
> thousands of them to get something different than 0.0s very much depends
> on the operating system and thus fluctuates. On the other hand, if we
> only spawn a single git-update-refs and have it perform a few thousand
> ref updates, overhead of the hook will not show at all as it is only
> executed once per prepare/commit of the transaction.
>
> The following timings are for the case where we execute
>
>     git update-ref refs/heads/update-branch PRE POST &&
>     git update-ref refs/heads/update-branch POST PRE
>
> respectively
>
>     git update-ref refs/heads/new POST &&
>     git update-ref -d refs/heads/new
>
> a thousand times:
>
> Test                                   master            ref-hook-no-cache       ref-hook-map-cache      ref-hook-reftx-cache
> ------------------------------------------------------------------------------------------------------------------------------
> 1400.2: update existing reference      1.96(1.50+0.53)   2.00(1.54+0.53) +2.0%   2.02(1.54+0.55) +3.1%   1.98(1.52+0.52) +1.0%
> 1400.3: create and destroy reference   1.74(1.33+0.49)   1.77(1.38+0.47) +1.7%   1.77(1.36+0.48) +1.7%   1.76(1.35+0.49) +1.1%

Huh. It is super interesting (to me, at least) that caching the set of
hooks that are on disk and should be run makes this *slower* than
without the cache. What's going on there? In p1400.2, I'd expect to see
'ref-hook-map-cache' attain at most a 2.0% slow-down, plus a little bit
to process the hook when it is there, but not much more than that.

I think that this just seems a little contrived to me. I can understand
why server administrators may want this feature, but the general
user-base of Git doesn't seem to stand to benefit much from this change
(in my own mind, at least).

So... I'm not sure where this leaves us. Maybe a 2.0% slow-down on an
already fast 'git update-ref' invocation for the average user won't be
noticeable. It certainly *will* be noticeable to administrators who
processes a much higher volume of such transactions.

> For such a short-lived program like git-update-refs(1), one can see that
> the overhead of using a map negatively impacts performance compared to
> the no-cache case. But using the reftx-cache roughly cuts the overhead
> in half as expected, as we only need to look up the hook once instead of
> twice.
>
> And here's the results if we use a single `git update-ref --stdin` with a
> thousand reference updates at once:
>
> Test                             master            ref-hook-no-cache       ref-hook-map-cache      ref-hook-reftx-cache
> ------------------------------------------------------------------------------------------------------------------------
> 1400.2: git update-ref --stdin   0.21(0.09+0.12)   0.21(0.07+0.14) +0.0%   0.21(0.07+0.13) +0.0%   0.21(0.07+0.13) +0.0%
>
> As expected, there's nothing much to be seen here because there's only a
> single transaction for all ref updates and, as a result, at most two
> calls to `access(refhook_path, X_OK)`.

Better, but I have to imagine that real-world usage will look much more
like a thousand tiny transactions than one large one.

> Patrick

Thanks,
Taylor
Patrick Steinhardt June 16, 2020, 5:45 a.m. UTC | #6
On Mon, Jun 15, 2020 at 10:46:39AM -0600, Taylor Blau wrote:
> Hi Patrick,
> 
> Sorry for the slow response. I was out of the office last week and am
> only just now getting a chance to catch up on the backlog of emails that
> I missed.

No worries!

> On Thu, Jun 04, 2020 at 09:36:32AM +0200, Patrick Steinhardt wrote:
> > On Wed, Jun 03, 2020 at 10:51:42AM -0600, Taylor Blau wrote:
> > > Hi Patrick,
> > >
> > > On Wed, Jun 03, 2020 at 02:27:50PM +0200, Patrick Steinhardt wrote:
> > > > In order to test the impact on the case where we don't have any
> > > > "reference-transaction" hook installed in the repository, this commit
> > > > introduces a new performance test for git-update-refs(1). Run against an
> > > > empty repository, it produces the following results:
> > > >
> > > >   Test                                   HEAD~             HEAD
> > > >   ------------------------------------------------------------------------------
> > > >   1400.2: update existing reference      2.05(1.58+0.54)   2.08(1.58+0.57) +1.5%
> > > >   1400.3: create and destroy reference   1.79(1.38+0.49)   1.82(1.39+0.51) +1.7%
> > > >
> > > > So the overhead is around ~1.5%. Given that git-update-refs(1) is a
> > > > near-direct wrapper around reference transactions, there likely is no
> > > > other command that is impacted much worse than this.
> > >
> > > This is a serious performance regression that is worth considering. For
> > > example, a 1.5% slow-down on average in reference transactions would
> > > cause be enough for me to bisect the regression down to see what
> > > changed.
> > >
> > > What are ways that this could be avoided? I bet that this would work
> > > quite well with Emily's work on hooks, where we could check in the
> > > configuration first whether a hook is even configured.
> > >
> > > Could this be integrated with that? If not, could you cache the result
> > > of whether or not the hook exists, and/or implement some mechanism to
> > > say something like, for eg., "only run reference transaction hooks
> > > core.blah = 1" is true?
> >
> > I wasn't aware of her work until now, so I'll take a look.
> >
> > Meanwhile, I toyed around with performance and tried out two different
> > caching mechanisms:
> >
> >     - map-cache: `find_hook()` uses a map of hook names mapped to their
> >       resolved hook path (or `NULL` if none was found). This is a
> >       generic mechanism working across all hooks, but has some overhead
> >       in maintaining the map.
> >
> >     - reftx-cache: `run_transaction_hook()` caches the path returned by
> >       `find_hook()`. It's got less overhead as it only caches the path,
> >       but obviously only applies to the reference-transaction hook.
> >
> > In theory, we could go even further and cache the hook's file
> > descriptor, executing it via fexecve(3P) whenever it's required, but I
> > didn't go down that route yet. This would also solve the atomicity
> > issue, though, if the administrator replaces the reference-transactions
> > hook halfway through the transaction.
> >
> > Benchmarking results are mixed, mostly in the sense that I can choose
> > which run of the benchmarks I take in order to have my own work look
> > better or worse, as timings vary quite a lot between runs. Which
> > probably hints at the fact that the benchmarks themselves aren't really
> > reliable. The issue is that a single git-update-ref(1) run finishes so
> > quick that it's hard to measure with our benchmarks, but spawning
> > thousands of them to get something different than 0.0s very much depends
> > on the operating system and thus fluctuates. On the other hand, if we
> > only spawn a single git-update-refs and have it perform a few thousand
> > ref updates, overhead of the hook will not show at all as it is only
> > executed once per prepare/commit of the transaction.
> >
> > The following timings are for the case where we execute
> >
> >     git update-ref refs/heads/update-branch PRE POST &&
> >     git update-ref refs/heads/update-branch POST PRE
> >
> > respectively
> >
> >     git update-ref refs/heads/new POST &&
> >     git update-ref -d refs/heads/new
> >
> > a thousand times:
> >
> > Test                                   master            ref-hook-no-cache       ref-hook-map-cache      ref-hook-reftx-cache
> > ------------------------------------------------------------------------------------------------------------------------------
> > 1400.2: update existing reference      1.96(1.50+0.53)   2.00(1.54+0.53) +2.0%   2.02(1.54+0.55) +3.1%   1.98(1.52+0.52) +1.0%
> > 1400.3: create and destroy reference   1.74(1.33+0.49)   1.77(1.38+0.47) +1.7%   1.77(1.36+0.48) +1.7%   1.76(1.35+0.49) +1.1%
> 
> Huh. It is super interesting (to me, at least) that caching the set of
> hooks that are on disk and should be run makes this *slower* than
> without the cache. What's going on there? In p1400.2, I'd expect to see
> 'ref-hook-map-cache' attain at most a 2.0% slow-down, plus a little bit
> to process the hook when it is there, but not much more than that.

I think the issue is that a single git-update-ref(1) invocation does so
little work that allocating the hashmap and inserting the hook already
has noticeable impact on the program's runtime. E.g. in above
benchmark, a single call to git-update-ref in p1400.2 takes roughly
0.002s on my machine. You also see this by the fact that doing a single
stat(3P) call as introduced by my patch adds a 1% performance penalty
already, and with the map cache we still have to do this single stat(3P)
call in addition to the dynamic memory allocations for the map and
insertion of the hook.

> I think that this just seems a little contrived to me. I can understand
> why server administrators may want this feature, but the general
> user-base of Git doesn't seem to stand to benefit much from this change
> (in my own mind, at least).

That's true for several hooks we have, though.

> So... I'm not sure where this leaves us. Maybe a 2.0% slow-down on an
> already fast 'git update-ref' invocation for the average user won't be
> noticeable. It certainly *will* be noticeable to administrators who
> processes a much higher volume of such transactions.

I think we should keep in mind that it's git-update-ref(1) we're talking
about, which is a nearly direct wrapper around reference transactions.
The 1% perfomance hit is thus the worst case that can happen, as there
is no other tool that does as little work around the reftx as this one.
For any other tool, I imagine the performance hit to be at worst the
same (e.g. git-branch(1)) or not noticeable at all because a single
stat(3P) call will be drowned out by other things (e.g. git-clone(1)).

That's not to say that nobody will be impacted by this change, I bet
there are setups that make heavy use of git-update-ref(1).

> > For such a short-lived program like git-update-refs(1), one can see that
> > the overhead of using a map negatively impacts performance compared to
> > the no-cache case. But using the reftx-cache roughly cuts the overhead
> > in half as expected, as we only need to look up the hook once instead of
> > twice.
> >
> > And here's the results if we use a single `git update-ref --stdin` with a
> > thousand reference updates at once:
> >
> > Test                             master            ref-hook-no-cache       ref-hook-map-cache      ref-hook-reftx-cache
> > ------------------------------------------------------------------------------------------------------------------------
> > 1400.2: git update-ref --stdin   0.21(0.09+0.12)   0.21(0.07+0.14) +0.0%   0.21(0.07+0.13) +0.0%   0.21(0.07+0.13) +0.0%
> >
> > As expected, there's nothing much to be seen here because there's only a
> > single transaction for all ref updates and, as a result, at most two
> > calls to `access(refhook_path, X_OK)`.
> 
> Better, but I have to imagine that real-world usage will look much more
> like a thousand tiny transactions than one large one.

Most likely, yes. Doesn't happen too often that one updates multiple
references at once.

Patrick

Patch
diff mbox series

diff --git a/Documentation/githooks.txt b/Documentation/githooks.txt
index 81f2a87e88..642471109f 100644
--- a/Documentation/githooks.txt
+++ b/Documentation/githooks.txt
@@ -404,6 +404,35 @@  Both standard output and standard error output are forwarded to
 `git send-pack` on the other end, so you can simply `echo` messages
 for the user.
 
+ref-transaction
+~~~~~~~~~~~~~~~
+
+This hook is invoked by any Git command that performs reference
+updates. It executes whenever a reference transaction is prepared,
+committed or aborted and may thus get called multiple times.
+
+The hook takes exactly one argument, which is the current state the
+given reference transaction is in:
+
+    - "prepared": All reference updates have been queued to the
+      transaction and references were locked on disk.
+
+    - "committed": The reference transaction was committed and all
+      references now have their respective new value.
+
+    - "aborted": The reference transaction was aborted, no changes
+      were performed and the locks have been released.
+
+For each reference update that was added to the transaction, the hook
+receives on standard input a line of the format:
+
+  <old-value> SP <new-value> SP <ref-name> LF
+
+The exit status of the hook is ignored for any state except for the
+"prepared" state. In the "prepared" state, a non-zero exit status will
+cause the transaction to be aborted. The hook will not be called with
+"aborted" state in that case.
+
 push-to-checkout
 ~~~~~~~~~~~~~~~~
 
diff --git a/refs.c b/refs.c
index 224ff66c7b..af752e1759 100644
--- a/refs.c
+++ b/refs.c
@@ -9,6 +9,7 @@ 
 #include "iterator.h"
 #include "refs.h"
 #include "refs/refs-internal.h"
+#include "run-command.h"
 #include "object-store.h"
 #include "object.h"
 #include "tag.h"
@@ -16,6 +17,7 @@ 
 #include "worktree.h"
 #include "argv-array.h"
 #include "repository.h"
+#include "sigchain.h"
 
 /*
  * List of all available backends
@@ -1986,10 +1988,61 @@  int ref_update_reject_duplicates(struct string_list *refnames,
 	return 0;
 }
 
+static int run_transaction_hook(struct ref_transaction *transaction,
+				const char *state)
+{
+	struct child_process proc = CHILD_PROCESS_INIT;
+	struct strbuf buf = STRBUF_INIT;
+	int saved_errno = 0, ret, i;
+	const char *hook;
+
+	hook = find_hook("reference-transaction");
+	if (!hook)
+		return 0;
+
+	argv_array_pushl(&proc.args, hook, state, NULL);
+	proc.in = -1;
+	proc.stdout_to_stderr = 1;
+	proc.trace2_hook_name = hook;
+
+	ret = start_command(&proc);
+	if (ret)
+		return ret;
+
+	sigchain_push(SIGPIPE, SIG_IGN);
+
+	for (i = 0; i < transaction->nr; i++) {
+		struct ref_update *update = transaction->updates[i];
+
+		strbuf_reset(&buf);
+		strbuf_addf(&buf, "%s %s %s\n",
+			    oid_to_hex(&update->old_oid),
+			    oid_to_hex(&update->new_oid),
+			    update->refname);
+
+		if (write_in_full(proc.in, buf.buf, buf.len) < 0) {
+			if (errno != EPIPE)
+				saved_errno = errno;
+			break;
+		}
+	}
+
+	close(proc.in);
+	sigchain_pop(SIGPIPE);
+	strbuf_release(&buf);
+
+	ret = finish_command(&proc);
+	if (ret)
+		return ret;
+
+	return saved_errno;
+}
+
 int ref_transaction_prepare(struct ref_transaction *transaction,
 			    struct strbuf *err)
 {
 	struct ref_store *refs = transaction->ref_store;
+	int ret;
 
 	switch (transaction->state) {
 	case REF_TRANSACTION_OPEN:
@@ -2012,7 +2065,17 @@  int ref_transaction_prepare(struct ref_transaction *transaction,
 		return -1;
 	}
 
-	return refs->be->transaction_prepare(refs, transaction, err);
+	ret = refs->be->transaction_prepare(refs, transaction, err);
+	if (ret)
+		return ret;
+
+	ret = run_transaction_hook(transaction, "prepared");
+	if (ret) {
+		ref_transaction_abort(transaction, err);
+		die(_("ref updates aborted by hook"));
+	}
+
+	return 0;
 }
 
 int ref_transaction_abort(struct ref_transaction *transaction,
@@ -2036,6 +2099,8 @@  int ref_transaction_abort(struct ref_transaction *transaction,
 		break;
 	}
 
+	run_transaction_hook(transaction, "aborted");
+
 	ref_transaction_free(transaction);
 	return ret;
 }
@@ -2064,7 +2129,10 @@  int ref_transaction_commit(struct ref_transaction *transaction,
 		break;
 	}
 
-	return refs->be->transaction_finish(refs, transaction, err);
+	ret = refs->be->transaction_finish(refs, transaction, err);
+	if (!ret)
+		run_transaction_hook(transaction, "committed");
+	return ret;
 }
 
 int refs_verify_refname_available(struct ref_store *refs,
diff --git a/t/perf/p1400-update-ref.sh b/t/perf/p1400-update-ref.sh
new file mode 100755
index 0000000000..4f4519529e
--- /dev/null
+++ b/t/perf/p1400-update-ref.sh
@@ -0,0 +1,31 @@ 
+#!/bin/sh
+
+test_description="Tests performance of update-ref"
+
+. ./perf-lib.sh
+
+test_perf_fresh_repo
+
+test_expect_success "setup" '
+	test_commit PRE &&
+	test_commit POST &&
+	git branch update-branch
+'
+
+test_perf "update existing reference" '
+	for i in $(test_seq 1000)
+	do
+		git update-ref refs/heads/update-branch PRE POST &&
+		git update-ref refs/heads/update-branch POST PRE
+	done
+'
+
+test_perf "create and destroy reference" '
+	for i in $(test_seq 1000)
+	do
+		git update-ref refs/heads/new POST
+		git update-ref -d refs/heads/new
+	done
+'
+
+test_done
diff --git a/t/t1416-ref-transaction-hooks.sh b/t/t1416-ref-transaction-hooks.sh
new file mode 100755
index 0000000000..da58d867a5
--- /dev/null
+++ b/t/t1416-ref-transaction-hooks.sh
@@ -0,0 +1,109 @@ 
+#!/bin/sh
+
+test_description='reference transaction hooks'
+
+. ./test-lib.sh
+
+test_expect_success setup '
+	mkdir -p .git/hooks &&
+	test_commit PRE &&
+	test_commit POST &&
+	POST_OID=$(git rev-parse POST)
+'
+
+test_expect_success 'hook allows updating ref if successful' '
+	test_when_finished "rm .git/hooks/reference-transaction" &&
+	git reset --hard PRE &&
+	write_script .git/hooks/reference-transaction <<-\EOF &&
+		echo "$*" >>actual
+	EOF
+	cat >expect <<-EOF &&
+		prepared
+		committed
+	EOF
+	git update-ref HEAD POST &&
+	test_cmp expect actual
+'
+
+test_expect_success 'hook aborts updating ref in prepared state' '
+	test_when_finished "rm .git/hooks/reference-transaction" &&
+	git reset --hard PRE &&
+	write_script .git/hooks/reference-transaction <<-\EOF &&
+		if test "$1" = prepared
+		then
+			exit 1
+		fi
+	EOF
+	test_must_fail git update-ref HEAD POST 2>err &&
+	test_i18ngrep "ref updates aborted by hook" err
+'
+
+test_expect_success 'hook gets all queued updates in prepared state' '
+	test_when_finished "rm .git/hooks/reference-transaction actual" &&
+	git reset --hard PRE &&
+	write_script .git/hooks/reference-transaction <<-\EOF &&
+		if test "$1" = prepared
+		then
+			while read -r line
+			do
+				printf "%s\n" "$line"
+			done >actual
+		fi
+	EOF
+	cat >expect <<-EOF &&
+		$ZERO_OID $POST_OID HEAD
+		$ZERO_OID $POST_OID refs/heads/master
+	EOF
+	git update-ref HEAD POST <<-EOF &&
+		update HEAD $ZERO_OID $POST_OID
+		update refs/heads/master $ZERO_OID $POST_OID
+	EOF
+	test_cmp expect actual
+'
+
+test_expect_success 'hook gets all queued updates in committed state' '
+	test_when_finished "rm .git/hooks/reference-transaction actual" &&
+	git reset --hard PRE &&
+	write_script .git/hooks/reference-transaction <<-\EOF &&
+		if test "$1" = committed
+		then
+			while read -r line
+			do
+				printf "%s\n" "$line"
+			done >actual
+		fi
+	EOF
+	cat >expect <<-EOF &&
+		$ZERO_OID $POST_OID HEAD
+		$ZERO_OID $POST_OID refs/heads/master
+	EOF
+	git update-ref HEAD POST &&
+	test_cmp expect actual
+'
+
+test_expect_success 'hook gets all queued updates in aborted state' '
+	test_when_finished "rm .git/hooks/reference-transaction actual" &&
+	git reset --hard PRE &&
+	write_script .git/hooks/reference-transaction <<-\EOF &&
+		if test "$1" = aborted
+		then
+			while read -r line
+			do
+				printf "%s\n" "$line"
+			done >actual
+		fi
+	EOF
+	cat >expect <<-EOF &&
+		$ZERO_OID $POST_OID HEAD
+		$ZERO_OID $POST_OID refs/heads/master
+	EOF
+	git update-ref --stdin <<-EOF &&
+		start
+		update HEAD POST $ZERO_OID
+		update refs/heads/master POST $ZERO_OID
+		abort
+	EOF
+	test_cmp expect actual
+'
+
+test_done