[0/2] describe and diff: implement --no-optional-locks

Message ID	pull.1872.git.1741240685.gitgitgadget@gmail.com (mailing list archive)
Headers	show Received: from mail-wr1-f44.google.com (mail-wr1-f44.google.com [209.85.221.44]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6A81038DFC for <git@vger.kernel.org>; Thu, 6 Mar 2025 05:58:08 +0000 (UTC) Message-Id: <pull.1872.git.1741240685.gitgitgadget@gmail.com> From: "Benjamin Woodruff via GitGitGadget" <gitgitgadget@gmail.com> Date: Thu, 06 Mar 2025 05:58:03 +0000 Subject: [PATCH 0/2] describe and diff: implement --no-optional-locks Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk MIME-Version: 1.0 To: git@vger.kernel.org Cc: Benjamin Woodruff <github@benjam.info>
Series	describe and diff: implement --no-optional-locks \| expand [0/2] describe and diff: implement --no-optional-locks [1/2] describe: implement --no-optional-locks [2/2] diff: implement --no-optional-locks

Derrick Stolee via GitGitGadget March 6, 2025, 5:58 a.m. UTC

git describe and git diff may update the index in the background for similar
performance reasons to git-status. These patches implement the
--no-optional-locks option for these commands, which allows scripts to
bypass this behavior.

I'm implementing this as a solution to a stale lockfile issue we've
sporadically encountered due to a build script that runs git describe. We're
mitigating this issue by using libgit2 in our build script, which does not
have the same background update behavior for its git_describe_workdir
implementation, but it would be nice to have this option supported in the
git CLI.

Tests and documentation changes are included!

Benjamin Woodruff (2):
  describe: implement --no-optional-locks
  diff: implement --no-optional-locks

 Documentation/config/diff.adoc     |  4 ++-
 Documentation/git-describe.adoc    | 12 ++++++++
 Documentation/git.adoc             |  4 ++-
 builtin/describe.c                 | 12 ++++----
 builtin/diff.c                     |  4 +++
 t/meson.build                      |  1 +
 t/t4070-diff-auto-refresh-index.sh | 46 ++++++++++++++++++++++++++++++
 t/t6120-describe.sh                |  8 ++++++
 8 files changed, 84 insertions(+), 7 deletions(-)
 create mode 100755 t/t4070-diff-auto-refresh-index.sh


base-commit: e969bc875963a10890d61ba84eab3a460bd9e535
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1872%2Fbgw%2Fbgw%2Fdiff-describe-optional-locks-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1872/bgw/bgw/diff-describe-optional-locks-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1872

Junio C Hamano March 6, 2025, 4:11 p.m. UTC | #1

"Benjamin Woodruff via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> git describe and git diff may update the index in the background for similar
> performance reasons to git-status.

That is a wrong reasoning that is completely opposite, though.

The commands at the Porcelain level, like "status" and "diff",
refresh the index for the CORRECTNESS purposes.

The commands at the plumbing level, which are designed to be used in
your own scripts, like "diff-files" and "diff-index", do not refresh
the index for the performance purposes.  If your own script wants to
produce correct result, your script IS responsible for refreshing
the index after its last modification to working tree files before
it starts to use the plumbing commands to inspect which ones are
modified and which ones are not.  This is so that your script has
more control over when the index is refreshed.  It does not have to
pay cost to run refresh for each Git command it invokes, if it knows
that it does not make any modification between the two invocations;
it can instead refresh just once and then run these two plumbing
commands.

So, it would be absolute no-no to make the Porcelain commands like
"describe" and "diff" not to refresh the index before they work by
default.  As an optional behaviour, it might be acceptable if there
is no other reasonable solutions (like, using plumbing commands if
the callers are not you typing but your scripts calling them).

Jeff King March 9, 2025, 3:39 a.m. UTC | #2

On Thu, Mar 06, 2025 at 08:11:09AM -0800, Junio C Hamano wrote:

> "Benjamin Woodruff via GitGitGadget" <gitgitgadget@gmail.com>
> writes:
> 
> > git describe and git diff may update the index in the background for similar
> > performance reasons to git-status.
> 
> That is a wrong reasoning that is completely opposite, though.
> 
> The commands at the Porcelain level, like "status" and "diff",
> refresh the index for the CORRECTNESS purposes.

Right, but "status" supports --no-optional-locks already.

So I think these patches are wrong, but the goal is reasonable. What
git-status does with --no-optional-locks is to update the index
internally for its _own_ use (giving it the correct results), but not to
lock nor write out the resulting index (to avoid conflicting with other
running programs). So it's pessimal (losing the opportunity to share
what it learned) but prevents lock contention.

And I think other programs could do that. I even wrote back in
27344d6a6c (git: add --no-optional-locks option, 2017-09-27):

  I've punted here on finding more callers to convert, since "status" is
  the obvious one to call as a repeated background job. But "git diff"'s
  opportunistic refresh of the index may be a good candidate.

I must admit that I didn't imagine "describe" is something that somebody
would run a lot in the background, but there's not really any harm in
having it support optional locks if somebody cares about that case.

> The commands at the plumbing level, which are designed to be used in
> your own scripts, like "diff-files" and "diff-index", do not refresh
> the index for the performance purposes.  If your own script wants to
> produce correct result, your script IS responsible for refreshing
> the index after its last modification to working tree files before
> it starts to use the plumbing commands to inspect which ones are
> modified and which ones are not.  This is so that your script has
> more control over when the index is refreshed.  It does not have to
> pay cost to run refresh for each Git command it invokes, if it knows
> that it does not make any modification between the two invocations;
> it can instead refresh just once and then run these two plumbing
> commands.

Assuming --no-optional-locks is being used as intended, the idea is that
the script cannot touch the index at all, because it is running in the
background and does not want lock contention with things the user is
doing in the foreground. So it cannot naively do a single index refresh
followed by plumbing commands like diff-files. Either it must:

  - accept that diff-files might return stat-dirty results (yuck)

  - use its own index that is separate from the regular .git/index file.
    But that may be overly slow, since the index "update" would rewrite
    the whole thing from scratch. Of course all of our index writes are
    from scratch, but you'd pay the price even when there is nothing to
    update.

  - use a command which operates all in a single process, with an
    in-memory index that is updated but not written out (e.g., "git
    --no-optional-locks status").

-Peff

Junio C Hamano March 10, 2025, 12:25 p.m. UTC | #3

Jeff King <peff@peff.net> writes:

> .... What
> git-status does with --no-optional-locks is to update the index
> internally for its _own_ use (giving it the correct results), but not to
> lock nor write out the resulting index (to avoid conflicting with other
> running programs). So it's pessimal (losing the opportunity to share
> what it learned) but prevents lock contention.

Yup, that sounds somewhat sensible.  I also have to wonder, other
than commands that are clearly about changing the repository state
like "add", the inspection commands like diff and status should
always opportunistically write the index back, without even being
asked?

> ... Either it must:
>
>   - accept that diff-files might return stat-dirty results (yuck)
>
>   - use its own index that is separate from the regular .git/index file.
>     But that may be overly slow, since the index "update" would rewrite
>     the whole thing from scratch. Of course all of our index writes are
>     from scratch, but you'd pay the price even when there is nothing to
>     update.
>
>   - use a command which operates all in a single process, with an
>     in-memory index that is updated but not written out (e.g., "git
>     --no-optional-locks status").

Jeff King March 10, 2025, 4:08 p.m. UTC | #4

On Mon, Mar 10, 2025 at 05:25:32AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > .... What
> > git-status does with --no-optional-locks is to update the index
> > internally for its _own_ use (giving it the correct results), but not to
> > lock nor write out the resulting index (to avoid conflicting with other
> > running programs). So it's pessimal (losing the opportunity to share
> > what it learned) but prevents lock contention.
> 
> Yup, that sounds somewhat sensible.  I also have to wonder, other
> than commands that are clearly about changing the repository state
> like "add", the inspection commands like diff and status should
> always opportunistically write the index back, without even being
> asked?

Yeah, certainly it is a potential source of confusion to have a
conceptually read-only operation take locks or modify the repo state.

But I'm not sure we have a sense of how valuable that optimization is in
practice. After touching some files, every git-status, git-diff, etc
would end up re-hashing those files instead of using the stat cache.
But maybe that is lost in the noise of reading the files to actually do
diffs, etc? I dunno. I expect it is more important for status, which
probably does not need to read the whole file contents in most cases
(and which may be run a lot from the user's prompt, etc).

It seems like a big and possibly risky departure from what we've done
for so many years. I'm inclined not to rock the boat too much. ;)

-Peff

Junio C Hamano March 10, 2025, 6:53 p.m. UTC | #5

Jeff King <peff@peff.net> writes:

> But maybe that is lost in the noise of reading the files to actually do
> diffs, etc? I dunno. I expect it is more important for status, which
> probably does not need to read the whole file contents in most cases
> (and which may be run a lot from the user's prompt, etc).

Yeah, and old timers who run "diff --raw" as if it were a quick
analogue for "status" also would notice.

> It seems like a big and possibly risky departure from what we've done
> for so many years. I'm inclined not to rock the boat too much. ;)

Certainly not right now.  But adding a command line option is even
worse as we would have to carry the support for it for practically
forever X-<.

Benjamin Woodruff March 10, 2025, 8:50 p.m. UTC | #6

On Mon, Mar 10, 2025, at 11:53 AM, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
>
>> But maybe that is lost in the noise of reading the files to actually do
>> diffs, etc? I dunno. I expect it is more important for status, which
>> probably does not need to read the whole file contents in most cases
>> (and which may be run a lot from the user's prompt, etc).
>
> Yeah, and old timers who run "diff --raw" as if it were a quick
> analogue for "status" also would notice.

Thanks for your reviews, Junio and Jeff.

>>>> I must admit that I didn't imagine "describe" is something that
>>>> somebody would run a lot in the background

It might help if I start with some more context of why I wrote this
patch. We've got a tool that uses the Rust `vergen-gitcl` crate to call
`git describe --dirty`. You can see that code here:
https://github.com/vercel/next.js/pull/76889/files

We use `vergen-gitcl` to generate version identifiers for an on-disk
cache. This cache stores results of thousands of functions and has no
backwards compatibility. We want to invalidate it when *any* of the code
changes. `git-describe` felt like a good fit for that, as it gives us a
unique identifier that's still reasonably user-friendly.

However, we discovered that we'd frequently end up with stale git
lockfiles. This appeared to be due some combination of IDE tools that
run the build in the background (i.e. the rust-analyzer LSP), behavior
that causes builds to sometimes get killed before completion, and the
fact that `git describe --dirty` takes a lock.

I've worked around this on our end, by re-implementing `describe`'s
`--dirty` flag on top of `status`:
<https://github.com/rustyhorde/vergen/pull/406>

So, from my end, there's no urgency to get this change in, or to add
support for `diff` (we only care about `describe`). Rather, I felt like
this might be a footgun for other users, and wanted to do the right
thing by submitting an upstream change.

>>> git describe and git diff may update the index in the background for
>>> similar performance reasons to git-status.
>>
>> That is a wrong reasoning that is completely opposite, though.
>> 
>> The commands at the Porcelain level, like "status" and "diff",
>> refresh the index for the CORRECTNESS purposes.
>
> Right, but "status" supports --no-optional-locks already.

Does this mean the documentation in `git-status` is incorrect? It
implies that the background refresh is only for performance reasons.
That's where I got this idea from:
<https://git-scm.com/docs/git-status#_background_refresh>

It's also worth noting that libgit2 does not do this background refresh
by default (`GIT_DIFF_UPDATE_INDEX` and `GIT_STATUS_OPT_UPDATE_INDEX`).
I think that makes sense for libgit2's typical use-cases, but it is a
divergence in behavior.

>> Yeah, certainly it is a potential source of confusion to have a
>> conceptually read-only operation take locks or modify the repo state.
>>
>> [...]
>>
>> It seems like a big and possibly risky departure from what we've done
>> for so many years. I'm inclined not to rock the boat too much. ;)
>
> Certainly not right now.  But adding a command line option is even
> worse as we would have to carry the support for it for practically
> forever X-<.

What about if we reduce this to documentation changes? That alone
would've saved me a lot of pain of trying to figure out what does and
doesn't take a lock:

- Explicitly document that `--no-optional-locks` only changes behavior
  for `git status`.

- Document that the `--dirty` flag on `describe` will lead to
  `status`-like background refresh behavior.

- Change the documentation for status's background refresh to indicate
  that there is a subtle difference in behavior caused by
  `--no-optional-locks`? I lack sufficient context on how best to
  describe or explain this.

- Leave `git-diff` documentation as-is. I think the current
  documentation of `diff.autoRefreshIndex` is sufficient?

Junio C Hamano March 10, 2025, 11:04 p.m. UTC | #7

"Benjamin Woodruff" <github@benjam.info> writes:

>>> The commands at the Porcelain level, like "status" and "diff",
>>> refresh the index for the CORRECTNESS purposes.
>>
>> Right, but "status" supports --no-optional-locks already.
>
> Does this mean the documentation in `git-status` is incorrect? It
> implies that the background refresh is only for performance reasons.

It has to do, and it does, refresh the in-core index for correctness
reasons.  Writing the result out to the on-disk file for _other_
processes that will use the index later is a performance hack, and
may or may not happen (it is up to what repo_update_index_if_able()
does).

If we do not care about redundant refreshing these other processes
need to do before they start, we do not have to write the resulting
in-core index out to the disk.

Writing the in-core index out for other processes has one downside,
which is that for correctness it has to be done under lock.  We
cannot say "we tried to be nice to others by writing the index file
out, but we just found out that there is a competing process trying
to open/write the index so let's skip writing it and let them do
their thing---they are responsible for refreshing the index for
their own use".  There needs some coordination, i.e. when we start
writing the result of our refreshing the index for others, if other
processes need to pay attention to that fact and wait for a while
before reading the index, then they benefit because they do not have
to refresh the index themselves.  Even though our programs, by
calling hold_lock_file_for_update(), know how to wait for a bit when
this happens, not everybody cooperates.

Jeff King March 11, 2025, 2:10 a.m. UTC | #8

On Mon, Mar 10, 2025 at 01:50:36PM -0700, Benjamin Woodruff wrote:

> It might help if I start with some more context of why I wrote this
> patch. We've got a tool that uses the Rust `vergen-gitcl` crate to call
> `git describe --dirty`. You can see that code here:
> https://github.com/vercel/next.js/pull/76889/files
> 
> We use `vergen-gitcl` to generate version identifiers for an on-disk
> cache. This cache stores results of thousands of functions and has no
> backwards compatibility. We want to invalidate it when *any* of the code
> changes. `git-describe` felt like a good fit for that, as it gives us a
> unique identifier that's still reasonably user-friendly.
> 
> However, we discovered that we'd frequently end up with stale git
> lockfiles. This appeared to be due some combination of IDE tools that
> run the build in the background (i.e. the rust-analyzer LSP), behavior
> that causes builds to sometimes get killed before completion, and the
> fact that `git describe --dirty` takes a lock.

Yeah, that is not quite the original use case that --no-optional-locks
was designed for (i.e., simultaneous contention), but I think it is a
reasonable application of the flag.

> >>> git describe and git diff may update the index in the background for
> >>> similar performance reasons to git-status.
> >>
> >> That is a wrong reasoning that is completely opposite, though.
> >> 
> >> The commands at the Porcelain level, like "status" and "diff",
> >> refresh the index for the CORRECTNESS purposes.
> >
> > Right, but "status" supports --no-optional-locks already.
> 
> Does this mean the documentation in `git-status` is incorrect? It
> implies that the background refresh is only for performance reasons.
> That's where I got this idea from:
> <https://git-scm.com/docs/git-status#_background_refresh>

I think Junio gave an explanation here, so I won't repeat that. But I
also think both of us may have been a bit confused about the changes
your patches are making, because there's some subtlety.

The important thing to keep in mind is that there are _two_ steps:
refreshing the in-core index and writing the result out to the on-disk
file. With --no-optional-locks we must continue to do the first step
(for correctness), and skip the second step.

So looking at your patch 1/2 for git-describe, it is doing the right
thing: we still call refresh_index() always, and only skip the calls to
repo_hold_locked_index() and repo_update_index_if_able().

But one thing that puzzles me is that we read and refresh the index
first and only _then_ take a lock. Which seems wrong to me, as we could
racily overwrite an intermediate write from somebody else that we never
even saw (e.g., imagine you call "git add" at just the wrong moment).

That is not a bug in your code, but an existing problem that I think
made it harder to understand your change (and probably one we should
fix regardless).

Your patch 2/2 for git-diff is what I thought was actually wrong, but
after digging further, I'm not so sure.

In your patch we return early from refresh_index_quietly(), without
actually refreshing the in-core index. So I _thought_ that meant we'd
produce a wrong answer for something like this:

  $ touch git.c
  $ ./git --no-optional-locks diff

where we should report "no changes", but would instead find the
stat-dirty git.c (just like a plumbing "git diff-files" would). But
that doesn't happen!

That's because refresh_index_quietly() runs after the diff has completed
anyway. The real magic is in diffcore_skip_stat_unmatch(), which
processes individual stat-dirty entries and suppresses them (when
there's no actual content change).

So the call in refresh_index_quietly() really is just about updating
what we're about to write out, and your patch is correct to bail from
the whole function (if we are not writing it out, there is no purpose in
refreshing at that point).

So as far as I can tell the patches are doing the right thing. But I
think the commit messages probably need to describe those subtleties and
argue that the change is correct. Bonus points if a preparatory patch
fixes the race in git-describe. ;)

> It's also worth noting that libgit2 does not do this background refresh
> by default (`GIT_DIFF_UPDATE_INDEX` and `GIT_STATUS_OPT_UPDATE_INDEX`).
> I think that makes sense for libgit2's typical use-cases, but it is a
> divergence in behavior.

Yes, I think that is probably a reasonable default for libgit2, where
you'd expect everything to happen in a single process (that can share
the in-core index).

-Peff

[0/2] describe and diff: implement --no-optional-locks

Message

Comments