Message ID | pull.1872.git.1741240685.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | describe and diff: implement --no-optional-locks | expand |
"Benjamin Woodruff via GitGitGadget" <gitgitgadget@gmail.com> writes: > git describe and git diff may update the index in the background for similar > performance reasons to git-status. That is a wrong reasoning that is completely opposite, though. The commands at the Porcelain level, like "status" and "diff", refresh the index for the CORRECTNESS purposes. The commands at the plumbing level, which are designed to be used in your own scripts, like "diff-files" and "diff-index", do not refresh the index for the performance purposes. If your own script wants to produce correct result, your script IS responsible for refreshing the index after its last modification to working tree files before it starts to use the plumbing commands to inspect which ones are modified and which ones are not. This is so that your script has more control over when the index is refreshed. It does not have to pay cost to run refresh for each Git command it invokes, if it knows that it does not make any modification between the two invocations; it can instead refresh just once and then run these two plumbing commands. So, it would be absolute no-no to make the Porcelain commands like "describe" and "diff" not to refresh the index before they work by default. As an optional behaviour, it might be acceptable if there is no other reasonable solutions (like, using plumbing commands if the callers are not you typing but your scripts calling them).
On Thu, Mar 06, 2025 at 08:11:09AM -0800, Junio C Hamano wrote: > "Benjamin Woodruff via GitGitGadget" <gitgitgadget@gmail.com> > writes: > > > git describe and git diff may update the index in the background for similar > > performance reasons to git-status. > > That is a wrong reasoning that is completely opposite, though. > > The commands at the Porcelain level, like "status" and "diff", > refresh the index for the CORRECTNESS purposes. Right, but "status" supports --no-optional-locks already. So I think these patches are wrong, but the goal is reasonable. What git-status does with --no-optional-locks is to update the index internally for its _own_ use (giving it the correct results), but not to lock nor write out the resulting index (to avoid conflicting with other running programs). So it's pessimal (losing the opportunity to share what it learned) but prevents lock contention. And I think other programs could do that. I even wrote back in 27344d6a6c (git: add --no-optional-locks option, 2017-09-27): I've punted here on finding more callers to convert, since "status" is the obvious one to call as a repeated background job. But "git diff"'s opportunistic refresh of the index may be a good candidate. I must admit that I didn't imagine "describe" is something that somebody would run a lot in the background, but there's not really any harm in having it support optional locks if somebody cares about that case. > The commands at the plumbing level, which are designed to be used in > your own scripts, like "diff-files" and "diff-index", do not refresh > the index for the performance purposes. If your own script wants to > produce correct result, your script IS responsible for refreshing > the index after its last modification to working tree files before > it starts to use the plumbing commands to inspect which ones are > modified and which ones are not. This is so that your script has > more control over when the index is refreshed. It does not have to > pay cost to run refresh for each Git command it invokes, if it knows > that it does not make any modification between the two invocations; > it can instead refresh just once and then run these two plumbing > commands. Assuming --no-optional-locks is being used as intended, the idea is that the script cannot touch the index at all, because it is running in the background and does not want lock contention with things the user is doing in the foreground. So it cannot naively do a single index refresh followed by plumbing commands like diff-files. Either it must: - accept that diff-files might return stat-dirty results (yuck) - use its own index that is separate from the regular .git/index file. But that may be overly slow, since the index "update" would rewrite the whole thing from scratch. Of course all of our index writes are from scratch, but you'd pay the price even when there is nothing to update. - use a command which operates all in a single process, with an in-memory index that is updated but not written out (e.g., "git --no-optional-locks status"). -Peff
Jeff King <peff@peff.net> writes: > .... What > git-status does with --no-optional-locks is to update the index > internally for its _own_ use (giving it the correct results), but not to > lock nor write out the resulting index (to avoid conflicting with other > running programs). So it's pessimal (losing the opportunity to share > what it learned) but prevents lock contention. Yup, that sounds somewhat sensible. I also have to wonder, other than commands that are clearly about changing the repository state like "add", the inspection commands like diff and status should always opportunistically write the index back, without even being asked? > ... Either it must: > > - accept that diff-files might return stat-dirty results (yuck) > > - use its own index that is separate from the regular .git/index file. > But that may be overly slow, since the index "update" would rewrite > the whole thing from scratch. Of course all of our index writes are > from scratch, but you'd pay the price even when there is nothing to > update. > > - use a command which operates all in a single process, with an > in-memory index that is updated but not written out (e.g., "git > --no-optional-locks status").
On Mon, Mar 10, 2025 at 05:25:32AM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > .... What > > git-status does with --no-optional-locks is to update the index > > internally for its _own_ use (giving it the correct results), but not to > > lock nor write out the resulting index (to avoid conflicting with other > > running programs). So it's pessimal (losing the opportunity to share > > what it learned) but prevents lock contention. > > Yup, that sounds somewhat sensible. I also have to wonder, other > than commands that are clearly about changing the repository state > like "add", the inspection commands like diff and status should > always opportunistically write the index back, without even being > asked? Yeah, certainly it is a potential source of confusion to have a conceptually read-only operation take locks or modify the repo state. But I'm not sure we have a sense of how valuable that optimization is in practice. After touching some files, every git-status, git-diff, etc would end up re-hashing those files instead of using the stat cache. But maybe that is lost in the noise of reading the files to actually do diffs, etc? I dunno. I expect it is more important for status, which probably does not need to read the whole file contents in most cases (and which may be run a lot from the user's prompt, etc). It seems like a big and possibly risky departure from what we've done for so many years. I'm inclined not to rock the boat too much. ;) -Peff
Jeff King <peff@peff.net> writes: > But maybe that is lost in the noise of reading the files to actually do > diffs, etc? I dunno. I expect it is more important for status, which > probably does not need to read the whole file contents in most cases > (and which may be run a lot from the user's prompt, etc). Yeah, and old timers who run "diff --raw" as if it were a quick analogue for "status" also would notice. > It seems like a big and possibly risky departure from what we've done > for so many years. I'm inclined not to rock the boat too much. ;) Certainly not right now. But adding a command line option is even worse as we would have to carry the support for it for practically forever X-<.
On Mon, Mar 10, 2025, at 11:53 AM, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > >> But maybe that is lost in the noise of reading the files to actually do >> diffs, etc? I dunno. I expect it is more important for status, which >> probably does not need to read the whole file contents in most cases >> (and which may be run a lot from the user's prompt, etc). > > Yeah, and old timers who run "diff --raw" as if it were a quick > analogue for "status" also would notice. Thanks for your reviews, Junio and Jeff. >>>> I must admit that I didn't imagine "describe" is something that >>>> somebody would run a lot in the background It might help if I start with some more context of why I wrote this patch. We've got a tool that uses the Rust `vergen-gitcl` crate to call `git describe --dirty`. You can see that code here: https://github.com/vercel/next.js/pull/76889/files We use `vergen-gitcl` to generate version identifiers for an on-disk cache. This cache stores results of thousands of functions and has no backwards compatibility. We want to invalidate it when *any* of the code changes. `git-describe` felt like a good fit for that, as it gives us a unique identifier that's still reasonably user-friendly. However, we discovered that we'd frequently end up with stale git lockfiles. This appeared to be due some combination of IDE tools that run the build in the background (i.e. the rust-analyzer LSP), behavior that causes builds to sometimes get killed before completion, and the fact that `git describe --dirty` takes a lock. I've worked around this on our end, by re-implementing `describe`'s `--dirty` flag on top of `status`: <https://github.com/rustyhorde/vergen/pull/406> So, from my end, there's no urgency to get this change in, or to add support for `diff` (we only care about `describe`). Rather, I felt like this might be a footgun for other users, and wanted to do the right thing by submitting an upstream change. >>> git describe and git diff may update the index in the background for >>> similar performance reasons to git-status. >> >> That is a wrong reasoning that is completely opposite, though. >> >> The commands at the Porcelain level, like "status" and "diff", >> refresh the index for the CORRECTNESS purposes. > > Right, but "status" supports --no-optional-locks already. Does this mean the documentation in `git-status` is incorrect? It implies that the background refresh is only for performance reasons. That's where I got this idea from: <https://git-scm.com/docs/git-status#_background_refresh> It's also worth noting that libgit2 does not do this background refresh by default (`GIT_DIFF_UPDATE_INDEX` and `GIT_STATUS_OPT_UPDATE_INDEX`). I think that makes sense for libgit2's typical use-cases, but it is a divergence in behavior. >> Yeah, certainly it is a potential source of confusion to have a >> conceptually read-only operation take locks or modify the repo state. >> >> [...] >> >> It seems like a big and possibly risky departure from what we've done >> for so many years. I'm inclined not to rock the boat too much. ;) > > Certainly not right now. But adding a command line option is even > worse as we would have to carry the support for it for practically > forever X-<. What about if we reduce this to documentation changes? That alone would've saved me a lot of pain of trying to figure out what does and doesn't take a lock: - Explicitly document that `--no-optional-locks` only changes behavior for `git status`. - Document that the `--dirty` flag on `describe` will lead to `status`-like background refresh behavior. - Change the documentation for status's background refresh to indicate that there is a subtle difference in behavior caused by `--no-optional-locks`? I lack sufficient context on how best to describe or explain this. - Leave `git-diff` documentation as-is. I think the current documentation of `diff.autoRefreshIndex` is sufficient?
"Benjamin Woodruff" <github@benjam.info> writes: >>> The commands at the Porcelain level, like "status" and "diff", >>> refresh the index for the CORRECTNESS purposes. >> >> Right, but "status" supports --no-optional-locks already. > > Does this mean the documentation in `git-status` is incorrect? It > implies that the background refresh is only for performance reasons. It has to do, and it does, refresh the in-core index for correctness reasons. Writing the result out to the on-disk file for _other_ processes that will use the index later is a performance hack, and may or may not happen (it is up to what repo_update_index_if_able() does). If we do not care about redundant refreshing these other processes need to do before they start, we do not have to write the resulting in-core index out to the disk. Writing the in-core index out for other processes has one downside, which is that for correctness it has to be done under lock. We cannot say "we tried to be nice to others by writing the index file out, but we just found out that there is a competing process trying to open/write the index so let's skip writing it and let them do their thing---they are responsible for refreshing the index for their own use". There needs some coordination, i.e. when we start writing the result of our refreshing the index for others, if other processes need to pay attention to that fact and wait for a while before reading the index, then they benefit because they do not have to refresh the index themselves. Even though our programs, by calling hold_lock_file_for_update(), know how to wait for a bit when this happens, not everybody cooperates.
On Mon, Mar 10, 2025 at 01:50:36PM -0700, Benjamin Woodruff wrote: > It might help if I start with some more context of why I wrote this > patch. We've got a tool that uses the Rust `vergen-gitcl` crate to call > `git describe --dirty`. You can see that code here: > https://github.com/vercel/next.js/pull/76889/files > > We use `vergen-gitcl` to generate version identifiers for an on-disk > cache. This cache stores results of thousands of functions and has no > backwards compatibility. We want to invalidate it when *any* of the code > changes. `git-describe` felt like a good fit for that, as it gives us a > unique identifier that's still reasonably user-friendly. > > However, we discovered that we'd frequently end up with stale git > lockfiles. This appeared to be due some combination of IDE tools that > run the build in the background (i.e. the rust-analyzer LSP), behavior > that causes builds to sometimes get killed before completion, and the > fact that `git describe --dirty` takes a lock. Yeah, that is not quite the original use case that --no-optional-locks was designed for (i.e., simultaneous contention), but I think it is a reasonable application of the flag. > >>> git describe and git diff may update the index in the background for > >>> similar performance reasons to git-status. > >> > >> That is a wrong reasoning that is completely opposite, though. > >> > >> The commands at the Porcelain level, like "status" and "diff", > >> refresh the index for the CORRECTNESS purposes. > > > > Right, but "status" supports --no-optional-locks already. > > Does this mean the documentation in `git-status` is incorrect? It > implies that the background refresh is only for performance reasons. > That's where I got this idea from: > <https://git-scm.com/docs/git-status#_background_refresh> I think Junio gave an explanation here, so I won't repeat that. But I also think both of us may have been a bit confused about the changes your patches are making, because there's some subtlety. The important thing to keep in mind is that there are _two_ steps: refreshing the in-core index and writing the result out to the on-disk file. With --no-optional-locks we must continue to do the first step (for correctness), and skip the second step. So looking at your patch 1/2 for git-describe, it is doing the right thing: we still call refresh_index() always, and only skip the calls to repo_hold_locked_index() and repo_update_index_if_able(). But one thing that puzzles me is that we read and refresh the index first and only _then_ take a lock. Which seems wrong to me, as we could racily overwrite an intermediate write from somebody else that we never even saw (e.g., imagine you call "git add" at just the wrong moment). That is not a bug in your code, but an existing problem that I think made it harder to understand your change (and probably one we should fix regardless). Your patch 2/2 for git-diff is what I thought was actually wrong, but after digging further, I'm not so sure. In your patch we return early from refresh_index_quietly(), without actually refreshing the in-core index. So I _thought_ that meant we'd produce a wrong answer for something like this: $ touch git.c $ ./git --no-optional-locks diff where we should report "no changes", but would instead find the stat-dirty git.c (just like a plumbing "git diff-files" would). But that doesn't happen! That's because refresh_index_quietly() runs after the diff has completed anyway. The real magic is in diffcore_skip_stat_unmatch(), which processes individual stat-dirty entries and suppresses them (when there's no actual content change). So the call in refresh_index_quietly() really is just about updating what we're about to write out, and your patch is correct to bail from the whole function (if we are not writing it out, there is no purpose in refreshing at that point). So as far as I can tell the patches are doing the right thing. But I think the commit messages probably need to describe those subtleties and argue that the change is correct. Bonus points if a preparatory patch fixes the race in git-describe. ;) > It's also worth noting that libgit2 does not do this background refresh > by default (`GIT_DIFF_UPDATE_INDEX` and `GIT_STATUS_OPT_UPDATE_INDEX`). > I think that makes sense for libgit2's typical use-cases, but it is a > divergence in behavior. Yes, I think that is probably a reasonable default for libgit2, where you'd expect everything to happen in a single process (that can share the in-core index). -Peff