[0/4] gc docs: modernize and fix the documentation

Message ID	20190318161502.7979-1-avarab@gmail.com (mailing list archive)
Headers	show Return-Path: <git-owner@kernel.org> From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= <avarab@gmail.com> To: git@vger.kernel.org Cc: Junio C Hamano <gitster@pobox.com>, =?utf-8?b?Tmd1eeG7hW4gVGjDoWkgTmc=?= =?utf-8?b?4buNYyBEdXk=?= <pclouds@gmail.com>, Jeff King <peff@peff.net>, Michael Haggerty <mhagger@alum.mit.edu>, Stefan Beller <stefanbeller@gmail.com>, Jonathan Nieder <jrnieder@gmail.com>, Matt McCutchen <matt@mattmccutchen.net>, =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBC?= =?utf-8?b?amFybWFzb24=?= <avarab@gmail.com> Subject: [PATCH 0/4] gc docs: modernize and fix the documentation Date: Mon, 18 Mar 2019 17:14:58 +0100 Message-Id: <20190318161502.7979-1-avarab@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: git-owner@vger.kernel.org Precedence: bulk
Series	gc docs: modernize and fix the documentation \| expand [0/4] gc docs: modernize and fix the documentation [1/4] gc docs: modernize the advice for manually running "gc" [2/4] gc docs: include the "gc." section from "config" in "gc" [3/4] gc docs: de-duplicate "OPTIONS" and "CONFIGURATION" [4/4] gc docs: downplay the usefulness of --aggressive [v2,05/10] gc docs: re-flow the "gc." section in "config" [v2,06/10] gc docs: note how --aggressive impacts --window & --depth [v2,07/10] gc docs: downplay the usefulness of --aggressive [v2,08/10] gc docs: note "gc --aggressive" in "fast-import" [v2,09/10] gc docs: clarify that "gc" doesn't throw away referenced objects [v2,10/10] gc docs: remove incorrect reference to gc.auto=0

Ævar Arnfjörð Bjarmason March 18, 2019, 4:14 p.m. UTC

I've been annoyed by the "gc" docs for a while. This fixes most of the
issues I've noticed, and also removes the duplication between the "gc"
variables in git-config(1) and in git-gc(1), which was made possible
by Duy's Documentation/config/* series.

This should make the "gc" docs more awesome, and due to removing the
de-duplication results in a net deletion of lines. Yay.

The only thing I was on the fence about was removing the 'gc
"refs/remotes/*' config example, but I think the remaining docs
explain it well enough. It can be added back if someone insists...

Now, I was originally going to have 5 patches in this series by
modernizing the "NOTES" section, but that didn't make it in.

This series is unrelated (and does not conflict with) my in-flight gc
contention series
(https://public-inbox.org/git/20190315155959.12390-1-avarab@gmail.com/),
but the "git-gc" docs should be updated to discuss the
core.filesRefLockTimeout option and how it impacts contention, see 8/8
in that series for context. I.e. "you may have contention, but
core.filesRefLockTimeout can mitigate blah blah".

I was going to do that, but then thought that we should also mention
that on the server-side we mitigate most/all of the contention via the
quarantine, see "QUARANTINE ENVIRONMENT" in
git-receive-pack(1). I.e. we:

 1. Get the temp pack
 2. OK it (fsck, hooks etc.)
 3. Move *complete* previously temp packs over
 4. Update the refs

I.e. we are immune from the "concurrently with another process" race,
but of course something concurrently updating the "server" repo
without a quarantine environment may be subject to that race.

The only problem is that the last couple of paragraphs may be
wrong. That's just my understanding from a brief reading of
722ff7f876c ("receive-pack: quarantine objects until pre-receive
accepts", 2016-10-03) so I didn't want to include that in this
series. Peff (or others), any comments?

Ævar Arnfjörð Bjarmason (4):
  gc docs: modernize the advice for manually running "gc"
  gc docs: include the "gc.*" section from "config" in "gc"
  gc docs: de-duplicate "OPTIONS" and "CONFIGURATION"
  gc docs: downplay the usefulness of --aggressive

 Documentation/config/gc.txt |  39 ++++++++++--
 Documentation/git-gc.txt    | 120 +++++++++---------------------------
 2 files changed, 65 insertions(+), 94 deletions(-)

Jeff King March 18, 2019, 9:51 p.m. UTC | #1

On Mon, Mar 18, 2019 at 05:14:58PM +0100, Ævar Arnfjörð Bjarmason wrote:

> This series is unrelated (and does not conflict with) my in-flight gc
> contention series
> (https://public-inbox.org/git/20190315155959.12390-1-avarab@gmail.com/),
> but the "git-gc" docs should be updated to discuss the
> core.filesRefLockTimeout option and how it impacts contention, see 8/8
> in that series for context. I.e. "you may have contention, but
> core.filesRefLockTimeout can mitigate blah blah".
> 
> I was going to do that, but then thought that we should also mention
> that on the server-side we mitigate most/all of the contention via the
> quarantine, see "QUARANTINE ENVIRONMENT" in
> git-receive-pack(1). I.e. we:
> 
>  1. Get the temp pack
>  2. OK it (fsck, hooks etc.)
>  3. Move *complete* previously temp packs over
>  4. Update the refs
> 
> I.e. we are immune from the "concurrently with another process" race,
> but of course something concurrently updating the "server" repo
> without a quarantine environment may be subject to that race.
> 
> The only problem is that the last couple of paragraphs may be
> wrong. That's just my understanding from a brief reading of
> 722ff7f876c ("receive-pack: quarantine objects until pre-receive
> accepts", 2016-10-03) so I didn't want to include that in this
> series. Peff (or others), any comments?

I don't think the quarantine stuff should impact contention at all. It's
only quarantining the objects, which are the least contentious part of
Git (because object content is idempotent, so we don't do any locking
there, and with two racing processes, one will just "win").

-Peff

Ævar Arnfjörð Bjarmason March 18, 2019, 10:45 p.m. UTC | #2

On Mon, Mar 18 2019, Jeff King wrote:

> On Mon, Mar 18, 2019 at 05:14:58PM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>> This series is unrelated (and does not conflict with) my in-flight gc
>> contention series
>> (https://public-inbox.org/git/20190315155959.12390-1-avarab@gmail.com/),
>> but the "git-gc" docs should be updated to discuss the
>> core.filesRefLockTimeout option and how it impacts contention, see 8/8
>> in that series for context. I.e. "you may have contention, but
>> core.filesRefLockTimeout can mitigate blah blah".
>>
>> I was going to do that, but then thought that we should also mention
>> that on the server-side we mitigate most/all of the contention via the
>> quarantine, see "QUARANTINE ENVIRONMENT" in
>> git-receive-pack(1). I.e. we:
>>
>>  1. Get the temp pack
>>  2. OK it (fsck, hooks etc.)
>>  3. Move *complete* previously temp packs over
>>  4. Update the refs
>>
>> I.e. we are immune from the "concurrently with another process" race,
>> but of course something concurrently updating the "server" repo
>> without a quarantine environment may be subject to that race.
>>
>> The only problem is that the last couple of paragraphs may be
>> wrong. That's just my understanding from a brief reading of
>> 722ff7f876c ("receive-pack: quarantine objects until pre-receive
>> accepts", 2016-10-03) so I didn't want to include that in this
>> series. Peff (or others), any comments?
>
> I don't think the quarantine stuff should impact contention at all. It's
> only quarantining the objects, which are the least contentious part of
> Git (because object content is idempotent, so we don't do any locking
> there, and with two racing processes, one will just "win").

Without the quarantine, isn't there the race that the NOTES section
talks about (unless I've misread it).

I.e. we have some loose object "ABCD" not referrred to by anything for
the last 2 weeks, as we're gc-ing a ref update comes in that makes it
referenced again. We then delete "ABCD" (not used!) at the same time the
ref update happens, and get corruption.

Whereas the quarantine might work around since the client will have sent
ABCD with no reference pointing to it to the server in the temp pack,
which we then rename in-place and then update the ref, so we don't care
if "ABCD" goes away.

Unless that interacts racily with the receive.unpackLimit, but then I
have no idea that section is trying to say...

Also, surely the part where "NOTES" says something to the effect of "you
are subject to races unless gc.auto=0" is wrong. To the extent that
there's races it won't matter that you invoke "git gc" or "git gc
--auto", it's the concurrency that matters. So if there's still races we
should be saying the repo needs to be locked for writes for the duration
of the "gc".

Jeff King March 19, 2019, 12:18 a.m. UTC | #3

On Mon, Mar 18, 2019 at 11:45:39PM +0100, Ævar Arnfjörð Bjarmason wrote:

> > I don't think the quarantine stuff should impact contention at all. It's
> > only quarantining the objects, which are the least contentious part of
> > Git (because object content is idempotent, so we don't do any locking
> > there, and with two racing processes, one will just "win").
> 
> Without the quarantine, isn't there the race that the NOTES section
> talks about (unless I've misread it).

Ah, OK, I wasn't quite sure which documentation you were talking about.
I see the discussion now in the "NOTES" section of git-gc(1).

> I.e. we have some loose object "ABCD" not referrred to by anything for
> the last 2 weeks, as we're gc-ing a ref update comes in that makes it
> referenced again. We then delete "ABCD" (not used!) at the same time the
> ref update happens, and get corruption.
> 
> Whereas the quarantine might work around since the client will have sent
> ABCD with no reference pointing to it to the server in the temp pack,
> which we then rename in-place and then update the ref, so we don't care
> if "ABCD" goes away.

tl;dr I don't think quarantine impacts this, but if you really want gory
details, read on.

This is a problem with or without the quarantine. It's fundamentally a
race because we do not atomically say "is anybody using X? If not, we
can delete it" and some other process saying "I'd like to use X".

Pushes are actually better off than most operations, because we only
advertise what's reachable, and the client is expected to send
everything else. So with just a normal update-ref call, we could race
like this:

  1. ABCD is ancient.

  2. Process 1 (update-ref) wants to reference ABCD. It sees that we
     have it.

  3. Process 2 (gc/prune) sees that nobody references it. It deletes
     ABCD.

  4. Process 1 writes out the reference.

That doesn't happen with a push, because the server never would have
told the client that it has ABCD in the first place (so process 1 here
is the client). That is true with or without quarantine.

But pushes aren't foolproof either. You said "loose object ABCD not
referred t oby anything for the last 2 weeks". But that's not exactly
how it works. It's "object with an mtime of more than 2 weeks which is
not currently referenced". So imagine a sequence like:

  1. ABCD is ancient.

  2. refs/heads/foo points to ABCD.

  3. Server receive-pack advertises foo pointing to ABCD.

  4. Simultaneous process on the server deletes refs/heads/foo (or
     perhaps somebody force-pushes over it).

  5. Client prepares and sends pack without ABCD.

  6. Server receive-pack checks that yes, we still have ABCD (i.e., the
     usual connectivity check).

  7. Server gc drops ABCD, which is now unreachable (reflogs can help
     here, if you've enabled them; but we do delete reflogs when the
     branch is deleted).

  8. Server receive-pack writes corrupt repo mentioning ABCD.

That's a lot more steps, though they might not be as implausible as you
think (e.g., consider moving "refs/heads/foo" to "refs/heads/bar" in a
single push; that's actually a delete and an update, which is all you
need to race with a simultaneous gc).

I have no idea how often this happens in practice. My subjective
recollection is that most of the race corruptions I've seen were from
local operations on the server. E.g., we compute a tentative merge for
somebody's pull request which shares objects with an older tentative
merge. They click the "merge" button and we reference that commit, which
is recent, but unbeknownst to us, while we were creating our new
tentative merge, a "gc" was deleting the old one.

We're sometimes saved by the "transitive freshness" rules in d3038d22f9
(prune: keep objects reachable from recent objects, 2014-10-15).  But
they're far from perfect:

 - some operations (like the push rename example) aren't writing new
   objects, so the ref write _is_ the moment that gc would find out
   something is reachable

 - the "is it reachable?" and "no, then delete it" steps aren't atomic.
   Unless you want a whole-repo stop-the-world lock, somebody can
   reference the object in between. And since it may take many seconds
   to compute reachability, stop-the-world is not great.

I think there are probably ways to make it better. Perhaps some kind of
lockless delete-but-be-able-to-rollback scheme (but keep in mind this
has to be implemented no top of POSIX filesystem semantics). Or even
just a "compute reachability, mark for deletion, and then hold a
stop-the-world lock briefly to double-check that our reachability is
still up to date".

At least those seem plausible to me. I've never worked out the details,
and our solution was to just stop deleting objects during routine
maintenance (using "repack -adk"). We do still occasionally prune
manually (e.g., when somebody writes to support to remove a confidential
mistake).

Anyway, that was more than you probably wanted to know. The short of it
is that I don't think quarantines help (they may even make things worse
by slightly increasing the length of the race window, though in practice
I doubt it).

> Unless that interacts racily with the receive.unpackLimit, but then I
> have no idea that section is trying to say...

No, I don't think unpackLimit really affects it much either way.

> Also, surely the part where "NOTES" says something to the effect of "you
> are subject to races unless gc.auto=0" is wrong. To the extent that
> there's races it won't matter that you invoke "git gc" or "git gc
> --auto", it's the concurrency that matters. So if there's still races we
> should be saying the repo needs to be locked for writes for the duration
> of the "gc".

Correct. It's the very act of pruning that is problematic. I think the
point is that if you are manually running "git gc", you'd presumably do
it at a time when the repository is not otherwise active.

-Peff

Ævar Arnfjörð Bjarmason May 6, 2019, 9:44 a.m. UTC | #4

On Tue, Mar 19 2019, Jeff King wrote:

> On Mon, Mar 18, 2019 at 11:45:39PM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>> > I don't think the quarantine stuff should impact contention at all. It's
>> > only quarantining the objects, which are the least contentious part of
>> > Git (because object content is idempotent, so we don't do any locking
>> > there, and with two racing processes, one will just "win").
>>
>> Without the quarantine, isn't there the race that the NOTES section
>> talks about (unless I've misread it).
>
> Ah, OK, I wasn't quite sure which documentation you were talking about.
> I see the discussion now in the "NOTES" section of git-gc(1).
>
>> I.e. we have some loose object "ABCD" not referrred to by anything for
>> the last 2 weeks, as we're gc-ing a ref update comes in that makes it
>> referenced again. We then delete "ABCD" (not used!) at the same time the
>> ref update happens, and get corruption.
>>
>> Whereas the quarantine might work around since the client will have sent
>> ABCD with no reference pointing to it to the server in the temp pack,
>> which we then rename in-place and then update the ref, so we don't care
>> if "ABCD" goes away.
>
> tl;dr I don't think quarantine impacts this, but if you really want gory
> details, read on.
>
> This is a problem with or without the quarantine. It's fundamentally a
> race because we do not atomically say "is anybody using X? If not, we
> can delete it" and some other process saying "I'd like to use X".
>
> Pushes are actually better off than most operations, because we only
> advertise what's reachable, and the client is expected to send
> everything else. So with just a normal update-ref call, we could race
> like this:
>
>   1. ABCD is ancient.
>
>   2. Process 1 (update-ref) wants to reference ABCD. It sees that we
>      have it.
>
>   3. Process 2 (gc/prune) sees that nobody references it. It deletes
>      ABCD.
>
>   4. Process 1 writes out the reference.
>
> That doesn't happen with a push, because the server never would have
> told the client that it has ABCD in the first place (so process 1 here
> is the client). That is true with or without quarantine.
>
> But pushes aren't foolproof either. You said "loose object ABCD not
> referred t oby anything for the last 2 weeks". But that's not exactly
> how it works. It's "object with an mtime of more than 2 weeks which is
> not currently referenced". So imagine a sequence like:
>
>   1. ABCD is ancient.
>
>   2. refs/heads/foo points to ABCD.
>
>   3. Server receive-pack advertises foo pointing to ABCD.
>
>   4. Simultaneous process on the server deletes refs/heads/foo (or
>      perhaps somebody force-pushes over it).
>
>   5. Client prepares and sends pack without ABCD.
>
>   6. Server receive-pack checks that yes, we still have ABCD (i.e., the
>      usual connectivity check).
>
>   7. Server gc drops ABCD, which is now unreachable (reflogs can help
>      here, if you've enabled them; but we do delete reflogs when the
>      branch is deleted).
>
>   8. Server receive-pack writes corrupt repo mentioning ABCD.
>
> That's a lot more steps, though they might not be as implausible as you
> think (e.g., consider moving "refs/heads/foo" to "refs/heads/bar" in a
> single push; that's actually a delete and an update, which is all you
> need to race with a simultaneous gc).
>
> I have no idea how often this happens in practice. My subjective
> recollection is that most of the race corruptions I've seen were from
> local operations on the server. E.g., we compute a tentative merge for
> somebody's pull request which shares objects with an older tentative
> merge. They click the "merge" button and we reference that commit, which
> is recent, but unbeknownst to us, while we were creating our new
> tentative merge, a "gc" was deleting the old one.
>
> We're sometimes saved by the "transitive freshness" rules in d3038d22f9
> (prune: keep objects reachable from recent objects, 2014-10-15).  But
> they're far from perfect:
>
>  - some operations (like the push rename example) aren't writing new
>    objects, so the ref write _is_ the moment that gc would find out
>    something is reachable
>
>  - the "is it reachable?" and "no, then delete it" steps aren't atomic.
>    Unless you want a whole-repo stop-the-world lock, somebody can
>    reference the object in between. And since it may take many seconds
>    to compute reachability, stop-the-world is not great.
>
> I think there are probably ways to make it better. Perhaps some kind of
> lockless delete-but-be-able-to-rollback scheme (but keep in mind this
> has to be implemented no top of POSIX filesystem semantics). Or even
> just a "compute reachability, mark for deletion, and then hold a
> stop-the-world lock briefly to double-check that our reachability is
> still up to date".
>
> At least those seem plausible to me. I've never worked out the details,
> and our solution was to just stop deleting objects during routine
> maintenance (using "repack -adk"). We do still occasionally prune
> manually (e.g., when somebody writes to support to remove a confidential
> mistake).
>
> Anyway, that was more than you probably wanted to know. The short of it
> is that I don't think quarantines help (they may even make things worse
> by slightly increasing the length of the race window, though in practice
> I doubt it).
>
>> Unless that interacts racily with the receive.unpackLimit, but then I
>> have no idea that section is trying to say...
>
> No, I don't think unpackLimit really affects it much either way.
>
>> Also, surely the part where "NOTES" says something to the effect of "you
>> are subject to races unless gc.auto=0" is wrong. To the extent that
>> there's races it won't matter that you invoke "git gc" or "git gc
>> --auto", it's the concurrency that matters. So if there's still races we
>> should be saying the repo needs to be locked for writes for the duration
>> of the "gc".
>
> Correct. It's the very act of pruning that is problematic. I think the
> point is that if you are manually running "git gc", you'd presumably do
> it at a time when the repository is not otherwise active.

Thanks for that E-Mail. I'm hoping to get around to another set of "gc
doc" updates and will hopefully be able to steal liberally from it.

Maybe there's some case I haven't thought of that makes this stupid, but
I wonder if something like a "gc quarantine" might be a fix fo both of
the the issues you noted above.

I.e. it seems to me that the main issue is that we conflate "mtime 2
weeks old because it's unreferenced for 2 weeks" v.s. "mtime 2 weeks old
because we haven't gotten around to a 'gc'".

So in such a "gc quarantine" mode when we discover an object/pack that's
unreachable/purely made up of unreachable objects we'd move the relevant
loose object/"loose" pack to such a quarantine, which would just be
.git/unreferenced-objects/{??,pack}/ or whatever.

AFAICT both cases you mentioned above would be mitigated by this because
we'd no longer conflate "haven't gc'd this yet and it's 2 weeks old"
v.s. "hasn't been referenced in 2 weeks".

I started looking at this initially because I was wondering if the
--keep-unreachable mode you modified in e26a8c4721 ("repack: extend
--keep-unreachable to loose objects", 2016-06-13) could be made to write
out such "unreferenced" objects into their *own* pack, so we could
delete them all at once as a batch, and wouldn't create the "ref
explosions" mentioned in [1].

But of course without an accompanying quarantine described above doing
that would just make this race condition worse.

1. https://public-inbox.org/git/87fu6bmr0j.fsf@evledraar.gmail.com/

Jeff King May 7, 2019, 7:51 a.m. UTC | #5

On Mon, May 06, 2019 at 11:44:06AM +0200, Ævar Arnfjörð Bjarmason wrote:

> Maybe there's some case I haven't thought of that makes this stupid, but
> I wonder if something like a "gc quarantine" might be a fix fo both of
> the the issues you noted above.
> 
> I.e. it seems to me that the main issue is that we conflate "mtime 2
> weeks old because it's unreferenced for 2 weeks" v.s. "mtime 2 weeks old
> because we haven't gotten around to a 'gc'".
> 
> So in such a "gc quarantine" mode when we discover an object/pack that's
> unreachable/purely made up of unreachable objects we'd move the relevant
> loose object/"loose" pack to such a quarantine, which would just be
> .git/unreferenced-objects/{??,pack}/ or whatever.
> 
> AFAICT both cases you mentioned above would be mitigated by this because
> we'd no longer conflate "haven't gc'd this yet and it's 2 weeks old"
> v.s. "hasn't been referenced in 2 weeks".

Michael Haggerty and I have (off-list) discussed variations on that, but
it opens up a lot of new issues.  Moving something into quarantine isn't
atomic. So you've still corrupted the repo, but now it's recoverable by
reaching into the quarantine. Who notices that the repo is corrupt, and
how? When do we expire objects from quarantine?

I think the heart of the issue is really the lack of atomicity in the
operations. You need some way to mark "I am using this now" in a way
that cannot race with "looks like nobody is using this, so I'll delete
it".

And ideally without traversing large bits of the graph on the writing
side, and without requiring any stop-the-world locks during pruning.

> I started looking at this initially because I was wondering if the
> --keep-unreachable mode you modified in e26a8c4721 ("repack: extend
> --keep-unreachable to loose objects", 2016-06-13) could be made to write
> out such "unreferenced" objects into their *own* pack, so we could
> delete them all at once as a batch, and wouldn't create the "ref
> explosions" mentioned in [1].
> 
> But of course without an accompanying quarantine described above doing
> that would just make this race condition worse.

I'm not sure it really makes it worse. The pack would have the same
mtime as the loose objects would.

-Peff

Ævar Arnfjörð Bjarmason May 9, 2019, 11:20 p.m. UTC | #6

On Tue, May 07 2019, Jeff King wrote:

> On Mon, May 06, 2019 at 11:44:06AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> Maybe there's some case I haven't thought of that makes this stupid, but
>> I wonder if something like a "gc quarantine" might be a fix fo both of
>> the the issues you noted above.
>>
>> I.e. it seems to me that the main issue is that we conflate "mtime 2
>> weeks old because it's unreferenced for 2 weeks" v.s. "mtime 2 weeks old
>> because we haven't gotten around to a 'gc'".
>>
>> So in such a "gc quarantine" mode when we discover an object/pack that's
>> unreachable/purely made up of unreachable objects we'd move the relevant
>> loose object/"loose" pack to such a quarantine, which would just be
>> .git/unreferenced-objects/{??,pack}/ or whatever.
>>
>> AFAICT both cases you mentioned above would be mitigated by this because
>> we'd no longer conflate "haven't gc'd this yet and it's 2 weeks old"
>> v.s. "hasn't been referenced in 2 weeks".
>
> Michael Haggerty and I have (off-list) discussed variations on that, but
> it opens up a lot of new issues.  Moving something into quarantine isn't
> atomic. So you've still corrupted the repo, but now it's recoverable by
> reaching into the quarantine. Who notices that the repo is corrupt, and
> how? When do we expire objects from quarantine?
>
> I think the heart of the issue is really the lack of atomicity in the
> operations. You need some way to mark "I am using this now" in a way
> that cannot race with "looks like nobody is using this, so I'll delete
> it".
>
> And ideally without traversing large bits of the graph on the writing
> side, and without requiring any stop-the-world locks during pruning.

I was thinking (but realize now that I didn't articulate) that the "gc
quarantine" would be another "alternate" implementing a copy-on-write
"lockless delete-but-be-able-to-rollback scheme" as you put it.

So "gc" would decide (racily) what's unreachable, but instead of
unlink()-ing it would "mv" the loose object/pack into the
"unreferenced-objects" quarantine.

Then in your example #1 "wants to reference ABCD. It sees that we have
it." would race on the "other side". I.e. maybe ABCD was *just* moved to
the quarantine. But in that case we'd move it back, which would bump the
mtime and thus make it ineligible for expiry.

Similarly for example #2, the "ABCD is ancient" would be moved, but then
promptely moved back on the next GC as we notice ABCD has been
re-referenced.

Maybe it's just the same problem all over again, but I don't see how
yet.

Aside from that, I have a hunch that while it's theoretically true that
you can at any time re-reference some loose blob/tree/commit again, that
the likelyhood of that in practice goes down as it ages, since a user is
likely to e.g. re-push or rename some branch they pushed last week, not
last year.

Hence the mention of creating "unreferenced packs" with some new
--keep-unreachable mode. Since we'd pack those together they wouldn't
create the "ref explosion" problem we have with the loose refs, and thus
you could afford to keep them longer (even though the deltas would be
shittier).

Whereas now you either need --keep-unreachable (keep stuff forever) or a
more aggressive gc.pruneExpire if you'd like to not end up with a
ginormous amount of loose objects.

>> I started looking at this initially because I was wondering if the
>> --keep-unreachable mode you modified in e26a8c4721 ("repack: extend
>> --keep-unreachable to loose objects", 2016-06-13) could be made to write
>> out such "unreferenced" objects into their *own* pack, so we could
>> delete them all at once as a batch, and wouldn't create the "ref
>> explosions" mentioned in [1].
>>
>> But of course without an accompanying quarantine described above doing
>> that would just make this race condition worse.
>
> I'm not sure it really makes it worse. The pack would have the same
> mtime as the loose objects would.

Jeff King July 31, 2019, 4:26 a.m. UTC | #7

On Fri, May 10, 2019 at 01:20:55AM +0200, Ævar Arnfjörð Bjarmason wrote:

> > Michael Haggerty and I have (off-list) discussed variations on that, but
> > it opens up a lot of new issues.  Moving something into quarantine isn't
> > atomic. So you've still corrupted the repo, but now it's recoverable by
> > reaching into the quarantine. Who notices that the repo is corrupt, and
> > how? When do we expire objects from quarantine?
> >
> > I think the heart of the issue is really the lack of atomicity in the
> > operations. You need some way to mark "I am using this now" in a way
> > that cannot race with "looks like nobody is using this, so I'll delete
> > it".
> >
> > And ideally without traversing large bits of the graph on the writing
> > side, and without requiring any stop-the-world locks during pruning.
> 
> I was thinking (but realize now that I didn't articulate) that the "gc
> quarantine" would be another "alternate" implementing a copy-on-write
> "lockless delete-but-be-able-to-rollback scheme" as you put it.
> 
> So "gc" would decide (racily) what's unreachable, but instead of
> unlink()-ing it would "mv" the loose object/pack into the
> "unreferenced-objects" quarantine.
> 
> Then in your example #1 "wants to reference ABCD. It sees that we have
> it." would race on the "other side". I.e. maybe ABCD was *just* moved to
> the quarantine. But in that case we'd move it back, which would bump the
> mtime and thus make it ineligible for expiry.

I think this is basically the same as the current freshening scheme,
though. In general, you can replace "move it back" with "update its
mtime". Neither is atomic with respect to other operations.

It does seem like the twist is that "gc" is supposed to do the "move it
back" step (and it's also the thing expiring, if we assume that there's
only one gc running at a time). But again, how do we know somebody isn't
referencing it _right now_ while we're deciding whether to move it back?

I think there are lots of solutions you can come up with if you have
atomicity. But fundamentally it isn't there in the way we handle updates
now. You could imagine something like a shared/unique lock where anybody
updating a ref takes the "shared" side, and multiple entities can hold
it at once. But somebody pruning takes the "unique" side and excludes
everybody else, stopping ref updates during the prune (which you'd
obviously want to do in a way that you hold the lock for as short as
possible; say, optimistically check reachability without the lock, then
take the lock and check to see if anything has changed).

(By shared/unique I basically mean a reader/writer lock, but I didn't
want to use those terms in the paragraph since both holders are
writing).

It is tricky to find out when to hold the shared lock, though. It's
_not_ just a ref write, for example. When you accept a push, you'd want
to hold the lock while you are checking that you have all of the
necessary objects to write the ref. For something like "git commit" it's
even harder, because we implicitly rely on state created by commands run
over the course of hours or days (e.g., "git add" to put a blob in the
index and maybe create the tree via cache-tree, then a commit to
reference it, and finally the ref write; each step adds state which the
next step relies on).

> Aside from that, I have a hunch that while it's theoretically true that
> you can at any time re-reference some loose blob/tree/commit again, that
> the likelyhood of that in practice goes down as it ages, since a user is
> likely to e.g. re-push or rename some branch they pushed last week, not
> last year.
>
> Hence the mention of creating "unreferenced packs" with some new
> --keep-unreachable mode. Since we'd pack those together they wouldn't
> create the "ref explosion" problem we have with the loose refs, and thus
> you could afford to keep them longer (even though the deltas would be
> shittier).

Yeah, that may make it less likely (and we'd like those unreferenced
packs for other reasons anyway, so it's certainly worth a shot). But the
whole race is kind of unlikely in the first place. If you have enough
repositories, you see it eventually. ;)

-Peff

Ævar Arnfjörð Bjarmason July 31, 2019, 10:12 a.m. UTC | #8

On Wed, Jul 31 2019, Jeff King wrote:

> On Fri, May 10, 2019 at 01:20:55AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> > Michael Haggerty and I have (off-list) discussed variations on that, but
>> > it opens up a lot of new issues.  Moving something into quarantine isn't
>> > atomic. So you've still corrupted the repo, but now it's recoverable by
>> > reaching into the quarantine. Who notices that the repo is corrupt, and
>> > how? When do we expire objects from quarantine?
>> >
>> > I think the heart of the issue is really the lack of atomicity in the
>> > operations. You need some way to mark "I am using this now" in a way
>> > that cannot race with "looks like nobody is using this, so I'll delete
>> > it".
>> >
>> > And ideally without traversing large bits of the graph on the writing
>> > side, and without requiring any stop-the-world locks during pruning.
>>
>> I was thinking (but realize now that I didn't articulate) that the "gc
>> quarantine" would be another "alternate" implementing a copy-on-write
>> "lockless delete-but-be-able-to-rollback scheme" as you put it.
>>
>> So "gc" would decide (racily) what's unreachable, but instead of
>> unlink()-ing it would "mv" the loose object/pack into the
>> "unreferenced-objects" quarantine.
>>
>> Then in your example #1 "wants to reference ABCD. It sees that we have
>> it." would race on the "other side". I.e. maybe ABCD was *just* moved to
>> the quarantine. But in that case we'd move it back, which would bump the
>> mtime and thus make it ineligible for expiry.
>
> I think this is basically the same as the current freshening scheme,
> though. In general, you can replace "move it back" with "update its
> mtime". Neither is atomic with respect to other operations.
>
> It does seem like the twist is that "gc" is supposed to do the "move it
> back" step (and it's also the thing expiring, if we assume that there's
> only one gc running at a time). But again, how do we know somebody isn't
> referencing it _right now_ while we're deciding whether to move it back?

The twist is to create a "quarantine" area of the ref store you can't
read any objects from without copying them to the "main" area (git-gc
itself would be an exception).

Hence step #2 and #6, respectively, in your examples in
https://public-inbox.org/git/20190319001829.GL29661@sigill.intra.peff.net/
would have update-ref/receive-pack fail to find "ABCD" in the "main"
store due to the exact same race we have now with mtimes & gc, then fall
back to the "quarantine" and (this is the important part) immediately
copy it back to the "main" store.

IOW yes, you'd have the exact same race you have now with the initial
move to the quarantine. You'd have ref updates & gc racing and
"unreachable" things would be moved to the quarantine, but really the
just became reachable again.

The difference is that instead of unlinking that unreachable object we
move it to the quarantine, so the next "gc" (which is what would delete
it) would notice it's reachable and move it to the "main" area before
proceeding, *and* anything that "faults" back to reading the
"quarantine" would do the same.

> I think there are lots of solutions you can come up with if you have
> atomicity. But fundamentally it isn't there in the way we handle updates
> now. You could imagine something like a shared/unique lock where anybody
> updating a ref takes the "shared" side, and multiple entities can hold
> it at once. But somebody pruning takes the "unique" side and excludes
> everybody else, stopping ref updates during the prune (which you'd
> obviously want to do in a way that you hold the lock for as short as
> possible; say, optimistically check reachability without the lock, then
> take the lock and check to see if anything has changed).
>
> (By shared/unique I basically mean a reader/writer lock, but I didn't
> want to use those terms in the paragraph since both holders are
> writing).
>
> It is tricky to find out when to hold the shared lock, though. It's
> _not_ just a ref write, for example. When you accept a push, you'd want
> to hold the lock while you are checking that you have all of the
> necessary objects to write the ref. For something like "git commit" it's
> even harder, because we implicitly rely on state created by commands run
> over the course of hours or days (e.g., "git add" to put a blob in the
> index and maybe create the tree via cache-tree, then a commit to
> reference it, and finally the ref write; each step adds state which the
> next step relies on).

I don't think this sort of approach would require any global locks, but
it would be vulnerable to operations that take longer than the
"main->quarantine->unlink()" cycle takes. E.g. a "hash-object" that
takes a month before the subsequent "write-tree" etc.

All of the above written with the previously stated "I may be missing
something" caveat etc. :)

[0/4] gc docs: modernize and fix the documentation

Message

Comments