Message ID | 20190318161502.7979-1-avarab@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | gc docs: modernize and fix the documentation | expand |
On Mon, Mar 18, 2019 at 05:14:58PM +0100, Ævar Arnfjörð Bjarmason wrote: > This series is unrelated (and does not conflict with) my in-flight gc > contention series > (https://public-inbox.org/git/20190315155959.12390-1-avarab@gmail.com/), > but the "git-gc" docs should be updated to discuss the > core.filesRefLockTimeout option and how it impacts contention, see 8/8 > in that series for context. I.e. "you may have contention, but > core.filesRefLockTimeout can mitigate blah blah". > > I was going to do that, but then thought that we should also mention > that on the server-side we mitigate most/all of the contention via the > quarantine, see "QUARANTINE ENVIRONMENT" in > git-receive-pack(1). I.e. we: > > 1. Get the temp pack > 2. OK it (fsck, hooks etc.) > 3. Move *complete* previously temp packs over > 4. Update the refs > > I.e. we are immune from the "concurrently with another process" race, > but of course something concurrently updating the "server" repo > without a quarantine environment may be subject to that race. > > The only problem is that the last couple of paragraphs may be > wrong. That's just my understanding from a brief reading of > 722ff7f876c ("receive-pack: quarantine objects until pre-receive > accepts", 2016-10-03) so I didn't want to include that in this > series. Peff (or others), any comments? I don't think the quarantine stuff should impact contention at all. It's only quarantining the objects, which are the least contentious part of Git (because object content is idempotent, so we don't do any locking there, and with two racing processes, one will just "win"). -Peff
On Mon, Mar 18 2019, Jeff King wrote: > On Mon, Mar 18, 2019 at 05:14:58PM +0100, Ævar Arnfjörð Bjarmason wrote: > >> This series is unrelated (and does not conflict with) my in-flight gc >> contention series >> (https://public-inbox.org/git/20190315155959.12390-1-avarab@gmail.com/), >> but the "git-gc" docs should be updated to discuss the >> core.filesRefLockTimeout option and how it impacts contention, see 8/8 >> in that series for context. I.e. "you may have contention, but >> core.filesRefLockTimeout can mitigate blah blah". >> >> I was going to do that, but then thought that we should also mention >> that on the server-side we mitigate most/all of the contention via the >> quarantine, see "QUARANTINE ENVIRONMENT" in >> git-receive-pack(1). I.e. we: >> >> 1. Get the temp pack >> 2. OK it (fsck, hooks etc.) >> 3. Move *complete* previously temp packs over >> 4. Update the refs >> >> I.e. we are immune from the "concurrently with another process" race, >> but of course something concurrently updating the "server" repo >> without a quarantine environment may be subject to that race. >> >> The only problem is that the last couple of paragraphs may be >> wrong. That's just my understanding from a brief reading of >> 722ff7f876c ("receive-pack: quarantine objects until pre-receive >> accepts", 2016-10-03) so I didn't want to include that in this >> series. Peff (or others), any comments? > > I don't think the quarantine stuff should impact contention at all. It's > only quarantining the objects, which are the least contentious part of > Git (because object content is idempotent, so we don't do any locking > there, and with two racing processes, one will just "win"). Without the quarantine, isn't there the race that the NOTES section talks about (unless I've misread it). I.e. we have some loose object "ABCD" not referrred to by anything for the last 2 weeks, as we're gc-ing a ref update comes in that makes it referenced again. We then delete "ABCD" (not used!) at the same time the ref update happens, and get corruption. Whereas the quarantine might work around since the client will have sent ABCD with no reference pointing to it to the server in the temp pack, which we then rename in-place and then update the ref, so we don't care if "ABCD" goes away. Unless that interacts racily with the receive.unpackLimit, but then I have no idea that section is trying to say... Also, surely the part where "NOTES" says something to the effect of "you are subject to races unless gc.auto=0" is wrong. To the extent that there's races it won't matter that you invoke "git gc" or "git gc --auto", it's the concurrency that matters. So if there's still races we should be saying the repo needs to be locked for writes for the duration of the "gc".
On Mon, Mar 18, 2019 at 11:45:39PM +0100, Ævar Arnfjörð Bjarmason wrote: > > I don't think the quarantine stuff should impact contention at all. It's > > only quarantining the objects, which are the least contentious part of > > Git (because object content is idempotent, so we don't do any locking > > there, and with two racing processes, one will just "win"). > > Without the quarantine, isn't there the race that the NOTES section > talks about (unless I've misread it). Ah, OK, I wasn't quite sure which documentation you were talking about. I see the discussion now in the "NOTES" section of git-gc(1). > I.e. we have some loose object "ABCD" not referrred to by anything for > the last 2 weeks, as we're gc-ing a ref update comes in that makes it > referenced again. We then delete "ABCD" (not used!) at the same time the > ref update happens, and get corruption. > > Whereas the quarantine might work around since the client will have sent > ABCD with no reference pointing to it to the server in the temp pack, > which we then rename in-place and then update the ref, so we don't care > if "ABCD" goes away. tl;dr I don't think quarantine impacts this, but if you really want gory details, read on. This is a problem with or without the quarantine. It's fundamentally a race because we do not atomically say "is anybody using X? If not, we can delete it" and some other process saying "I'd like to use X". Pushes are actually better off than most operations, because we only advertise what's reachable, and the client is expected to send everything else. So with just a normal update-ref call, we could race like this: 1. ABCD is ancient. 2. Process 1 (update-ref) wants to reference ABCD. It sees that we have it. 3. Process 2 (gc/prune) sees that nobody references it. It deletes ABCD. 4. Process 1 writes out the reference. That doesn't happen with a push, because the server never would have told the client that it has ABCD in the first place (so process 1 here is the client). That is true with or without quarantine. But pushes aren't foolproof either. You said "loose object ABCD not referred t oby anything for the last 2 weeks". But that's not exactly how it works. It's "object with an mtime of more than 2 weeks which is not currently referenced". So imagine a sequence like: 1. ABCD is ancient. 2. refs/heads/foo points to ABCD. 3. Server receive-pack advertises foo pointing to ABCD. 4. Simultaneous process on the server deletes refs/heads/foo (or perhaps somebody force-pushes over it). 5. Client prepares and sends pack without ABCD. 6. Server receive-pack checks that yes, we still have ABCD (i.e., the usual connectivity check). 7. Server gc drops ABCD, which is now unreachable (reflogs can help here, if you've enabled them; but we do delete reflogs when the branch is deleted). 8. Server receive-pack writes corrupt repo mentioning ABCD. That's a lot more steps, though they might not be as implausible as you think (e.g., consider moving "refs/heads/foo" to "refs/heads/bar" in a single push; that's actually a delete and an update, which is all you need to race with a simultaneous gc). I have no idea how often this happens in practice. My subjective recollection is that most of the race corruptions I've seen were from local operations on the server. E.g., we compute a tentative merge for somebody's pull request which shares objects with an older tentative merge. They click the "merge" button and we reference that commit, which is recent, but unbeknownst to us, while we were creating our new tentative merge, a "gc" was deleting the old one. We're sometimes saved by the "transitive freshness" rules in d3038d22f9 (prune: keep objects reachable from recent objects, 2014-10-15). But they're far from perfect: - some operations (like the push rename example) aren't writing new objects, so the ref write _is_ the moment that gc would find out something is reachable - the "is it reachable?" and "no, then delete it" steps aren't atomic. Unless you want a whole-repo stop-the-world lock, somebody can reference the object in between. And since it may take many seconds to compute reachability, stop-the-world is not great. I think there are probably ways to make it better. Perhaps some kind of lockless delete-but-be-able-to-rollback scheme (but keep in mind this has to be implemented no top of POSIX filesystem semantics). Or even just a "compute reachability, mark for deletion, and then hold a stop-the-world lock briefly to double-check that our reachability is still up to date". At least those seem plausible to me. I've never worked out the details, and our solution was to just stop deleting objects during routine maintenance (using "repack -adk"). We do still occasionally prune manually (e.g., when somebody writes to support to remove a confidential mistake). Anyway, that was more than you probably wanted to know. The short of it is that I don't think quarantines help (they may even make things worse by slightly increasing the length of the race window, though in practice I doubt it). > Unless that interacts racily with the receive.unpackLimit, but then I > have no idea that section is trying to say... No, I don't think unpackLimit really affects it much either way. > Also, surely the part where "NOTES" says something to the effect of "you > are subject to races unless gc.auto=0" is wrong. To the extent that > there's races it won't matter that you invoke "git gc" or "git gc > --auto", it's the concurrency that matters. So if there's still races we > should be saying the repo needs to be locked for writes for the duration > of the "gc". Correct. It's the very act of pruning that is problematic. I think the point is that if you are manually running "git gc", you'd presumably do it at a time when the repository is not otherwise active. -Peff
On Tue, Mar 19 2019, Jeff King wrote: > On Mon, Mar 18, 2019 at 11:45:39PM +0100, Ævar Arnfjörð Bjarmason wrote: > >> > I don't think the quarantine stuff should impact contention at all. It's >> > only quarantining the objects, which are the least contentious part of >> > Git (because object content is idempotent, so we don't do any locking >> > there, and with two racing processes, one will just "win"). >> >> Without the quarantine, isn't there the race that the NOTES section >> talks about (unless I've misread it). > > Ah, OK, I wasn't quite sure which documentation you were talking about. > I see the discussion now in the "NOTES" section of git-gc(1). > >> I.e. we have some loose object "ABCD" not referrred to by anything for >> the last 2 weeks, as we're gc-ing a ref update comes in that makes it >> referenced again. We then delete "ABCD" (not used!) at the same time the >> ref update happens, and get corruption. >> >> Whereas the quarantine might work around since the client will have sent >> ABCD with no reference pointing to it to the server in the temp pack, >> which we then rename in-place and then update the ref, so we don't care >> if "ABCD" goes away. > > tl;dr I don't think quarantine impacts this, but if you really want gory > details, read on. > > This is a problem with or without the quarantine. It's fundamentally a > race because we do not atomically say "is anybody using X? If not, we > can delete it" and some other process saying "I'd like to use X". > > Pushes are actually better off than most operations, because we only > advertise what's reachable, and the client is expected to send > everything else. So with just a normal update-ref call, we could race > like this: > > 1. ABCD is ancient. > > 2. Process 1 (update-ref) wants to reference ABCD. It sees that we > have it. > > 3. Process 2 (gc/prune) sees that nobody references it. It deletes > ABCD. > > 4. Process 1 writes out the reference. > > That doesn't happen with a push, because the server never would have > told the client that it has ABCD in the first place (so process 1 here > is the client). That is true with or without quarantine. > > But pushes aren't foolproof either. You said "loose object ABCD not > referred t oby anything for the last 2 weeks". But that's not exactly > how it works. It's "object with an mtime of more than 2 weeks which is > not currently referenced". So imagine a sequence like: > > 1. ABCD is ancient. > > 2. refs/heads/foo points to ABCD. > > 3. Server receive-pack advertises foo pointing to ABCD. > > 4. Simultaneous process on the server deletes refs/heads/foo (or > perhaps somebody force-pushes over it). > > 5. Client prepares and sends pack without ABCD. > > 6. Server receive-pack checks that yes, we still have ABCD (i.e., the > usual connectivity check). > > 7. Server gc drops ABCD, which is now unreachable (reflogs can help > here, if you've enabled them; but we do delete reflogs when the > branch is deleted). > > 8. Server receive-pack writes corrupt repo mentioning ABCD. > > That's a lot more steps, though they might not be as implausible as you > think (e.g., consider moving "refs/heads/foo" to "refs/heads/bar" in a > single push; that's actually a delete and an update, which is all you > need to race with a simultaneous gc). > > I have no idea how often this happens in practice. My subjective > recollection is that most of the race corruptions I've seen were from > local operations on the server. E.g., we compute a tentative merge for > somebody's pull request which shares objects with an older tentative > merge. They click the "merge" button and we reference that commit, which > is recent, but unbeknownst to us, while we were creating our new > tentative merge, a "gc" was deleting the old one. > > We're sometimes saved by the "transitive freshness" rules in d3038d22f9 > (prune: keep objects reachable from recent objects, 2014-10-15). But > they're far from perfect: > > - some operations (like the push rename example) aren't writing new > objects, so the ref write _is_ the moment that gc would find out > something is reachable > > - the "is it reachable?" and "no, then delete it" steps aren't atomic. > Unless you want a whole-repo stop-the-world lock, somebody can > reference the object in between. And since it may take many seconds > to compute reachability, stop-the-world is not great. > > I think there are probably ways to make it better. Perhaps some kind of > lockless delete-but-be-able-to-rollback scheme (but keep in mind this > has to be implemented no top of POSIX filesystem semantics). Or even > just a "compute reachability, mark for deletion, and then hold a > stop-the-world lock briefly to double-check that our reachability is > still up to date". > > At least those seem plausible to me. I've never worked out the details, > and our solution was to just stop deleting objects during routine > maintenance (using "repack -adk"). We do still occasionally prune > manually (e.g., when somebody writes to support to remove a confidential > mistake). > > Anyway, that was more than you probably wanted to know. The short of it > is that I don't think quarantines help (they may even make things worse > by slightly increasing the length of the race window, though in practice > I doubt it). > >> Unless that interacts racily with the receive.unpackLimit, but then I >> have no idea that section is trying to say... > > No, I don't think unpackLimit really affects it much either way. > >> Also, surely the part where "NOTES" says something to the effect of "you >> are subject to races unless gc.auto=0" is wrong. To the extent that >> there's races it won't matter that you invoke "git gc" or "git gc >> --auto", it's the concurrency that matters. So if there's still races we >> should be saying the repo needs to be locked for writes for the duration >> of the "gc". > > Correct. It's the very act of pruning that is problematic. I think the > point is that if you are manually running "git gc", you'd presumably do > it at a time when the repository is not otherwise active. Thanks for that E-Mail. I'm hoping to get around to another set of "gc doc" updates and will hopefully be able to steal liberally from it. Maybe there's some case I haven't thought of that makes this stupid, but I wonder if something like a "gc quarantine" might be a fix fo both of the the issues you noted above. I.e. it seems to me that the main issue is that we conflate "mtime 2 weeks old because it's unreferenced for 2 weeks" v.s. "mtime 2 weeks old because we haven't gotten around to a 'gc'". So in such a "gc quarantine" mode when we discover an object/pack that's unreachable/purely made up of unreachable objects we'd move the relevant loose object/"loose" pack to such a quarantine, which would just be .git/unreferenced-objects/{??,pack}/ or whatever. AFAICT both cases you mentioned above would be mitigated by this because we'd no longer conflate "haven't gc'd this yet and it's 2 weeks old" v.s. "hasn't been referenced in 2 weeks". I started looking at this initially because I was wondering if the --keep-unreachable mode you modified in e26a8c4721 ("repack: extend --keep-unreachable to loose objects", 2016-06-13) could be made to write out such "unreferenced" objects into their *own* pack, so we could delete them all at once as a batch, and wouldn't create the "ref explosions" mentioned in [1]. But of course without an accompanying quarantine described above doing that would just make this race condition worse. 1. https://public-inbox.org/git/87fu6bmr0j.fsf@evledraar.gmail.com/
On Mon, May 06, 2019 at 11:44:06AM +0200, Ævar Arnfjörð Bjarmason wrote: > Maybe there's some case I haven't thought of that makes this stupid, but > I wonder if something like a "gc quarantine" might be a fix fo both of > the the issues you noted above. > > I.e. it seems to me that the main issue is that we conflate "mtime 2 > weeks old because it's unreferenced for 2 weeks" v.s. "mtime 2 weeks old > because we haven't gotten around to a 'gc'". > > So in such a "gc quarantine" mode when we discover an object/pack that's > unreachable/purely made up of unreachable objects we'd move the relevant > loose object/"loose" pack to such a quarantine, which would just be > .git/unreferenced-objects/{??,pack}/ or whatever. > > AFAICT both cases you mentioned above would be mitigated by this because > we'd no longer conflate "haven't gc'd this yet and it's 2 weeks old" > v.s. "hasn't been referenced in 2 weeks". Michael Haggerty and I have (off-list) discussed variations on that, but it opens up a lot of new issues. Moving something into quarantine isn't atomic. So you've still corrupted the repo, but now it's recoverable by reaching into the quarantine. Who notices that the repo is corrupt, and how? When do we expire objects from quarantine? I think the heart of the issue is really the lack of atomicity in the operations. You need some way to mark "I am using this now" in a way that cannot race with "looks like nobody is using this, so I'll delete it". And ideally without traversing large bits of the graph on the writing side, and without requiring any stop-the-world locks during pruning. > I started looking at this initially because I was wondering if the > --keep-unreachable mode you modified in e26a8c4721 ("repack: extend > --keep-unreachable to loose objects", 2016-06-13) could be made to write > out such "unreferenced" objects into their *own* pack, so we could > delete them all at once as a batch, and wouldn't create the "ref > explosions" mentioned in [1]. > > But of course without an accompanying quarantine described above doing > that would just make this race condition worse. I'm not sure it really makes it worse. The pack would have the same mtime as the loose objects would. -Peff
On Tue, May 07 2019, Jeff King wrote: > On Mon, May 06, 2019 at 11:44:06AM +0200, Ævar Arnfjörð Bjarmason wrote: > >> Maybe there's some case I haven't thought of that makes this stupid, but >> I wonder if something like a "gc quarantine" might be a fix fo both of >> the the issues you noted above. >> >> I.e. it seems to me that the main issue is that we conflate "mtime 2 >> weeks old because it's unreferenced for 2 weeks" v.s. "mtime 2 weeks old >> because we haven't gotten around to a 'gc'". >> >> So in such a "gc quarantine" mode when we discover an object/pack that's >> unreachable/purely made up of unreachable objects we'd move the relevant >> loose object/"loose" pack to such a quarantine, which would just be >> .git/unreferenced-objects/{??,pack}/ or whatever. >> >> AFAICT both cases you mentioned above would be mitigated by this because >> we'd no longer conflate "haven't gc'd this yet and it's 2 weeks old" >> v.s. "hasn't been referenced in 2 weeks". > > Michael Haggerty and I have (off-list) discussed variations on that, but > it opens up a lot of new issues. Moving something into quarantine isn't > atomic. So you've still corrupted the repo, but now it's recoverable by > reaching into the quarantine. Who notices that the repo is corrupt, and > how? When do we expire objects from quarantine? > > I think the heart of the issue is really the lack of atomicity in the > operations. You need some way to mark "I am using this now" in a way > that cannot race with "looks like nobody is using this, so I'll delete > it". > > And ideally without traversing large bits of the graph on the writing > side, and without requiring any stop-the-world locks during pruning. I was thinking (but realize now that I didn't articulate) that the "gc quarantine" would be another "alternate" implementing a copy-on-write "lockless delete-but-be-able-to-rollback scheme" as you put it. So "gc" would decide (racily) what's unreachable, but instead of unlink()-ing it would "mv" the loose object/pack into the "unreferenced-objects" quarantine. Then in your example #1 "wants to reference ABCD. It sees that we have it." would race on the "other side". I.e. maybe ABCD was *just* moved to the quarantine. But in that case we'd move it back, which would bump the mtime and thus make it ineligible for expiry. Similarly for example #2, the "ABCD is ancient" would be moved, but then promptely moved back on the next GC as we notice ABCD has been re-referenced. Maybe it's just the same problem all over again, but I don't see how yet. Aside from that, I have a hunch that while it's theoretically true that you can at any time re-reference some loose blob/tree/commit again, that the likelyhood of that in practice goes down as it ages, since a user is likely to e.g. re-push or rename some branch they pushed last week, not last year. Hence the mention of creating "unreferenced packs" with some new --keep-unreachable mode. Since we'd pack those together they wouldn't create the "ref explosion" problem we have with the loose refs, and thus you could afford to keep them longer (even though the deltas would be shittier). Whereas now you either need --keep-unreachable (keep stuff forever) or a more aggressive gc.pruneExpire if you'd like to not end up with a ginormous amount of loose objects. >> I started looking at this initially because I was wondering if the >> --keep-unreachable mode you modified in e26a8c4721 ("repack: extend >> --keep-unreachable to loose objects", 2016-06-13) could be made to write >> out such "unreferenced" objects into their *own* pack, so we could >> delete them all at once as a batch, and wouldn't create the "ref >> explosions" mentioned in [1]. >> >> But of course without an accompanying quarantine described above doing >> that would just make this race condition worse. > > I'm not sure it really makes it worse. The pack would have the same > mtime as the loose objects would.
On Fri, May 10, 2019 at 01:20:55AM +0200, Ævar Arnfjörð Bjarmason wrote: > > Michael Haggerty and I have (off-list) discussed variations on that, but > > it opens up a lot of new issues. Moving something into quarantine isn't > > atomic. So you've still corrupted the repo, but now it's recoverable by > > reaching into the quarantine. Who notices that the repo is corrupt, and > > how? When do we expire objects from quarantine? > > > > I think the heart of the issue is really the lack of atomicity in the > > operations. You need some way to mark "I am using this now" in a way > > that cannot race with "looks like nobody is using this, so I'll delete > > it". > > > > And ideally without traversing large bits of the graph on the writing > > side, and without requiring any stop-the-world locks during pruning. > > I was thinking (but realize now that I didn't articulate) that the "gc > quarantine" would be another "alternate" implementing a copy-on-write > "lockless delete-but-be-able-to-rollback scheme" as you put it. > > So "gc" would decide (racily) what's unreachable, but instead of > unlink()-ing it would "mv" the loose object/pack into the > "unreferenced-objects" quarantine. > > Then in your example #1 "wants to reference ABCD. It sees that we have > it." would race on the "other side". I.e. maybe ABCD was *just* moved to > the quarantine. But in that case we'd move it back, which would bump the > mtime and thus make it ineligible for expiry. I think this is basically the same as the current freshening scheme, though. In general, you can replace "move it back" with "update its mtime". Neither is atomic with respect to other operations. It does seem like the twist is that "gc" is supposed to do the "move it back" step (and it's also the thing expiring, if we assume that there's only one gc running at a time). But again, how do we know somebody isn't referencing it _right now_ while we're deciding whether to move it back? I think there are lots of solutions you can come up with if you have atomicity. But fundamentally it isn't there in the way we handle updates now. You could imagine something like a shared/unique lock where anybody updating a ref takes the "shared" side, and multiple entities can hold it at once. But somebody pruning takes the "unique" side and excludes everybody else, stopping ref updates during the prune (which you'd obviously want to do in a way that you hold the lock for as short as possible; say, optimistically check reachability without the lock, then take the lock and check to see if anything has changed). (By shared/unique I basically mean a reader/writer lock, but I didn't want to use those terms in the paragraph since both holders are writing). It is tricky to find out when to hold the shared lock, though. It's _not_ just a ref write, for example. When you accept a push, you'd want to hold the lock while you are checking that you have all of the necessary objects to write the ref. For something like "git commit" it's even harder, because we implicitly rely on state created by commands run over the course of hours or days (e.g., "git add" to put a blob in the index and maybe create the tree via cache-tree, then a commit to reference it, and finally the ref write; each step adds state which the next step relies on). > Aside from that, I have a hunch that while it's theoretically true that > you can at any time re-reference some loose blob/tree/commit again, that > the likelyhood of that in practice goes down as it ages, since a user is > likely to e.g. re-push or rename some branch they pushed last week, not > last year. > > Hence the mention of creating "unreferenced packs" with some new > --keep-unreachable mode. Since we'd pack those together they wouldn't > create the "ref explosion" problem we have with the loose refs, and thus > you could afford to keep them longer (even though the deltas would be > shittier). Yeah, that may make it less likely (and we'd like those unreferenced packs for other reasons anyway, so it's certainly worth a shot). But the whole race is kind of unlikely in the first place. If you have enough repositories, you see it eventually. ;) -Peff
On Wed, Jul 31 2019, Jeff King wrote: > On Fri, May 10, 2019 at 01:20:55AM +0200, Ævar Arnfjörð Bjarmason wrote: > >> > Michael Haggerty and I have (off-list) discussed variations on that, but >> > it opens up a lot of new issues. Moving something into quarantine isn't >> > atomic. So you've still corrupted the repo, but now it's recoverable by >> > reaching into the quarantine. Who notices that the repo is corrupt, and >> > how? When do we expire objects from quarantine? >> > >> > I think the heart of the issue is really the lack of atomicity in the >> > operations. You need some way to mark "I am using this now" in a way >> > that cannot race with "looks like nobody is using this, so I'll delete >> > it". >> > >> > And ideally without traversing large bits of the graph on the writing >> > side, and without requiring any stop-the-world locks during pruning. >> >> I was thinking (but realize now that I didn't articulate) that the "gc >> quarantine" would be another "alternate" implementing a copy-on-write >> "lockless delete-but-be-able-to-rollback scheme" as you put it. >> >> So "gc" would decide (racily) what's unreachable, but instead of >> unlink()-ing it would "mv" the loose object/pack into the >> "unreferenced-objects" quarantine. >> >> Then in your example #1 "wants to reference ABCD. It sees that we have >> it." would race on the "other side". I.e. maybe ABCD was *just* moved to >> the quarantine. But in that case we'd move it back, which would bump the >> mtime and thus make it ineligible for expiry. > > I think this is basically the same as the current freshening scheme, > though. In general, you can replace "move it back" with "update its > mtime". Neither is atomic with respect to other operations. > > It does seem like the twist is that "gc" is supposed to do the "move it > back" step (and it's also the thing expiring, if we assume that there's > only one gc running at a time). But again, how do we know somebody isn't > referencing it _right now_ while we're deciding whether to move it back? The twist is to create a "quarantine" area of the ref store you can't read any objects from without copying them to the "main" area (git-gc itself would be an exception). Hence step #2 and #6, respectively, in your examples in https://public-inbox.org/git/20190319001829.GL29661@sigill.intra.peff.net/ would have update-ref/receive-pack fail to find "ABCD" in the "main" store due to the exact same race we have now with mtimes & gc, then fall back to the "quarantine" and (this is the important part) immediately copy it back to the "main" store. IOW yes, you'd have the exact same race you have now with the initial move to the quarantine. You'd have ref updates & gc racing and "unreachable" things would be moved to the quarantine, but really the just became reachable again. The difference is that instead of unlinking that unreachable object we move it to the quarantine, so the next "gc" (which is what would delete it) would notice it's reachable and move it to the "main" area before proceeding, *and* anything that "faults" back to reading the "quarantine" would do the same. > I think there are lots of solutions you can come up with if you have > atomicity. But fundamentally it isn't there in the way we handle updates > now. You could imagine something like a shared/unique lock where anybody > updating a ref takes the "shared" side, and multiple entities can hold > it at once. But somebody pruning takes the "unique" side and excludes > everybody else, stopping ref updates during the prune (which you'd > obviously want to do in a way that you hold the lock for as short as > possible; say, optimistically check reachability without the lock, then > take the lock and check to see if anything has changed). > > (By shared/unique I basically mean a reader/writer lock, but I didn't > want to use those terms in the paragraph since both holders are > writing). > > It is tricky to find out when to hold the shared lock, though. It's > _not_ just a ref write, for example. When you accept a push, you'd want > to hold the lock while you are checking that you have all of the > necessary objects to write the ref. For something like "git commit" it's > even harder, because we implicitly rely on state created by commands run > over the course of hours or days (e.g., "git add" to put a blob in the > index and maybe create the tree via cache-tree, then a commit to > reference it, and finally the ref write; each step adds state which the > next step relies on). I don't think this sort of approach would require any global locks, but it would be vulnerable to operations that take longer than the "main->quarantine->unlink()" cycle takes. E.g. a "hash-object" that takes a month before the subsequent "write-tree" etc. All of the above written with the previously stated "I may be missing something" caveat etc. :)