Message ID | cover.1724080006.git.ps@pks.im (mailing list archive) |
---|---|
Headers | show |
Series | reftable: fix reload with active iterators | expand |
Patrick Steinhardt <ps@pks.im> writes: > Hi, > > as reported by Peff in [1], the reftable backend fails in t5616 in a > racy manner. The cause of this is that git-maintenance(1) processes leak > into subsequent tests and perform concurrent compaction of the stack. > The reftable backend is expected to handle this gracefully, but doesn't. > I've not looked into the recent changes made to `git-maintenance(1)` but my understanding is that, now, `git-maintenance(1)` can detach itself without relying on `git-gc(1)`'s detach. > The issue can surface whenever reloading the reftable stack while an > iterator is active. In case some of the tables got compacted, we will > end up closing them and thus the iterator now tries to use those closed > tables. > Make sense till here. > This patch series fixes that issue by starting to refcount the readers. Not sure what you mean by 'refcount' here though. > Each iterator will bump the refcount, thus avoiding the problem. While > at it, I also found a second issue where we segfault when reloading a > table fails while reusing one of the table readers. In this scenario, we > would end up releasing the reader of the stack itself, even though it > would still be used by it. > > This patch series addresses those issues by making the reftable reader > refcounted. The first 6 patches do some simplifications of code which is > close. The remaining 4 ones introduce refcounting and wire it up as > required. > Perhaps it becomes more clearer as I read through the patches. [snip]
On Mon, Aug 19, 2024 at 05:39:38PM +0200, Patrick Steinhardt wrote: > This patch series fixes that issue by starting to refcount the readers. > Each iterator will bump the refcount, thus avoiding the problem. While > at it, I also found a second issue where we segfault when reloading a > table fails while reusing one of the table readers. In this scenario, we > would end up releasing the reader of the stack itself, even though it > would still be used by it. I gave a fairly cursory look over this, as I'm not all that familiar with the reftable code. But everything looked pretty sensible to me. I wondered how we might test this. It looks like you checked the --stress output (which I confirmed seems fixed for me), along with a synthetic test directly calling various reftable functions. Both seem good to me. I had hoped to be able to have a non-racy external test, where running actual Git commands showed the segfault in a deterministic way. But I couldn't come up with one. One trick we've used for "pausing" a reading process is to run "cat-file --batch" from a fifo. You can ask it to do some ref lookups, then while it waits for more input, update the reftable, and then ask it to do more. But I don't think that's sufficient here, as the race happens while we are actually iterating. You'd really need some for_each_ref() callback that blocks in some externally controllable way. Possibly you could do something clever with partial-clone lazy fetches (where you stall the fetch and then do ref updates in the middle), but that is getting pretty convoluted. So I think the tests you included seem like a good place to stop. I did have a little trouble applying this for testing. I wanted to do it on 'next', which has the maintenance changes to cause the race. But merging it there ended up with a lot of conflicts with other reftable topics (especially the tests). I was able to resolve them all, but you might be able to make Junio's life easier by coordinating the topics a bit. -Peff
On Thu, Aug 22, 2024 at 08:41:00AM -0400, Jeff King wrote: > On Mon, Aug 19, 2024 at 05:39:38PM +0200, Patrick Steinhardt wrote: > > > This patch series fixes that issue by starting to refcount the readers. > > Each iterator will bump the refcount, thus avoiding the problem. While > > at it, I also found a second issue where we segfault when reloading a > > table fails while reusing one of the table readers. In this scenario, we > > would end up releasing the reader of the stack itself, even though it > > would still be used by it. > > I gave a fairly cursory look over this, as I'm not all that familiar > with the reftable code. But everything looked pretty sensible to me. Thanks for your review, appreciated! > I wondered how we might test this. It looks like you checked the > --stress output (which I confirmed seems fixed for me), along with a > synthetic test directly calling various reftable functions. Both seem > good to me. > > I had hoped to be able to have a non-racy external test, where running > actual Git commands showed the segfault in a deterministic way. But I > couldn't come up with one. One trick we've used for "pausing" a reading > process is to run "cat-file --batch" from a fifo. You can ask it to do > some ref lookups, then while it waits for more input, update the > reftable, and then ask it to do more. > > But I don't think that's sufficient here, as the race happens while we > are actually iterating. You'd really need some for_each_ref() callback > that blocks in some externally controllable way. Possibly you could do > something clever with partial-clone lazy fetches (where you stall the > fetch and then do ref updates in the middle), but that is getting pretty > convoluted. > > So I think the tests you included seem like a good place to stop. Yeah. I think demonstrating the issues on the API level is okayish. It would of course have been nice to also do it via our shell scripts, but as you say, it feels rather fragile. > I did have a little trouble applying this for testing. I wanted to do it > on 'next', which has the maintenance changes to cause the race. But > merging it there ended up with a lot of conflicts with other reftable > topics (especially the tests). I was able to resolve them all, but you > might be able to make Junio's life easier by coordinating the topics a > bit. I can certainly do that. I know that there are conflicts both with the patch series dropping the generic tables and with the patch series that move the reftable unit tests into our own codebase. I did address the former by basing it on top of that series, but didn't yet address the latter. I'm okay with waiting a bit until most of the conflicting topics land. I guess most of them should be close to landing anyway. Alternatively, I can also pull all of them in as dependencies. Patrick
Jeff King <peff@peff.net> writes: > But I don't think that's sufficient here, as the race happens while we > are actually iterating. You'd really need some for_each_ref() callback > that blocks in some externally controllable way. Possibly you could do > something clever with partial-clone lazy fetches (where you stall the > fetch and then do ref updates in the middle), but that is getting pretty > convoluted. ;-) > So I think the tests you included seem like a good place to stop. Yeah, I agree with this assessment. > I did have a little trouble applying this for testing. I wanted to do it > on 'next', which has the maintenance changes to cause the race. But > merging it there ended up with a lot of conflicts with other reftable > topics (especially the tests). I was able to resolve them all, but you > might be able to make Junio's life easier by coordinating the topics a > bit. I haven't tried merging the topic to 'next', jumping over other topics in flight in 'seen'. Thanks for a heads up, but I am hoping that my rerere database is fairly complete for this topic by now. Thanks.
Patrick Steinhardt <ps@pks.im> writes: > I can certainly do that. I know that there are conflicts both with the > patch series dropping the generic tables and with the patch series that > move the reftable unit tests into our own codebase. I did address the > former by basing it on top of that series, but didn't yet address the > latter. > > I'm okay with waiting a bit until most of the conflicting topics land. I > guess most of them should be close to landing anyway. Alternatively, I > can also pull all of them in as dependencies. As long as the resolution you see in 'seen' looks acceptable, I'd rather prefer you do nothing, than a rebase that _reduces_ the number of conflicts (but does not completely eliminate them), when I already have a working resolution recorded in my rerere database. Of course, updates to polish the substance of the topic are very much appreciated, and I'll redo the resolution in such a case if needed. That's one of the things the maintainers do. Thanks.
On Thu, Aug 22, 2024 at 08:15:55AM -0700, Junio C Hamano wrote: > Patrick Steinhardt <ps@pks.im> writes: > > > I can certainly do that. I know that there are conflicts both with the > > patch series dropping the generic tables and with the patch series that > > move the reftable unit tests into our own codebase. I did address the > > former by basing it on top of that series, but didn't yet address the > > latter. > > > > I'm okay with waiting a bit until most of the conflicting topics land. I > > guess most of them should be close to landing anyway. Alternatively, I > > can also pull all of them in as dependencies. > > As long as the resolution you see in 'seen' looks acceptable, I'd > rather prefer you do nothing, than a rebase that _reduces_ the > number of conflicts (but does not completely eliminate them), when I > already have a working resolution recorded in my rerere database. > > Of course, updates to polish the substance of the topic are very > much appreciated, and I'll redo the resolution in such a case if > needed. That's one of the things the maintainers do. Yup, the merge in `seen` looks good to me. Thanks! Patrick