mbox series

[00/10] reftable: fix reload with active iterators

Message ID cover.1724080006.git.ps@pks.im (mailing list archive)
Headers show
Series reftable: fix reload with active iterators | expand

Message

Patrick Steinhardt Aug. 19, 2024, 3:39 p.m. UTC
Hi,

as reported by Peff in [1], the reftable backend fails in t5616 in a
racy manner. The cause of this is that git-maintenance(1) processes leak
into subsequent tests and perform concurrent compaction of the stack.
The reftable backend is expected to handle this gracefully, but doesn't.

The issue can surface whenever reloading the reftable stack while an
iterator is active. In case some of the tables got compacted, we will
end up closing them and thus the iterator now tries to use those closed
tables.

This patch series fixes that issue by starting to refcount the readers.
Each iterator will bump the refcount, thus avoiding the problem. While
at it, I also found a second issue where we segfault when reloading a
table fails while reusing one of the table readers. In this scenario, we
would end up releasing the reader of the stack itself, even though it
would still be used by it.

This patch series addresses those issues by making the reftable reader
refcounted. The first 6 patches do some simplifications of code which is
close. The remaining 4 ones introduce refcounting and wire it up as
required.

With this, the following command now passes:

    make -C .. -j20 SANITIZE=address && GIT_TEST_DEFAULT_REF_FORMAT=reftable ./t5616-partial-clone.sh --run=1-16 --stress

The patch series builds on top of Junio's ps/reftable-drop-generic at
1a3c3870ee (reftable/generic: drop interface, 2024-08-14). It's only on
seen right now, but would otherise cause a bunch of conflicts.

Thanks!

[1]: <20240817121424.GA2439299@coredump.intra.peff.net>

Patrick Steinhardt (10):
  reftable/blocksource: drop malloc block source
  reftable/stack: inline `stack_compact_range_stats()`
  reftable/reader: rename `reftable_new_reader()`
  reftable/reader: inline `init_reader()`
  reftable/reader: inline `reader_close()`
  reftable/stack: fix broken refnames in `write_n_ref_tables()`
  reftable/reader: introduce refcounting
  reftable/reader: keep readers alive during iteration
  reftable/stack: reorder swapping in the reloaded stack contents
  reftable/stack: fix segfault when reload with reused readers fails

 reftable/block_test.c            |   3 +-
 reftable/blocksource.c           |  20 -----
 reftable/blocksource.h           |   2 -
 reftable/reader.c                | 149 ++++++++++++++++---------------
 reftable/reader.h                |   5 +-
 reftable/readwrite_test.c        |  85 +++++++++---------
 reftable/reftable-reader.h       |  19 ++--
 reftable/stack.c                 |  90 +++++++++++--------
 reftable/stack_test.c            | 115 +++++++++++++++++++++++-
 t/helper/test-reftable.c         |   4 +-
 t/unit-tests/t-reftable-merged.c |  10 +--
 11 files changed, 311 insertions(+), 191 deletions(-)


base-commit: 1a3c3870ee1a65b0579ccbac6b18e22b8c44c5b4

Comments

karthik nayak Aug. 22, 2024, 8:13 a.m. UTC | #1
Patrick Steinhardt <ps@pks.im> writes:

> Hi,
>
> as reported by Peff in [1], the reftable backend fails in t5616 in a
> racy manner. The cause of this is that git-maintenance(1) processes leak
> into subsequent tests and perform concurrent compaction of the stack.
> The reftable backend is expected to handle this gracefully, but doesn't.
>

I've not looked into the recent changes made to `git-maintenance(1)` but
my understanding is that, now, `git-maintenance(1)` can detach itself
without relying on `git-gc(1)`'s detach.

> The issue can surface whenever reloading the reftable stack while an
> iterator is active. In case some of the tables got compacted, we will
> end up closing them and thus the iterator now tries to use those closed
> tables.
>

Make sense till here.

> This patch series fixes that issue by starting to refcount the readers.

Not sure what you mean by 'refcount' here though.

> Each iterator will bump the refcount, thus avoiding the problem. While
> at it, I also found a second issue where we segfault when reloading a
> table fails while reusing one of the table readers. In this scenario, we
> would end up releasing the reader of the stack itself, even though it
> would still be used by it.
>
> This patch series addresses those issues by making the reftable reader
> refcounted. The first 6 patches do some simplifications of code which is
> close. The remaining 4 ones introduce refcounting and wire it up as
> required.
>

Perhaps it becomes more clearer as I read through the patches.

[snip]
Jeff King Aug. 22, 2024, 12:41 p.m. UTC | #2
On Mon, Aug 19, 2024 at 05:39:38PM +0200, Patrick Steinhardt wrote:

> This patch series fixes that issue by starting to refcount the readers.
> Each iterator will bump the refcount, thus avoiding the problem. While
> at it, I also found a second issue where we segfault when reloading a
> table fails while reusing one of the table readers. In this scenario, we
> would end up releasing the reader of the stack itself, even though it
> would still be used by it.

I gave a fairly cursory look over this, as I'm not all that familiar
with the reftable code. But everything looked pretty sensible to me.

I wondered how we might test this. It looks like you checked the
--stress output (which I confirmed seems fixed for me), along with a
synthetic test directly calling various reftable functions. Both seem
good to me.

I had hoped to be able to have a non-racy external test, where running
actual Git commands showed the segfault in a deterministic way. But I
couldn't come up with one. One trick we've used for "pausing" a reading
process is to run "cat-file --batch" from a fifo. You can ask it to do
some ref lookups, then while it waits for more input, update the
reftable, and then ask it to do more.

But I don't think that's sufficient here, as the race happens while we
are actually iterating. You'd really need some for_each_ref() callback
that blocks in some externally controllable way. Possibly you could do
something clever with partial-clone lazy fetches (where you stall the
fetch and then do ref updates in the middle), but that is getting pretty
convoluted.

So I think the tests you included seem like a good place to stop.

I did have a little trouble applying this for testing. I wanted to do it
on 'next', which has the maintenance changes to cause the race. But
merging it there ended up with a lot of conflicts with other reftable
topics (especially the tests). I was able to resolve them all, but you
might be able to make Junio's life easier by coordinating the topics a
bit.

-Peff
Patrick Steinhardt Aug. 22, 2024, 12:53 p.m. UTC | #3
On Thu, Aug 22, 2024 at 08:41:00AM -0400, Jeff King wrote:
> On Mon, Aug 19, 2024 at 05:39:38PM +0200, Patrick Steinhardt wrote:
> 
> > This patch series fixes that issue by starting to refcount the readers.
> > Each iterator will bump the refcount, thus avoiding the problem. While
> > at it, I also found a second issue where we segfault when reloading a
> > table fails while reusing one of the table readers. In this scenario, we
> > would end up releasing the reader of the stack itself, even though it
> > would still be used by it.
> 
> I gave a fairly cursory look over this, as I'm not all that familiar
> with the reftable code. But everything looked pretty sensible to me.

Thanks for your review, appreciated!

> I wondered how we might test this. It looks like you checked the
> --stress output (which I confirmed seems fixed for me), along with a
> synthetic test directly calling various reftable functions. Both seem
> good to me.
> 
> I had hoped to be able to have a non-racy external test, where running
> actual Git commands showed the segfault in a deterministic way. But I
> couldn't come up with one. One trick we've used for "pausing" a reading
> process is to run "cat-file --batch" from a fifo. You can ask it to do
> some ref lookups, then while it waits for more input, update the
> reftable, and then ask it to do more.
> 
> But I don't think that's sufficient here, as the race happens while we
> are actually iterating. You'd really need some for_each_ref() callback
> that blocks in some externally controllable way. Possibly you could do
> something clever with partial-clone lazy fetches (where you stall the
> fetch and then do ref updates in the middle), but that is getting pretty
> convoluted.
> 
> So I think the tests you included seem like a good place to stop.

Yeah. I think demonstrating the issues on the API level is okayish. It
would of course have been nice to also do it via our shell scripts, but
as you say, it feels rather fragile.

> I did have a little trouble applying this for testing. I wanted to do it
> on 'next', which has the maintenance changes to cause the race. But
> merging it there ended up with a lot of conflicts with other reftable
> topics (especially the tests). I was able to resolve them all, but you
> might be able to make Junio's life easier by coordinating the topics a
> bit.

I can certainly do that. I know that there are conflicts both with the
patch series dropping the generic tables and with the patch series that
move the reftable unit tests into our own codebase. I did address the
former by basing it on top of that series, but didn't yet address the
latter.

I'm okay with waiting a bit until most of the conflicting topics land. I
guess most of them should be close to landing anyway. Alternatively, I
can also pull all of them in as dependencies.

Patrick
Junio C Hamano Aug. 22, 2024, 3:11 p.m. UTC | #4
Jeff King <peff@peff.net> writes:

> But I don't think that's sufficient here, as the race happens while we
> are actually iterating. You'd really need some for_each_ref() callback
> that blocks in some externally controllable way. Possibly you could do
> something clever with partial-clone lazy fetches (where you stall the
> fetch and then do ref updates in the middle), but that is getting pretty
> convoluted.

;-)

> So I think the tests you included seem like a good place to stop.

Yeah, I agree with this assessment.

> I did have a little trouble applying this for testing. I wanted to do it
> on 'next', which has the maintenance changes to cause the race. But
> merging it there ended up with a lot of conflicts with other reftable
> topics (especially the tests). I was able to resolve them all, but you
> might be able to make Junio's life easier by coordinating the topics a
> bit.

I haven't tried merging the topic to 'next', jumping over other
topics in flight in 'seen'.  Thanks for a heads up, but I am hoping
that my rerere database is fairly complete for this topic by now.

Thanks.
Junio C Hamano Aug. 22, 2024, 3:15 p.m. UTC | #5
Patrick Steinhardt <ps@pks.im> writes:

> I can certainly do that. I know that there are conflicts both with the
> patch series dropping the generic tables and with the patch series that
> move the reftable unit tests into our own codebase. I did address the
> former by basing it on top of that series, but didn't yet address the
> latter.
>
> I'm okay with waiting a bit until most of the conflicting topics land. I
> guess most of them should be close to landing anyway. Alternatively, I
> can also pull all of them in as dependencies.

As long as the resolution you see in 'seen' looks acceptable, I'd
rather prefer you do nothing, than a rebase that _reduces_ the
number of conflicts (but does not completely eliminate them), when I
already have a working resolution recorded in my rerere database.

Of course, updates to polish the substance of the topic are very
much appreciated, and I'll redo the resolution in such a case if
needed.  That's one of the things the maintainers do.

Thanks.
Patrick Steinhardt Aug. 23, 2024, 5:23 a.m. UTC | #6
On Thu, Aug 22, 2024 at 08:15:55AM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > I can certainly do that. I know that there are conflicts both with the
> > patch series dropping the generic tables and with the patch series that
> > move the reftable unit tests into our own codebase. I did address the
> > former by basing it on top of that series, but didn't yet address the
> > latter.
> >
> > I'm okay with waiting a bit until most of the conflicting topics land. I
> > guess most of them should be close to landing anyway. Alternatively, I
> > can also pull all of them in as dependencies.
> 
> As long as the resolution you see in 'seen' looks acceptable, I'd
> rather prefer you do nothing, than a rebase that _reduces_ the
> number of conflicts (but does not completely eliminate them), when I
> already have a working resolution recorded in my rerere database.
> 
> Of course, updates to polish the substance of the topic are very
> much appreciated, and I'll redo the resolution in such a case if
> needed.  That's one of the things the maintainers do.

Yup, the merge in `seen` looks good to me. Thanks!

Patrick