[0/2] midx: prevent against racily disappearing packs

Message ID	cover.1606324509.git.me@ttaylorr.com (mailing list archive)
Headers	show Return-Path: <git-owner@kernel.org> Date: Wed, 25 Nov 2020 12:17:22 -0500 From: Taylor Blau <me@ttaylorr.com> To: git@vger.kernel.org Cc: dstolee@microsoft.com, peff@peff.net Subject: [PATCH 0/2] midx: prevent against racily disappearing packs Message-ID: <cover.1606324509.git.me@ttaylorr.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Precedence: bulk
Series	midx: prevent against racily disappearing packs \| expand [0/2] midx: prevent against racily disappearing packs [1/2] packfile.c: protect against disappearing indexes [2/2] midx.c: protect against disappearing packs

Message ID

cover.1606324509.git.me@ttaylorr.com (mailing list archive)

Headers

Date: Wed, 25 Nov 2020 12:17:22 -0500
From: Taylor Blau <me@ttaylorr.com>
To: git@vger.kernel.org
Cc: dstolee@microsoft.com, peff@peff.net
Subject: [PATCH 0/2] midx: prevent against racily disappearing packs
Message-ID: <cover.1606324509.git.me@ttaylorr.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Precedence: bulk

Series

midx: prevent against racily disappearing packs | expand

Message

Taylor Blau Nov. 25, 2020, 5:17 p.m. UTC

Here are a couple of patches that Peff and I wrote after noticing a
problem where racily disappearing .idx files can cause a segfault. While
investigating, we fixed a related issue which is described and fixed in
the second patch.

(I'm a little behind in responding to review on the reachability
generation improvements patches, but had these in my backlog and they
seemed relatively easy to send and remove from my TODO list. So, here
they are :-).)

Thanks,
Taylor

Taylor Blau (2):
  packfile.c: protect against disappearing indexes
  midx.c: protect against disappearing packs

 midx.c                      |  2 +-
 packfile.c                  | 19 ++-----------------
 t/t5319-multi-pack-index.sh | 30 ++++++++++++++++++++++++++++--
 3 files changed, 31 insertions(+), 20 deletions(-)

--
2.29.2.368.ge1806d1bdc

Comments

Junio C Hamano Nov. 25, 2020, 9:13 p.m. UTC | #1

Taylor Blau <me@ttaylorr.com> writes:

> Here are a couple of patches that Peff and I wrote after noticing a
> problem where racily disappearing .idx files can cause a segfault. While
> investigating, we fixed a related issue which is described and fixed in
> the second patch.

In the cover letter it won't affect the end result, but when talking
about "race", it always is a good idea to explicitly mention both
sides of the race.  It is clear what one side is in the above
(i.e. somebody who removes .idx file that is still in use), but it
is not so clear who gets hit and segfaults.

I am guessing that the other party is the user of .pack file(s)
bypassing the corresponding .idx file(s) because the necessary data
is mostly in .midx?

Thanks.

Jeff King Nov. 26, 2020, 12:48 a.m. UTC | #2

On Wed, Nov 25, 2020 at 01:13:31PM -0800, Junio C Hamano wrote:

> Taylor Blau <me@ttaylorr.com> writes:
> 
> > Here are a couple of patches that Peff and I wrote after noticing a
> > problem where racily disappearing .idx files can cause a segfault. While
> > investigating, we fixed a related issue which is described and fixed in
> > the second patch.
> 
> In the cover letter it won't affect the end result, but when talking
> about "race", it always is a good idea to explicitly mention both
> sides of the race.  It is clear what one side is in the above
> (i.e. somebody who removes .idx file that is still in use), but it
> is not so clear who gets hit and segfaults.
> 
> I am guessing that the other party is the user of .pack file(s)
> bypassing the corresponding .idx file(s) because the necessary data
> is mostly in .midx?

Yeah, the race reproduction in the second commit message can actually
reproduce the segfault as well (it depends on the exact timing which
error you get). So the segfault is in the reader, who is not checking
the result of find_revindex_entry().

Arguably every call there should be checking for NULL, but in practice
I think it would always be a bug:

  - we were somehow unable to open the index in order to generate the
    revindex (which is what happened here). But I think we are better
    off making sure that we can always do so, which is what this series
    does.

  - the caller asked about an object at a position beyond the number of
    objects in the packfile. This is a bug in the caller.

So we could perhaps BUG() in find_revindex_entry() instead of returning
NULL. A quick segfault accomplishes mostly the same thing, though the
BUG() could distinguish the two cases more clearly.

-Peff

Taylor Blau Nov. 26, 2020, 1:04 a.m. UTC | #3

On Wed, Nov 25, 2020 at 07:48:54PM -0500, Jeff King wrote:
> Yeah, the race reproduction in the second commit message can actually
> reproduce the segfault as well (it depends on the exact timing which
> error you get). So the segfault is in the reader, who is not checking
> the result of find_revindex_entry().
>
> Arguably every call there should be checking for NULL, but in practice
> I think it would always be a bug:
>
>   - we were somehow unable to open the index in order to generate the
>     revindex (which is what happened here). But I think we are better
>     off making sure that we can always do so, which is what this series
>     does.
>
>   - the caller asked about an object at a position beyond the number of
>     objects in the packfile. This is a bug in the caller.
>
> So we could perhaps BUG() in find_revindex_entry() instead of returning
> NULL. A quick segfault accomplishes mostly the same thing, though the
> BUG() could distinguish the two cases more clearly.

Yeah, a find_revindex_entry() that returns NULL means that the caller is
probably dead in the water.

FWIW, this function gets touched by a series that I'm working on here:
[1]. There, I think "returning NULL" is equivalent to "returning -1",
and the problem exists there, too.

We could return a different negative number, call BUG(), or do nothing
other than what's written. I don't have any strong feelings, though.

> -Peff

Thanks,
Taylor

[1]: https://github.com/ttaylorr/git/blob/tb/on-disk-revindex-part-one/pack-revindex.c#L177-L201

Jeff King Nov. 26, 2020, 1:27 a.m. UTC | #4

On Wed, Nov 25, 2020 at 08:04:44PM -0500, Taylor Blau wrote:

> > So we could perhaps BUG() in find_revindex_entry() instead of returning
> > NULL. A quick segfault accomplishes mostly the same thing, though the
> > BUG() could distinguish the two cases more clearly.
> 
> Yeah, a find_revindex_entry() that returns NULL means that the caller is
> probably dead in the water.
> 
> FWIW, this function gets touched by a series that I'm working on here:
> [1]. There, I think "returning NULL" is equivalent to "returning -1",
> and the problem exists there, too.
> 
> We could return a different negative number, call BUG(), or do nothing
> other than what's written. I don't have any strong feelings, though.

Yeah, I was suggesting _never_ returning a failure, and just hitting a
BUG() within the function. So it does not matter then how you represent
the error return type, because there isn't one. :)

Looking over the callers, there are actually a few that check the return
value and handle it sanely. I suspect they can never trigger in
practice, given that the other callers would all segfault, and we have
not seen any reports in the 15 years during which that has been the
case. But perhaps some of them can be triggered by bogus pack data that
nobody has ever run into in the real world.

At any rate, I am content to ignore it until somebody feels like digging
into each caller. A BUG() is only marginally better than an immediate
segfault anyway, and I'd prefer not to disrupt more substantive
improvements in the area.

-Peff