diff mbox series

[v2,4/5] repack: use tempfiles for signal cleanup

Message ID Y1M3oie5dPrRaOni@coredump.intra.peff.net (mailing list archive)
State New, archived
Headers show
Series repack tempfile-cleanup signal deadlock | expand

Commit Message

Jeff King Oct. 22, 2022, 12:21 a.m. UTC
When git-repack exits due to a signal, it tries to clean up by calling
its remove_temporary_files() function, which walks through the packs dir
looking for ".tmp-$$-pack-*" files to delete (where "$$" is the pid of
the current process).

The biggest problem here is that remove_temporary_files() is not safe to
call in a signal handler. It uses opendir(), which isn't on the POSIX
async-signal-safe list. The details will be platform-specific, but a
likely issue is that it needs to allocate memory; if we receive a signal
while inside malloc(), etc, we'll conflict on the allocator lock and
deadlock with ourselves.

We can fix this by just cleaning up the files directly, without walking
the directory. We already know the complete list of .tmp-* files that
were generated, because we recorded them via populate_pack_exts(). When
we find files there, we can use register_tempfile() to record the
filenames. If we receive a signal, then the tempfile API will clean them
up for us, and it's async-safe and pretty battle-tested.

Note that this is slightly racier than the existing scheme. We don't
record the filenames until pack-objects tells us the hash over stdout.
So during the period between it generating the file and reporting the
hash, we'd fail to clean up. However, that period is very small. During
most of the pack generation process pack-objects is using its own
internal tempfiles. It's only at the very end that it moves them into
the names git-repack expects, and then it immediately reports the name
to us. Given that cleanup like this is best effort (after all, we may
get SIGKILL), this level of race is acceptable.

When we register the tempfiles, we'll record them locally and use the
result to call rename_tempfile(), rather than renaming by hand.  This
isn't strictly necessary, as once we've renamed the files they're gone,
and the tempfile API's cleanup unlink() would simply become a pointless
noop. But managing the lifetimes of the tempfile objects is the cleanest
thing to do, and the tempfile pointers naturally fill the same role as
the old booleans.

This patch also fixes another small problem. We only hook signals, and
don't set up an atexit handler. So if we see an error that causes us to
die(), we'll leave the .tmp-* files in place. But since the tempfile API
handles this for us, this is now fixed for free. The new test covers
this by stimulating a failure of pack-objects when generating a cruft
pack. Before this patch, the .tmp-* file for the main pack would have
been left, but now we correctly clean it up.

Two small subtleties on the implementation:

  - in the renaming loop, we can stop re-constructing fname_old; we only
    use it when we have a tempfile to rename, so we can just ask the
    tempfile for its path (which, barring bugs, should be identical)

  - when renaming fails, our error message mentions fname_old. But since
    a failed rename_tempfile() invalidates the tempfile struct, we'll
    lose access to that string. Instead, let's mention the destination
    filename, which is what most other callers do.

Reported-by: Jan Pokorný <poki@fnusa.cz>
Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/repack.c  | 26 ++++++++------------------
 t/t7700-repack.sh |  8 ++++++++
 2 files changed, 16 insertions(+), 18 deletions(-)

Comments

Jeff King Oct. 22, 2022, 8:35 p.m. UTC | #1
On Fri, Oct 21, 2022 at 08:21:54PM -0400, Jeff King wrote:

> +test_expect_success 'clean up .tmp-* packs on error' '
> +	test_must_fail git \
> +		-c repack.cruftwindow=bogus \
> +		repack -ad --cruft &&
> +	find $objdir/pack -name '.tmp-*' >tmpfiles &&
> +	test_must_be_empty tmpfiles
> +'

Ugh, I think this test is racy. I saw a failure via SIGPIPE on OS X in
CI, and running it locally with --stress confirmed. I think the problem
is in our method to trigger pack-objects to fail for the cruft pack. We
pass a bogus command line argument, so pack-objects exits more or less
immediately. But the parent git-repack process is trying to write to its
stdin (to name packs to keep/exclude). So that write may succeed or fail
based on how quickly the child dies.

Some options:

  - find a different way to convince repack to die. The most likely is
    probably corrupting the cruft objects in some way such that
    pack-objects dies, but not until it starts doing real work.

  - mark the test_must_fail with ok=sigpipe. In most case this is a
    band-aid, but here we still test what we want. The .tmp cleanup
    should kick in from a die() or from a signal death, so either is
    sufficient for our purposes.

  - teach git-repack to ignore sigpipe. We don't actually check the
    writes to pack-objects (since they're done by fprintf), but we'd
    notice its failing exit code. And arguably this is improving the
    user experience. Saying "pack-objects died with an error" is more
    useful than a sigpipe.

Thoughts?

-Peff
Junio C Hamano Oct. 23, 2022, 12:14 a.m. UTC | #2
Jeff King <peff@peff.net> writes:

> Ugh, I think this test is racy. I saw a failure via SIGPIPE on OS X in
> CI, and running it locally with --stress confirmed. I think the problem
> is in our method to trigger pack-objects to fail for the cruft pack. We
> pass a bogus command line argument, so pack-objects exits more or less
> immediately. But the parent git-repack process is trying to write to its
> stdin (to name packs to keep/exclude). So that write may succeed or fail
> based on how quickly the child dies.

Yeah, I saw flaky failures myself during my integration tests.

> Some options:
>
>   - find a different way to convince repack to die. The most likely is
>     probably corrupting the cruft objects in some way such that
>     pack-objects dies, but not until it starts doing real work.
>
>   - mark the test_must_fail with ok=sigpipe. In most case this is a
>     band-aid, but here we still test what we want. The .tmp cleanup
>     should kick in from a die() or from a signal death, so either is
>     sufficient for our purposes.
>
>   - teach git-repack to ignore sigpipe. We don't actually check the
>     writes to pack-objects (since they're done by fprintf), but we'd
>     notice its failing exit code. And arguably this is improving the
>     user experience. Saying "pack-objects died with an error" is more
>     useful than a sigpipe.
>
> Thoughts?

I agree that for this particular one the second one is a reasonable
thing to do in the context of the test.  The third one may actually
be a better fix, exactly for the reason you state here.

Thanks.
Jeff King Oct. 23, 2022, 5 p.m. UTC | #3
On Sat, Oct 22, 2022 at 05:14:33PM -0700, Junio C Hamano wrote:

> > Some options:
> >
> >   - find a different way to convince repack to die. The most likely is
> >     probably corrupting the cruft objects in some way such that
> >     pack-objects dies, but not until it starts doing real work.
> >
> >   - mark the test_must_fail with ok=sigpipe. In most case this is a
> >     band-aid, but here we still test what we want. The .tmp cleanup
> >     should kick in from a die() or from a signal death, so either is
> >     sufficient for our purposes.
> >
> >   - teach git-repack to ignore sigpipe. We don't actually check the
> >     writes to pack-objects (since they're done by fprintf), but we'd
> >     notice its failing exit code. And arguably this is improving the
> >     user experience. Saying "pack-objects died with an error" is more
> >     useful than a sigpipe.
> >
> > Thoughts?
> 
> I agree that for this particular one the second one is a reasonable
> thing to do in the context of the test.  The third one may actually
> be a better fix, exactly for the reason you state here.

Here's a patch on top of of jk/repack-tempfile-cleanup that adjusts the
test (and should make the immediate racy CI pain go away). It gives some
explanation why the third option isn't as interesting as you'd think. If
somebody later wants to add such a "pack-objects died" error, we can
adjust sigpipe handling there.

-- >8 --
Subject: [PATCH] t7700: annotate cruft-pack failure with ok=sigpipe

One of our tests intentionally causes the cruft-pack generation phase of
repack to fail, in order to stimulate an exit from repack at the desired
moment. It does so by feeding a bogus option argument to pack-objects.
This is a simple and reliable way to get pack-objects to fail, but it
has one downside: pack-objects will die before reading its stdin, which
means the caller repack may racily get SIGPIPE writing to it.

For the purposes of this test, that's OK. We are checking whether repack
cleans up already-created .tmp files, and it will do so whether it exits
or dies by signal (because the tempfile API hooks both).

But we have to tell test_must_fail that either outcome is OK, or it
complains about the signal. Arguably this is a workaround (compared to
fixing repack), as repack dying to SIGPIPE means that it loses the
opportunity to give a more detailed message. But we don't actually write
such a message anyway; we rely on pack-objects to have written something
useful to stderr, and it does. In either case (signal or exit), that is
the main thing the user will see.

Signed-off-by: Jeff King <peff@peff.net>
---
 t/t7700-repack.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index edcda849b9..9164acbe02 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -433,7 +433,7 @@ test_expect_success TTY '--quiet disables progress' '
 '
 
 test_expect_success 'clean up .tmp-* packs on error' '
-	test_must_fail git \
+	test_must_fail ok=sigpipe git \
 		-c repack.cruftwindow=bogus \
 		repack -ad --cruft &&
 	find $objdir/pack -name '.tmp-*' >tmpfiles &&
Junio C Hamano Oct. 23, 2022, 6:08 p.m. UTC | #4
Jeff King <peff@peff.net> writes:

> Here's a patch on top of of jk/repack-tempfile-cleanup that adjusts the
> test (and should make the immediate racy CI pain go away). It gives some
> explanation why the third option isn't as interesting as you'd think. If
> somebody later wants to add such a "pack-objects died" error, we can
> adjust sigpipe handling there.

An extremely simplified alternative would be just to say !  instead
of test_must_fail, which essentially is ok=anycrash ;-)

I did try the same exact patch before going to bed last night but
t7700 somehow failed some other steps in my local tests and I gave
up digging further X-<.  One step at a time...

Will queue.  Thanks.

> -- >8 --
> Subject: [PATCH] t7700: annotate cruft-pack failure with ok=sigpipe
>
> One of our tests intentionally causes the cruft-pack generation phase of
> repack to fail, in order to stimulate an exit from repack at the desired
> moment. It does so by feeding a bogus option argument to pack-objects.
> This is a simple and reliable way to get pack-objects to fail, but it
> has one downside: pack-objects will die before reading its stdin, which
> means the caller repack may racily get SIGPIPE writing to it.
>
> For the purposes of this test, that's OK. We are checking whether repack
> cleans up already-created .tmp files, and it will do so whether it exits
> or dies by signal (because the tempfile API hooks both).
>
> But we have to tell test_must_fail that either outcome is OK, or it
> complains about the signal. Arguably this is a workaround (compared to
> fixing repack), as repack dying to SIGPIPE means that it loses the
> opportunity to give a more detailed message. But we don't actually write
> such a message anyway; we rely on pack-objects to have written something
> useful to stderr, and it does. In either case (signal or exit), that is
> the main thing the user will see.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  t/t7700-repack.sh | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
> index edcda849b9..9164acbe02 100755
> --- a/t/t7700-repack.sh
> +++ b/t/t7700-repack.sh
> @@ -433,7 +433,7 @@ test_expect_success TTY '--quiet disables progress' '
>  '
>  
>  test_expect_success 'clean up .tmp-* packs on error' '
> -	test_must_fail git \
> +	test_must_fail ok=sigpipe git \
>  		-c repack.cruftwindow=bogus \
>  		repack -ad --cruft &&
>  	find $objdir/pack -name '.tmp-*' >tmpfiles &&
Jeff King Oct. 23, 2022, 8:55 p.m. UTC | #5
On Sun, Oct 23, 2022 at 11:08:26AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Here's a patch on top of of jk/repack-tempfile-cleanup that adjusts the
> > test (and should make the immediate racy CI pain go away). It gives some
> > explanation why the third option isn't as interesting as you'd think. If
> > somebody later wants to add such a "pack-objects died" error, we can
> > adjust sigpipe handling there.
> 
> An extremely simplified alternative would be just to say !  instead
> of test_must_fail, which essentially is ok=anycrash ;-)

I would have been tempted to do that if we hadn't already added
ok=sigpipe long ago. :)

> I did try the same exact patch before going to bed last night but
> t7700 somehow failed some other steps in my local tests and I gave
> up digging further X-<.  One step at a time...

Hopefully not my bits. :) FWIW, the tips of "master", "next", and "seen"
all pass just fine for me (minus this race on seen).

-Peff
Junio C Hamano Oct. 23, 2022, 9:48 p.m. UTC | #6
Jeff King <peff@peff.net> writes:

>> I did try the same exact patch before going to bed last night but
>> t7700 somehow failed some other steps in my local tests and I gave
>> up digging further X-<.  One step at a time...
>
> Hopefully not my bits. :) FWIW, the tips of "master", "next", and "seen"
> all pass just fine for me (minus this race on seen).

Not your bits.  What I am pushing out on 'seen' now has a few topics
excluded to avoid CI breakages, reproducible or flaky.

Thanks.
diff mbox series

Patch

diff --git a/builtin/repack.c b/builtin/repack.c
index d1929bb3db..39f03c3a1d 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -122,13 +122,6 @@  static void remove_temporary_files(void)
 	strbuf_release(&buf);
 }
 
-static void remove_pack_on_signal(int signo)
-{
-	remove_temporary_files();
-	sigchain_pop(signo);
-	raise(signo);
-}
-
 /*
  * Adds all packs hex strings to either fname_nonkept_list or
  * fname_kept_list based on whether each pack has a corresponding
@@ -248,7 +241,7 @@  static struct {
 };
 
 struct generated_pack_data {
-	char exts[ARRAY_SIZE(exts)];
+	struct tempfile *tempfiles[ARRAY_SIZE(exts)];
 };
 
 static struct generated_pack_data *populate_pack_exts(const char *name)
@@ -265,7 +258,7 @@  static struct generated_pack_data *populate_pack_exts(const char *name)
 		if (stat(path.buf, &statbuf))
 			continue;
 
-		data->exts[i] = 1;
+		data->tempfiles[i] = register_tempfile(path.buf);
 	}
 
 	strbuf_release(&path);
@@ -867,8 +860,6 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 		split_pack_geometry(geometry, geometric_factor);
 	}
 
-	sigchain_push_common(remove_pack_on_signal);
-
 	prepare_pack_objects(&cmd, &po_args);
 
 	show_progress = !po_args.quiet && isatty(2);
@@ -1013,30 +1004,29 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 		struct generated_pack_data *data = item->util;
 
 		for (ext = 0; ext < ARRAY_SIZE(exts); ext++) {
-			char *fname, *fname_old;
+			char *fname;
 
 			fname = mkpathdup("%s/pack-%s%s",
 					packdir, item->string, exts[ext].name);
-			fname_old = mkpathdup("%s-%s%s",
-					packtmp, item->string, exts[ext].name);
 
-			if (data->exts[ext]) {
+			if (data->tempfiles[ext]) {
+				const char *fname_old = get_tempfile_path(data->tempfiles[ext]);
 				struct stat statbuffer;
+
 				if (!stat(fname_old, &statbuffer)) {
 					statbuffer.st_mode &= ~(S_IWUSR | S_IWGRP | S_IWOTH);
 					chmod(fname_old, statbuffer.st_mode);
 				}
 
-				if (rename(fname_old, fname))
-					die_errno(_("renaming '%s' failed"), fname_old);
+				if (rename_tempfile(&data->tempfiles[ext], fname))
+					die_errno(_("renaming pack to '%s' failed"), fname);
 			} else if (!exts[ext].optional)
 				die(_("pack-objects did not write a '%s' file for pack %s-%s"),
 				    exts[ext].name, packtmp, item->string);
 			else if (unlink(fname) < 0 && errno != ENOENT)
 				die_errno(_("could not unlink: %s"), fname);
 
 			free(fname);
-			free(fname_old);
 		}
 	}
 	/* End of pack replacement. */
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index ca45c4cd2c..592016f64a 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -432,6 +432,14 @@  test_expect_success TTY '--quiet disables progress' '
 	test_must_be_empty stderr
 '
 
+test_expect_success 'clean up .tmp-* packs on error' '
+	test_must_fail git \
+		-c repack.cruftwindow=bogus \
+		repack -ad --cruft &&
+	find $objdir/pack -name '.tmp-*' >tmpfiles &&
+	test_must_be_empty tmpfiles
+'
+
 test_expect_success 'setup for update-server-info' '
 	git init update-server-info &&
 	test_commit -C update-server-info message