Message ID | Y1M3oie5dPrRaOni@coredump.intra.peff.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | repack tempfile-cleanup signal deadlock | expand |
On Fri, Oct 21, 2022 at 08:21:54PM -0400, Jeff King wrote: > +test_expect_success 'clean up .tmp-* packs on error' ' > + test_must_fail git \ > + -c repack.cruftwindow=bogus \ > + repack -ad --cruft && > + find $objdir/pack -name '.tmp-*' >tmpfiles && > + test_must_be_empty tmpfiles > +' Ugh, I think this test is racy. I saw a failure via SIGPIPE on OS X in CI, and running it locally with --stress confirmed. I think the problem is in our method to trigger pack-objects to fail for the cruft pack. We pass a bogus command line argument, so pack-objects exits more or less immediately. But the parent git-repack process is trying to write to its stdin (to name packs to keep/exclude). So that write may succeed or fail based on how quickly the child dies. Some options: - find a different way to convince repack to die. The most likely is probably corrupting the cruft objects in some way such that pack-objects dies, but not until it starts doing real work. - mark the test_must_fail with ok=sigpipe. In most case this is a band-aid, but here we still test what we want. The .tmp cleanup should kick in from a die() or from a signal death, so either is sufficient for our purposes. - teach git-repack to ignore sigpipe. We don't actually check the writes to pack-objects (since they're done by fprintf), but we'd notice its failing exit code. And arguably this is improving the user experience. Saying "pack-objects died with an error" is more useful than a sigpipe. Thoughts? -Peff
Jeff King <peff@peff.net> writes: > Ugh, I think this test is racy. I saw a failure via SIGPIPE on OS X in > CI, and running it locally with --stress confirmed. I think the problem > is in our method to trigger pack-objects to fail for the cruft pack. We > pass a bogus command line argument, so pack-objects exits more or less > immediately. But the parent git-repack process is trying to write to its > stdin (to name packs to keep/exclude). So that write may succeed or fail > based on how quickly the child dies. Yeah, I saw flaky failures myself during my integration tests. > Some options: > > - find a different way to convince repack to die. The most likely is > probably corrupting the cruft objects in some way such that > pack-objects dies, but not until it starts doing real work. > > - mark the test_must_fail with ok=sigpipe. In most case this is a > band-aid, but here we still test what we want. The .tmp cleanup > should kick in from a die() or from a signal death, so either is > sufficient for our purposes. > > - teach git-repack to ignore sigpipe. We don't actually check the > writes to pack-objects (since they're done by fprintf), but we'd > notice its failing exit code. And arguably this is improving the > user experience. Saying "pack-objects died with an error" is more > useful than a sigpipe. > > Thoughts? I agree that for this particular one the second one is a reasonable thing to do in the context of the test. The third one may actually be a better fix, exactly for the reason you state here. Thanks.
On Sat, Oct 22, 2022 at 05:14:33PM -0700, Junio C Hamano wrote: > > Some options: > > > > - find a different way to convince repack to die. The most likely is > > probably corrupting the cruft objects in some way such that > > pack-objects dies, but not until it starts doing real work. > > > > - mark the test_must_fail with ok=sigpipe. In most case this is a > > band-aid, but here we still test what we want. The .tmp cleanup > > should kick in from a die() or from a signal death, so either is > > sufficient for our purposes. > > > > - teach git-repack to ignore sigpipe. We don't actually check the > > writes to pack-objects (since they're done by fprintf), but we'd > > notice its failing exit code. And arguably this is improving the > > user experience. Saying "pack-objects died with an error" is more > > useful than a sigpipe. > > > > Thoughts? > > I agree that for this particular one the second one is a reasonable > thing to do in the context of the test. The third one may actually > be a better fix, exactly for the reason you state here. Here's a patch on top of of jk/repack-tempfile-cleanup that adjusts the test (and should make the immediate racy CI pain go away). It gives some explanation why the third option isn't as interesting as you'd think. If somebody later wants to add such a "pack-objects died" error, we can adjust sigpipe handling there. -- >8 -- Subject: [PATCH] t7700: annotate cruft-pack failure with ok=sigpipe One of our tests intentionally causes the cruft-pack generation phase of repack to fail, in order to stimulate an exit from repack at the desired moment. It does so by feeding a bogus option argument to pack-objects. This is a simple and reliable way to get pack-objects to fail, but it has one downside: pack-objects will die before reading its stdin, which means the caller repack may racily get SIGPIPE writing to it. For the purposes of this test, that's OK. We are checking whether repack cleans up already-created .tmp files, and it will do so whether it exits or dies by signal (because the tempfile API hooks both). But we have to tell test_must_fail that either outcome is OK, or it complains about the signal. Arguably this is a workaround (compared to fixing repack), as repack dying to SIGPIPE means that it loses the opportunity to give a more detailed message. But we don't actually write such a message anyway; we rely on pack-objects to have written something useful to stderr, and it does. In either case (signal or exit), that is the main thing the user will see. Signed-off-by: Jeff King <peff@peff.net> --- t/t7700-repack.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh index edcda849b9..9164acbe02 100755 --- a/t/t7700-repack.sh +++ b/t/t7700-repack.sh @@ -433,7 +433,7 @@ test_expect_success TTY '--quiet disables progress' ' ' test_expect_success 'clean up .tmp-* packs on error' ' - test_must_fail git \ + test_must_fail ok=sigpipe git \ -c repack.cruftwindow=bogus \ repack -ad --cruft && find $objdir/pack -name '.tmp-*' >tmpfiles &&
Jeff King <peff@peff.net> writes: > Here's a patch on top of of jk/repack-tempfile-cleanup that adjusts the > test (and should make the immediate racy CI pain go away). It gives some > explanation why the third option isn't as interesting as you'd think. If > somebody later wants to add such a "pack-objects died" error, we can > adjust sigpipe handling there. An extremely simplified alternative would be just to say ! instead of test_must_fail, which essentially is ok=anycrash ;-) I did try the same exact patch before going to bed last night but t7700 somehow failed some other steps in my local tests and I gave up digging further X-<. One step at a time... Will queue. Thanks. > -- >8 -- > Subject: [PATCH] t7700: annotate cruft-pack failure with ok=sigpipe > > One of our tests intentionally causes the cruft-pack generation phase of > repack to fail, in order to stimulate an exit from repack at the desired > moment. It does so by feeding a bogus option argument to pack-objects. > This is a simple and reliable way to get pack-objects to fail, but it > has one downside: pack-objects will die before reading its stdin, which > means the caller repack may racily get SIGPIPE writing to it. > > For the purposes of this test, that's OK. We are checking whether repack > cleans up already-created .tmp files, and it will do so whether it exits > or dies by signal (because the tempfile API hooks both). > > But we have to tell test_must_fail that either outcome is OK, or it > complains about the signal. Arguably this is a workaround (compared to > fixing repack), as repack dying to SIGPIPE means that it loses the > opportunity to give a more detailed message. But we don't actually write > such a message anyway; we rely on pack-objects to have written something > useful to stderr, and it does. In either case (signal or exit), that is > the main thing the user will see. > > Signed-off-by: Jeff King <peff@peff.net> > --- > t/t7700-repack.sh | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh > index edcda849b9..9164acbe02 100755 > --- a/t/t7700-repack.sh > +++ b/t/t7700-repack.sh > @@ -433,7 +433,7 @@ test_expect_success TTY '--quiet disables progress' ' > ' > > test_expect_success 'clean up .tmp-* packs on error' ' > - test_must_fail git \ > + test_must_fail ok=sigpipe git \ > -c repack.cruftwindow=bogus \ > repack -ad --cruft && > find $objdir/pack -name '.tmp-*' >tmpfiles &&
On Sun, Oct 23, 2022 at 11:08:26AM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > Here's a patch on top of of jk/repack-tempfile-cleanup that adjusts the > > test (and should make the immediate racy CI pain go away). It gives some > > explanation why the third option isn't as interesting as you'd think. If > > somebody later wants to add such a "pack-objects died" error, we can > > adjust sigpipe handling there. > > An extremely simplified alternative would be just to say ! instead > of test_must_fail, which essentially is ok=anycrash ;-) I would have been tempted to do that if we hadn't already added ok=sigpipe long ago. :) > I did try the same exact patch before going to bed last night but > t7700 somehow failed some other steps in my local tests and I gave > up digging further X-<. One step at a time... Hopefully not my bits. :) FWIW, the tips of "master", "next", and "seen" all pass just fine for me (minus this race on seen). -Peff
Jeff King <peff@peff.net> writes: >> I did try the same exact patch before going to bed last night but >> t7700 somehow failed some other steps in my local tests and I gave >> up digging further X-<. One step at a time... > > Hopefully not my bits. :) FWIW, the tips of "master", "next", and "seen" > all pass just fine for me (minus this race on seen). Not your bits. What I am pushing out on 'seen' now has a few topics excluded to avoid CI breakages, reproducible or flaky. Thanks.
diff --git a/builtin/repack.c b/builtin/repack.c index d1929bb3db..39f03c3a1d 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -122,13 +122,6 @@ static void remove_temporary_files(void) strbuf_release(&buf); } -static void remove_pack_on_signal(int signo) -{ - remove_temporary_files(); - sigchain_pop(signo); - raise(signo); -} - /* * Adds all packs hex strings to either fname_nonkept_list or * fname_kept_list based on whether each pack has a corresponding @@ -248,7 +241,7 @@ static struct { }; struct generated_pack_data { - char exts[ARRAY_SIZE(exts)]; + struct tempfile *tempfiles[ARRAY_SIZE(exts)]; }; static struct generated_pack_data *populate_pack_exts(const char *name) @@ -265,7 +258,7 @@ static struct generated_pack_data *populate_pack_exts(const char *name) if (stat(path.buf, &statbuf)) continue; - data->exts[i] = 1; + data->tempfiles[i] = register_tempfile(path.buf); } strbuf_release(&path); @@ -867,8 +860,6 @@ int cmd_repack(int argc, const char **argv, const char *prefix) split_pack_geometry(geometry, geometric_factor); } - sigchain_push_common(remove_pack_on_signal); - prepare_pack_objects(&cmd, &po_args); show_progress = !po_args.quiet && isatty(2); @@ -1013,30 +1004,29 @@ int cmd_repack(int argc, const char **argv, const char *prefix) struct generated_pack_data *data = item->util; for (ext = 0; ext < ARRAY_SIZE(exts); ext++) { - char *fname, *fname_old; + char *fname; fname = mkpathdup("%s/pack-%s%s", packdir, item->string, exts[ext].name); - fname_old = mkpathdup("%s-%s%s", - packtmp, item->string, exts[ext].name); - if (data->exts[ext]) { + if (data->tempfiles[ext]) { + const char *fname_old = get_tempfile_path(data->tempfiles[ext]); struct stat statbuffer; + if (!stat(fname_old, &statbuffer)) { statbuffer.st_mode &= ~(S_IWUSR | S_IWGRP | S_IWOTH); chmod(fname_old, statbuffer.st_mode); } - if (rename(fname_old, fname)) - die_errno(_("renaming '%s' failed"), fname_old); + if (rename_tempfile(&data->tempfiles[ext], fname)) + die_errno(_("renaming pack to '%s' failed"), fname); } else if (!exts[ext].optional) die(_("pack-objects did not write a '%s' file for pack %s-%s"), exts[ext].name, packtmp, item->string); else if (unlink(fname) < 0 && errno != ENOENT) die_errno(_("could not unlink: %s"), fname); free(fname); - free(fname_old); } } /* End of pack replacement. */ diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh index ca45c4cd2c..592016f64a 100755 --- a/t/t7700-repack.sh +++ b/t/t7700-repack.sh @@ -432,6 +432,14 @@ test_expect_success TTY '--quiet disables progress' ' test_must_be_empty stderr ' +test_expect_success 'clean up .tmp-* packs on error' ' + test_must_fail git \ + -c repack.cruftwindow=bogus \ + repack -ad --cruft && + find $objdir/pack -name '.tmp-*' >tmpfiles && + test_must_be_empty tmpfiles +' + test_expect_success 'setup for update-server-info' ' git init update-server-info && test_commit -C update-server-info message
When git-repack exits due to a signal, it tries to clean up by calling its remove_temporary_files() function, which walks through the packs dir looking for ".tmp-$$-pack-*" files to delete (where "$$" is the pid of the current process). The biggest problem here is that remove_temporary_files() is not safe to call in a signal handler. It uses opendir(), which isn't on the POSIX async-signal-safe list. The details will be platform-specific, but a likely issue is that it needs to allocate memory; if we receive a signal while inside malloc(), etc, we'll conflict on the allocator lock and deadlock with ourselves. We can fix this by just cleaning up the files directly, without walking the directory. We already know the complete list of .tmp-* files that were generated, because we recorded them via populate_pack_exts(). When we find files there, we can use register_tempfile() to record the filenames. If we receive a signal, then the tempfile API will clean them up for us, and it's async-safe and pretty battle-tested. Note that this is slightly racier than the existing scheme. We don't record the filenames until pack-objects tells us the hash over stdout. So during the period between it generating the file and reporting the hash, we'd fail to clean up. However, that period is very small. During most of the pack generation process pack-objects is using its own internal tempfiles. It's only at the very end that it moves them into the names git-repack expects, and then it immediately reports the name to us. Given that cleanup like this is best effort (after all, we may get SIGKILL), this level of race is acceptable. When we register the tempfiles, we'll record them locally and use the result to call rename_tempfile(), rather than renaming by hand. This isn't strictly necessary, as once we've renamed the files they're gone, and the tempfile API's cleanup unlink() would simply become a pointless noop. But managing the lifetimes of the tempfile objects is the cleanest thing to do, and the tempfile pointers naturally fill the same role as the old booleans. This patch also fixes another small problem. We only hook signals, and don't set up an atexit handler. So if we see an error that causes us to die(), we'll leave the .tmp-* files in place. But since the tempfile API handles this for us, this is now fixed for free. The new test covers this by stimulating a failure of pack-objects when generating a cruft pack. Before this patch, the .tmp-* file for the main pack would have been left, but now we correctly clean it up. Two small subtleties on the implementation: - in the renaming loop, we can stop re-constructing fname_old; we only use it when we have a tempfile to rename, so we can just ask the tempfile for its path (which, barring bugs, should be identical) - when renaming fails, our error message mentions fname_old. But since a failed rename_tempfile() invalidates the tempfile struct, we'll lose access to that string. Instead, let's mention the destination filename, which is what most other callers do. Reported-by: Jan Pokorný <poki@fnusa.cz> Signed-off-by: Jeff King <peff@peff.net> --- builtin/repack.c | 26 ++++++++------------------ t/t7700-repack.sh | 8 ++++++++ 2 files changed, 16 insertions(+), 18 deletions(-)