Message ID | pull.1348.git.git.1664236383785.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | tmp-objdir: do not opendir() when handling a signal | expand |
[+cc Peff as the author of tmp-objdir] On Mon, Sep 26, 2022 at 11:53:03PM +0000, John Cai via GitGitGadget wrote: > One place we call tmp_objdir_create() is in git-receive-pack, where > we create a temporary quarantine directory "incoming". Incoming > objects will be written to this directory before they get moved to > the object directory. Right, calling opendir() will allocate memory, so we'll get stuck in a deadlock if the signal arrives while libc's allocator lock is held. So we can't safely call opendir() there. It does make me a little uneasy leaving the quarantine directory around via this path. So I wonder if we should be optimistically opening up the DIR handle? Calling unlink() in a signal is perfectly fine, so I'd think as long as we have an open DIR handle we could call readdir_r(), but I don't think we've discussed it before. > @@ -3261,7 +3261,10 @@ static int remove_dir_recurse(struct strbuf *path, int flag, int *kept_up) > } > > flag &= ~REMOVE_DIR_KEEP_TOPLEVEL; > - dir = opendir(path->buf); > + > + if ((flag & REMOVE_DIR_SIGNAL) == 0) Comparing to the zero value is discouraged. Consider: if (!(flag & REMOVE_DIR_SIGNAL)) instead. > @@ -498,6 +498,9 @@ int get_sparse_checkout_patterns(struct pattern_list *pl); > /* Remove the_original_cwd too */ > #define REMOVE_DIR_PURGE_ORIGINAL_CWD 0x08 > > +/* Indicates a signal is being handled */ > +#define REMOVE_DIR_SIGNAL 0x16 > + Perhaps REMOVE_DIR_IN_SIGNAL would be slightly more descriptive. > @@ -49,7 +50,11 @@ static int tmp_objdir_destroy_1(struct tmp_objdir *t, int on_signal) > * have pre-grown t->path sufficiently so that this > * doesn't happen in practice. > */ > - err = remove_dir_recursively(&t->path, 0); > + > + if (on_signal) > + flags = flags | REMOVE_DIR_SIGNAL; I'm nitpicking, but you could just write "flags |= REMOVE_DIR_SIGNAL", or even: err = remove_dir_recursively(&t->path, on_signal ? REMOVE_DIR_SIGNAL : 0); Thanks, Taylor
"John Cai via GitGitGadget" <gitgitgadget@gmail.com> writes: > tmp-objdir: do not closedir() when handling a signal > > We have recently observed a Git process hanging around for weeks. A > backtrace revealed that a git-receive-pack(1) process was deadlocked > when trying to remove the quarantine directory "incoming." It turns out > that the tmp_objdir API calls opendir(3) and closedir(3) to observe a > directory's contents in order to remove all the contents before removing > the directory itself. These functions are not async signal save as they > allocate and free memory. > > The fix is to avoid calling these functions when handling a signal in > order to avoid a deadlock. The implication of such a fix however, is > that temporary object directories may not get cleaned up properly when a > signal is being handled. The tradeoff this fix is making is to prevent > deadlocks at the cost of temporary object directory cleanup. > > This is similar to 58d4d7f1c5 (2022-01-07 fetch: fix deadlock when > cleaning up lockfiles in async signals) Hmph, is it really similar? That one, even though the lockfiles won't be cleaned up inside signal handler, they will eventually be cleaned, won't they? As opposed to here, once we punt, we punt and do not revisit when we re-raise and eventually exit, no? Leaving temporary directories behind is MUCH MUCH better than getting stuck in a deadlock, so it is much better than the status quo, of course. > static int remove_dir_recurse(struct strbuf *path, int flag, int *kept_up) > { > - DIR *dir; > + DIR *dir = NULL; > struct dirent *e; > int ret = 0, original_len = path->len, len, kept_down = 0; > int only_empty = (flag & REMOVE_DIR_EMPTY_ONLY); > @@ -3261,7 +3261,10 @@ static int remove_dir_recurse(struct strbuf *path, int flag, int *kept_up) > } > > flag &= ~REMOVE_DIR_KEEP_TOPLEVEL; > - dir = opendir(path->buf); > + > + if ((flag & REMOVE_DIR_SIGNAL) == 0) > + dir = opendir(path->buf); > + > if (!dir) { > if (errno == ENOENT) > return keep_toplevel ? -1 : 0; > diff --git a/dir.h b/dir.h > index 674747d93af..ba159f4abeb 100644 > --- a/dir.h > +++ b/dir.h > @@ -498,6 +498,9 @@ int get_sparse_checkout_patterns(struct pattern_list *pl); > /* Remove the_original_cwd too */ > #define REMOVE_DIR_PURGE_ORIGINAL_CWD 0x08 > > +/* Indicates a signal is being handled */ > +#define REMOVE_DIR_SIGNAL 0x16 > + > /* > * Remove path and its contents, recursively. flags is a combination > * of the above REMOVE_DIR_* constants. Return 0 on success. > diff --git a/tmp-objdir.c b/tmp-objdir.c The fix looks quite straight-forward. Thanks for spotting and working on this issue.
Hi John On 27/09/2022 00:53, John Cai via GitGitGadget wrote: > From: John Cai <johncai86@gmail.com> > > In the tmp-objdir api, tmp_objdir_create will create a temporary > directory but also register signal handlers responsible for removing > the directory's contents and the directory itself. However, the > function responsible for recursively removing the contents and > directory, remove_dir_recurse() calls opendir(3) and closedir(3). > This can be problematic because these functions allocate and free > memory, which are not async-signal-safe functions. This can lead to > deadlocks. > --- a/dir.h > +++ b/dir.h > @@ -498,6 +498,9 @@ int get_sparse_checkout_patterns(struct pattern_list *pl); > /* Remove the_original_cwd too */ > #define REMOVE_DIR_PURGE_ORIGINAL_CWD 0x08 > > +/* Indicates a signal is being handled */ > +#define REMOVE_DIR_SIGNAL 0x16 This is setting the bits for REMOVE_DIR_KEEP_NESTED_GIT and REMOVE_DIR_KEEP_TOPLEVEL is that intentional? (it looks like you've doubled 8 to 16 to get the next free bit but used a hex constant, the earlier constants use decimal) Best Wishes Phillip
On Mon, Sep 26, 2022 at 11:53:03PM +0000, John Cai via GitGitGadget wrote: > One place we call tmp_objdir_create() is in git-receive-pack, where > we create a temporary quarantine directory "incoming". Incoming > objects will be written to this directory before they get moved to > the object directory. > > We have observed this code leading to a deadlock: > > Thread 1 (Thread 0x7f621ba0b200 (LWP 326305)): > #0 __lll_lock_wait_private (futex=futex@entry=0x7f621bbf8b80 > <main_arena>) at ./lowlevellock.c:35 > #1 0x00007f621baa635b in __GI___libc_malloc > (bytes=bytes@entry=32816) at malloc.c:3064 > #2 0x00007f621bae9f49 in __alloc_dir (statp=0x7fff2ea7ed60, > flags=0, close_fd=true, fd=5) > at ../sysdeps/posix/opendir.c:118 > #3 opendir_tail (fd=5) at ../sysdeps/posix/opendir.c:69 > #4 __opendir (name=<optimized out>) > at ../sysdeps/posix/opendir.c:92 > #5 0x0000557c19c77de1 in remove_dir_recurse () > #6 0x0000557c19d81a4f in remove_tmp_objdir_on_signal () > #7 <signal handler called> Yuck. Your analysis looks right, and certainly opendir() can't really work without allocating memory for the pointer-to-DIR. > To prevent this, add a flag REMOVE_DIR_SIGNAL that allows > remove_dir_recurse() to know that a signal is being handled and avoid > calling opendir(3). One implication of this change is that when > signal handling, the temporary directory may not get cleaned up > properly. It's not really "may not" here, is it? It will never get cleaned up on a signal now. I don't think remove_dir_recursively() will try to rmdir() in this case. But even if it did, we'll always have a "pack" subdirectory (minus the small race before we've created it). That's unfortunate, but I don't think we really have a portable alternative. We can't keep an exact list of files to be deleted, because some of them will be created by sub-processes. We could perhaps exec a helper that does the deletion, but that seems like a race and portability nightmare. On Linux, we could probably use open() and getdents64() to traverse, but obviously that won't work everywhere. It _might_ be worth having some kind of compat/ wrapper here, to let supported systems do as good a job as they can. But it's not like there aren't already cases where we might leave the tmp-objdir directory around (say, SIGKILL), so this is really just extending the existing problem to more signals. I was going to suggest we should do a better job of cleaning up these directories via git-gc. But it looks like b3cecf49ea (tmp-objdir: new API for creating temporary writable databases, 2021-12-06) changed the default name such that a regular git-gc should do so. So I think we're covered there. The main case we care about is normal exit when index-pack or a hook sees an error, in which case we should still be cleaning up via the atexit() handler. So I think your patch is going in the right direction, but... > static int remove_dir_recurse(struct strbuf *path, int flag, int *kept_up) > { > - DIR *dir; > + DIR *dir = NULL; > struct dirent *e; > int ret = 0, original_len = path->len, len, kept_down = 0; > int only_empty = (flag & REMOVE_DIR_EMPTY_ONLY); > @@ -3261,7 +3261,10 @@ static int remove_dir_recurse(struct strbuf *path, int flag, int *kept_up) > } > > flag &= ~REMOVE_DIR_KEEP_TOPLEVEL; > - dir = opendir(path->buf); > + > + if ((flag & REMOVE_DIR_SIGNAL) == 0) > + dir = opendir(path->buf); > + > if (!dir) { > if (errno == ENOENT) > return keep_toplevel ? -1 : 0; We skip calling opendir() entirely, so "dir" will still be NULL. But we immediately start looking at errno, which will have some undefined value (based on some previous syscall). If we set "errno" to "EACCES" in this case, then we'd at least hit the rmdir() below: if (!dir) { if (errno == ENOENT) return keep_toplevel ? -1 : 0; else if (errno == EACCES && !keep_toplevel) /* * An empty dir could be removable even if it * is unreadable: */ return rmdir(path->buf); else return -1; } but we know it won't really do anything for our proposed caller, since it will have files inside the directory that need to be removed before rmdir() can work. Moreover, if you were to combine REMOVE_DIR_SIGNAL with REMOVE_DIR_KEEP_NESTED_GIT, I suspect that the call to resolve_gitlink_ref() would end up with similar deadlocks. Obviously nobody is proposing to do that, but it is a pitfall in the API. So all of that makes me think we should not add a new flag here, but instead just avoid calling the function entirely from tmp_objdir_destroy_1(). But then we can observe that tmp_objdir_destroy_1() is basically doing nothing if on_signal is set. So there is really no point in setting up the signal handler at all. We should just set up the atexit() handler. I.e., something like: diff --git a/tmp-objdir.c b/tmp-objdir.c index a8be92bca1..10549e95db 100644 --- a/tmp-objdir.c +++ b/tmp-objdir.c @@ -169,7 +169,6 @@ struct tmp_objdir *tmp_objdir_create(const char *prefix) the_tmp_objdir = t; if (!installed_handlers) { atexit(remove_tmp_objdir); - sigchain_push_common(remove_tmp_objdir_on_signal); installed_handlers++; } with the commit message explaining that we can't do the cleanup in a portable and signal-safe way, so we just punt on the whole concept. There's also some minor cleanup we could do elsewhere to drop the "on_signal" argument (which can come as part of the same patch, or on top). -Peff
On Mon, Sep 26, 2022 at 08:18:47PM -0400, Taylor Blau wrote: > It does make me a little uneasy leaving the quarantine directory around > via this path. So I wonder if we should be optimistically opening up the > DIR handle? Calling unlink() in a signal is perfectly fine, so I'd think > as long as we have an open DIR handle we could call readdir_r(), but I > don't think we've discussed it before. You'd need to hold multiple such DIRs, since the removal is recursive. It's easy-ish for the "pack" directory, but the sub-process index-pack may have created 00-ff directories, too. You'd have to pre-create and opendir() all of them. I'm also not sure what timing guarantees we have. If I opendir() a directory, then wait a long time while somebody else creates entries, is a readdir() guaranteed to see those new entries? -Peff
Hey Peff, On 27 Sep 2022, at 7:44, Jeff King wrote: > On Mon, Sep 26, 2022 at 11:53:03PM +0000, John Cai via GitGitGadget wrote: > >> One place we call tmp_objdir_create() is in git-receive-pack, where >> we create a temporary quarantine directory "incoming". Incoming >> objects will be written to this directory before they get moved to >> the object directory. >> >> We have observed this code leading to a deadlock: >> >> Thread 1 (Thread 0x7f621ba0b200 (LWP 326305)): >> #0 __lll_lock_wait_private (futex=futex@entry=0x7f621bbf8b80 >> <main_arena>) at ./lowlevellock.c:35 >> #1 0x00007f621baa635b in __GI___libc_malloc >> (bytes=bytes@entry=32816) at malloc.c:3064 >> #2 0x00007f621bae9f49 in __alloc_dir (statp=0x7fff2ea7ed60, >> flags=0, close_fd=true, fd=5) >> at ../sysdeps/posix/opendir.c:118 >> #3 opendir_tail (fd=5) at ../sysdeps/posix/opendir.c:69 >> #4 __opendir (name=<optimized out>) >> at ../sysdeps/posix/opendir.c:92 >> #5 0x0000557c19c77de1 in remove_dir_recurse () >> #6 0x0000557c19d81a4f in remove_tmp_objdir_on_signal () >> #7 <signal handler called> > > Yuck. Your analysis looks right, and certainly opendir() can't really > work without allocating memory for the pointer-to-DIR. > >> To prevent this, add a flag REMOVE_DIR_SIGNAL that allows >> remove_dir_recurse() to know that a signal is being handled and avoid >> calling opendir(3). One implication of this change is that when >> signal handling, the temporary directory may not get cleaned up >> properly. > > It's not really "may not" here, is it? It will never get cleaned up on a > signal now. I don't think remove_dir_recursively() will try to rmdir() > in this case. But even if it did, we'll always have a "pack" > subdirectory (minus the small race before we've created it). > > That's unfortunate, but I don't think we really have a portable > alternative. We can't keep an exact list of files to be deleted, because > some of them will be created by sub-processes. We could perhaps exec a > helper that does the deletion, but that seems like a race and > portability nightmare. On Linux, we could probably use open() and > getdents64() to traverse, but obviously that won't work everywhere. It > _might_ be worth having some kind of compat/ wrapper here, to let > supported systems do as good a job as they can. But it's not like there > aren't already cases where we might leave the tmp-objdir directory > around (say, SIGKILL), so this is really just extending the existing > problem to more signals. > > I was going to suggest we should do a better job of cleaning up these > directories via git-gc. But it looks like b3cecf49ea (tmp-objdir: new > API for creating temporary writable databases, 2021-12-06) changed the > default name such that a regular git-gc should do so. So I think we're > covered there. I was wondering about this as well. Thanks for checking on this--that's reassuring. > > The main case we care about is normal exit when index-pack or a hook > sees an error, in which case we should still be cleaning up via the > atexit() handler. > > So I think your patch is going in the right direction, but... > >> static int remove_dir_recurse(struct strbuf *path, int flag, int *kept_up) >> { >> - DIR *dir; >> + DIR *dir = NULL; >> struct dirent *e; >> int ret = 0, original_len = path->len, len, kept_down = 0; >> int only_empty = (flag & REMOVE_DIR_EMPTY_ONLY); >> @@ -3261,7 +3261,10 @@ static int remove_dir_recurse(struct strbuf *path, int flag, int *kept_up) >> } >> >> flag &= ~REMOVE_DIR_KEEP_TOPLEVEL; >> - dir = opendir(path->buf); >> + >> + if ((flag & REMOVE_DIR_SIGNAL) == 0) >> + dir = opendir(path->buf); >> + >> if (!dir) { >> if (errno == ENOENT) >> return keep_toplevel ? -1 : 0; > > We skip calling opendir() entirely, so "dir" will still be NULL. But we > immediately start looking at errno, which will have some undefined value > (based on some previous syscall). > > If we set "errno" to "EACCES" in this case, then we'd at least hit the > rmdir() below: > > if (!dir) { > if (errno == ENOENT) > return keep_toplevel ? -1 : 0; > else if (errno == EACCES && !keep_toplevel) > /* > * An empty dir could be removable even if it > * is unreadable: > */ > return rmdir(path->buf); > else > return -1; > } > > but we know it won't really do anything for our proposed caller, since > it will have files inside the directory that need to be removed before > rmdir() can work. yeah, I suppose the only case it would help is if the directory is already empty. > > Moreover, if you were to combine REMOVE_DIR_SIGNAL with > REMOVE_DIR_KEEP_NESTED_GIT, I suspect that the call to > resolve_gitlink_ref() would end up with similar deadlocks. Obviously > nobody is proposing to do that, but it is a pitfall in the API. > > So all of that makes me think we should not add a new flag here, but > instead just avoid calling the function entirely from > tmp_objdir_destroy_1(). > > But then we can observe that tmp_objdir_destroy_1() is basically doing > nothing if on_signal is set. So there is really no point in setting up > the signal handler at all. We should just set up the atexit() handler. > I.e., something like: > > diff --git a/tmp-objdir.c b/tmp-objdir.c > index a8be92bca1..10549e95db 100644 > --- a/tmp-objdir.c > +++ b/tmp-objdir.c > @@ -169,7 +169,6 @@ struct tmp_objdir *tmp_objdir_create(const char *prefix) > the_tmp_objdir = t; > if (!installed_handlers) { > atexit(remove_tmp_objdir); > - sigchain_push_common(remove_tmp_objdir_on_signal); > installed_handlers++; > } This makes sense and is a clean solution. I guess the only case where we would benefit in calling into remove_tmp_objdir_on_signal() is if the temp dir exists but is empty. I'm not sure how often this would happen to make it worth it? Probably not... > > > with the commit message explaining that we can't do the cleanup in a > portable and signal-safe way, so we just punt on the whole concept. > > There's also some minor cleanup we could do elsewhere to drop the > "on_signal" argument (which can come as part of the same patch, or on > top). > > -Peff
Jeff King <peff@peff.net> writes: > So all of that makes me think we should not add a new flag here, but > instead just avoid calling the function entirely from > tmp_objdir_destroy_1(). Thanks. I missed that undefined access to errno that breaks the intention of the patch. > But then we can observe that tmp_objdir_destroy_1() is basically doing > nothing if on_signal is set. So there is really no point in setting up > the signal handler at all. We should just set up the atexit() handler. > I.e., something like: > > diff --git a/tmp-objdir.c b/tmp-objdir.c > index a8be92bca1..10549e95db 100644 > --- a/tmp-objdir.c > +++ b/tmp-objdir.c > @@ -169,7 +169,6 @@ struct tmp_objdir *tmp_objdir_create(const char *prefix) > the_tmp_objdir = t; > if (!installed_handlers) { > atexit(remove_tmp_objdir); > - sigchain_push_common(remove_tmp_objdir_on_signal); > installed_handlers++; > } > > > with the commit message explaining that we can't do the cleanup in a > portable and signal-safe way, so we just punt on the whole concept. > > There's also some minor cleanup we could do elsewhere to drop the > "on_signal" argument (which can come as part of the same patch, or on > top). ;-) I like the simplification.
On Tue, Sep 27, 2022 at 09:50:43AM -0400, John Cai wrote: > > diff --git a/tmp-objdir.c b/tmp-objdir.c > > index a8be92bca1..10549e95db 100644 > > --- a/tmp-objdir.c > > +++ b/tmp-objdir.c > > @@ -169,7 +169,6 @@ struct tmp_objdir *tmp_objdir_create(const char *prefix) > > the_tmp_objdir = t; > > if (!installed_handlers) { > > atexit(remove_tmp_objdir); > > - sigchain_push_common(remove_tmp_objdir_on_signal); > > installed_handlers++; > > } > > This makes sense and is a clean solution. I guess the only case where we would > benefit in calling into remove_tmp_objdir_on_signal() is if the temp dir exists > but is empty. I'm not sure how often this would happen to make it worth it? > Probably not... It's more or less never, as the first thing we do after creating it is call tmp_objdir_setup(), which creates the pack subdirectory. We _could_ try removing them both manually, like: diff --git a/tmp-objdir.c b/tmp-objdir.c index adf6033549..509eb58363 100644 --- a/tmp-objdir.c +++ b/tmp-objdir.c @@ -49,7 +49,17 @@ static int tmp_objdir_destroy_1(struct tmp_objdir *t, int on_signal) * have pre-grown t->path sufficiently so that this * doesn't happen in practice. */ - err = remove_dir_recursively(&t->path, 0); + if (!on_signal) + err = remove_dir_recursively(&t->path, 0); + else { + size_t orig_len = t->path.len; + + strbuf_addstr(&t->path, "/pack"); + rmdir(t->path.buf); + strbuf_setlen(&t->path, orig_len); + + rmdir(t->path.buf); + } /* * When we are cleaning up due to a signal, we won't bother but even that would rarely do anything useful. As soon as the child index-pack receives even a single byte, we'll have an actual pack tmpfile with an unknown name, which will cause rmdir() to fail (ditto for unpack-objects, though it would probably write as soon as we've received a whole single object). If we can't enumerate the list of objects via readdir() or similar, I think there's really not much we can do. -Peff
John Cai via GitGitGadget píše v Po 26. 09. 2022 v 23:53 +0000: > From: John Cai <johncai86@gmail.com> > > In the tmp-objdir api, tmp_objdir_create will create a temporary > directory but also register signal handlers responsible for removing > the directory's contents and the directory itself. However, the > function responsible for recursively removing the contents and > directory, remove_dir_recurse() calls opendir(3) and closedir(3). > This can be problematic because these functions allocate and free > memory, which are not async-signal-safe functions. This can lead to > deadlocks. > > One place we call tmp_objdir_create() is in git-receive-pack, where > we create a temporary quarantine directory "incoming". Incoming > objects will be written to this directory before they get moved to > the object directory. Just noticed this unattended git crash in the logs and I think it might actually be another occurrence of the same problem in principle, so shamelessly piggy-backing the stack trace here (no coredump preserved at this point, sorry): #0 0x00007f08df0ea06c __pthread_kill_implementation (libc.so.6 + 0x8b06c) #1 0x00007f08df098046 raise (libc.so.6 + 0x39046) #2 0x00007f08df0817fc abort (libc.so.6 + 0x227fc) #3 0x00007f08df082533 __libc_message.cold (libc.so.6 + 0x23533) #4 0x00007f08df090a67 __libc_assert_fail (libc.so.6 + 0x31a67) #5 0x00007f08df0f68f2 sysmalloc (libc.so.6 + 0x978f2) #6 0x00007f08df0f7789 _int_malloc (libc.so.6 + 0x98789) #7 0x00007f08df0f80d2 __libc_malloc (libc.so.6 + 0x990d2) #8 0x00007f08df12fa55 __alloc_dir (libc.so.6 + 0xd0a55) ->#9 0x00007f08df12fac2 opendir_tail (libc.so.6 + 0xd0ac2) #10 0x00005632ea10c823 remove_temporary_files.lto_priv.0 (git + 0xdc823) ->#11 0x00005632ea10c97c remove_pack_on_signal.lto_priv.0 (git + 0xdc97c) #12 0x00007f08df0980f0 __restore_rt (libc.so.6 + 0x390f0) #13 0x00007f08df0f7775 _int_malloc (libc.so.6 + 0x98775) #14 0x00007f08df0f80d2 __libc_malloc (libc.so.6 + 0x990d2) #15 0x00007f08df0d2b44 _IO_file_doallocate (libc.so.6 + 0x73b44) #16 0x00007f08df0e1d20 _IO_doallocbuf (libc.so.6 + 0x82d20) #17 0x00007f08df0e0a8c _IO_file_underflow@@GLIBC_2.2.5 (libc.so.6 + 0x81a8c) #18 0x00007f08df0d4598 __getdelim (libc.so.6 + 0x75598) #19 0x00005632ea29f372 strbuf_getwholeline (git + 0x26f372) #20 0x00005632ea117915 cmd_repack (git + 0xe7915) #21 0x00005632ea0556fa handle_builtin.lto_priv.0 (git + 0x256fa) #22 0x00005632ea050551 main (git + 0x20551) #23 0x00007f08df082a50 __libc_start_call_main (libc.so.6 + 0x23a50) #24 0x00007f08df082b09 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x23b09) #25 0x00005632ea051555 _start (git + 0x21555) As per the captured info, following Fedora x86_64 packages were involved: git-2.37.3-1.fc38 zlib-1.2.12-5.fc38 pcre2-10.40-1.fc37.1 Full command at hand: /usr/libexec/git-core/git repack -d -l --no-write-bitmap-index It happened twice, actually, about a week's time apart, but that was a month ago while this unattended task runs hourly till today and git hasn't been updated since. Not subscribed to the list and I don't think I can provide more info on this, but feel free to contact me directly. -- poki
On Thu, Oct 20, 2022 at 01:58:58PM +0200, Jan Pokorný wrote: > Just noticed this unattended git crash in the logs and I think it might > actually be another occurrence of the same problem in principle, so > shamelessly piggy-backing the stack trace here (no coredump preserved > at this point, sorry): > > #0 0x00007f08df0ea06c __pthread_kill_implementation (libc.so.6 + 0x8b06c) > #1 0x00007f08df098046 raise (libc.so.6 + 0x39046) > #2 0x00007f08df0817fc abort (libc.so.6 + 0x227fc) > #3 0x00007f08df082533 __libc_message.cold (libc.so.6 + 0x23533) > #4 0x00007f08df090a67 __libc_assert_fail (libc.so.6 + 0x31a67) > #5 0x00007f08df0f68f2 sysmalloc (libc.so.6 + 0x978f2) > #6 0x00007f08df0f7789 _int_malloc (libc.so.6 + 0x98789) > #7 0x00007f08df0f80d2 __libc_malloc (libc.so.6 + 0x990d2) > #8 0x00007f08df12fa55 __alloc_dir (libc.so.6 + 0xd0a55) > ->#9 0x00007f08df12fac2 opendir_tail (libc.so.6 + 0xd0ac2) > #10 0x00005632ea10c823 remove_temporary_files.lto_priv.0 (git + 0xdc823) > ->#11 0x00005632ea10c97c remove_pack_on_signal.lto_priv.0 (git + 0xdc97c) > #12 0x00007f08df0980f0 __restore_rt (libc.so.6 + 0x390f0) > #13 0x00007f08df0f7775 _int_malloc (libc.so.6 + 0x98775) > #14 0x00007f08df0f80d2 __libc_malloc (libc.so.6 + 0x990d2) > #15 0x00007f08df0d2b44 _IO_file_doallocate (libc.so.6 + 0x73b44) > #16 0x00007f08df0e1d20 _IO_doallocbuf (libc.so.6 + 0x82d20) > #17 0x00007f08df0e0a8c _IO_file_underflow@@GLIBC_2.2.5 (libc.so.6 + 0x81a8c) > #18 0x00007f08df0d4598 __getdelim (libc.so.6 + 0x75598) > #19 0x00005632ea29f372 strbuf_getwholeline (git + 0x26f372) > #20 0x00005632ea117915 cmd_repack (git + 0xe7915) > #21 0x00005632ea0556fa handle_builtin.lto_priv.0 (git + 0x256fa) > #22 0x00005632ea050551 main (git + 0x20551) > #23 0x00007f08df082a50 __libc_start_call_main (libc.so.6 + 0x23a50) > #24 0x00007f08df082b09 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x23b09) > #25 0x00005632ea051555 _start (git + 0x21555) Good find. This is definitely the same issue. Unfortunately it's another hard one to fix. git-repack asks git-pack-objects to use .tmp-pack-$$ as a prefix. So I thought we might be able to get away with using register_tempfile() to just record the names of potential files it creates by appending .pack, .idx, and so forth. That code is well-tested and is careful to be signal-safe. But the actual name will include the checksum of the packfile, so .tmp-pack-$$-1234abcd.pack, etc. And repack doesn't know those until pack-objects reports the hash to us. OTOH, I think pack-objects doesn't create the files with that name until the very last minute (it has its own tempfiles[1]). So we could probably get away with expanding the filenames once it tells us the hash. Something like the patch below. diff --git a/builtin/repack.c b/builtin/repack.c index a5bacc7797..8d6121b438 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -122,13 +122,6 @@ static void remove_temporary_files(void) strbuf_release(&buf); } -static void remove_pack_on_signal(int signo) -{ - remove_temporary_files(); - sigchain_pop(signo); - raise(signo); -} - /* * Adds all packs hex strings to either fname_nonkept_list or * fname_kept_list based on whether each pack has a corresponding @@ -268,6 +261,17 @@ static unsigned populate_pack_exts(char *name) return ret; } +static void register_tmp_pack_for_cleanup(const char *name) +{ + int i; + + for (i = 0; i < ARRAY_SIZE(exts); i++) { + char *path = xstrfmt("%s-%s%s", packtmp, name, exts[i].name); + register_tempfile(path); + free(path); + } +} + static void repack_promisor_objects(const struct pack_objects_args *args, struct string_list *names) { @@ -714,6 +718,7 @@ static int write_cruft_pack(const struct pack_objects_args *args, die(_("repack: Expecting full hex object ID lines only " "from pack-objects.")); string_list_append(names, line.buf); + register_tmp_pack_for_cleanup(line.buf); } fclose(out); @@ -859,8 +864,6 @@ int cmd_repack(int argc, const char **argv, const char *prefix) split_pack_geometry(geometry, geometric_factor); } - sigchain_push_common(remove_pack_on_signal); - prepare_pack_objects(&cmd, &po_args); show_progress = !po_args.quiet && isatty(2); @@ -955,6 +958,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix) if (line.len != the_hash_algo->hexsz) die(_("repack: Expecting full hex object ID lines only from pack-objects.")); string_list_append(&names, line.buf); + register_tmp_pack_for_cleanup(line.buf); } fclose(out); ret = finish_command(&cmd); @@ -1025,6 +1029,16 @@ int cmd_repack(int argc, const char **argv, const char *prefix) else if (unlink(fname) < 0 && errno != ENOENT) die_errno(_("could not unlink: %s"), fname); + /* + * We should deregister pack tempfiles here; leaving + * them kind of works in that we'll try to unlink + * something that is already gone, but it would be + * cleaner to note that we expect them to be gone here. + * We could perhaps even just use rename_tempfile() and + * delete_tempfile() here, if we modified item->util to + * keep track of the tempfile objects (rather than + * single bits). + */ free(fname); free(fname_old); } -Peff [1] Note that to some degree, this cleanup is already a bit of a fool's errand. It is only catching the "finished" packfiles before we move them to their final names. Whereas pack-objects is perfectly happy to write bytes into tmp_pack_XXXXXX, and does not bother to clean it if it runs into an error or signal. And that is where most of the time will be spent. So one option would be to just drop this signal handling entirely. But IMHO we should go the other way, and teach pack-objects to also clean up after itself. I know GitHub has been running with patches to call register_tempfile() like this for years (but it is hacky the way the above patch is; it never unregisters, which is why we never sent them upstream). Another problem is that repack.c only catches signals, not die(), which is the more likely culprit (e.g., you generate the "main" pack, and then something goes wrong while renaming or generating an ancillary cruft-packs). Switching to register_tempfile() does fix that, though.
diff --git a/dir.c b/dir.c index 75429508200..a3183cb043f 100644 --- a/dir.c +++ b/dir.c @@ -3244,7 +3244,7 @@ void strip_dir_trailing_slashes(char *dir) static int remove_dir_recurse(struct strbuf *path, int flag, int *kept_up) { - DIR *dir; + DIR *dir = NULL; struct dirent *e; int ret = 0, original_len = path->len, len, kept_down = 0; int only_empty = (flag & REMOVE_DIR_EMPTY_ONLY); @@ -3261,7 +3261,10 @@ static int remove_dir_recurse(struct strbuf *path, int flag, int *kept_up) } flag &= ~REMOVE_DIR_KEEP_TOPLEVEL; - dir = opendir(path->buf); + + if ((flag & REMOVE_DIR_SIGNAL) == 0) + dir = opendir(path->buf); + if (!dir) { if (errno == ENOENT) return keep_toplevel ? -1 : 0; diff --git a/dir.h b/dir.h index 674747d93af..ba159f4abeb 100644 --- a/dir.h +++ b/dir.h @@ -498,6 +498,9 @@ int get_sparse_checkout_patterns(struct pattern_list *pl); /* Remove the_original_cwd too */ #define REMOVE_DIR_PURGE_ORIGINAL_CWD 0x08 +/* Indicates a signal is being handled */ +#define REMOVE_DIR_SIGNAL 0x16 + /* * Remove path and its contents, recursively. flags is a combination * of the above REMOVE_DIR_* constants. Return 0 on success. diff --git a/tmp-objdir.c b/tmp-objdir.c index adf6033549e..13943098738 100644 --- a/tmp-objdir.c +++ b/tmp-objdir.c @@ -34,6 +34,7 @@ static void tmp_objdir_free(struct tmp_objdir *t) static int tmp_objdir_destroy_1(struct tmp_objdir *t, int on_signal) { int err; + int flags = 0; if (!t) return 0; @@ -49,7 +50,11 @@ static int tmp_objdir_destroy_1(struct tmp_objdir *t, int on_signal) * have pre-grown t->path sufficiently so that this * doesn't happen in practice. */ - err = remove_dir_recursively(&t->path, 0); + + if (on_signal) + flags = flags | REMOVE_DIR_SIGNAL; + + err = remove_dir_recursively(&t->path, flags); /* * When we are cleaning up due to a signal, we won't bother