Message ID | 50d872a057a558fa5519856b95abd048ddb514dc.1654625626.git.steadmon@google.com (mailing list archive) |
---|---|
State | Accepted |
Commit | ce3986bb224067ecdb372d7cb7a56d22682e9a31 |
Headers | show |
Series | [v2] run-command: don't spam trace2_child_exit() | expand |
On Tue, Jun 07 2022, Josh Steadmon wrote: > In rare cases[1], wait_or_whine() cannot determine a child process's > status (and will return -1 in this case). This can cause Git to issue > trace2 child_exit events despite the fact that the child may still be > running. In pathological cases, we've seen > 80 million exit events in > our trace logs for a single child process. > > Fix this by only issuing trace2 events in finish_command_in_signal() if > we get a value other than -1 from wait_or_whine(). This can lead to > missing child_exit events in such a case, but that is preferable to > duplicating events on a scale that threatens to fill the user's > filesystem with invalid trace logs. > > [1]: This can happen when: > > * waitpid() returns -1 and errno != EINTR > * waitpid() returns an invalid PID > * the status set by waitpid() has neither the WIFEXITED() nor > WIFSIGNALED() flags > > Signed-off-by: Josh Steadmon <steadmon@google.com> > --- > Updated the commit message with more details about when wait_or_whine() > can fail. > > run-command.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/run-command.c b/run-command.c > index a8501e38ce..e0fe2418a2 100644 > --- a/run-command.c > +++ b/run-command.c > @@ -983,7 +983,8 @@ int finish_command(struct child_process *cmd) > int finish_command_in_signal(struct child_process *cmd) > { > int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1); > - trace2_child_exit(cmd, ret); > + if (ret != -1) > + trace2_child_exit(cmd, ret); > return ret; > } This seems like a legitimate issue, but I really don't think we should sweep this under the rug like this. * Why can't we see if we logged such an event already in common_exit(), if we didn't we should trace2_child_exit() (or similar). I.e. not miss an event, ever. * Should this really be an "exit" event, aren't some of these failed signal events? Per the "should this be an exit event?" question in my related "signal on BUG" series. * We should have tests here, e.g. in t0210 to see the exact events we emit in certain cases, we really should have a test for this. Perhaps we can instrument a simulated failure with some GIT_TEST_* variables?
On 6/7/22 2:21 PM, Josh Steadmon wrote: > In rare cases[1], wait_or_whine() cannot determine a child process's > status (and will return -1 in this case). This can cause Git to issue > trace2 child_exit events despite the fact that the child may still be > running. In pathological cases, we've seen > 80 million exit events in > our trace logs for a single child process. > > Fix this by only issuing trace2 events in finish_command_in_signal() if > we get a value other than -1 from wait_or_whine(). This can lead to > missing child_exit events in such a case, but that is preferable to > duplicating events on a scale that threatens to fill the user's > filesystem with invalid trace logs. > > [1]: This can happen when: > > * waitpid() returns -1 and errno != EINTR > * waitpid() returns an invalid PID > * the status set by waitpid() has neither the WIFEXITED() nor > WIFSIGNALED() flags > > Signed-off-by: Josh Steadmon <steadmon@google.com> > --- > Updated the commit message with more details about when wait_or_whine() > can fail. > > run-command.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/run-command.c b/run-command.c > index a8501e38ce..e0fe2418a2 100644 > --- a/run-command.c > +++ b/run-command.c > @@ -983,7 +983,8 @@ int finish_command(struct child_process *cmd) > int finish_command_in_signal(struct child_process *cmd) > { > int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1); > - trace2_child_exit(cmd, ret); > + if (ret != -1) > + trace2_child_exit(cmd, ret); > return ret; > } > > > base-commit: faa21c10d44184f616d391c158dcbb13b9c72ef3 > I think there are larger bugs lurking here. finish_command_in_signal() is only called from wait_for_pager_signal(), which is loaded up by sigchain_push_common() when the pager is started. There are no loops or other callers here. We sigchain-pop our handler upon first use, so even if we were getting millions of PIPE signals from other chilld processes, we shouldn't be getting more than 1 message here. (Assuming that our pop is working and our call to raise() doesn't cause us to be called again.) The code is behaving as if the "s->n--" at the bottom of sigchain_pop() isn't being performed and our raise() is calling us again. I originally thought it might be due to unbalanced push/pops -- which could happen if the process were in a "push; do something; pop" sequence and the PIPE happened in the middle of that. But I don't think that matters. So I have to wonder if the sigchain code has a bug. The Linux man-page states that signal(2) is not thread-safe and even if it were, the sigchain code here does not have any locking on those global data structures. But that might not be the problem here and it's hard to say without any of the original trace data. Having said all of that I think it would be better to fix the original problem something like this (not tested): static void wait_for_pager_signal(int signo) { + static int handled = 0; + + if (handled) + return; + handled = 1; + close_pager_fds(); finish_command_in_signal(&pager_process); sigchain_pop(signo); raise(signo); } Jeff
diff --git a/run-command.c b/run-command.c index a8501e38ce..e0fe2418a2 100644 --- a/run-command.c +++ b/run-command.c @@ -983,7 +983,8 @@ int finish_command(struct child_process *cmd) int finish_command_in_signal(struct child_process *cmd) { int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1); - trace2_child_exit(cmd, ret); + if (ret != -1) + trace2_child_exit(cmd, ret); return ret; }
In rare cases[1], wait_or_whine() cannot determine a child process's status (and will return -1 in this case). This can cause Git to issue trace2 child_exit events despite the fact that the child may still be running. In pathological cases, we've seen > 80 million exit events in our trace logs for a single child process. Fix this by only issuing trace2 events in finish_command_in_signal() if we get a value other than -1 from wait_or_whine(). This can lead to missing child_exit events in such a case, but that is preferable to duplicating events on a scale that threatens to fill the user's filesystem with invalid trace logs. [1]: This can happen when: * waitpid() returns -1 and errno != EINTR * waitpid() returns an invalid PID * the status set by waitpid() has neither the WIFEXITED() nor WIFSIGNALED() flags Signed-off-by: Josh Steadmon <steadmon@google.com> --- Updated the commit message with more details about when wait_or_whine() can fail. run-command.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) base-commit: faa21c10d44184f616d391c158dcbb13b9c72ef3