diff mbox series

[v2] run-command: don't spam trace2_child_exit()

Message ID 50d872a057a558fa5519856b95abd048ddb514dc.1654625626.git.steadmon@google.com (mailing list archive)
State Accepted
Commit ce3986bb224067ecdb372d7cb7a56d22682e9a31
Headers show
Series [v2] run-command: don't spam trace2_child_exit() | expand

Commit Message

Josh Steadmon June 7, 2022, 6:21 p.m. UTC
In rare cases[1], wait_or_whine() cannot determine a child process's
status (and will return -1 in this case). This can cause Git to issue
trace2 child_exit events despite the fact that the child may still be
running. In pathological cases, we've seen > 80 million exit events in
our trace logs for a single child process.

Fix this by only issuing trace2 events in finish_command_in_signal() if
we get a value other than -1 from wait_or_whine(). This can lead to
missing child_exit events in such a case, but that is preferable to
duplicating events on a scale that threatens to fill the user's
filesystem with invalid trace logs.

[1]: This can happen when:

* waitpid() returns -1 and errno != EINTR
* waitpid() returns an invalid PID
* the status set by waitpid() has neither the WIFEXITED() nor
  WIFSIGNALED() flags

Signed-off-by: Josh Steadmon <steadmon@google.com>
---
Updated the commit message with more details about when wait_or_whine()
can fail.

 run-command.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


base-commit: faa21c10d44184f616d391c158dcbb13b9c72ef3

Comments

Ævar Arnfjörð Bjarmason June 7, 2022, 10:09 p.m. UTC | #1
On Tue, Jun 07 2022, Josh Steadmon wrote:

> In rare cases[1], wait_or_whine() cannot determine a child process's
> status (and will return -1 in this case). This can cause Git to issue
> trace2 child_exit events despite the fact that the child may still be
> running. In pathological cases, we've seen > 80 million exit events in
> our trace logs for a single child process.
>
> Fix this by only issuing trace2 events in finish_command_in_signal() if
> we get a value other than -1 from wait_or_whine(). This can lead to
> missing child_exit events in such a case, but that is preferable to
> duplicating events on a scale that threatens to fill the user's
> filesystem with invalid trace logs.
>
> [1]: This can happen when:
>
> * waitpid() returns -1 and errno != EINTR
> * waitpid() returns an invalid PID
> * the status set by waitpid() has neither the WIFEXITED() nor
>   WIFSIGNALED() flags
>
> Signed-off-by: Josh Steadmon <steadmon@google.com>
> ---
> Updated the commit message with more details about when wait_or_whine()
> can fail.
>
>  run-command.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/run-command.c b/run-command.c
> index a8501e38ce..e0fe2418a2 100644
> --- a/run-command.c
> +++ b/run-command.c
> @@ -983,7 +983,8 @@ int finish_command(struct child_process *cmd)
>  int finish_command_in_signal(struct child_process *cmd)
>  {
>  	int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1);
> -	trace2_child_exit(cmd, ret);
> +	if (ret != -1)
> +		trace2_child_exit(cmd, ret);
>  	return ret;
>  }

This seems like a legitimate issue, but I really don't think we should
sweep this under the rug like this.

 * Why can't we see if we logged such an event already in common_exit(),
   if we didn't we should trace2_child_exit() (or similar). I.e. not
   miss an event, ever.

 * Should this really be an "exit" event, aren't some of these failed
   signal events? Per the "should this be an exit event?" question in my
   related "signal on BUG" series.

 * We should have tests here, e.g. in t0210 to see the exact events we
   emit in certain cases, we really should have a test for this. Perhaps
   we can instrument a simulated failure with some GIT_TEST_* variables?
Jeff Hostetler June 10, 2022, 3:31 p.m. UTC | #2
On 6/7/22 2:21 PM, Josh Steadmon wrote:
> In rare cases[1], wait_or_whine() cannot determine a child process's
> status (and will return -1 in this case). This can cause Git to issue
> trace2 child_exit events despite the fact that the child may still be
> running. In pathological cases, we've seen > 80 million exit events in
> our trace logs for a single child process.
> 
> Fix this by only issuing trace2 events in finish_command_in_signal() if
> we get a value other than -1 from wait_or_whine(). This can lead to
> missing child_exit events in such a case, but that is preferable to
> duplicating events on a scale that threatens to fill the user's
> filesystem with invalid trace logs.
> 
> [1]: This can happen when:
> 
> * waitpid() returns -1 and errno != EINTR
> * waitpid() returns an invalid PID
> * the status set by waitpid() has neither the WIFEXITED() nor
>    WIFSIGNALED() flags
> 
> Signed-off-by: Josh Steadmon <steadmon@google.com>
> ---
> Updated the commit message with more details about when wait_or_whine()
> can fail.
> 
>   run-command.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/run-command.c b/run-command.c
> index a8501e38ce..e0fe2418a2 100644
> --- a/run-command.c
> +++ b/run-command.c
> @@ -983,7 +983,8 @@ int finish_command(struct child_process *cmd)
>   int finish_command_in_signal(struct child_process *cmd)
>   {
>   	int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1);
> -	trace2_child_exit(cmd, ret);
> +	if (ret != -1)
> +		trace2_child_exit(cmd, ret);
>   	return ret;
>   }
>   
> 
> base-commit: faa21c10d44184f616d391c158dcbb13b9c72ef3
> 


I think there are larger bugs lurking here.

finish_command_in_signal() is only called from
wait_for_pager_signal(), which is loaded up by
sigchain_push_common() when the pager is started.

There are no loops or other callers here.

We sigchain-pop our handler upon first use, so even if we
were getting millions of PIPE signals from other chilld
processes, we shouldn't be getting more than 1 message
here.  (Assuming that our pop is working and our call to
raise() doesn't cause us to be called again.)

The code is behaving as if the "s->n--" at the bottom of
sigchain_pop() isn't being performed and our raise() is
calling us again.

I originally thought it might be due to unbalanced push/pops -- which
could happen if the process were in a "push; do something; pop"
sequence and the PIPE happened in the middle of that.  But I don't
think that matters.

So I have to wonder if the sigchain code has a bug.

The Linux man-page states that signal(2) is not thread-safe
and even if it were, the sigchain code here does not have any
locking on those global data structures.

But that might not be the problem here and it's hard to say without
any of the original trace data.

Having said all of that I think it would be better to fix the
original problem something like this (not tested):

  static void wait_for_pager_signal(int signo)
  {
+	static int handled = 0;
+
+	if (handled)
+		return;
+	handled = 1;
+
	close_pager_fds();
	finish_command_in_signal(&pager_process);
	sigchain_pop(signo);
	raise(signo);
  }

Jeff
diff mbox series

Patch

diff --git a/run-command.c b/run-command.c
index a8501e38ce..e0fe2418a2 100644
--- a/run-command.c
+++ b/run-command.c
@@ -983,7 +983,8 @@  int finish_command(struct child_process *cmd)
 int finish_command_in_signal(struct child_process *cmd)
 {
 	int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1);
-	trace2_child_exit(cmd, ret);
+	if (ret != -1)
+		trace2_child_exit(cmd, ret);
 	return ret;
 }