diff mbox series

run-command: don't spam trace2_child_exit()

Message ID 4616d09ffa632bd2c9e308a713c4bdf2a1328c3c.1651179450.git.steadmon@google.com (mailing list archive)
State Accepted
Commit ce3986bb224067ecdb372d7cb7a56d22682e9a31
Headers show
Series run-command: don't spam trace2_child_exit() | expand

Commit Message

Josh Steadmon April 28, 2022, 8:58 p.m. UTC
In rare cases, wait_or_whine() cannot determine a child process's exit
status (and will return -1 in this case). This can cause Git to issue
trace2 child_exit events despite the fact that the child is still
running. In pathological cases, we've seen > 80 million exit events in
our trace logs for a single child process.

Fix this by only issuing trace2 events in finish_command_in_signal() if
we get a value other than -1 from wait_or_whine().

Signed-off-by: Josh Steadmon <steadmon@google.com>
---
 run-command.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


base-commit: faa21c10d44184f616d391c158dcbb13b9c72ef3

Comments

Junio C Hamano April 28, 2022, 9:46 p.m. UTC | #1
Josh Steadmon <steadmon@google.com> writes:

> In rare cases, wait_or_whine() cannot determine a child process's exit
> status (and will return -1 in this case). This can cause Git to issue
> trace2 child_exit events despite the fact that the child is still
> running.

Rather, we do not even know if the child is still running when it
happens, right?  It is curious what "rare cases" makes the symptom
appear.  Do we know?

The patch looks OK from the "we do not know the child exited in this
case, so we shouldn't be reporting the child exit" point of view, of
course.  Having one event that started a child in the log and then
having millions of events that reports the exit of the (same) child
is way too broken.  With this change, we remove these phoney exit
events from the log.

Do we know, for such a child process that caused these millions
phoney exit events, we got a real exit event at the end?  Otherwise,
we'd still have a similar problem in the opposite direction, i.e. a
child has a start event recorded, many exit event discarded but the
log lacks the true exit event for the child, implying that the child
is still running because we failed to log its exit?

>  int finish_command_in_signal(struct child_process *cmd)
>  {
>  	int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1);
> -	trace2_child_exit(cmd, ret);
> +	if (ret != -1)
> +		trace2_child_exit(cmd, ret);
>  	return ret;
>  }

Will queue; thanks.
Jeff Hostetler May 3, 2022, 2:59 p.m. UTC | #2
On 4/28/22 5:46 PM, Junio C Hamano wrote:
> Josh Steadmon <steadmon@google.com> writes:
> 
>> In rare cases, wait_or_whine() cannot determine a child process's exit
>> status (and will return -1 in this case). This can cause Git to issue
>> trace2 child_exit events despite the fact that the child is still
>> running.

I'm curious what is causing the spurious return values.
Could you instrument wait_or_whine() and see which of the
if/else arms are causing the -1 to be returned?

That routine is rather complicated and looks like it has 3
different ways that a -1 could be returned.


> 
> Rather, we do not even know if the child is still running when it
> happens, right?  It is curious what "rare cases" makes the symptom
> appear.  Do we know?
> 
> The patch looks OK from the "we do not know the child exited in this
> case, so we shouldn't be reporting the child exit" point of view, of
> course.  Having one event that started a child in the log and then
> having millions of events that reports the exit of the (same) child
> is way too broken.  With this change, we remove these phoney exit
> events from the log.
> 
> Do we know, for such a child process that caused these millions
> phoney exit events, we got a real exit event at the end?  Otherwise,
> we'd still have a similar problem in the opposite direction, i.e. a
> child has a start event recorded, many exit event discarded but the
> log lacks the true exit event for the child, implying that the child
> is still running because we failed to log its exit?
> 
>>   int finish_command_in_signal(struct child_process *cmd)
>>   {
>>   	int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1);
>> -	trace2_child_exit(cmd, ret);
>> +	if (ret != -1)
>> +		trace2_child_exit(cmd, ret);
>>   	return ret;
>>   }

Since this is only called from pager.c and is used to setup the
pager, I have to wonder if you're only getting these spurious events
for the pager process or for any of the other random child processes.

And whether they are received while the pager is alive and working
properly, or when you're trying to quit the pager or when the pager
is trying to signal eof.

> 
> Will queue; thanks.
>
Josh Steadmon May 5, 2022, 7:44 p.m. UTC | #3
On 2022.04.28 14:46, Junio C Hamano wrote:
> Josh Steadmon <steadmon@google.com> writes:
> 
> > In rare cases, wait_or_whine() cannot determine a child process's exit
> > status (and will return -1 in this case). This can cause Git to issue
> > trace2 child_exit events despite the fact that the child is still
> > running.
> 
> Rather, we do not even know if the child is still running when it
> happens, right?

Correct, if you'd like me to clarify the commit message I'll send a V2.


> It is curious what "rare cases" makes the symptom
> appear.  Do we know?

Unfortunately, no. The quoted 80 million exit event instance was not
reproducible.

> The patch looks OK from the "we do not know the child exited in this
> case, so we shouldn't be reporting the child exit" point of view, of
> course.  Having one event that started a child in the log and then
> having millions of events that reports the exit of the (same) child
> is way too broken.  With this change, we remove these phoney exit
> events from the log.
> 
> Do we know, for such a child process that caused these millions
> phoney exit events, we got a real exit event at the end?

We don't know. The trace log filled up the user's disk in this case, so
the log was truncated.


> Otherwise,
> we'd still have a similar problem in the opposite direction, i.e. a
> child has a start event recorded, many exit event discarded but the
> log lacks the true exit event for the child, implying that the child
> is still running because we failed to log its exit?

Yes, that is a weakness with this approach.


> >  int finish_command_in_signal(struct child_process *cmd)
> >  {
> >  	int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1);
> > -	trace2_child_exit(cmd, ret);
> > +	if (ret != -1)
> > +		trace2_child_exit(cmd, ret);
> >  	return ret;
> >  }
> 
> Will queue; thanks.

Thanks, and sorry for the delayed reply, I've been out sick for a few
days.
Josh Steadmon May 5, 2022, 7:58 p.m. UTC | #4
On 2022.05.03 10:59, Jeff Hostetler wrote:
> 
> 
> On 4/28/22 5:46 PM, Junio C Hamano wrote:
> > Josh Steadmon <steadmon@google.com> writes:
> > 
> > > In rare cases, wait_or_whine() cannot determine a child process's exit
> > > status (and will return -1 in this case). This can cause Git to issue
> > > trace2 child_exit events despite the fact that the child is still
> > > running.
> 
> I'm curious what is causing the spurious return values.
> Could you instrument wait_or_whine() and see which of the
> if/else arms are causing the -1 to be returned?
> 
> That routine is rather complicated and looks like it has 3
> different ways that a -1 could be returned.

Unfortunately we haven't been able to reproduce the issue, so
instrumentation won't clarify the cause unless we randomly hit another
instance.


> > 
> > Rather, we do not even know if the child is still running when it
> > happens, right?  It is curious what "rare cases" makes the symptom
> > appear.  Do we know?
> > 
> > The patch looks OK from the "we do not know the child exited in this
> > case, so we shouldn't be reporting the child exit" point of view, of
> > course.  Having one event that started a child in the log and then
> > having millions of events that reports the exit of the (same) child
> > is way too broken.  With this change, we remove these phoney exit
> > events from the log.
> > 
> > Do we know, for such a child process that caused these millions
> > phoney exit events, we got a real exit event at the end?  Otherwise,
> > we'd still have a similar problem in the opposite direction, i.e. a
> > child has a start event recorded, many exit event discarded but the
> > log lacks the true exit event for the child, implying that the child
> > is still running because we failed to log its exit?
> > 
> > >   int finish_command_in_signal(struct child_process *cmd)
> > >   {
> > >   	int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1);
> > > -	trace2_child_exit(cmd, ret);
> > > +	if (ret != -1)
> > > +		trace2_child_exit(cmd, ret);
> > >   	return ret;
> > >   }
> 
> Since this is only called from pager.c and is used to setup the
> pager, I have to wonder if you're only getting these spurious events
> for the pager process or for any of the other random child processes.
> 
> And whether they are received while the pager is alive and working
> properly, or when you're trying to quit the pager or when the pager
> is trying to signal eof.

It was for a pager run by `git diff`. The only other child process in
this instance was `git status`. I believe the events were generated
while the user was not interacting with the process; IIRC the pager was
left running in a terminal overnight and they discovered the next
morning that their disk was filled up by trace logs.
Jeff Hostetler May 10, 2022, 8:37 p.m. UTC | #5
On 5/5/22 3:58 PM, Josh Steadmon wrote:
> On 2022.05.03 10:59, Jeff Hostetler wrote:
>>
>>
>> On 4/28/22 5:46 PM, Junio C Hamano wrote:
>>> Josh Steadmon <steadmon@google.com> writes:
>>>
>>>> In rare cases, wait_or_whine() cannot determine a child process's exit
>>>> status (and will return -1 in this case). This can cause Git to issue
>>>> trace2 child_exit events despite the fact that the child is still
>>>> running.
>>
>> I'm curious what is causing the spurious return values.
>> Could you instrument wait_or_whine() and see which of the
>> if/else arms are causing the -1 to be returned?
>>
>> That routine is rather complicated and looks like it has 3
>> different ways that a -1 could be returned.
> 
> Unfortunately we haven't been able to reproduce the issue, so
> instrumentation won't clarify the cause unless we randomly hit another
> instance.
> 
> 
>>>
>>> Rather, we do not even know if the child is still running when it
>>> happens, right?  It is curious what "rare cases" makes the symptom
>>> appear.  Do we know?
>>>
>>> The patch looks OK from the "we do not know the child exited in this
>>> case, so we shouldn't be reporting the child exit" point of view, of
>>> course.  Having one event that started a child in the log and then
>>> having millions of events that reports the exit of the (same) child
>>> is way too broken.  With this change, we remove these phoney exit
>>> events from the log.
>>>
>>> Do we know, for such a child process that caused these millions
>>> phoney exit events, we got a real exit event at the end?  Otherwise,
>>> we'd still have a similar problem in the opposite direction, i.e. a
>>> child has a start event recorded, many exit event discarded but the
>>> log lacks the true exit event for the child, implying that the child
>>> is still running because we failed to log its exit?
>>>
>>>>    int finish_command_in_signal(struct child_process *cmd)
>>>>    {
>>>>    	int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1);
>>>> -	trace2_child_exit(cmd, ret);
>>>> +	if (ret != -1)
>>>> +		trace2_child_exit(cmd, ret);
>>>>    	return ret;
>>>>    }
>>
>> Since this is only called from pager.c and is used to setup the
>> pager, I have to wonder if you're only getting these spurious events
>> for the pager process or for any of the other random child processes.
>>
>> And whether they are received while the pager is alive and working
>> properly, or when you're trying to quit the pager or when the pager
>> is trying to signal eof.
> 
> It was for a pager run by `git diff`. The only other child process in
> this instance was `git status`. I believe the events were generated
> while the user was not interacting with the process; IIRC the pager was
> left running in a terminal overnight and they discovered the next
> morning that their disk was filled up by trace logs.
> 

This worries me that you've found a race or infinite loop here
and the only symptom is the child exit messages.  Suppressing
them solves the disk-fill problem, but might mean that there's
a lurking infinite loop or something.

Can you (privately) send me the part of the log before the
infinite loop?  And maybe the parent and child processes too?
I'd kinda like to understand what's happening here.

Jeff
Josh Steadmon June 7, 2022, 6:45 p.m. UTC | #6
On 2022.05.10 16:37, Jeff Hostetler wrote:
> 
> 
> On 5/5/22 3:58 PM, Josh Steadmon wrote:
> > On 2022.05.03 10:59, Jeff Hostetler wrote:
> > > 
> > > 
> > > On 4/28/22 5:46 PM, Junio C Hamano wrote:
> > > > Josh Steadmon <steadmon@google.com> writes:
> > > > 
> > > > > In rare cases, wait_or_whine() cannot determine a child process's exit
> > > > > status (and will return -1 in this case). This can cause Git to issue
> > > > > trace2 child_exit events despite the fact that the child is still
> > > > > running.
> > > 
> > > I'm curious what is causing the spurious return values.
> > > Could you instrument wait_or_whine() and see which of the
> > > if/else arms are causing the -1 to be returned?
> > > 
> > > That routine is rather complicated and looks like it has 3
> > > different ways that a -1 could be returned.
> > 
> > Unfortunately we haven't been able to reproduce the issue, so
> > instrumentation won't clarify the cause unless we randomly hit another
> > instance.
> > 
> > 
> > > > 
> > > > Rather, we do not even know if the child is still running when it
> > > > happens, right?  It is curious what "rare cases" makes the symptom
> > > > appear.  Do we know?
> > > > 
> > > > The patch looks OK from the "we do not know the child exited in this
> > > > case, so we shouldn't be reporting the child exit" point of view, of
> > > > course.  Having one event that started a child in the log and then
> > > > having millions of events that reports the exit of the (same) child
> > > > is way too broken.  With this change, we remove these phoney exit
> > > > events from the log.
> > > > 
> > > > Do we know, for such a child process that caused these millions
> > > > phoney exit events, we got a real exit event at the end?  Otherwise,
> > > > we'd still have a similar problem in the opposite direction, i.e. a
> > > > child has a start event recorded, many exit event discarded but the
> > > > log lacks the true exit event for the child, implying that the child
> > > > is still running because we failed to log its exit?
> > > > 
> > > > >    int finish_command_in_signal(struct child_process *cmd)
> > > > >    {
> > > > >    	int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1);
> > > > > -	trace2_child_exit(cmd, ret);
> > > > > +	if (ret != -1)
> > > > > +		trace2_child_exit(cmd, ret);
> > > > >    	return ret;
> > > > >    }
> > > 
> > > Since this is only called from pager.c and is used to setup the
> > > pager, I have to wonder if you're only getting these spurious events
> > > for the pager process or for any of the other random child processes.
> > > 
> > > And whether they are received while the pager is alive and working
> > > properly, or when you're trying to quit the pager or when the pager
> > > is trying to signal eof.
> > 
> > It was for a pager run by `git diff`. The only other child process in
> > this instance was `git status`. I believe the events were generated
> > while the user was not interacting with the process; IIRC the pager was
> > left running in a terminal overnight and they discovered the next
> > morning that their disk was filled up by trace logs.
> > 
> 
> This worries me that you've found a race or infinite loop here
> and the only symptom is the child exit messages.  Suppressing
> them solves the disk-fill problem, but might mean that there's
> a lurking infinite loop or something.
> 
> Can you (privately) send me the part of the log before the
> infinite loop?  And maybe the parent and child processes too?
> I'd kinda like to understand what's happening here.
> 
> Jeff

Unfortunately we no longer have the relevant logs. This incident
happened in 2020 and the logs have expired. I can share some general
details though:

The session started on a Friday when the user ran `git diff`. Presumably
the user left the pager open after they left for the weekend. On the
following Saturday we started getting ~30M child_exit events per day
until their disk filled up early Monday morning.

The process was running on a VM, and I have reason to believe that the
user's SSH connection to the VM was broken at some point on the Saturday
that the spammy events started appearing. I don't know whether or not
the user was using any sort of session manager / terminal multiplexer
such as screen or tmux. I am curious whether or not the broken SSH
connection could have created some kind of signal loop that caused the
pager to continuously trigger our wait_for_pager_signal() handler.

I don't have any useful info beyond that, sorry. Thanks for taking a
look though!
diff mbox series

Patch

diff --git a/run-command.c b/run-command.c
index a8501e38ce..e0fe2418a2 100644
--- a/run-command.c
+++ b/run-command.c
@@ -983,7 +983,8 @@  int finish_command(struct child_process *cmd)
 int finish_command_in_signal(struct child_process *cmd)
 {
 	int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1);
-	trace2_child_exit(cmd, ret);
+	if (ret != -1)
+		trace2_child_exit(cmd, ret);
 	return ret;
 }