diff mbox series

monitor: flush messages on abort

Message ID 1699027289-213995-1-git-send-email-steven.sistare@oracle.com (mailing list archive)
State New, archived
Headers show
Series monitor: flush messages on abort | expand

Commit Message

Steve Sistare Nov. 3, 2023, 4:01 p.m. UTC
Buffered monitor output is lost when abort() is called.  The pattern
error_report() followed by abort() occurs about 60 times, so valuable
information is being lost when the abort is called in the context of a
monitor command.

To fix, install a SIGABRT handler to flush the monitor buffer to stderr.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 monitor/monitor.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

Comments

Daniel P. Berrangé Nov. 3, 2023, 5:33 p.m. UTC | #1
On Fri, Nov 03, 2023 at 09:01:29AM -0700, Steve Sistare wrote:
> Buffered monitor output is lost when abort() is called.  The pattern
> error_report() followed by abort() occurs about 60 times, so valuable
> information is being lost when the abort is called in the context of a
> monitor command.

I'm curious, was there a particular abort() scenario that you hit ?

For some crude statistics:

  $ for i in abort return exit goto ; do echo -n "$i: " ; git grep --after 1 error_report | grep $i | wc -l ; done
  abort: 47
  return: 512
  exit: 458
  goto: 177

to me those numbers say that calling "abort()" after error_report
should be considered a bug, and we can blanket replace all the
abort() calls with exit(EXIT_FAILURE), and thus avoid the need to
special case flushing the monitor.

Also I think there's a decent case to be made for error_report()
to call monitor_flush().

> 
> To fix, install a SIGABRT handler to flush the monitor buffer to stderr.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  monitor/monitor.c | 38 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/monitor/monitor.c b/monitor/monitor.c
> index dc352f9..65dace0 100644
> --- a/monitor/monitor.c
> +++ b/monitor/monitor.c
> @@ -701,6 +701,43 @@ void monitor_cleanup(void)
>      }
>  }
>  
> +#ifdef CONFIG_LINUX
> +
> +static void monitor_abort(int signal, siginfo_t *info, void *c)
> +{
> +    Monitor *mon = monitor_cur();
> +
> +    if (!mon || qemu_mutex_trylock(&mon->mon_lock)) {
> +        return;
> +    }
> +
> +    if (mon->outbuf && mon->outbuf->len) {
> +        fputs("SIGABRT received: ", stderr);
> +        fputs(mon->outbuf->str, stderr);
> +        if (mon->outbuf->str[mon->outbuf->len - 1] != '\n') {
> +            fputc('\n', stderr);
> +        }
> +    }
> +
> +    qemu_mutex_unlock(&mon->mon_lock);

The SIGABRT handling does not only fire in response to abort()
calls, but also in response to bad memory scenarios, so we have
to be careful what we do in signal handlers.

In particular using mutexes in signal handlers is a big red
flag generally. Mutex APIs are not declare async signal
safe, so this code is technically a POSIX compliance
violation.

So I think we'd be safer just eliminating the explicit abort()
calls and adding monitor_flush call to error_report.

> +}
> +
> +static void monitor_add_abort_handler(void)
> +{
> +    struct sigaction act;
> +
> +    memset(&act, 0, sizeof(act));
> +    act.sa_sigaction = monitor_abort;
> +    act.sa_flags = SA_SIGINFO;
> +    sigaction(SIGABRT,  &act, NULL);
> +}
> +
> +#else
> +
> +static void monitor_add_abort_handler(void) {}
> +
> +#endif
> +
>  static void monitor_qapi_event_init(void)
>  {
>      monitor_qapi_event_state = g_hash_table_new(qapi_event_throttle_hash,
> @@ -712,6 +749,7 @@ void monitor_init_globals(void)
>      monitor_qapi_event_init();
>      qemu_mutex_init(&monitor_lock);
>      coroutine_mon = g_hash_table_new(NULL, NULL);
> +    monitor_add_abort_handler();
>  
>      /*
>       * The dispatcher BH must run in the main loop thread, since we
> -- 
> 1.8.3.1
> 
> 

With regards,
Daniel
Steve Sistare Nov. 3, 2023, 7:51 p.m. UTC | #2
On 11/3/2023 1:33 PM, Daniel P. Berrangé wrote:
> On Fri, Nov 03, 2023 at 09:01:29AM -0700, Steve Sistare wrote:
>> Buffered monitor output is lost when abort() is called.  The pattern
>> error_report() followed by abort() occurs about 60 times, so valuable
>> information is being lost when the abort is called in the context of a
>> monitor command.
> 
> I'm curious, was there a particular abort() scenario that you hit ?

Yes, while tweaking the suspended state, and forgetting to add transitions:

        error_report("invalid runstate transition: '%s' -> '%s'",
        abort();

But I have previously hit this for other errors.

> For some crude statistics:
> 
>   $ for i in abort return exit goto ; do echo -n "$i: " ; git grep --after 1 error_report | grep $i | wc -l ; done
>   abort: 47
>   return: 512
>   exit: 458
>   goto: 177
> 
> to me those numbers say that calling "abort()" after error_report
> should be considered a bug, and we can blanket replace all the
> abort() calls with exit(EXIT_FAILURE), and thus avoid the need to
> special case flushing the monitor.

And presumably add an atexit handler to flush the monitor ala monitor_abort.
AFAICT currently no destructor is called for the monitor at exit time.

> Also I think there's a decent case to be made for error_report()
> to call monitor_flush().

A good start, but that would not help for monitors with skip_flush=true, which 
need to format the buffered string in a json response, which is the case I 
tripped over.

>> To fix, install a SIGABRT handler to flush the monitor buffer to stderr.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  monitor/monitor.c | 38 ++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 38 insertions(+)
>>
>> diff --git a/monitor/monitor.c b/monitor/monitor.c
>> index dc352f9..65dace0 100644
>> --- a/monitor/monitor.c
>> +++ b/monitor/monitor.c
>> @@ -701,6 +701,43 @@ void monitor_cleanup(void)
>>      }
>>  }
>>  
>> +#ifdef CONFIG_LINUX
>> +
>> +static void monitor_abort(int signal, siginfo_t *info, void *c)
>> +{
>> +    Monitor *mon = monitor_cur();
>> +
>> +    if (!mon || qemu_mutex_trylock(&mon->mon_lock)) {
>> +        return;
>> +    }
>> +
>> +    if (mon->outbuf && mon->outbuf->len) {
>> +        fputs("SIGABRT received: ", stderr);
>> +        fputs(mon->outbuf->str, stderr);
>> +        if (mon->outbuf->str[mon->outbuf->len - 1] != '\n') {
>> +            fputc('\n', stderr);
>> +        }
>> +    }
>> +
>> +    qemu_mutex_unlock(&mon->mon_lock);
> 
> The SIGABRT handling does not only fire in response to abort()
> calls, but also in response to bad memory scenarios, so we have
> to be careful what we do in signal handlers.
> 
> In particular using mutexes in signal handlers is a big red
> flag generally. Mutex APIs are not declare async signal
> safe, so this code is technically a POSIX compliance
> violation.

Righto.  I would need to mask all signals in the sigaction to be on the safe(r) side.

> So I think we'd be safer just eliminating the explicit abort()
> calls and adding monitor_flush call to error_report.

I like adding a handler because it is future proof.  No need to play whack-a-mole when
developers re-introduce abort() calls in the future.  A minor benefit is I would not
need ack's from 50 maintainers to change 50 call sites from abort to exit.  

A slight risk of the exit solution is that something bad happened at the call site, so 
qemu state can no longer be trusted.  Calling abort immediately may be safer than calling 
exit which will call the existing atexit handlers and could have side effects.

A third option is to define qemu_abort() which flushes the monitor, and replaces all abort
calls.  That avoids async-signal-mutex hand wringing, but is still subject to whack-a-mole.

So: atexit, signal handler, or qemu_abort?  I will go with your preference.

- Steve

>> +}
>> +
>> +static void monitor_add_abort_handler(void)
>> +{
>> +    struct sigaction act;
>> +
>> +    memset(&act, 0, sizeof(act));
>> +    act.sa_sigaction = monitor_abort;
>> +    act.sa_flags = SA_SIGINFO;
>> +    sigaction(SIGABRT,  &act, NULL);
>> +}
>> +
>> +#else
>> +
>> +static void monitor_add_abort_handler(void) {}
>> +
>> +#endif
>> +
>>  static void monitor_qapi_event_init(void)
>>  {
>>      monitor_qapi_event_state = g_hash_table_new(qapi_event_throttle_hash,
>> @@ -712,6 +749,7 @@ void monitor_init_globals(void)
>>      monitor_qapi_event_init();
>>      qemu_mutex_init(&monitor_lock);
>>      coroutine_mon = g_hash_table_new(NULL, NULL);
>> +    monitor_add_abort_handler();
>>  
>>      /*
>>       * The dispatcher BH must run in the main loop thread, since we
>> -- 
>> 1.8.3.1
>>
>>
> 
> With regards,
> Daniel
Steve Sistare Nov. 3, 2023, 8:28 p.m. UTC | #3
On 11/3/2023 3:51 PM, Steven Sistare wrote:
> On 11/3/2023 1:33 PM, Daniel P. Berrangé wrote:
>> On Fri, Nov 03, 2023 at 09:01:29AM -0700, Steve Sistare wrote:
>>> Buffered monitor output is lost when abort() is called.  The pattern
>>> error_report() followed by abort() occurs about 60 times, so valuable
>>> information is being lost when the abort is called in the context of a
>>> monitor command.
>>
>> I'm curious, was there a particular abort() scenario that you hit ?
> 
> Yes, while tweaking the suspended state, and forgetting to add transitions:
> 
>         error_report("invalid runstate transition: '%s' -> '%s'",
>         abort();
> 
> But I have previously hit this for other errors.
> 
>> For some crude statistics:
>>
>>   $ for i in abort return exit goto ; do echo -n "$i: " ; git grep --after 1 error_report | grep $i | wc -l ; done
>>   abort: 47
>>   return: 512
>>   exit: 458
>>   goto: 177
>>
>> to me those numbers say that calling "abort()" after error_report
>> should be considered a bug, and we can blanket replace all the
>> abort() calls with exit(EXIT_FAILURE), and thus avoid the need to
>> special case flushing the monitor.
> 
> And presumably add an atexit handler to flush the monitor ala monitor_abort.
> AFAICT currently no destructor is called for the monitor at exit time.
> 
>> Also I think there's a decent case to be made for error_report()
>> to call monitor_flush().
> 
> A good start, but that would not help for monitors with skip_flush=true, which 
> need to format the buffered string in a json response, which is the case I 
> tripped over.
> 
>>> To fix, install a SIGABRT handler to flush the monitor buffer to stderr.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>>  monitor/monitor.c | 38 ++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 38 insertions(+)
>>>
>>> diff --git a/monitor/monitor.c b/monitor/monitor.c
>>> index dc352f9..65dace0 100644
>>> --- a/monitor/monitor.c
>>> +++ b/monitor/monitor.c
>>> @@ -701,6 +701,43 @@ void monitor_cleanup(void)
>>>      }
>>>  }
>>>  
>>> +#ifdef CONFIG_LINUX
>>> +
>>> +static void monitor_abort(int signal, siginfo_t *info, void *c)
>>> +{
>>> +    Monitor *mon = monitor_cur();
>>> +
>>> +    if (!mon || qemu_mutex_trylock(&mon->mon_lock)) {
>>> +        return;
>>> +    }
>>> +
>>> +    if (mon->outbuf && mon->outbuf->len) {
>>> +        fputs("SIGABRT received: ", stderr);
>>> +        fputs(mon->outbuf->str, stderr);
>>> +        if (mon->outbuf->str[mon->outbuf->len - 1] != '\n') {
>>> +            fputc('\n', stderr);
>>> +        }
>>> +    }
>>> +
>>> +    qemu_mutex_unlock(&mon->mon_lock);
>>
>> The SIGABRT handling does not only fire in response to abort()
>> calls, but also in response to bad memory scenarios, so we have
>> to be careful what we do in signal handlers.
>>
>> In particular using mutexes in signal handlers is a big red
>> flag generally. Mutex APIs are not declare async signal
>> safe, so this code is technically a POSIX compliance
>> violation.
> 
> Righto.  I would need to mask all signals in the sigaction to be on the safe(r) side.
> 
>> So I think we'd be safer just eliminating the explicit abort()
>> calls and adding monitor_flush call to error_report.
> 
> I like adding a handler because it is future proof.  No need to play whack-a-mole when
> developers re-introduce abort() calls in the future.  A minor benefit is I would not
> need ack's from 50 maintainers to change 50 call sites from abort to exit.  
> 
> A slight risk of the exit solution is that something bad happened at the call site, so 
> qemu state can no longer be trusted.  Calling abort immediately may be safer than calling 
> exit which will call the existing atexit handlers and could have side effects.
> 
> A third option is to define qemu_abort() which flushes the monitor, and replaces all abort
> calls.  That avoids async-signal-mutex hand wringing, but is still subject to whack-a-mole.
> 
> So: atexit, signal handler, or qemu_abort?  I will go with your preference.

If I go with signal handler, I would also add atexit to flush the existing call sites of
error_report() + exit().

One more tidbit: the signal handler would print pending messages if any of the 11000 assert()
calls fails, though having a message present is less likely in this case.

- Steve

>>> +}
>>> +
>>> +static void monitor_add_abort_handler(void)
>>> +{
>>> +    struct sigaction act;
>>> +
>>> +    memset(&act, 0, sizeof(act));
>>> +    act.sa_sigaction = monitor_abort;
>>> +    act.sa_flags = SA_SIGINFO;
>>> +    sigaction(SIGABRT,  &act, NULL);
>>> +}
>>> +
>>> +#else
>>> +
>>> +static void monitor_add_abort_handler(void) {}
>>> +
>>> +#endif
>>> +
>>>  static void monitor_qapi_event_init(void)
>>>  {
>>>      monitor_qapi_event_state = g_hash_table_new(qapi_event_throttle_hash,
>>> @@ -712,6 +749,7 @@ void monitor_init_globals(void)
>>>      monitor_qapi_event_init();
>>>      qemu_mutex_init(&monitor_lock);
>>>      coroutine_mon = g_hash_table_new(NULL, NULL);
>>> +    monitor_add_abort_handler();
>>>  
>>>      /*
>>>       * The dispatcher BH must run in the main loop thread, since we
>>> -- 
>>> 1.8.3.1
>>>
>>>
>>
>> With regards,
>> Daniel
Daniel P. Berrangé Nov. 6, 2023, 10:10 a.m. UTC | #4
On Fri, Nov 03, 2023 at 03:51:00PM -0400, Steven Sistare wrote:
> On 11/3/2023 1:33 PM, Daniel P. Berrangé wrote:
> > On Fri, Nov 03, 2023 at 09:01:29AM -0700, Steve Sistare wrote:
> >> Buffered monitor output is lost when abort() is called.  The pattern
> >> error_report() followed by abort() occurs about 60 times, so valuable
> >> information is being lost when the abort is called in the context of a
> >> monitor command.
> > 
> > I'm curious, was there a particular abort() scenario that you hit ?
> 
> Yes, while tweaking the suspended state, and forgetting to add transitions:
> 
>         error_report("invalid runstate transition: '%s' -> '%s'",
>         abort();
> 
> But I have previously hit this for other errors.
> 
> > For some crude statistics:
> > 
> >   $ for i in abort return exit goto ; do echo -n "$i: " ; git grep --after 1 error_report | grep $i | wc -l ; done
> >   abort: 47
> >   return: 512
> >   exit: 458
> >   goto: 177
> > 
> > to me those numbers say that calling "abort()" after error_report
> > should be considered a bug, and we can blanket replace all the
> > abort() calls with exit(EXIT_FAILURE), and thus avoid the need to
> > special case flushing the monitor.
> 
> And presumably add an atexit handler to flush the monitor ala monitor_abort.
> AFAICT currently no destructor is called for the monitor at exit time.

The HMP monitor flushes at each newline,  and exit() will take care of
flushing stdout, so I don't think there's anything else needed.

> > Also I think there's a decent case to be made for error_report()
> > to call monitor_flush().
> 
> A good start, but that would not help for monitors with skip_flush=true, which 
> need to format the buffered string in a json response, which is the case I 
> tripped over.

'skip_flush' is only set to 'true' when using a QMP monitor and invoking
"hmp-monitor-command".

In such a case, the error message needs to be built into a JSON error
reply and sent over the socket. Your patch doesn't help this case
since you've just printed to stderr.  I don't think it is reasonable
to expect QMP monitors to send replies on SIG_ABRT anyway. So I don't
think the skip_flush=true scenario is a problem to be concerned with.

> >> To fix, install a SIGABRT handler to flush the monitor buffer to stderr.
> >>
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >> ---
> >>  monitor/monitor.c | 38 ++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 38 insertions(+)
> >>
> >> diff --git a/monitor/monitor.c b/monitor/monitor.c
> >> index dc352f9..65dace0 100644
> >> --- a/monitor/monitor.c
> >> +++ b/monitor/monitor.c
> >> @@ -701,6 +701,43 @@ void monitor_cleanup(void)
> >>      }
> >>  }
> >>  
> >> +#ifdef CONFIG_LINUX
> >> +
> >> +static void monitor_abort(int signal, siginfo_t *info, void *c)
> >> +{
> >> +    Monitor *mon = monitor_cur();
> >> +
> >> +    if (!mon || qemu_mutex_trylock(&mon->mon_lock)) {
> >> +        return;
> >> +    }
> >> +
> >> +    if (mon->outbuf && mon->outbuf->len) {
> >> +        fputs("SIGABRT received: ", stderr);
> >> +        fputs(mon->outbuf->str, stderr);
> >> +        if (mon->outbuf->str[mon->outbuf->len - 1] != '\n') {
> >> +            fputc('\n', stderr);
> >> +        }
> >> +    }
> >> +
> >> +    qemu_mutex_unlock(&mon->mon_lock);
> > 
> > The SIGABRT handling does not only fire in response to abort()
> > calls, but also in response to bad memory scenarios, so we have
> > to be careful what we do in signal handlers.
> > 
> > In particular using mutexes in signal handlers is a big red
> > flag generally. Mutex APIs are not declare async signal
> > safe, so this code is technically a POSIX compliance
> > violation.
> 
> Righto.  I would need to mask all signals in the sigaction to be on the safe(r) side.

This is still doomed, because SIGABRT could fire while 'mon_lock' is
already held, and so this code would deadlock trying to acquire the
lock.

> > So I think we'd be safer just eliminating the explicit abort()
> > calls and adding monitor_flush call to error_report.
> 
> I like adding a handler because it is future proof.  No need to play whack-a-mole when
> developers re-introduce abort() calls in the future.  A minor benefit is I would not
> need ack's from 50 maintainers to change 50 call sites from abort to exit.

That's a bit of a crazy exaggeration. THe aborts() don't cover 50 different
subsystems, and we don't require explicit acks from every subsystem maintainer
for trivial cleanups like this.

> A slight risk of the exit solution is that something bad happened at the call site, so 
> qemu state can no longer be trusted.  Calling abort immediately may be safer than calling 
> exit which will call the existing atexit handlers and could have side effects.

If that was a real problem, then we already face it because we have
~500 places already calling exit() and only 50 calling abort().

> A third option is to define qemu_abort() which flushes the monitor, and replaces all abort
> calls.  That avoids async-signal-mutex hand wringing, but is still subject to whack-a-mole.
> 
> So: atexit, signal handler, or qemu_abort?  I will go with your preference.

Just replace abort -> exit.

I'm not seeing a need for an atexit handler on top.

With regards,
Daniel
Markus Armbruster Nov. 15, 2023, 8:41 a.m. UTC | #5
Daniel P. Berrangé <berrange@redhat.com> writes:

> On Fri, Nov 03, 2023 at 03:51:00PM -0400, Steven Sistare wrote:
>> On 11/3/2023 1:33 PM, Daniel P. Berrangé wrote:
>> > On Fri, Nov 03, 2023 at 09:01:29AM -0700, Steve Sistare wrote:
>> >> Buffered monitor output is lost when abort() is called.  The pattern
>> >> error_report() followed by abort() occurs about 60 times, so valuable
>> >> information is being lost when the abort is called in the context of a
>> >> monitor command.
>> > 
>> > I'm curious, was there a particular abort() scenario that you hit ?
>> 
>> Yes, while tweaking the suspended state, and forgetting to add transitions:
>> 
>>         error_report("invalid runstate transition: '%s' -> '%s'",
>>         abort();
>> 
>> But I have previously hit this for other errors.

Can you provide a reproducer?

>> > For some crude statistics:
>> > 
>> >   $ for i in abort return exit goto ; do echo -n "$i: " ; git grep --after 1 error_report | grep $i | wc -l ; done
>> >   abort: 47
>> >   return: 512
>> >   exit: 458
>> >   goto: 177
>> > 
>> > to me those numbers say that calling "abort()" after error_report
>> > should be considered a bug, and we can blanket replace all the
>> > abort() calls with exit(EXIT_FAILURE), and thus avoid the need to
>> > special case flushing the monitor.
>> 
>> And presumably add an atexit handler to flush the monitor ala monitor_abort.
>> AFAICT currently no destructor is called for the monitor at exit time.
>
> The HMP monitor flushes at each newline,  and exit() will take care of
> flushing stdout, so I don't think there's anything else needed.

Correct.

>> > Also I think there's a decent case to be made for error_report()
>> > to call monitor_flush().

No, because the messages printed by error_report() all end in newline,
and printing a newline to a monitor triggers monitor_flush_locked().

>> A good start, but that would not help for monitors with skip_flush=true, which 
>> need to format the buffered string in a json response, which is the case I 
>> tripped over.
>
> 'skip_flush' is only set to 'true' when using a QMP monitor and invoking
> "hmp-monitor-command".

Correct.

> In such a case, the error message needs to be built into a JSON error
> reply and sent over the socket. Your patch doesn't help this case
> since you've just printed to stderr.  I don't think it is reasonable
> to expect QMP monitors to send replies on SIG_ABRT anyway. So I don't
> think the skip_flush=true scenario is a problem to be concerned with.
>
>> >> To fix, install a SIGABRT handler to flush the monitor buffer to stderr.
>> >>
>> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> >> ---
>> >>  monitor/monitor.c | 38 ++++++++++++++++++++++++++++++++++++++
>> >>  1 file changed, 38 insertions(+)
>> >>
>> >> diff --git a/monitor/monitor.c b/monitor/monitor.c
>> >> index dc352f9..65dace0 100644
>> >> --- a/monitor/monitor.c
>> >> +++ b/monitor/monitor.c
>> >> @@ -701,6 +701,43 @@ void monitor_cleanup(void)
>> >>      }
>> >>  }
>> >>  
>> >> +#ifdef CONFIG_LINUX
>> >> +
>> >> +static void monitor_abort(int signal, siginfo_t *info, void *c)
>> >> +{
>> >> +    Monitor *mon = monitor_cur();
>> >> +
>> >> +    if (!mon || qemu_mutex_trylock(&mon->mon_lock)) {
>> >> +        return;
>> >> +    }
>> >> +
>> >> +    if (mon->outbuf && mon->outbuf->len) {
>> >> +        fputs("SIGABRT received: ", stderr);
>> >> +        fputs(mon->outbuf->str, stderr);
>> >> +        if (mon->outbuf->str[mon->outbuf->len - 1] != '\n') {
>> >> +            fputc('\n', stderr);
>> >> +        }
>> >> +    }
>> >> +
>> >> +    qemu_mutex_unlock(&mon->mon_lock);
>> > 
>> > The SIGABRT handling does not only fire in response to abort()
>> > calls, but also in response to bad memory scenarios, so we have
>> > to be careful what we do in signal handlers.
>> > 
>> > In particular using mutexes in signal handlers is a big red
>> > flag generally. Mutex APIs are not declare async signal
>> > safe, so this code is technically a POSIX compliance
>> > violation.

"Technically a POSIX compliance violation" sounds like something only
pedants would care about.  It's actually a recipe for deadlocks and
crashes.

>> Righto.  I would need to mask all signals in the sigaction to be on the safe(r) side.
>
> This is still doomed, because SIGABRT could fire while 'mon_lock' is
> already held, and so this code would deadlock trying to acquire the
> lock.

Yup.

There is no way to make async signal unsafe code safe.

>> > So I think we'd be safer just eliminating the explicit abort()
>> > calls and adding monitor_flush call to error_report.
>> 
>> I like adding a handler because it is future proof.  No need to play whack-a-mole when
>> developers re-introduce abort() calls in the future.  A minor benefit is I would not
>> need ack's from 50 maintainers to change 50 call sites from abort to exit.
>
> That's a bit of a crazy exaggeration. THe aborts() don't cover 50 different
> subsystems, and we don't require explicit acks from every subsystem maintainer
> for trivial cleanups like this.
>
>> A slight risk of the exit solution is that something bad happened at the call site, so 
>> qemu state can no longer be trusted.  Calling abort immediately may be safer than calling 
>> exit which will call the existing atexit handlers and could have side effects.
>
> If that was a real problem, then we already face it because we have
> ~500 places already calling exit() and only 50 calling abort().
>
>> A third option is to define qemu_abort() which flushes the monitor, and replaces all abort
>> calls.  That avoids async-signal-mutex hand wringing, but is still subject to whack-a-mole.
>> 
>> So: atexit, signal handler, or qemu_abort?  I will go with your preference.
>
> Just replace abort -> exit.

I disagree.

Use exit(1) for fatal errors.

Use abort() for programming errors (a.k.a. bugs).

> I'm not seeing a need for an atexit handler on top.

I'm not yet seeing a need for anything.  A reproducer may change that.
Steve Sistare Nov. 15, 2023, 3:30 p.m. UTC | #6
On 11/6/2023 5:10 AM, Daniel P. Berrangé wrote:
> On Fri, Nov 03, 2023 at 03:51:00PM -0400, Steven Sistare wrote:
>> On 11/3/2023 1:33 PM, Daniel P. Berrangé wrote:
>>> On Fri, Nov 03, 2023 at 09:01:29AM -0700, Steve Sistare wrote:
>>>> Buffered monitor output is lost when abort() is called.  The pattern
>>>> error_report() followed by abort() occurs about 60 times, so valuable
>>>> information is being lost when the abort is called in the context of a
>>>> monitor command.
>>>
>>> I'm curious, was there a particular abort() scenario that you hit ?
>>
>> Yes, while tweaking the suspended state, and forgetting to add transitions:
>>
>>         error_report("invalid runstate transition: '%s' -> '%s'",
>>         abort();
>>
>> But I have previously hit this for other errors.
>>
>>> For some crude statistics:
>>>
>>>   $ for i in abort return exit goto ; do echo -n "$i: " ; git grep --after 1 error_report | grep $i | wc -l ; done
>>>   abort: 47
>>>   return: 512
>>>   exit: 458
>>>   goto: 177
>>>
>>> to me those numbers say that calling "abort()" after error_report
>>> should be considered a bug, and we can blanket replace all the
>>> abort() calls with exit(EXIT_FAILURE), and thus avoid the need to
>>> special case flushing the monitor.
>>
>> And presumably add an atexit handler to flush the monitor ala monitor_abort.
>> AFAICT currently no destructor is called for the monitor at exit time.
> 
> The HMP monitor flushes at each newline,  and exit() will take care of
> flushing stdout, so I don't think there's anything else needed.
> 
>>> Also I think there's a decent case to be made for error_report()
>>> to call monitor_flush().
>>
>> A good start, but that would not help for monitors with skip_flush=true, which 
>> need to format the buffered string in a json response, which is the case I 
>> tripped over.
> 
> 'skip_flush' is only set to 'true' when using a QMP monitor and invoking
> "hmp-monitor-command".

OK, that is narrower than I thought.  Now I see that other QMP monitors send 
error_report() to stderr, hence it is visible after abort and exit:

int error_vprintf(const char *fmt, va_list ap) {
    if (cur_mon && !monitor_cur_is_qmp())
        return monitor_vprintf(cur_mon, fmt, ap);
    return vfprintf(stderr, fmt, ap);                <-- HERE

That surprises me, I thought that would be returned to the monitor caller in the
json response. I guess the rationale is that the "main" error, if any, will be
set and returned by the err object that is passed down the stack during command
evaluation.

> In such a case, the error message needs to be built into a JSON error
> reply and sent over the socket. Your patch doesn't help this case
> since you've just printed to stderr.  

Same as vfprintf above!

> I don't think it is reasonable
> to expect QMP monitors to send replies on SIG_ABRT anyway. 

I agree.  My patch causes the error to be seen somewhere, anywhere, instead
of being dropped on the floor.

> So I don't
> think the skip_flush=true scenario is a problem to be concerned with.

It is indeed a narrow case, and not worth much effort or code change.
I'm inclined to drop it, but I appreciate the time you have spent reviewing it.

- Steve

>>>> To fix, install a SIGABRT handler to flush the monitor buffer to stderr.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>>  monitor/monitor.c | 38 ++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 38 insertions(+)
>>>>
>>>> diff --git a/monitor/monitor.c b/monitor/monitor.c
>>>> index dc352f9..65dace0 100644
>>>> --- a/monitor/monitor.c
>>>> +++ b/monitor/monitor.c
>>>> @@ -701,6 +701,43 @@ void monitor_cleanup(void)
>>>>      }
>>>>  }
>>>>  
>>>> +#ifdef CONFIG_LINUX
>>>> +
>>>> +static void monitor_abort(int signal, siginfo_t *info, void *c)
>>>> +{
>>>> +    Monitor *mon = monitor_cur();
>>>> +
>>>> +    if (!mon || qemu_mutex_trylock(&mon->mon_lock)) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (mon->outbuf && mon->outbuf->len) {
>>>> +        fputs("SIGABRT received: ", stderr);
>>>> +        fputs(mon->outbuf->str, stderr);
>>>> +        if (mon->outbuf->str[mon->outbuf->len - 1] != '\n') {
>>>> +            fputc('\n', stderr);
>>>> +        }
>>>> +    }
>>>> +
>>>> +    qemu_mutex_unlock(&mon->mon_lock);
>>>
>>> The SIGABRT handling does not only fire in response to abort()
>>> calls, but also in response to bad memory scenarios, so we have
>>> to be careful what we do in signal handlers.
>>>
>>> In particular using mutexes in signal handlers is a big red
>>> flag generally. Mutex APIs are not declare async signal
>>> safe, so this code is technically a POSIX compliance
>>> violation.
>>
>> Righto.  I would need to mask all signals in the sigaction to be on the safe(r) side.
> 
> This is still doomed, because SIGABRT could fire while 'mon_lock' is
> already held, and so this code would deadlock trying to acquire the
> lock.
> 
>>> So I think we'd be safer just eliminating the explicit abort()
>>> calls and adding monitor_flush call to error_report.
>>
>> I like adding a handler because it is future proof.  No need to play whack-a-mole when
>> developers re-introduce abort() calls in the future.  A minor benefit is I would not
>> need ack's from 50 maintainers to change 50 call sites from abort to exit.
> 
> That's a bit of a crazy exaggeration. THe aborts() don't cover 50 different
> subsystems, and we don't require explicit acks from every subsystem maintainer
> for trivial cleanups like this.
> 
>> A slight risk of the exit solution is that something bad happened at the call site, so 
>> qemu state can no longer be trusted.  Calling abort immediately may be safer than calling 
>> exit which will call the existing atexit handlers and could have side effects.
> 
> If that was a real problem, then we already face it because we have
> ~500 places already calling exit() and only 50 calling abort().
> 
>> A third option is to define qemu_abort() which flushes the monitor, and replaces all abort
>> calls.  That avoids async-signal-mutex hand wringing, but is still subject to whack-a-mole.
>>
>> So: atexit, signal handler, or qemu_abort?  I will go with your preference.
> 
> Just replace abort -> exit.
> 
> I'm not seeing a need for an atexit handler on top.
> 
> With regards,
> Daniel
Steve Sistare Nov. 15, 2023, 3:52 p.m. UTC | #7
On 11/15/2023 3:41 AM, Markus Armbruster wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
>> On Fri, Nov 03, 2023 at 03:51:00PM -0400, Steven Sistare wrote:
>>> On 11/3/2023 1:33 PM, Daniel P. Berrangé wrote:
>>>> On Fri, Nov 03, 2023 at 09:01:29AM -0700, Steve Sistare wrote:
>>>>> Buffered monitor output is lost when abort() is called.  The pattern
>>>>> error_report() followed by abort() occurs about 60 times, so valuable
>>>>> information is being lost when the abort is called in the context of a
>>>>> monitor command.
>>>>
>>>> I'm curious, was there a particular abort() scenario that you hit ?
>>>
>>> Yes, while tweaking the suspended state, and forgetting to add transitions:
>>>
>>>         error_report("invalid runstate transition: '%s' -> '%s'",
>>>         abort();
>>>
>>> But I have previously hit this for other errors.
> 
> Can you provide a reproducer?

I sometimes hit this when developing new code.  I do not have a reproducer for upstream
branches. The patch is aimed at helping developers, not users.

>>>> For some crude statistics:
>>>>
>>>>   $ for i in abort return exit goto ; do echo -n "$i: " ; git grep --after 1 error_report | grep $i | wc -l ; done
>>>>   abort: 47
>>>>   return: 512
>>>>   exit: 458
>>>>   goto: 177
>>>>
>>>> to me those numbers say that calling "abort()" after error_report
>>>> should be considered a bug, and we can blanket replace all the
>>>> abort() calls with exit(EXIT_FAILURE), and thus avoid the need to
>>>> special case flushing the monitor.
>>>
>>> And presumably add an atexit handler to flush the monitor ala monitor_abort.
>>> AFAICT currently no destructor is called for the monitor at exit time.
>>
>> The HMP monitor flushes at each newline,  and exit() will take care of
>> flushing stdout, so I don't think there's anything else needed.
> 
> Correct.
> 
>>>> Also I think there's a decent case to be made for error_report()
>>>> to call monitor_flush().
> 
> No, because the messages printed by error_report() all end in newline,
> and printing a newline to a monitor triggers monitor_flush_locked().
> 
>>> A good start, but that would not help for monitors with skip_flush=true, which 
>>> need to format the buffered string in a json response, which is the case I 
>>> tripped over.
>>
>> 'skip_flush' is only set to 'true' when using a QMP monitor and invoking
>> "hmp-monitor-command".
> 
> Correct.
> 
>> In such a case, the error message needs to be built into a JSON error
>> reply and sent over the socket. Your patch doesn't help this case
>> since you've just printed to stderr.  I don't think it is reasonable
>> to expect QMP monitors to send replies on SIG_ABRT anyway. So I don't
>> think the skip_flush=true scenario is a problem to be concerned with.
>>
>>>>> To fix, install a SIGABRT handler to flush the monitor buffer to stderr.
>>>>>
>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>> ---
>>>>>  monitor/monitor.c | 38 ++++++++++++++++++++++++++++++++++++++
>>>>>  1 file changed, 38 insertions(+)
>>>>>
>>>>> diff --git a/monitor/monitor.c b/monitor/monitor.c
>>>>> index dc352f9..65dace0 100644
>>>>> --- a/monitor/monitor.c
>>>>> +++ b/monitor/monitor.c
>>>>> @@ -701,6 +701,43 @@ void monitor_cleanup(void)
>>>>>      }
>>>>>  }
>>>>>  
>>>>> +#ifdef CONFIG_LINUX
>>>>> +
>>>>> +static void monitor_abort(int signal, siginfo_t *info, void *c)
>>>>> +{
>>>>> +    Monitor *mon = monitor_cur();
>>>>> +
>>>>> +    if (!mon || qemu_mutex_trylock(&mon->mon_lock)) {
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    if (mon->outbuf && mon->outbuf->len) {
>>>>> +        fputs("SIGABRT received: ", stderr);
>>>>> +        fputs(mon->outbuf->str, stderr);
>>>>> +        if (mon->outbuf->str[mon->outbuf->len - 1] != '\n') {
>>>>> +            fputc('\n', stderr);
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    qemu_mutex_unlock(&mon->mon_lock);
>>>>
>>>> The SIGABRT handling does not only fire in response to abort()
>>>> calls, but also in response to bad memory scenarios, so we have
>>>> to be careful what we do in signal handlers.
>>>>
>>>> In particular using mutexes in signal handlers is a big red
>>>> flag generally. Mutex APIs are not declare async signal
>>>> safe, so this code is technically a POSIX compliance
>>>> violation.
> 
> "Technically a POSIX compliance violation" sounds like something only
> pedants would care about.  It's actually a recipe for deadlocks and
> crashes.
> 
>>> Righto.  I would need to mask all signals in the sigaction to be on the safe(r) side.
>>
>> This is still doomed, because SIGABRT could fire while 'mon_lock' is
>> already held, and so this code would deadlock trying to acquire the
>> lock.
> 
> Yup.
> 
> There is no way to make async signal unsafe code safe.

The handler calls trylock, not lock.  If it cannot get the lock, it bails.

However, I suppose pthread_mutex_trylock could in theory take and briefly hold
some internal lock as part of its implementation.

- Steve

>>>> So I think we'd be safer just eliminating the explicit abort()
>>>> calls and adding monitor_flush call to error_report.
>>>
>>> I like adding a handler because it is future proof.  No need to play whack-a-mole when
>>> developers re-introduce abort() calls in the future.  A minor benefit is I would not
>>> need ack's from 50 maintainers to change 50 call sites from abort to exit.
>>
>> That's a bit of a crazy exaggeration. THe aborts() don't cover 50 different
>> subsystems, and we don't require explicit acks from every subsystem maintainer
>> for trivial cleanups like this.
>>
>>> A slight risk of the exit solution is that something bad happened at the call site, so 
>>> qemu state can no longer be trusted.  Calling abort immediately may be safer than calling 
>>> exit which will call the existing atexit handlers and could have side effects.
>>
>> If that was a real problem, then we already face it because we have
>> ~500 places already calling exit() and only 50 calling abort().
>>
>>> A third option is to define qemu_abort() which flushes the monitor, and replaces all abort
>>> calls.  That avoids async-signal-mutex hand wringing, but is still subject to whack-a-mole.
>>>
>>> So: atexit, signal handler, or qemu_abort?  I will go with your preference.
>>
>> Just replace abort -> exit.
> 
> I disagree.
> 
> Use exit(1) for fatal errors.
> 
> Use abort() for programming errors (a.k.a. bugs).
> 
>> I'm not seeing a need for an atexit handler on top.
> 
> I'm not yet seeing a need for anything.  A reproducer may change that.
>
Markus Armbruster Nov. 15, 2023, 4:05 p.m. UTC | #8
Steven Sistare <steven.sistare@oracle.com> writes:

> On 11/6/2023 5:10 AM, Daniel P. Berrangé wrote:
>> On Fri, Nov 03, 2023 at 03:51:00PM -0400, Steven Sistare wrote:
>>> On 11/3/2023 1:33 PM, Daniel P. Berrangé wrote:
>>>> On Fri, Nov 03, 2023 at 09:01:29AM -0700, Steve Sistare wrote:
>>>>> Buffered monitor output is lost when abort() is called.  The pattern
>>>>> error_report() followed by abort() occurs about 60 times, so valuable
>>>>> information is being lost when the abort is called in the context of a
>>>>> monitor command.
>>>>
>>>> I'm curious, was there a particular abort() scenario that you hit ?
>>>
>>> Yes, while tweaking the suspended state, and forgetting to add transitions:
>>>
>>>         error_report("invalid runstate transition: '%s' -> '%s'",
>>>         abort();
>>>
>>> But I have previously hit this for other errors.
>>>
>>>> For some crude statistics:
>>>>
>>>>   $ for i in abort return exit goto ; do echo -n "$i: " ; git grep --after 1 error_report | grep $i | wc -l ; done
>>>>   abort: 47
>>>>   return: 512
>>>>   exit: 458
>>>>   goto: 177
>>>>
>>>> to me those numbers say that calling "abort()" after error_report
>>>> should be considered a bug, and we can blanket replace all the
>>>> abort() calls with exit(EXIT_FAILURE), and thus avoid the need to
>>>> special case flushing the monitor.
>>>
>>> And presumably add an atexit handler to flush the monitor ala monitor_abort.
>>> AFAICT currently no destructor is called for the monitor at exit time.
>> 
>> The HMP monitor flushes at each newline,  and exit() will take care of
>> flushing stdout, so I don't think there's anything else needed.
>> 
>>>> Also I think there's a decent case to be made for error_report()
>>>> to call monitor_flush().
>>>
>>> A good start, but that would not help for monitors with skip_flush=true, which 
>>> need to format the buffered string in a json response, which is the case I 
>>> tripped over.
>> 
>> 'skip_flush' is only set to 'true' when using a QMP monitor and invoking
>> "hmp-monitor-command".
>
> OK, that is narrower than I thought.  Now I see that other QMP monitors send 
> error_report() to stderr, hence it is visible after abort and exit:
>
> int error_vprintf(const char *fmt, va_list ap) {
>     if (cur_mon && !monitor_cur_is_qmp())
>         return monitor_vprintf(cur_mon, fmt, ap);
>     return vfprintf(stderr, fmt, ap);                <-- HERE
>
> That surprises me, I thought that would be returned to the monitor caller in the
> json response. I guess the rationale is that the "main" error, if any, will be
> set and returned by the err object that is passed down the stack during command
> evaluation.

Three cases:

1. !cur_mon

   Not executing a monitor command.  We want to report errors etc to
   stderr.

2. cur_mon && !monitor_cur_is_qmp()

   Executing an HMP command.  We want to report errors to the current
   monitor.

2. cur_mon && monitor_cur_is_qmp()

   Executing a QMP command.  What we want is less obvious.

   Somewhere up the call stack is the QMP command's handler function.
   It takes an Error **errp argument.

   Within such a function, any errors need to be passed up the call
   chain into that argument.  Reporting them with error_report() is
   *wrong*.  Reporting must be left to the function's caller.

   A QMP command handler returns it output, it doesn't print it.  So
   calling monitor_printf() is wrong, too.

   But what about warn_report()?  Is that wrong, too?  We decided it's
   not, mostly because we have nothing else to offer.

   The stupidest way to keep it useful in QMP command context is to have
   error_vprintf() print to stderr.  So that's what it does.

   We could instead accumulate error_vprintf() output in a buffer, and
   include it with the QMP reply.  However, it's not clear what a
   management application could do with it.  So we stick to stupid.

[...]
Markus Armbruster Nov. 15, 2023, 4:15 p.m. UTC | #9
Steven Sistare <steven.sistare@oracle.com> writes:

> On 11/15/2023 3:41 AM, Markus Armbruster wrote:
>> Daniel P. Berrangé <berrange@redhat.com> writes:
>> 
>>> On Fri, Nov 03, 2023 at 03:51:00PM -0400, Steven Sistare wrote:
>>>> On 11/3/2023 1:33 PM, Daniel P. Berrangé wrote:
>>>>> On Fri, Nov 03, 2023 at 09:01:29AM -0700, Steve Sistare wrote:
>>>>>> Buffered monitor output is lost when abort() is called.  The pattern
>>>>>> error_report() followed by abort() occurs about 60 times, so valuable
>>>>>> information is being lost when the abort is called in the context of a
>>>>>> monitor command.
>>>>>
>>>>> I'm curious, was there a particular abort() scenario that you hit ?
>>>>
>>>> Yes, while tweaking the suspended state, and forgetting to add transitions:
>>>>
>>>>         error_report("invalid runstate transition: '%s' -> '%s'",
>>>>         abort();
>>>>
>>>> But I have previously hit this for other errors.
>> 
>> Can you provide a reproducer?
>
> I sometimes hit this when developing new code.  I do not have a reproducer for upstream
> branches. The patch is aimed at helping developers, not users.

I'm asking because I can't see how the error message could be lost.  A
reproducer would let me find out.  "Apply this set of broken patches,
then do that" would serve.

[...]
Steve Sistare Nov. 15, 2023, 5:41 p.m. UTC | #10
On 11/15/2023 11:15 AM, Markus Armbruster wrote:
> Steven Sistare <steven.sistare@oracle.com> writes:
>> On 11/15/2023 3:41 AM, Markus Armbruster wrote:
>>> Daniel P. Berrangé <berrange@redhat.com> writes:
>>>
>>>> On Fri, Nov 03, 2023 at 03:51:00PM -0400, Steven Sistare wrote:
>>>>> On 11/3/2023 1:33 PM, Daniel P. Berrangé wrote:
>>>>>> On Fri, Nov 03, 2023 at 09:01:29AM -0700, Steve Sistare wrote:
>>>>>>> Buffered monitor output is lost when abort() is called.  The pattern
>>>>>>> error_report() followed by abort() occurs about 60 times, so valuable
>>>>>>> information is being lost when the abort is called in the context of a
>>>>>>> monitor command.
>>>>>>
>>>>>> I'm curious, was there a particular abort() scenario that you hit ?
>>>>>
>>>>> Yes, while tweaking the suspended state, and forgetting to add transitions:
>>>>>
>>>>>         error_report("invalid runstate transition: '%s' -> '%s'",
>>>>>         abort();
>>>>>
>>>>> But I have previously hit this for other errors.
>>>
>>> Can you provide a reproducer?
>>
>> I sometimes hit this when developing new code.  I do not have a reproducer for upstream
>> branches. The patch is aimed at helping developers, not users.
> 
> I'm asking because I can't see how the error message could be lost.  A
> reproducer would let me find out.  "Apply this set of broken patches,
> then do that" would serve.

$ patch -p1 << EOF
diff --git a/monitor/qmp-cmds.c b/monitor/qmp-cmds.c
index b0f948d..c9a3aee 100644
--- a/monitor/qmp-cmds.c
+++ b/monitor/qmp-cmds.c
@@ -47,8 +47,12 @@ void qmp_quit(Error **errp)
     qemu_system_shutdown_request(SHUTDOWN_CAUSE_HOST_QMP_QUIT);
 }

+#include "qemu/error-report.h"
+
 void qmp_stop(Error **errp)
 {
+    error_report("injected failure");
+    abort();
     /* if there is a dump in background, we should wait until the dump
      * finished */
     if (qemu_system_dump_in_progress()) {
EOF

# This example loses the error message:

$ args='-display none -chardev socket,id=mon1,server=on,path=mon1.sock,wait=off -mon mon1,mode=control'
$ qemu-system-x86_64 $args < /dev/null &
[1] 18048
$ echo '{"execute":"qmp_capabilities"} {"execute":"human-monitor-command","arguments":{"command-line":"stop"}}' | ncat -U mon1.sock
{"QMP": {"version": {"qemu": {"micro": 50, "minor": 1, "major": 8}, "package": "v8.1.0-2976-g4025fde-dirty"}, "capabilities": ["oob"]}}
{"return": {}}
Ncat: Connection reset by peer.
$
[1]+  Aborted                 qemu-system-x86_64 $args < /dev/null


# This example preserves the error message. I include it to show the ncat-based test is valid.

$ qemu-system-x86_64 $args < /dev/null &
[1] 18060
$ echo '{"execute":"qmp_capabilities"} {"execute":"stop"}' | ncat -U mon1.sock
{"QMP": {"version": {"qemu": {"micro": 50, "minor": 1, "major": 8}, "package": "v8.1.0-2976-g4025fde-dirty"}, "capabilities": ["oob"]}}
{"return": {}}
injected failure                <============= qemu stderr
Ncat: Connection reset by peer.
$
[1]+  Aborted                 qemu-system-x86_64 $args < /dev/null

- Steve
Daniel P. Berrangé Nov. 16, 2023, 11:21 a.m. UTC | #11
On Wed, Nov 15, 2023 at 10:30:29AM -0500, Steven Sistare wrote:
> 
> 
> On 11/6/2023 5:10 AM, Daniel P. Berrangé wrote:
> > On Fri, Nov 03, 2023 at 03:51:00PM -0400, Steven Sistare wrote:
> >> On 11/3/2023 1:33 PM, Daniel P. Berrangé wrote:
> >>> On Fri, Nov 03, 2023 at 09:01:29AM -0700, Steve Sistare wrote:
> >>>> Buffered monitor output is lost when abort() is called.  The pattern
> >>>> error_report() followed by abort() occurs about 60 times, so valuable
> >>>> information is being lost when the abort is called in the context of a
> >>>> monitor command.
> >>>
> >>> I'm curious, was there a particular abort() scenario that you hit ?
> >>
> >> Yes, while tweaking the suspended state, and forgetting to add transitions:
> >>
> >>         error_report("invalid runstate transition: '%s' -> '%s'",
> >>         abort();
> >>
> >> But I have previously hit this for other errors.
> >>
> >>> For some crude statistics:
> >>>
> >>>   $ for i in abort return exit goto ; do echo -n "$i: " ; git grep --after 1 error_report | grep $i | wc -l ; done
> >>>   abort: 47
> >>>   return: 512
> >>>   exit: 458
> >>>   goto: 177
> >>>
> >>> to me those numbers say that calling "abort()" after error_report
> >>> should be considered a bug, and we can blanket replace all the
> >>> abort() calls with exit(EXIT_FAILURE), and thus avoid the need to
> >>> special case flushing the monitor.
> >>
> >> And presumably add an atexit handler to flush the monitor ala monitor_abort.
> >> AFAICT currently no destructor is called for the monitor at exit time.
> > 
> > The HMP monitor flushes at each newline,  and exit() will take care of
> > flushing stdout, so I don't think there's anything else needed.
> > 
> >>> Also I think there's a decent case to be made for error_report()
> >>> to call monitor_flush().
> >>
> >> A good start, but that would not help for monitors with skip_flush=true, which 
> >> need to format the buffered string in a json response, which is the case I 
> >> tripped over.
> > 
> > 'skip_flush' is only set to 'true' when using a QMP monitor and invoking
> > "hmp-monitor-command".
> 
> OK, that is narrower than I thought.  Now I see that other QMP monitors send 
> error_report() to stderr, hence it is visible after abort and exit:
> 
> int error_vprintf(const char *fmt, va_list ap) {
>     if (cur_mon && !monitor_cur_is_qmp())
>         return monitor_vprintf(cur_mon, fmt, ap);
>     return vfprintf(stderr, fmt, ap);                <-- HERE
> 
> That surprises me, I thought that would be returned to the monitor caller in the
> json response. I guess the rationale is that the "main" error, if any, will be
> set and returned by the err object that is passed down the stack during command
> evaluation.
> 
> > In such a case, the error message needs to be built into a JSON error
> > reply and sent over the socket. Your patch doesn't help this case
> > since you've just printed to stderr.  
> 
> Same as vfprintf above!
> 
> > I don't think it is reasonable
> > to expect QMP monitors to send replies on SIG_ABRT anyway. 
> 
> I agree.  My patch causes the error to be seen somewhere, anywhere, instead
> of being dropped on the floor.
> 
> > So I don't
> > think the skip_flush=true scenario is a problem to be concerned with.
> 
> It is indeed a narrow case, and not worth much effort or code change.
> I'm inclined to drop it, but I appreciate the time you have spent reviewing it.

If you know of scenarios that can trigger abort() as a result of
either QMP or HMP commands, then we still need to fix those. Nothing
that is reachable from the monitor commands should ever end up in
abort(), as a general rule.

With regards,
Daniel
Markus Armbruster Nov. 16, 2023, 3:14 p.m. UTC | #12
Steven Sistare <steven.sistare@oracle.com> writes:

> On 11/15/2023 11:15 AM, Markus Armbruster wrote:
>> Steven Sistare <steven.sistare@oracle.com> writes:
>>> On 11/15/2023 3:41 AM, Markus Armbruster wrote:
>>>> Daniel P. Berrangé <berrange@redhat.com> writes:
>>>>
>>>>> On Fri, Nov 03, 2023 at 03:51:00PM -0400, Steven Sistare wrote:
>>>>>> On 11/3/2023 1:33 PM, Daniel P. Berrangé wrote:
>>>>>>> On Fri, Nov 03, 2023 at 09:01:29AM -0700, Steve Sistare wrote:
>>>>>>>> Buffered monitor output is lost when abort() is called.  The pattern
>>>>>>>> error_report() followed by abort() occurs about 60 times, so valuable
>>>>>>>> information is being lost when the abort is called in the context of a
>>>>>>>> monitor command.
>>>>>>>
>>>>>>> I'm curious, was there a particular abort() scenario that you hit ?
>>>>>>
>>>>>> Yes, while tweaking the suspended state, and forgetting to add transitions:
>>>>>>
>>>>>>         error_report("invalid runstate transition: '%s' -> '%s'",
>>>>>>         abort();
>>>>>>
>>>>>> But I have previously hit this for other errors.
>>>>
>>>> Can you provide a reproducer?
>>>
>>> I sometimes hit this when developing new code.  I do not have a reproducer for upstream
>>> branches. The patch is aimed at helping developers, not users.
>> 
>> I'm asking because I can't see how the error message could be lost.  A
>> reproducer would let me find out.  "Apply this set of broken patches,
>> then do that" would serve.
>
> $ patch -p1 << EOF
> diff --git a/monitor/qmp-cmds.c b/monitor/qmp-cmds.c
> index b0f948d..c9a3aee 100644
> --- a/monitor/qmp-cmds.c
> +++ b/monitor/qmp-cmds.c
> @@ -47,8 +47,12 @@ void qmp_quit(Error **errp)
>      qemu_system_shutdown_request(SHUTDOWN_CAUSE_HOST_QMP_QUIT);
>  }
>
> +#include "qemu/error-report.h"
> +
>  void qmp_stop(Error **errp)
>  {
> +    error_report("injected failure");
> +    abort();
>      /* if there is a dump in background, we should wait until the dump
>       * finished */
>      if (qemu_system_dump_in_progress()) {
> EOF
>
> # This example loses the error message:
>
> $ args='-display none -chardev socket,id=mon1,server=on,path=mon1.sock,wait=off -mon mon1,mode=control'
> $ qemu-system-x86_64 $args < /dev/null &
> [1] 18048
> $ echo '{"execute":"qmp_capabilities"} {"execute":"human-monitor-command","arguments":{"command-line":"stop"}}' | ncat -U mon1.sock
> {"QMP": {"version": {"qemu": {"micro": 50, "minor": 1, "major": 8}, "package": "v8.1.0-2976-g4025fde-dirty"}, "capabilities": ["oob"]}}
> {"return": {}}
> Ncat: Connection reset by peer.
> $
> [1]+  Aborted                 qemu-system-x86_64 $args < /dev/null
>
>
> # This example preserves the error message. I include it to show the ncat-based test is valid.
>
> $ qemu-system-x86_64 $args < /dev/null &
> [1] 18060
> $ echo '{"execute":"qmp_capabilities"} {"execute":"stop"}' | ncat -U mon1.sock
> {"QMP": {"version": {"qemu": {"micro": 50, "minor": 1, "major": 8}, "package": "v8.1.0-2976-g4025fde-dirty"}, "capabilities": ["oob"]}}
> {"return": {}}
> injected failure                <============= qemu stderr
> Ncat: Connection reset by peer.
> $
> [1]+  Aborted                 qemu-system-x86_64 $args < /dev/null
>
> - Steve

Reproduced, thanks!

Confirms my reading of the code.  Two cases:

1. Normal monitor output

   Flushed on newline.  A partial last line can be lost on crash or
   exit().

   Works as intended.

2. Output being captured for QMP command human-monitor-command

   Never flushed.  Instead, the entire buffer is included in the
   command's success response.  Naturally there is no response on crash.

   Works as intended.

Here's how to fish unflushed output out of a core dump:

    (gdb) bt
    #0  0x00007efeba52fecc in __pthread_kill_implementation () at /lib64/libc.so.6
    #1  0x00007efeba4dfab6 in raise () at /lib64/libc.so.6
    #2  0x00007efeba4c97fc in abort () at /lib64/libc.so.6
    #3  0x000055588dcc848c in qmp_stop (errp=0x0) at ../monitor/qmp-cmds.c:53
    #4  0x000055588dcc36ab in hmp_stop (mon=0x7ffc01d47cc0, qdict=0x555891d19200)
        at ../monitor/hmp-cmds.c:119
    #5  0x000055588dcc714e in handle_hmp_command_exec
        (mon=0x7ffc01d47cc0, cmd=0x55588f0689d0 <hmp_cmds+7760>, qdict=0x555891d19200) at ../monitor/hmp.c:1106
    #6  0x000055588dcc737b in handle_hmp_command
        (mon=0x7ffc01d47cc0, cmdline=0x7efea80036f4 "") at ../monitor/hmp.c:1158
    #7  0x000055588dcc8922 in qmp_human_monitor_command
        (command_line=0x7efea80036f0 "stop", has_cpu_index=false, cpu_index=0, errp=0x7ffc01d47dd0) at ../monitor/qmp-cmds.c:184

    [...]

    (gdb) up 4
    #4  0x000055588dcc36ab in hmp_stop (mon=0x7ffc01d47cc0, qdict=0x555891d19200)
        at ../monitor/hmp-cmds.c:119
    119	    qmp_stop(NULL);
    (gdb) p mon->outbuf
    $1 = (GString *) 0x555890a65260
    (gdb) p *$
    $2 = {str = 0x7efea8002c00 "injected failure\r\n", len = 18, 
      allocated_len = 128}
Steve Sistare Nov. 16, 2023, 5:03 p.m. UTC | #13
On 11/16/2023 10:14 AM, Markus Armbruster wrote:
> Steven Sistare <steven.sistare@oracle.com> writes:
>> On 11/15/2023 11:15 AM, Markus Armbruster wrote:
>>> Steven Sistare <steven.sistare@oracle.com> writes:
>>>> On 11/15/2023 3:41 AM, Markus Armbruster wrote:
>>>>> Daniel P. Berrangé <berrange@redhat.com> writes:
>>>>>
>>>>>> On Fri, Nov 03, 2023 at 03:51:00PM -0400, Steven Sistare wrote:
>>>>>>> On 11/3/2023 1:33 PM, Daniel P. Berrangé wrote:
>>>>>>>> On Fri, Nov 03, 2023 at 09:01:29AM -0700, Steve Sistare wrote:
>>>>>>>>> Buffered monitor output is lost when abort() is called.  The pattern
>>>>>>>>> error_report() followed by abort() occurs about 60 times, so valuable
>>>>>>>>> information is being lost when the abort is called in the context of a
>>>>>>>>> monitor command.
>>>>>>>>
>>>>>>>> I'm curious, was there a particular abort() scenario that you hit ?
>>>>>>>
>>>>>>> Yes, while tweaking the suspended state, and forgetting to add transitions:
>>>>>>>
>>>>>>>         error_report("invalid runstate transition: '%s' -> '%s'",
>>>>>>>         abort();
>>>>>>>
>>>>>>> But I have previously hit this for other errors.
>>>>>
>>>>> Can you provide a reproducer?
>>>>
>>>> I sometimes hit this when developing new code.  I do not have a reproducer for upstream
>>>> branches. The patch is aimed at helping developers, not users.
>>>
>>> I'm asking because I can't see how the error message could be lost.  A
>>> reproducer would let me find out.  "Apply this set of broken patches,
>>> then do that" would serve.
>>
>> $ patch -p1 << EOF
>> diff --git a/monitor/qmp-cmds.c b/monitor/qmp-cmds.c
>> index b0f948d..c9a3aee 100644
>> --- a/monitor/qmp-cmds.c
>> +++ b/monitor/qmp-cmds.c
>> @@ -47,8 +47,12 @@ void qmp_quit(Error **errp)
>>      qemu_system_shutdown_request(SHUTDOWN_CAUSE_HOST_QMP_QUIT);
>>  }
>>
>> +#include "qemu/error-report.h"
>> +
>>  void qmp_stop(Error **errp)
>>  {
>> +    error_report("injected failure");
>> +    abort();
>>      /* if there is a dump in background, we should wait until the dump
>>       * finished */
>>      if (qemu_system_dump_in_progress()) {
>> EOF
>>
>> # This example loses the error message:
>>
>> $ args='-display none -chardev socket,id=mon1,server=on,path=mon1.sock,wait=off -mon mon1,mode=control'
>> $ qemu-system-x86_64 $args < /dev/null &
>> [1] 18048
>> $ echo '{"execute":"qmp_capabilities"} {"execute":"human-monitor-command","arguments":{"command-line":"stop"}}' | ncat -U mon1.sock
>> {"QMP": {"version": {"qemu": {"micro": 50, "minor": 1, "major": 8}, "package": "v8.1.0-2976-g4025fde-dirty"}, "capabilities": ["oob"]}}
>> {"return": {}}
>> Ncat: Connection reset by peer.
>> $
>> [1]+  Aborted                 qemu-system-x86_64 $args < /dev/null
>>
>>
>> # This example preserves the error message. I include it to show the ncat-based test is valid.
>>
>> $ qemu-system-x86_64 $args < /dev/null &
>> [1] 18060
>> $ echo '{"execute":"qmp_capabilities"} {"execute":"stop"}' | ncat -U mon1.sock
>> {"QMP": {"version": {"qemu": {"micro": 50, "minor": 1, "major": 8}, "package": "v8.1.0-2976-g4025fde-dirty"}, "capabilities": ["oob"]}}
>> {"return": {}}
>> injected failure                <============= qemu stderr
>> Ncat: Connection reset by peer.
>> $
>> [1]+  Aborted                 qemu-system-x86_64 $args < /dev/null
>>
>> - Steve
> 
> Reproduced, thanks!
> 
> Confirms my reading of the code.  Two cases:
> 
> 1. Normal monitor output
> 
>    Flushed on newline.  A partial last line can be lost on crash or
>    exit().
> 
>    Works as intended.
> 
> 2. Output being captured for QMP command human-monitor-command
> 
>    Never flushed.  Instead, the entire buffer is included in the
>    command's success response.  Naturally there is no response on crash.
> 
>    Works as intended.
> 
> Here's how to fish unflushed output out of a core dump:
> 
>     (gdb) bt
>     #0  0x00007efeba52fecc in __pthread_kill_implementation () at /lib64/libc.so.6
>     #1  0x00007efeba4dfab6 in raise () at /lib64/libc.so.6
>     #2  0x00007efeba4c97fc in abort () at /lib64/libc.so.6
>     #3  0x000055588dcc848c in qmp_stop (errp=0x0) at ../monitor/qmp-cmds.c:53
>     #4  0x000055588dcc36ab in hmp_stop (mon=0x7ffc01d47cc0, qdict=0x555891d19200)
>         at ../monitor/hmp-cmds.c:119
>     #5  0x000055588dcc714e in handle_hmp_command_exec
>         (mon=0x7ffc01d47cc0, cmd=0x55588f0689d0 <hmp_cmds+7760>, qdict=0x555891d19200) at ../monitor/hmp.c:1106
>     #6  0x000055588dcc737b in handle_hmp_command
>         (mon=0x7ffc01d47cc0, cmdline=0x7efea80036f4 "") at ../monitor/hmp.c:1158
>     #7  0x000055588dcc8922 in qmp_human_monitor_command
>         (command_line=0x7efea80036f0 "stop", has_cpu_index=false, cpu_index=0, errp=0x7ffc01d47dd0) at ../monitor/qmp-cmds.c:184
> 
>     [...]
> 
>     (gdb) up 4
>     #4  0x000055588dcc36ab in hmp_stop (mon=0x7ffc01d47cc0, qdict=0x555891d19200)
>         at ../monitor/hmp-cmds.c:119
>     119	    qmp_stop(NULL);
>     (gdb) p mon->outbuf
>     $1 = (GString *) 0x555890a65260
>     (gdb) p *$
>     $2 = {str = 0x7efea8002c00 "injected failure\r\n", len = 18, 
>       allocated_len = 128}

Yes. thanks.  With the patch, the computer does the work for us, and sometimes the message
is definitive enough to fix without further investigation, eg
  invalid runstate transition: 'suspended -> 'save-vm'

But, it's no big deal, and I withdraw the patch request.  Thanks for looking at it.

- Steve
diff mbox series

Patch

diff --git a/monitor/monitor.c b/monitor/monitor.c
index dc352f9..65dace0 100644
--- a/monitor/monitor.c
+++ b/monitor/monitor.c
@@ -701,6 +701,43 @@  void monitor_cleanup(void)
     }
 }
 
+#ifdef CONFIG_LINUX
+
+static void monitor_abort(int signal, siginfo_t *info, void *c)
+{
+    Monitor *mon = monitor_cur();
+
+    if (!mon || qemu_mutex_trylock(&mon->mon_lock)) {
+        return;
+    }
+
+    if (mon->outbuf && mon->outbuf->len) {
+        fputs("SIGABRT received: ", stderr);
+        fputs(mon->outbuf->str, stderr);
+        if (mon->outbuf->str[mon->outbuf->len - 1] != '\n') {
+            fputc('\n', stderr);
+        }
+    }
+
+    qemu_mutex_unlock(&mon->mon_lock);
+}
+
+static void monitor_add_abort_handler(void)
+{
+    struct sigaction act;
+
+    memset(&act, 0, sizeof(act));
+    act.sa_sigaction = monitor_abort;
+    act.sa_flags = SA_SIGINFO;
+    sigaction(SIGABRT,  &act, NULL);
+}
+
+#else
+
+static void monitor_add_abort_handler(void) {}
+
+#endif
+
 static void monitor_qapi_event_init(void)
 {
     monitor_qapi_event_state = g_hash_table_new(qapi_event_throttle_hash,
@@ -712,6 +749,7 @@  void monitor_init_globals(void)
     monitor_qapi_event_init();
     qemu_mutex_init(&monitor_lock);
     coroutine_mon = g_hash_table_new(NULL, NULL);
+    monitor_add_abort_handler();
 
     /*
      * The dispatcher BH must run in the main loop thread, since we