diff mbox

[PULL,07/11] cpu-exec: don't overwrite exception_index

Message ID 004401d3894c$b3fc90f0$1bf5b2d0$@ru (mailing list archive)
State New, archived
Headers show

Commit Message

Pavel Dovgalyuk Jan. 9, 2018, 1:21 p.m. UTC
> From: Peter Maydell [mailto:peter.maydell@linaro.org]
> On 20 November 2017 at 11:06, Peter Maydell <peter.maydell@linaro.org> wrote:
> > On 20 November 2017 at 10:25, Pavel Dovgalyuk <dovgaluk@ispras.ru> wrote:
> >>> From: Peter Maydell [mailto:peter.maydell@linaro.org]
> >>> On 17 November 2017 at 20:26, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>> > On 17/11/2017 21:07, Peter Maydell wrote:
> >>> >> Hi. This commit breaks booting of Debian on aarch64 virt board.
> >>> >> (repro instructions for creating the image available at:
> >>> >> https://translatedcode.wordpress.com/2017/07/24/installing-debian-on-qemus-64-bit-arm-
> virt-
> >>> board/)
> >>> >> The guest kernel never prints anything to the serial port.
> >>> >>
> >>> >> Reverting this commit fixes master for me, so I plan to do
> >>> >> that on Monday.
> >>> >
> >>> > Maybe you can also test moving the atomic_set inside the "if".  It does
> >>> > seem to be a genuine bugfix.
> >>>
> >>> No, that doesn't help: guest still sits there like a lemon.
> >>
> >> Maybe this is a more complex problem?
> >> I tried removing this if and aarch64 still does not work.
> >
> > Reverting the commit fixes it for me; I have that going through
> > build tests and will push the revert later today.
> 
> Revert pushed to git master.
> 
> More generally, this commit seems to assume that QEMU always
> does:
>  * set exception_index to something
>  * handle that
>  * clear exception_index to -1
> 
> but it's not clear to me that it's actually always the case
> that it gets cleared back to -1.

I tried to get some logs with the following code.
It prints that there was an exception 5 and it was overwritten by the standard code.
Fixed code prevents this overwrite.

I guess that one of the following is true:
 - unfixed version misses some exceptions
 - fixed version processes some exceptions twice (e.g., when there is no clear exception)


Pavel Dovgalyuk

Comments

Peter Maydell Jan. 9, 2018, 1:44 p.m. UTC | #1
On 9 January 2018 at 13:21, Pavel Dovgalyuk <dovgaluk@ispras.ru> wrote:
> I tried to get some logs with the following code.
> It prints that there was an exception 5 and it was overwritten by the standard code.
> Fixed code prevents this overwrite.
>
> I guess that one of the following is true:
>  - unfixed version misses some exceptions
>  - fixed version processes some exceptions twice (e.g., when there is no clear exception)
>
> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> index 280200f..fa810f7 100644
> --- a/accel/tcg/cpu-exec.c
> +++ b/accel/tcg/cpu-exec.c
> @@ -605,6 +605,8 @@ static inline bool cpu_handle_interrupt(CPUState *cpu,
>      /* Finally, check if we need to exit to the main loop.  */
>      if (unlikely(atomic_read(&cpu->exit_request)
>          || (use_icount && cpu->icount_decr.u16.low + cpu->icount_extra == 0)))
> +        if (cpu->exception_index != -1 && cpu->exception_index != EXCP_INTERRUP
> +            qemu_log("overwriting excp_index %x\n", cpu->exception_index);
>          atomic_set(&cpu->exit_request, 0);
>          cpu->exception_index = EXCP_INTERRUPT;
>          return true;

This looks like it's just working around whatever is going on
(why should EXCP_INTERRUPT be special?). What we need to do is
find out what's actually happening here...

thanks
-- PMM
Pavel Dovgalyuk Jan. 10, 2018, 7:04 a.m. UTC | #2
> From: Peter Maydell [mailto:peter.maydell@linaro.org]
> On 9 January 2018 at 13:21, Pavel Dovgalyuk <dovgaluk@ispras.ru> wrote:
> > I tried to get some logs with the following code.
> > It prints that there was an exception 5 and it was overwritten by the standard code.
> > Fixed code prevents this overwrite.
> >
> > I guess that one of the following is true:
> >  - unfixed version misses some exceptions
> >  - fixed version processes some exceptions twice (e.g., when there is no clear exception)
> >
> > diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> > index 280200f..fa810f7 100644
> > --- a/accel/tcg/cpu-exec.c
> > +++ b/accel/tcg/cpu-exec.c
> > @@ -605,6 +605,8 @@ static inline bool cpu_handle_interrupt(CPUState *cpu,
> >      /* Finally, check if we need to exit to the main loop.  */
> >      if (unlikely(atomic_read(&cpu->exit_request)
> >          || (use_icount && cpu->icount_decr.u16.low + cpu->icount_extra == 0)))
> > +        if (cpu->exception_index != -1 && cpu->exception_index != EXCP_INTERRUP
> > +            qemu_log("overwriting excp_index %x\n", cpu->exception_index);
> >          atomic_set(&cpu->exit_request, 0);
> >          cpu->exception_index = EXCP_INTERRUPT;
> >          return true;
> 
> This looks like it's just working around whatever is going on
> (why should EXCP_INTERRUPT be special?). What we need to do is
> find out what's actually happening here...

The failure cause is in incorrect interrupt processing.
When ARM processes hardware interrupt in arm_cpu_exec_interrupt(),
it executes cs->exception_index = excp_idx;

This assumes, that the exception will be processed later.
But it is processed immediately by calling cc->do_interrupt(cs);
instead of leaving this job to cpu_exec.

I guess these calls should be removed to match the cpu_exec execution pattern.

Pavel Dovgalyuk
Peter Maydell Jan. 10, 2018, 10:24 a.m. UTC | #3
On 10 January 2018 at 07:04, Pavel Dovgalyuk <dovgaluk@ispras.ru> wrote:
> The failure cause is in incorrect interrupt processing.
> When ARM processes hardware interrupt in arm_cpu_exec_interrupt(),
> it executes cs->exception_index = excp_idx;
>
> This assumes, that the exception will be processed later.
> But it is processed immediately by calling cc->do_interrupt(cs);
> instead of leaving this job to cpu_exec.

That seems fine to me. The code knows it needs to take
an interrupt, it has all the information it needs to do
it, it can just go ahead and call the code that takes
the interrupt. The comment in accel/tcg/cpu-exec.c says:

        /* The target hook has 3 exit conditions:
           False when the interrupt isn't processed,
           True when it is, and we should restart on a new TB,
           and via longjmp via cpu_loop_exit.  */

and here we have processed the interrupt and returned true.
We set exception_index because that's how you tell the
do_interrupt hook which interrupt to deal with.

That is, the pattern that the arm target code assumes for
cs->exception_index is "you don't set this unless you're
about to call do_interrupt; if you do set it then definitely
call do_interrupt and don't do anything much in between".
In that view of the world there's no need to reset it or
check it because nothing is permitted to happen between
"set value" and "call do_interrupt". This is in line with
the way we handle other arm-specific bits of information
associated with the exception, like env->exception.target_el.
Having a long gap between "set value" and "do_interrupt"
is worrying because it means that maybe we might end
up doing something else in that gap that corrupts the
various bits of information associated with the exception,
or something that's not architecturally permitted to
happen at that point.

thanks
-- PMM
Pavel Dovgalyuk Jan. 10, 2018, 10:43 a.m. UTC | #4
> From: Peter Maydell [mailto:peter.maydell@linaro.org]
> On 10 January 2018 at 07:04, Pavel Dovgalyuk <dovgaluk@ispras.ru> wrote:
> > The failure cause is in incorrect interrupt processing.
> > When ARM processes hardware interrupt in arm_cpu_exec_interrupt(),
> > it executes cs->exception_index = excp_idx;
> >
> > This assumes, that the exception will be processed later.
> > But it is processed immediately by calling cc->do_interrupt(cs);
> > instead of leaving this job to cpu_exec.
> 
> That seems fine to me. The code knows it needs to take
> an interrupt, it has all the information it needs to do
> it, it can just go ahead and call the code that takes
> the interrupt. The comment in accel/tcg/cpu-exec.c says:
> 
>         /* The target hook has 3 exit conditions:
>            False when the interrupt isn't processed,
>            True when it is, and we should restart on a new TB,
>            and via longjmp via cpu_loop_exit.  */
> 
> and here we have processed the interrupt and returned true.
> We set exception_index because that's how you tell the
> do_interrupt hook which interrupt to deal with.
> 
> That is, the pattern that the arm target code assumes for
> cs->exception_index is "you don't set this unless you're
> about to call do_interrupt; if you do set it then definitely
> call do_interrupt and don't do anything much in between".
> In that view of the world there's no need to reset it or
> check it because nothing is permitted to happen between
> "set value" and "call do_interrupt". This is in line with
> the way we handle other arm-specific bits of information
> associated with the exception, like env->exception.target_el.
> Having a long gap between "set value" and "do_interrupt"
> is worrying because it means that maybe we might end
> up doing something else in that gap that corrupts the
> various bits of information associated with the exception,
> or something that's not architecturally permitted to
> happen at that point.

I see.
I found the same pattern in other targets.
But only MIPS resets exception_index after processing
the interrupt. Others do not bother.
Then the following change of cpu_exec should be correct?

            if (cc->cpu_exec_interrupt(cpu, interrupt_request)) {
                replay_interrupt();
+               cpu->exception_index = -1;
                *last_tb = NULL;
            }

Pavel Dovgalyuk
diff mbox

Patch

diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 280200f..fa810f7 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -605,6 +605,8 @@  static inline bool cpu_handle_interrupt(CPUState *cpu,
     /* Finally, check if we need to exit to the main loop.  */
     if (unlikely(atomic_read(&cpu->exit_request)
         || (use_icount && cpu->icount_decr.u16.low + cpu->icount_extra == 0))) 
+        if (cpu->exception_index != -1 && cpu->exception_index != EXCP_INTERRUP
+            qemu_log("overwriting excp_index %x\n", cpu->exception_index);
         atomic_set(&cpu->exit_request, 0);
         cpu->exception_index = EXCP_INTERRUPT;
         return true;