accel/tcg: Complete cpu initialization before registration

Message ID	20221031054105.3552-1-richard.henderson@linaro.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> From: Richard Henderson <richard.henderson@linaro.org> To: qemu-devel@nongnu.org Cc: Ilya Leoshkevich <iii@linux.ibm.com> Subject: [PATCH] accel/tcg: Complete cpu initialization before registration Date: Mon, 31 Oct 2022 16:41:05 +1100 Message-Id: <20221031054105.3552-1-richard.henderson@linaro.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=2001:4860:4864:20::33; envelope-from=richard.henderson@linaro.org; helo=mail-oa1-x33.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Sender: "Qemu-devel" <qemu-devel-bounces@nongnu.org> Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Series	accel/tcg: Complete cpu initialization before registration \| expand accel/tcg: Complete cpu initialization before registration

Richard Henderson Oct. 31, 2022, 5:41 a.m. UTC

Delay cpu_list_add until realize is complete, so that cross-cpu
interaction does not happen with incomplete cpu state.  For this,
we must delay plugin initialization out of tcg_exec_realizefn,
because no cpu_index has been assigned.

Fixes a problem with cross-cpu jump cache flushing, when the
jump cache has not yet been allocated.

Fixes: a976a99a2975 ("include/hw/core: Create struct CPUJumpCache")
Reported-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 accel/tcg/cpu-exec.c      |  8 +++++---
 accel/tcg/translate-all.c | 16 +++++++---------
 cpu.c                     | 10 +++++++++-
 3 files changed, 21 insertions(+), 13 deletions(-)

Ilya Leoshkevich Oct. 31, 2022, 11:07 a.m. UTC | #1

On Mon, Oct 31, 2022 at 04:41:05PM +1100, Richard Henderson wrote:
> Delay cpu_list_add until realize is complete, so that cross-cpu
> interaction does not happen with incomplete cpu state.  For this,
> we must delay plugin initialization out of tcg_exec_realizefn,
> because no cpu_index has been assigned.
> 
> Fixes a problem with cross-cpu jump cache flushing, when the
> jump cache has not yet been allocated.
> 
> Fixes: a976a99a2975 ("include/hw/core: Create struct CPUJumpCache")
> Reported-by: Ilya Leoshkevich <iii@linux.ibm.com>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  accel/tcg/cpu-exec.c      |  8 +++++---
>  accel/tcg/translate-all.c | 16 +++++++---------
>  cpu.c                     | 10 +++++++++-
>  3 files changed, 21 insertions(+), 13 deletions(-)

This fixes the issue, thanks!
The code looks correct to me as well.
I have one question though, see below.

Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>

> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> index 82b06c1824..356fe348de 100644
> --- a/accel/tcg/cpu-exec.c
> +++ b/accel/tcg/cpu-exec.c
> @@ -1052,23 +1052,25 @@ void tcg_exec_realizefn(CPUState *cpu, Error **errp)
>          cc->tcg_ops->initialize();
>          tcg_target_initialized = true;
>      }
> -    tlb_init(cpu);
> -    qemu_plugin_vcpu_init_hook(cpu);
>  
> +    cpu->tb_jmp_cache = g_new0(CPUJumpCache, 1);
> +    tlb_init(cpu);
>  #ifndef CONFIG_USER_ONLY
>      tcg_iommu_init_notifier_list(cpu);
>  #endif /* !CONFIG_USER_ONLY */
> +    /* qemu_plugin_vcpu_init_hook delayed until cpu_index assigned. */
>  }
>  
>  /* undo the initializations in reverse order */
>  void tcg_exec_unrealizefn(CPUState *cpu)
>  {
> +    qemu_plugin_vcpu_exit_hook(cpu);
>  #ifndef CONFIG_USER_ONLY
>      tcg_iommu_free_notifier_list(cpu);
>  #endif /* !CONFIG_USER_ONLY */
>  
> -    qemu_plugin_vcpu_exit_hook(cpu);
>      tlb_destroy(cpu);
> +    g_free(cpu->tb_jmp_cache);
>  }
>  
>  #ifndef CONFIG_USER_ONLY
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index 0089578f8f..921944a5ab 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -1580,15 +1580,13 @@ void tcg_flush_jmp_cache(CPUState *cpu)
>  {
>      CPUJumpCache *jc = cpu->tb_jmp_cache;
>  
> -    if (likely(jc)) {
> -        for (int i = 0; i < TB_JMP_CACHE_SIZE; i++) {
> -            qatomic_set(&jc->array[i].tb, NULL);
> -        }
> -    } else {
> -        /* This should happen once during realize, and thus never race. */
> -        jc = g_new0(CPUJumpCache, 1);
> -        jc = qatomic_xchg(&cpu->tb_jmp_cache, jc);
> -        assert(jc == NULL);
> +    /* During early initialization, the cache may not yet be allocated. */
> +    if (unlikely(jc == NULL)) {
> +        return;
> +    }

We can hit this condition in qemu-system with the following call
chain:

    tcg_flush_jmp_cache
    tlb_flush_by_mmuidx_async_work
    listener_add_address_space
    memory_listener_register
    cpu_address_space_init
    tcg_cpu_realizefn
    cpu_exec_realizefn
    x86_cpu_realizefn

I'm wondering if we can avoid having to think of early initialization
when dealing with tb_jmp_cache by initializing it as early as possible?
I don't think swapping accel_cpu_realizefn() and tcg_exec_realizefn()
is going to work (though I haven't tried it), but what about splitting
tcg_exec_realizefn() in two and calling the half that initializes
tb_jmp_cache before accel_cpu_realizefn()?

> +
> +    for (int i = 0; i < TB_JMP_CACHE_SIZE; i++) {
> +        qatomic_set(&jc->array[i].tb, NULL);
>      }
>  }
>  
> diff --git a/cpu.c b/cpu.c
> index 2a09b05205..4a7d865427 100644
> --- a/cpu.c
> +++ b/cpu.c
> @@ -134,15 +134,23 @@ void cpu_exec_realizefn(CPUState *cpu, Error **errp)
>      /* cache the cpu class for the hotpath */
>      cpu->cc = CPU_GET_CLASS(cpu);
>  
> -    cpu_list_add(cpu);
>      if (!accel_cpu_realizefn(cpu, errp)) {
>          return;
>      }
> +
>      /* NB: errp parameter is unused currently */
>      if (tcg_enabled()) {
>          tcg_exec_realizefn(cpu, errp);
>      }
>  
> +    /* Wait until cpu initialization complete before exposing cpu. */
> +    cpu_list_add(cpu);
> +
> +    /* Plugin initialization must wait until cpu_index assigned. */
> +    if (tcg_enabled()) {
> +        qemu_plugin_vcpu_init_hook(cpu);
> +    }
> +
>  #ifdef CONFIG_USER_ONLY
>      assert(qdev_get_vmsd(DEVICE(cpu)) == NULL ||
>             qdev_get_vmsd(DEVICE(cpu))->unmigratable);
> -- 
> 2.34.1
> 
>

Richard Henderson Oct. 31, 2022, 9 p.m. UTC | #2

On 10/31/22 22:07, Ilya Leoshkevich wrote:
>> @@ -1580,15 +1580,13 @@ void tcg_flush_jmp_cache(CPUState *cpu)
>>   {
>>       CPUJumpCache *jc = cpu->tb_jmp_cache;
>>   
>> -    if (likely(jc)) {
>> -        for (int i = 0; i < TB_JMP_CACHE_SIZE; i++) {
>> -            qatomic_set(&jc->array[i].tb, NULL);
>> -        }
>> -    } else {
>> -        /* This should happen once during realize, and thus never race. */
>> -        jc = g_new0(CPUJumpCache, 1);
>> -        jc = qatomic_xchg(&cpu->tb_jmp_cache, jc);
>> -        assert(jc == NULL);
>> +    /* During early initialization, the cache may not yet be allocated. */
>> +    if (unlikely(jc == NULL)) {
>> +        return;
>> +    }
> 
> We can hit this condition in qemu-system with the following call
> chain:
> 
>      tcg_flush_jmp_cache
>      tlb_flush_by_mmuidx_async_work
>      listener_add_address_space
>      memory_listener_register
>      cpu_address_space_init
>      tcg_cpu_realizefn
>      cpu_exec_realizefn
>      x86_cpu_realizefn
> 
> I'm wondering if we can avoid having to think of early initialization
> when dealing with tb_jmp_cache by initializing it as early as possible?
> I don't think swapping accel_cpu_realizefn() and tcg_exec_realizefn()
> is going to work (though I haven't tried it), but what about splitting
> tcg_exec_realizefn() in two and calling the half that initializes
> tb_jmp_cache before accel_cpu_realizefn()?

I thought about that, but couldn't bring myself to split out a third tcg piece of init.


r~

Eric Auger Feb. 1, 2023, 2:20 p.m. UTC | #3

Hi Richard,

Commit 4e4fa6c12d ("accel/tcg: Complete cpu initialization before
registration") seems to cause a regression on one kvm unit test:

FAIL debug-wp-migration (terminated on SIGSEGV)

This can be reproduced with upstream kernel, qemu and kvm unit test.

Seems the change in accel/tcg/translate-all.c is the cause of the SIGSEV
(the removal of the allocation of jc (CPUJumpCache)).

If I restore
    if (unlikely(jc == NULL)) {
        jc = g_new0(CPUJumpCache, 1);
        jc = qatomic_xchg(&cpu->tb_jmp_cache, jc);
        assert(jc == NULL);
        return;
    }
I don't get the crash anymore.

What I fail to understand is why this code is called with a kvm
accelerated qemu (the test runs by default with kvm).


#0  0x000002aaab41ee94 in tcg_flush_jmp_cache (cpu=cpu@entry=0x2aaac391910)
at ../accel/tcg/translate-all.c:1581
#1  0x000002aaab423458 in tlb_flush_by_mmuidx_async_work
(cpu=0x2aaac391910, data=...)
at ../accel/tcg/cputlb.c:360
#2  0x000002aaaae0b1d0 in process_queued_cpu_work
(cpu=cpu@entry=0x2aaac391910) at ../cpus-common.c:351
<augere> (cpu=cpu@entry=0x2aaac391910)

#0  0x000002aaab423658 in tlb_flush_by_mmuidx (cpu=0x2aaac391910,
idxmap=4095)
at ../accel/tcg/cputlb.c:377
#1  0x000002aaab4236e8 in tlb_flush (cpu=cpu@entry=0x2aaac391910) at
../accel/tcg/cputlb.c:391
#2  0x000002aaab1500f0 in vmsa_ttbr_write
(env=0x2aaac393850, ri=0x2aaac3c90e0, value=2154950976315703518) at
../target/arm/helper.c:3784
#3  0x000002aaab14e5a8 in write_raw_cp_reg
(env=env@entry=0x2aaac393850, ri=ri@entry=0x2aaac3c90e0,
v=v@entry=2154950976315703518)
at ../target/arm/helper.c:96
#4  0x000002aaab153f1c in write_list_to_cpustate
(cpu=cpu@entry=0x2aaac391910)
at ../target/arm/helper.c:191
#5  0x000002aaab20f24c in kvm_arm_reset_vcpu
(cpu=cpu@entry=0x2aaac391910) at ../target/arm/kvm.c:634
#6  0x000002aaab147cbc in arm_cpu_reset (dev=0x2aaac391910) at
../target/arm/cpu.c:522

Thanks

Eric

Richard Henderson Feb. 1, 2023, 8:37 p.m. UTC | #4

On 2/1/23 04:20, Eric Auger wrote:
> What I fail to understand is why this code is called with a kvm
> accelerated qemu (the test runs by default with kvm).
...
> #2  0x000002aaab1500f0 in vmsa_ttbr_write
> (env=0x2aaac393850, ri=0x2aaac3c90e0, value=2154950976315703518) at
> ../target/arm/helper.c:3784
> #3  0x000002aaab14e5a8 in write_raw_cp_reg
> (env=env@entry=0x2aaac393850, ri=ri@entry=0x2aaac3c90e0,
> v=v@entry=2154950976315703518)

This is indeed very curious -- vmsa_ttbr_write is supposed to be the "cooked" .writefn, 
not the .raw_writefn.  We're not supposed to arrive here at all.


r~

Peter Maydell Feb. 2, 2023, 10:53 a.m. UTC | #5

On Wed, 1 Feb 2023 at 20:37, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 2/1/23 04:20, Eric Auger wrote:
> > What I fail to understand is why this code is called with a kvm
> > accelerated qemu (the test runs by default with kvm).
> ...
> > #2  0x000002aaab1500f0 in vmsa_ttbr_write
> > (env=0x2aaac393850, ri=0x2aaac3c90e0, value=2154950976315703518) at
> > ../target/arm/helper.c:3784
> > #3  0x000002aaab14e5a8 in write_raw_cp_reg
> > (env=env@entry=0x2aaac393850, ri=ri@entry=0x2aaac3c90e0,
> > v=v@entry=2154950976315703518)
>
> This is indeed very curious -- vmsa_ttbr_write is supposed to be the "cooked" .writefn,
> not the .raw_writefn.  We're not supposed to arrive here at all.

If you only provide a cooked .writefn and no .raw_writefn,
the default is to assume that the cooked function will also
work as the raw one. None of the ARMCPRegInfo structs that
use vmsa_ttbr_write specify a raw_writefn...

thanks
-- PMM

Eric Auger Feb. 3, 2023, 10:29 a.m. UTC | #6

Hi Peter,
On 2/2/23 11:53, Peter Maydell wrote:
> On Wed, 1 Feb 2023 at 20:37, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> On 2/1/23 04:20, Eric Auger wrote:
>>> What I fail to understand is why this code is called with a kvm
>>> accelerated qemu (the test runs by default with kvm).
>> ...
>>> #2  0x000002aaab1500f0 in vmsa_ttbr_write
>>> (env=0x2aaac393850, ri=0x2aaac3c90e0, value=2154950976315703518) at
>>> ../target/arm/helper.c:3784
>>> #3  0x000002aaab14e5a8 in write_raw_cp_reg
>>> (env=env@entry=0x2aaac393850, ri=ri@entry=0x2aaac3c90e0,
>>> v=v@entry=2154950976315703518)
>>
>> This is indeed very curious -- vmsa_ttbr_write is supposed to be the "cooked" .writefn,
>> not the .raw_writefn.  We're not supposed to arrive here at all.
> 
> If you only provide a cooked .writefn and no .raw_writefn,
> the default is to assume that the cooked function will also
> work as the raw one. None of the ARMCPRegInfo structs that
> use vmsa_ttbr_write specify a raw_writefn...
I fail to understand. Do you suggest we miss explicit .raw_writefn =
raw_write in many places and that's the source of our trouble. Indeed
entering the TCG code in KVM mode looks weird.

Or is that supposed to work and we have a bug introduced by the abive
commit commit.

The backtrace of the sigsev shows:

Stack trace of thread 64909:
#0  0x0000aaaadb43ee4c tlb_flush_page_by_mmuidx_async_0 (qemu-kvm +
0x6aee4c)
#1  0x0000aaaadb0076a4 process_queued_cpu_work (qemu-kvm + 0x2776a4)
#2  0x0000aaaadb452ff8 kvm_vcpu_thread_fn (qemu-kvm + 0x6c2ff8)
#3  0x0000aaaadb591bf0 qemu_thread_start (qemu-kvm + 0x801bf0)
#4  0x0000ffff86382a28 start_thread (libc.so.6 + 0x82a28)
#5  0x0000ffff8632bb9c thread_start (libc.so.6 + 0x2bb9c)

Thanks

Eric

> 
> thanks
> -- PMM
>

Peter Maydell Feb. 3, 2023, 10:41 a.m. UTC | #7

On Fri, 3 Feb 2023 at 10:29, Eric Auger <eauger@redhat.com> wrote:
>
> Hi Peter,
> On 2/2/23 11:53, Peter Maydell wrote:
> > On Wed, 1 Feb 2023 at 20:37, Richard Henderson
> > <richard.henderson@linaro.org> wrote:
> >>
> >> On 2/1/23 04:20, Eric Auger wrote:
> >>> What I fail to understand is why this code is called with a kvm
> >>> accelerated qemu (the test runs by default with kvm).
> >> ...
> >>> #2  0x000002aaab1500f0 in vmsa_ttbr_write
> >>> (env=0x2aaac393850, ri=0x2aaac3c90e0, value=2154950976315703518) at
> >>> ../target/arm/helper.c:3784
> >>> #3  0x000002aaab14e5a8 in write_raw_cp_reg
> >>> (env=env@entry=0x2aaac393850, ri=ri@entry=0x2aaac3c90e0,
> >>> v=v@entry=2154950976315703518)
> >>
> >> This is indeed very curious -- vmsa_ttbr_write is supposed to be the "cooked" .writefn,
> >> not the .raw_writefn.  We're not supposed to arrive here at all.
> >
> > If you only provide a cooked .writefn and no .raw_writefn,
> > the default is to assume that the cooked function will also
> > work as the raw one. None of the ARMCPRegInfo structs that
> > use vmsa_ttbr_write specify a raw_writefn...
> I fail to understand. Do you suggest we miss explicit .raw_writefn =
> raw_write in many places and that's the source of our trouble. Indeed
> entering the TCG code in KVM mode looks weird.
>
> Or is that supposed to work and we have a bug introduced by the abive
> commit commit.

I don't know why the above commit specifically has caused a
problem, but yes, the registers which do TLB maintenance calls
in their writefns should set '.raw_writefn = raw_write'.

-- PMM

accel/tcg: Complete cpu initialization before registration

Commit Message

Comments

Patch