diff mbox

[mttcg] cputlb: Use async tlb_flush_by_mmuidx

Message ID 1456751763-24244-1-git-send-email-a.rigo@virtualopensystems.com (mailing list archive)
State New, archived
Headers show

Commit Message

alvise rigo Feb. 29, 2016, 1:16 p.m. UTC
As in the case of tlb_flush(), also tlb_flush_by_mmuidx has to query the
TLB flush if it targets another VCPU. To accomplish this, a new async
work has been added, together with a new TLBFlushByMMUIdxParams. A
bitmap is used to track the MMU indexes to flush.

This patch applies to the multi_tcg_v8 branch.

Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 53 insertions(+), 12 deletions(-)

Comments

Peter Maydell Feb. 29, 2016, 1:21 p.m. UTC | #1
On 29 February 2016 at 13:16, Alvise Rigo <a.rigo@virtualopensystems.com> wrote:
> As in the case of tlb_flush(), also tlb_flush_by_mmuidx has to query the
> TLB flush if it targets another VCPU. To accomplish this, a new async
> work has been added, together with a new TLBFlushByMMUIdxParams. A
> bitmap is used to track the MMU indexes to flush.
>
> This patch applies to the multi_tcg_v8 branch.

What's the API for a target CPU emulation to say "and now I must
wait for the TLB op to finish" before completing this guest
instruction?

thanks
-- PMM
Paolo Bonzini Feb. 29, 2016, 1:50 p.m. UTC | #2
On 29/02/2016 14:21, Peter Maydell wrote:
> On 29 February 2016 at 13:16, Alvise Rigo <a.rigo@virtualopensystems.com> wrote:
>> > As in the case of tlb_flush(), also tlb_flush_by_mmuidx has to query the
>> > TLB flush if it targets another VCPU. To accomplish this, a new async
>> > work has been added, together with a new TLBFlushByMMUIdxParams. A
>> > bitmap is used to track the MMU indexes to flush.
>> >
>> > This patch applies to the multi_tcg_v8 branch.
> What's the API for a target CPU emulation to say "and now I must
> wait for the TLB op to finish" before completing this guest
> instruction?

My proposal has been for a while for DMB to put the CPU in a halted
state (remote TLB callbacks then can decrement a counter and signal
cpu_halt_cond when it's zero), but no one has implemented this.

Paolo
Peter Maydell Feb. 29, 2016, 1:55 p.m. UTC | #3
On 29 February 2016 at 13:50, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 29/02/2016 14:21, Peter Maydell wrote:
>> On 29 February 2016 at 13:16, Alvise Rigo <a.rigo@virtualopensystems.com> wrote:
>>> > As in the case of tlb_flush(), also tlb_flush_by_mmuidx has to query the
>>> > TLB flush if it targets another VCPU. To accomplish this, a new async
>>> > work has been added, together with a new TLBFlushByMMUIdxParams. A
>>> > bitmap is used to track the MMU indexes to flush.
>>> >
>>> > This patch applies to the multi_tcg_v8 branch.
>> What's the API for a target CPU emulation to say "and now I must
>> wait for the TLB op to finish" before completing this guest
>> instruction?
>
> My proposal has been for a while for DMB to put the CPU in a halted
> state (remote TLB callbacks then can decrement a counter and signal
> cpu_halt_cond when it's zero), but no one has implemented this.

Yeah, that's the other approach -- really split the things that can
be async and do real "wait for completion" at points which must
synchronize. (Needs a little care since DMB is not the only such point.)
An initial implementation that does an immediate wait-for-completion
is probably simpler to review though, and add the real asynchrony
later. And either way you need an API for the target to wait for
completion.

thanks
-- PMM
alvise rigo Feb. 29, 2016, 2:02 p.m. UTC | #4
On Mon, Feb 29, 2016 at 2:55 PM, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 29 February 2016 at 13:50, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>>
>> On 29/02/2016 14:21, Peter Maydell wrote:
>>> On 29 February 2016 at 13:16, Alvise Rigo <a.rigo@virtualopensystems.com> wrote:
>>>> > As in the case of tlb_flush(), also tlb_flush_by_mmuidx has to query the
>>>> > TLB flush if it targets another VCPU. To accomplish this, a new async
>>>> > work has been added, together with a new TLBFlushByMMUIdxParams. A
>>>> > bitmap is used to track the MMU indexes to flush.
>>>> >
>>>> > This patch applies to the multi_tcg_v8 branch.
>>> What's the API for a target CPU emulation to say "and now I must
>>> wait for the TLB op to finish" before completing this guest
>>> instruction?
>>
>> My proposal has been for a while for DMB to put the CPU in a halted
>> state (remote TLB callbacks then can decrement a counter and signal
>> cpu_halt_cond when it's zero), but no one has implemented this.
>
> Yeah, that's the other approach -- really split the things that can
> be async and do real "wait for completion" at points which must
> synchronize. (Needs a little care since DMB is not the only such point.)
> An initial implementation that does an immediate wait-for-completion
> is probably simpler to review though, and add the real asynchrony
> later. And either way you need an API for the target to wait for
> completion.

OK, so basically being sure that the target CPU performs the flush
before executing the next TB is not enough. We need a sort of feedback
that the flush has been done before emulating the next guest
instruction. Did I get it right?

Thank you,
alvise

>
> thanks
> -- PMM
Paolo Bonzini Feb. 29, 2016, 2:06 p.m. UTC | #5
On 29/02/2016 15:02, alvise rigo wrote:
> > Yeah, that's the other approach -- really split the things that can
> > be async and do real "wait for completion" at points which must
> > synchronize. (Needs a little care since DMB is not the only such point.)
> > An initial implementation that does an immediate wait-for-completion
> > is probably simpler to review though, and add the real asynchrony
> > later. And either way you need an API for the target to wait for
> > completion.
> OK, so basically being sure that the target CPU performs the flush
> before executing the next TB is not enough. We need a sort of feedback
> that the flush has been done before emulating the next guest
> instruction. Did I get it right?

That risks getting deadlocks if CPU A asks B to flush the TLB and vice
versa.  Using a halted state means that the VCPU thread goes through the
cpus.c loop and can for example service other CPUs' TLB flush requests.

Paolo
alvise rigo Feb. 29, 2016, 2:18 p.m. UTC | #6
I see the risk. I will come back with something and let you know.

Thank you,
alvise

On Mon, Feb 29, 2016 at 3:06 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 29/02/2016 15:02, alvise rigo wrote:
>> > Yeah, that's the other approach -- really split the things that can
>> > be async and do real "wait for completion" at points which must
>> > synchronize. (Needs a little care since DMB is not the only such point.)
>> > An initial implementation that does an immediate wait-for-completion
>> > is probably simpler to review though, and add the real asynchrony
>> > later. And either way you need an API for the target to wait for
>> > completion.
>> OK, so basically being sure that the target CPU performs the flush
>> before executing the next TB is not enough. We need a sort of feedback
>> that the flush has been done before emulating the next guest
>> instruction. Did I get it right?
>
> That risks getting deadlocks if CPU A asks B to flush the TLB and vice
> versa.  Using a halted state means that the VCPU thread goes through the
> cpus.c loop and can for example service other CPUs' TLB flush requests.
>
> Paolo
alvise rigo March 4, 2016, 2:28 p.m. UTC | #7
A small update on this. I have a working implementation of the "halted
state" mechanism for waiting all the pending flushes to be completed.
However, the way I'm going back to the cpus.c loop (the while(1) in
qemu_tcg_cpu_thread_fn) is a bit convoluted. In the case of the TLB ops
that always end the TB, a simple cpu_exit() allows me to go back to the
main loop. I think in this case we can also use the cpu_loop_exit(), though
making the code a bit more complicated since the PC would require some
adjustments.

I wanted then to apply the same "halted state" to the LoadLink helper,
since also this one might cause some flush requests. In this case, we can
not just call cpu_loop_exit() in that the guest code would miss the
returned value. Forcing the LDREX instruction to also end the TB through an
empty 'is_jmp' condition did the trick allowing once again to use
cpu_exit(). Is there another better solution?

Thank you,
alvise

On Mon, Feb 29, 2016 at 3:18 PM, alvise rigo <a.rigo@virtualopensystems.com>
wrote:

> I see the risk. I will come back with something and let you know.
>
> Thank you,
> alvise
>
> On Mon, Feb 29, 2016 at 3:06 PM, Paolo Bonzini <pbonzini@redhat.com>
> wrote:
> >
> >
> > On 29/02/2016 15:02, alvise rigo wrote:
> >> > Yeah, that's the other approach -- really split the things that can
> >> > be async and do real "wait for completion" at points which must
> >> > synchronize. (Needs a little care since DMB is not the only such
> point.)
> >> > An initial implementation that does an immediate wait-for-completion
> >> > is probably simpler to review though, and add the real asynchrony
> >> > later. And either way you need an API for the target to wait for
> >> > completion.
> >> OK, so basically being sure that the target CPU performs the flush
> >> before executing the next TB is not enough. We need a sort of feedback
> >> that the flush has been done before emulating the next guest
> >> instruction. Did I get it right?
> >
> > That risks getting deadlocks if CPU A asks B to flush the TLB and vice
> > versa.  Using a halted state means that the VCPU thread goes through the
> > cpus.c loop and can for example service other CPUs' TLB flush requests.
> >
> > Paolo
>
Alex Bennée March 7, 2016, 8:18 p.m. UTC | #8
alvise rigo <a.rigo@virtualopensystems.com> writes:

> A small update on this. I have a working implementation of the "halted
> state" mechanism for waiting all the pending flushes to be completed.
> However, the way I'm going back to the cpus.c loop (the while(1) in
> qemu_tcg_cpu_thread_fn) is a bit convoluted. In the case of the TLB ops
> that always end the TB, a simple cpu_exit() allows me to go back to the
> main loop. I think in this case we can also use the cpu_loop_exit(), though
> making the code a bit more complicated since the PC would require some
> adjustments.
>
> I wanted then to apply the same "halted state" to the LoadLink helper,
> since also this one might cause some flush requests. In this case, we can
> not just call cpu_loop_exit() in that the guest code would miss the
> returned value. Forcing the LDREX instruction to also end the TB through an
> empty 'is_jmp' condition did the trick allowing once again to use
> cpu_exit(). Is there another better solution?

Have you looked at Emilio's tree where he replaced the async_safe_work
mechanism with a mechanism to do the work in the vCPU run loop but halt
all other vCPUs first?

>
> Thank you,
> alvise
>
> On Mon, Feb 29, 2016 at 3:18 PM, alvise rigo <a.rigo@virtualopensystems.com>
> wrote:
>
>> I see the risk. I will come back with something and let you know.
>>
>> Thank you,
>> alvise
>>
>> On Mon, Feb 29, 2016 at 3:06 PM, Paolo Bonzini <pbonzini@redhat.com>
>> wrote:
>> >
>> >
>> > On 29/02/2016 15:02, alvise rigo wrote:
>> >> > Yeah, that's the other approach -- really split the things that can
>> >> > be async and do real "wait for completion" at points which must
>> >> > synchronize. (Needs a little care since DMB is not the only such
>> point.)
>> >> > An initial implementation that does an immediate wait-for-completion
>> >> > is probably simpler to review though, and add the real asynchrony
>> >> > later. And either way you need an API for the target to wait for
>> >> > completion.
>> >> OK, so basically being sure that the target CPU performs the flush
>> >> before executing the next TB is not enough. We need a sort of feedback
>> >> that the flush has been done before emulating the next guest
>> >> instruction. Did I get it right?
>> >
>> > That risks getting deadlocks if CPU A asks B to flush the TLB and vice
>> > versa.  Using a halted state means that the VCPU thread goes through the
>> > cpus.c loop and can for example service other CPUs' TLB flush requests.
>> >
>> > Paolo
>>


--
Alex Bennée
Paolo Bonzini March 7, 2016, 9:18 p.m. UTC | #9
On 04/03/2016 15:28, alvise rigo wrote:
> A small update on this. I have a working implementation of the "halted
> state" mechanism for waiting all the pending flushes to be completed.
> However, the way I'm going back to the cpus.c loop (the while(1) in
> qemu_tcg_cpu_thread_fn) is a bit convoluted. In the case of the TLB ops
> that always end the TB, a simple cpu_exit() allows me to go back to the
> main loop. I think in this case we can also use the cpu_loop_exit(),
> though making the code a bit more complicated since the PC would require
> some adjustments.

I think in both cases the function to use is cpu_loop_exit_restore.  It
will restart execution of the current instruction so it should be fine
as long as you don't call it unconditionally.

If you're not calling it directly from the helper, you need to save
GETPC() in the helper and propagate it down to the call site.  Then the
call site can use it as the last argument.  For an example see
helper_ljmp_protected's call to switch_tss_ra in target-i386/seg_helper.c.

> I wanted then to apply the same "halted state" to the LoadLink helper,
> since also this one might cause some flush requests.

Interesting, where is this documented in the ARM ARM?

> In this case, we
> can not just call cpu_loop_exit() in that the guest code would miss the
> returned value. Forcing the LDREX instruction to also end the TB through
> an empty 'is_jmp' condition did the trick allowing once again to use
> cpu_exit(). Is there another better solution?

Perhaps cpu_loop_exit_restore()?

Paolo
alvise rigo March 11, 2016, 11:08 a.m. UTC | #10
Hi Paolo,

On Mon, Mar 7, 2016 at 10:18 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 04/03/2016 15:28, alvise rigo wrote:
>> A small update on this. I have a working implementation of the "halted
>> state" mechanism for waiting all the pending flushes to be completed.
>> However, the way I'm going back to the cpus.c loop (the while(1) in
>> qemu_tcg_cpu_thread_fn) is a bit convoluted. In the case of the TLB ops
>> that always end the TB, a simple cpu_exit() allows me to go back to the
>> main loop. I think in this case we can also use the cpu_loop_exit(),
>> though making the code a bit more complicated since the PC would require
>> some adjustments.
>
> I think in both cases the function to use is cpu_loop_exit_restore.  It
> will restart execution of the current instruction so it should be fine
> as long as you don't call it unconditionally.

Indeed, cpu_loop_exit_restore() works just fine for those helpers that
do not return any value, thank you.

>
> If you're not calling it directly from the helper, you need to save
> GETPC() in the helper and propagate it down to the call site.  Then the
> call site can use it as the last argument.  For an example see
> helper_ljmp_protected's call to switch_tss_ra in target-i386/seg_helper.c.
>
>> I wanted then to apply the same "halted state" to the LoadLink helper,
>> since also this one might cause some flush requests.
>
> Interesting, where is this documented in the ARM ARM?

I'm referring to the usual flush requests that a LL(x) operation might
issue in order to have all the VCPUs agreeing on "x is an exclusive
address". Adding the halted state we ensure that the calling VCPU
resumes its execution after all the other VCPUs have set the TLB_EXCL
flag (this should also fix the race condition you were worried about).

>
>> In this case, we
>> can not just call cpu_loop_exit() in that the guest code would miss the
>> returned value. Forcing the LDREX instruction to also end the TB through
>> an empty 'is_jmp' condition did the trick allowing once again to use
>> cpu_exit(). Is there another better solution?
>
> Perhaps cpu_loop_exit_restore()?

For some reason this is not working to exit from helper_ldlink_name in
softmmu_llsc_template.h (the method returns a "WORD_TYPE"). The
execution in the guest is brought to an infinite loop, most likely
because of a deadlock due to an improper emulation of LDREX and STREX.
In any case the cpu_exit() solution still works with the downside of a
slightly bigger overhead in exiting/entering the TB.

Thank you,
alvise

>
> Paolo
diff mbox

Patch

diff --git a/cputlb.c b/cputlb.c
index 29252d1..1eeeccb 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -103,9 +103,11 @@  void tlb_flush(CPUState *cpu, int flush_global)
     }
 }
 
-static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, va_list argp)
+/* Flush tlb_table[] and tlb_v_table[] of @cpu at MMU indexes given by @bitmap.
+ * Flush also tb_jmp_cache. */
+static inline void tlb_tables_flush_bitmap(CPUState *cpu, unsigned long *bitmap)
 {
-    CPUArchState *env = cpu->env_ptr;
+    int mmu_idx;
 
 #if defined(DEBUG_TLB)
     printf("tlb_flush_by_mmuidx:");
@@ -114,6 +116,41 @@  static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, va_list argp)
        links while we are modifying them */
     cpu->current_tb = NULL;
 
+    for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
+        if (test_bit(mmu_idx, bitmap)) {
+            CPUArchState *env = cpu->env_ptr;
+#if defined(DEBUG_TLB)
+            printf(" %d", mmu_idx);
+#endif
+            memset(env->tlb_table[mmu_idx], -1, sizeof(env->tlb_table[0]));
+            memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0]));
+        }
+    }
+    memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache));
+
+#if defined(DEBUG_TLB)
+    printf("\n");
+#endif
+}
+
+struct TLBFlushByMMUIdxParams {
+    CPUState *cpu;
+    DECLARE_BITMAP(idx_to_flush, NB_MMU_MODES);
+};
+
+static void tlb_flush_by_mmuidx_async_work(void *opaque)
+{
+    struct TLBFlushByMMUIdxParams *params = opaque;
+
+    tlb_tables_flush_bitmap(params->cpu, params->idx_to_flush);
+
+    g_free(params);
+}
+
+static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, va_list argp)
+{
+    DECLARE_BITMAP(idxmap, NB_MMU_MODES) = { 0 };
+
     for (;;) {
         int mmu_idx = va_arg(argp, int);
 
@@ -121,19 +158,23 @@  static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, va_list argp)
             break;
         }
 
-#if defined(DEBUG_TLB)
-        printf(" %d", mmu_idx);
-#endif
-
-        memset(env->tlb_table[mmu_idx], -1, sizeof(env->tlb_table[0]));
-        memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0]));
+        set_bit(mmu_idx, idxmap);
     }
 
-#if defined(DEBUG_TLB)
-    printf("\n");
-#endif
+    if (!qemu_cpu_is_self(cpu)) {
+        /* We do not set the pendind_tlb_flush bit, only a global flush
+         * does that. */
+        if (!atomic_read(&cpu->pending_tlb_flush)) {
+             struct TLBFlushByMMUIdxParams *params;
 
-    memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache));
+             params = g_malloc(sizeof(struct TLBFlushByMMUIdxParams));
+             params->cpu = cpu;
+             memcpy(params->idx_to_flush, idxmap, sizeof(idxmap));
+             async_run_on_cpu(cpu, tlb_flush_by_mmuidx_async_work, params);
+         }
+    } else {
+        tlb_tables_flush_bitmap(cpu, idxmap);
+    }
 }
 
 void tlb_flush_by_mmuidx(CPUState *cpu, ...)