diff mbox

[3/3] tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump()

Message ID 1467909880-18834-4-git-send-email-sergey.fedorov@linaro.org (mailing list archive)
State New, archived
Headers show

Commit Message

sergey.fedorov@linaro.org July 7, 2016, 4:44 p.m. UTC
From: Sergey Fedorov <serge.fdrv@gmail.com>

Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
---
 cpu-exec.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

Comments

Alex Bennée July 7, 2016, 7:36 p.m. UTC | #1
Sergey Fedorov <sergey.fedorov@linaro.org> writes:

> From: Sergey Fedorov <serge.fdrv@gmail.com>
>
> Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
> Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
> ---
>  cpu-exec.c | 15 +++++++++------
>  1 file changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/cpu-exec.c b/cpu-exec.c
> index dd0bd5007701..54c935039592 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -295,7 +295,8 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>
>          /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
>           * taken outside tb_lock. As system emulation is currently
> -         * single threaded the locks are NOPs.
> +         * single threaded the locks are NOPs. Both locks are to be
> +         * released at the end of tb_find_fast().
>           */
>          mmap_lock();
>          tb_lock();
> @@ -308,9 +309,6 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>              /* if no translated code available, then translate it now */
>              tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
>          }
> -
> -        tb_unlock();
> -        mmap_unlock();

Hmm pushing these outside of tb_find_slow() makes me uncomfortable. I
guess tb_find_fast/slow are intimately tied together but the idea of
taking locks which are the responsibility of the calling function to
clear seems ugly to me.

>      }
>
>      /* We add the TB in the virtual pc hash table for the fast lookup */
> @@ -354,10 +352,15 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu,
>  #endif
>      /* See if we can patch the calling TB. */
>      if (*last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
> -        tb_lock();
> +        if (!tb_lock_locked()) {
> +            tb_lock();
> +        }
>          tb_add_jump(*last_tb, tb_exit, tb);
> -        tb_unlock();
>      }
> +
> +    tb_lock_reset();
> +    mmap_lock_reset();
> +
>      return tb;
>  }


--
Alex Bennée
Paolo Bonzini July 8, 2016, 8:40 a.m. UTC | #2
> diff --git a/cpu-exec.c b/cpu-exec.c
> index dd0bd5007701..54c935039592 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -295,7 +295,8 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>  
>          /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
>           * taken outside tb_lock. As system emulation is currently
> -         * single threaded the locks are NOPs.
> +         * single threaded the locks are NOPs. Both locks are to be
> +         * released at the end of tb_find_fast().
>           */

Even better: add a "bool *tb_locked" argument to tb_find_slow, and
don't move the mmap_lock release.  Then tb_find_fast knows directly
whether tb_lock is taken, and you don't need any of tb_lock_reset
or mmap_lock_reset.

Thanks,

Paolo

>          mmap_lock();
>          tb_lock();
> @@ -308,9 +309,6 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>              /* if no translated code available, then translate it now */
>              tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
>          }
> -
> -        tb_unlock();
> -        mmap_unlock();
>      }
>  
>      /* We add the TB in the virtual pc hash table for the fast lookup */
> @@ -354,10 +352,15 @@ static inline TranslationBlock *tb_find_fast(CPUState
> *cpu,
>  #endif
>      /* See if we can patch the calling TB. */
>      if (*last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
> -        tb_lock();
> +        if (!tb_lock_locked()) {
> +            tb_lock();
> +        }
>          tb_add_jump(*last_tb, tb_exit, tb);
> -        tb_unlock();
>      }
> +
> +    tb_lock_reset();
> +    mmap_lock_reset();
> +
>      return tb;
>  }
>  
> --
> 1.9.1
> 
>
Sergey Fedorov July 8, 2016, 10:25 a.m. UTC | #3
On 08/07/16 11:40, Paolo Bonzini wrote:
> Even better: add a "bool *tb_locked" argument to tb_find_slow, and
> don't move the mmap_lock release.  Then tb_find_fast knows directly
> whether tb_lock is taken, and you don't need any of tb_lock_reset
> or mmap_lock_reset.

I think we can do even better. One option is using a separate tiny lock
to protect direct jump set/reset instead of tb_lock.

Another option which I've had in my mind for some time is to make direct
jump set/reset thread-safe. We already have thread-safe TB patching. The
only question is the right order of operations and handling
jmp_list_next/jmp_list_first safely. I think that could be done by
removing tb_remove_from_jmp_list() and making RCU-like manipulation with
jmp_list_next/jmp_list_first. What do you think?

Kind regards,
Sergey
Paolo Bonzini July 8, 2016, 11:02 a.m. UTC | #4
> On 08/07/16 11:40, Paolo Bonzini wrote:
> > Even better: add a "bool *tb_locked" argument to tb_find_slow, and
> > don't move the mmap_lock release.  Then tb_find_fast knows directly
> > whether tb_lock is taken, and you don't need any of tb_lock_reset
> > or mmap_lock_reset.
> 
> I think we can do even better. One option is using a separate tiny lock
> to protect direct jump set/reset instead of tb_lock.

If you have to use a separate tiny lock, you don't gain anything compared
to the two critical sections, do you?

In any case, this seems to be more complex than necessary.  The "bool *"
convention is pretty common in Linux for example---it works well.

The one below is even more complicated.  I'm all for simple lock-free
stacks (e.g. QSLIST_INSERT_HEAD_ATOMIC and QSLIST_MOVE_ATOMIC), but lock-free
lists are too much, especially if with the complicated first/next mechanism
of TCG's chained block lists.

Paolo

> Another option which I've had in my mind for some time is to make direct
> jump set/reset thread-safe. We already have thread-safe TB patching. The
> only question is the right order of operations and handling
> jmp_list_next/jmp_list_first safely. I think that could be done by
> removing tb_remove_from_jmp_list() and making RCU-like manipulation with
> jmp_list_next/jmp_list_first. What do you think?
> 
> Kind regards,
> Sergey
>
Sergey Fedorov July 8, 2016, 12:32 p.m. UTC | #5
On 08/07/16 14:02, Paolo Bonzini wrote:
>> On 08/07/16 11:40, Paolo Bonzini wrote:
>>> Even better: add a "bool *tb_locked" argument to tb_find_slow, and
>>> don't move the mmap_lock release.  Then tb_find_fast knows directly
>>> whether tb_lock is taken, and you don't need any of tb_lock_reset
>>> or mmap_lock_reset.
>> I think we can do even better. One option is using a separate tiny lock
>> to protect direct jump set/reset instead of tb_lock.
> If you have to use a separate tiny lock, you don't gain anything compared
> to the two critical sections, do you?

If we have a separate lock for direct jump set/reset then we can do fast
TB lookup + direct jump patching without taking tb_lock at all. How much
this would reduce lock contention largely depends on the workload we use.

>
> In any case, this seems to be more complex than necessary.  The "bool *"
> convention is pretty common in Linux for example---it works well.

It could work, no doubts.

>
> The one below is even more complicated.  I'm all for simple lock-free
> stacks (e.g. QSLIST_INSERT_HEAD_ATOMIC and QSLIST_MOVE_ATOMIC), but lock-free
> lists are too much, especially if with the complicated first/next mechanism
> of TCG's chained block lists.

Direct jump handling code is pretty isolated and self-contained. It
would require to back out of tb_remove_from_jmp_list() and sprinkle a
couple of atomic_rcu_read()/atomic_rcu_set() with some comments, I
think. Maybe it could be easier to justify looking at actual patches.

Thanks,
Sergey

>
> Paolo
>
>> Another option which I've had in my mind for some time is to make direct
>> jump set/reset thread-safe. We already have thread-safe TB patching. The
>> only question is the right order of operations and handling
>> jmp_list_next/jmp_list_first safely. I think that could be done by
>> removing tb_remove_from_jmp_list() and making RCU-like manipulation with
>> jmp_list_next/jmp_list_first. What do you think?
>>
>> Kind regards,
>> Sergey
>>
Paolo Bonzini July 8, 2016, 2:07 p.m. UTC | #6
On 08/07/2016 14:32, Sergey Fedorov wrote:
>>> >> I think we can do even better. One option is using a separate tiny lock
>>> >> to protect direct jump set/reset instead of tb_lock.
>> > If you have to use a separate tiny lock, you don't gain anything compared
>> > to the two critical sections, do you?
> If we have a separate lock for direct jump set/reset then we can do fast
> TB lookup + direct jump patching without taking tb_lock at all. How much
> this would reduce lock contention largely depends on the workload we use.

Yeah, it probably would be easy enough that it's hard to object to it
(unlike the other idea below, which I'm not very comfortable with, at
least without seeing patches).

The main advantage would be that this tiny lock could be a spinlock
rather than a mutex.

Paolo

>> The one below is even more complicated.  I'm all for simple lock-free
>> stacks (e.g. QSLIST_INSERT_HEAD_ATOMIC and QSLIST_MOVE_ATOMIC), but lock-free
>> lists are too much, especially if with the complicated first/next mechanism
>> of TCG's chained block lists.
> 
> Direct jump handling code is pretty isolated and self-contained. It
> would require to back out of tb_remove_from_jmp_list() and sprinkle a
> couple of atomic_rcu_read()/atomic_rcu_set() with some comments, I
> think. Maybe it could be easier to justify looking at actual patches.
Sergey Fedorov July 8, 2016, 7:55 p.m. UTC | #7
On 08/07/16 17:07, Paolo Bonzini wrote:
>
> On 08/07/2016 14:32, Sergey Fedorov wrote:
>>>>>> I think we can do even better. One option is using a separate tiny lock
>>>>>> to protect direct jump set/reset instead of tb_lock.
>>>> If you have to use a separate tiny lock, you don't gain anything compared
>>>> to the two critical sections, do you?
>> If we have a separate lock for direct jump set/reset then we can do fast
>> TB lookup + direct jump patching without taking tb_lock at all. How much
>> this would reduce lock contention largely depends on the workload we use.
> Yeah, it probably would be easy enough that it's hard to object to it
> (unlike the other idea below, which I'm not very comfortable with, at
> least without seeing patches).
>
> The main advantage would be that this tiny lock could be a spinlock
> rather than a mutex.

Well, the problem is more subtle than we thought: tb_find_fast() can
race with tb_phys_invalidate(). The first tb_find_phys() out of the lock
can return a TB which is being invalidated. Then a direct jump can be
set up to this TB. It can happen after concurrent tb_phys_invalidate()
resets all the direct jumps to the TB. Thus we can end up with a direct
jump to an invalidated TB. Even extending tb_lock critical section
wouldn't help if at least one tb_find_phys() is performed out of the lock.

Kind regards,
Sergey

>
>>> The one below is even more complicated.  I'm all for simple lock-free
>>> stacks (e.g. QSLIST_INSERT_HEAD_ATOMIC and QSLIST_MOVE_ATOMIC), but lock-free
>>> lists are too much, especially if with the complicated first/next mechanism
>>> of TCG's chained block lists.
>> Direct jump handling code is pretty isolated and self-contained. It
>> would require to back out of tb_remove_from_jmp_list() and sprinkle a
>> couple of atomic_rcu_read()/atomic_rcu_set() with some comments, I
>> think. Maybe it could be easier to justify looking at actual patches.
Paolo Bonzini July 8, 2016, 8:18 p.m. UTC | #8
On 08/07/2016 21:55, Sergey Fedorov wrote:
> On 08/07/16 17:07, Paolo Bonzini wrote:
>>
>> On 08/07/2016 14:32, Sergey Fedorov wrote:
>>>>>>> I think we can do even better. One option is using a separate tiny lock
>>>>>>> to protect direct jump set/reset instead of tb_lock.
>>>>> If you have to use a separate tiny lock, you don't gain anything compared
>>>>> to the two critical sections, do you?
>>> If we have a separate lock for direct jump set/reset then we can do fast
>>> TB lookup + direct jump patching without taking tb_lock at all. How much
>>> this would reduce lock contention largely depends on the workload we use.
>> Yeah, it probably would be easy enough that it's hard to object to it
>> (unlike the other idea below, which I'm not very comfortable with, at
>> least without seeing patches).
>>
>> The main advantage would be that this tiny lock could be a spinlock
>> rather than a mutex.
> 
> Well, the problem is more subtle than we thought: tb_find_fast() can
> race with tb_phys_invalidate(). The first tb_find_phys() out of the lock
> can return a TB which is being invalidated. Then a direct jump can be
> set up to this TB. It can happen after concurrent tb_phys_invalidate()
> resets all the direct jumps to the TB. Thus we can end up with a direct
> jump to an invalidated TB. Even extending tb_lock critical section
> wouldn't help if at least one tb_find_phys() is performed out of the lock.

Ahem, isn't this exactly why tb_find_phys was invalidating the PC in my
patches, as the very first step?...  (The smp_wmb after invalidating the
PC paired with an atomic_rcu_read in tb_find_fast; now we could do it
after computing the hash and before calling qht_remove).

It turned out that invalidating the PC wasn't as easy as writing -1 to
the pc, but it's possible to do one of these:

1) set cs_base to an invalid value (all-ones works for everything except
x86---instead anything nonzero is enough except on
x86 and SPARC)

2) set the flags to an invalid combination (x86 can use all ones or
rename the useless HF_SOFTMMU_MASK to HF_INVALID_MASK).

Paolo
Sergey Fedorov July 8, 2016, 8:24 p.m. UTC | #9
On 08/07/16 23:18, Paolo Bonzini wrote:
>
> On 08/07/2016 21:55, Sergey Fedorov wrote:
>> On 08/07/16 17:07, Paolo Bonzini wrote:
>>> On 08/07/2016 14:32, Sergey Fedorov wrote:
>>>>>>>> I think we can do even better. One option is using a separate tiny lock
>>>>>>>> to protect direct jump set/reset instead of tb_lock.
>>>>>> If you have to use a separate tiny lock, you don't gain anything compared
>>>>>> to the two critical sections, do you?
>>>> If we have a separate lock for direct jump set/reset then we can do fast
>>>> TB lookup + direct jump patching without taking tb_lock at all. How much
>>>> this would reduce lock contention largely depends on the workload we use.
>>> Yeah, it probably would be easy enough that it's hard to object to it
>>> (unlike the other idea below, which I'm not very comfortable with, at
>>> least without seeing patches).
>>>
>>> The main advantage would be that this tiny lock could be a spinlock
>>> rather than a mutex.
>> Well, the problem is more subtle than we thought: tb_find_fast() can
>> race with tb_phys_invalidate(). The first tb_find_phys() out of the lock
>> can return a TB which is being invalidated. Then a direct jump can be
>> set up to this TB. It can happen after concurrent tb_phys_invalidate()
>> resets all the direct jumps to the TB. Thus we can end up with a direct
>> jump to an invalidated TB. Even extending tb_lock critical section
>> wouldn't help if at least one tb_find_phys() is performed out of the lock.
> Ahem, isn't this exactly why tb_find_phys was invalidating the PC in my
> patches, as the very first step?...  (The smp_wmb after invalidating the
> PC paired with an atomic_rcu_read in tb_find_fast; now we could do it
> after computing the hash and before calling qht_remove).
>
> It turned out that invalidating the PC wasn't as easy as writing -1 to
> the pc, but it's possible to do one of these:
>
> 1) set cs_base to an invalid value (all-ones works for everything except
> x86---instead anything nonzero is enough except on
> x86 and SPARC)
>
> 2) set the flags to an invalid combination (x86 can use all ones or
> rename the useless HF_SOFTMMU_MASK to HF_INVALID_MASK).

I remember, I've just found that we discussed it in this thread:

http://thread.gmane.org/gmane.comp.emulators.qemu/401723/focus=406852

I was thinking of just doing 'tb_jmp_cache' lookup out of the lock, not
tb_find_physical(). Now thanks to QHT, we could do tb_find_physical()
out of the lock, too. This changes things.

Kind regards,
Sergey
Paolo Bonzini July 8, 2016, 8:52 p.m. UTC | #10
On 08/07/2016 22:24, Sergey Fedorov wrote:
> I remember, I've just found that we discussed it in this thread:
> 
> http://thread.gmane.org/gmane.comp.emulators.qemu/401723/focus=406852
> 
> I was thinking of just doing 'tb_jmp_cache' lookup out of the lock, not
> tb_find_physical(). Now thanks to QHT, we could do tb_find_physical()
> out of the lock, too. This changes things.

But in my patch ("tcg: move tb_find_fast outside the tb_lock critical
section", which originally was written by Fred---most of my contribution
was getting the invalidation right, not the lock-free lookup)
tb_find_physical was also done out of the lock.  It was then retried
inside the lock, if it failed.

This is why I needed to fail all concurrent lookups as the first step in
the invalidation.

Emilio's QHT resulted in a rewrite of tb_find_physical, but the basic
concepts are the same.

Paolo
Sergey Fedorov July 11, 2016, 1:06 p.m. UTC | #11
On 08/07/16 23:52, Paolo Bonzini wrote:
>
> On 08/07/2016 22:24, Sergey Fedorov wrote:
>> I remember, I've just found that we discussed it in this thread:
>>
>> http://thread.gmane.org/gmane.comp.emulators.qemu/401723/focus=406852
>>
>> I was thinking of just doing 'tb_jmp_cache' lookup out of the lock, not
>> tb_find_physical(). Now thanks to QHT, we could do tb_find_physical()
>> out of the lock, too. This changes things.
> But in my patch ("tcg: move tb_find_fast outside the tb_lock critical
> section", which originally was written by Fred---most of my contribution
> was getting the invalidation right, not the lock-free lookup)
> tb_find_physical was also done out of the lock.  It was then retried
> inside the lock, if it failed.
>
> This is why I needed to fail all concurrent lookups as the first step in
> the invalidation.
>
> Emilio's QHT resulted in a rewrite of tb_find_physical, but the basic
> concepts are the same.

That could work, I think, if we re-check under tb_lock whether the TB is
still valid before adding a direct jump to it.

Kind regards,
Sergey
Paolo Bonzini July 11, 2016, 2:03 p.m. UTC | #12
On 11/07/2016 15:06, Sergey Fedorov wrote:
> On 08/07/16 23:52, Paolo Bonzini wrote:
>>
>> On 08/07/2016 22:24, Sergey Fedorov wrote:
>>> I remember, I've just found that we discussed it in this thread:
>>>
>>> http://thread.gmane.org/gmane.comp.emulators.qemu/401723/focus=406852
>>>
>>> I was thinking of just doing 'tb_jmp_cache' lookup out of the lock, not
>>> tb_find_physical(). Now thanks to QHT, we could do tb_find_physical()
>>> out of the lock, too. This changes things.
>> But in my patch ("tcg: move tb_find_fast outside the tb_lock critical
>> section", which originally was written by Fred---most of my contribution
>> was getting the invalidation right, not the lock-free lookup)
>> tb_find_physical was also done out of the lock.  It was then retried
>> inside the lock, if it failed.
>>
>> This is why I needed to fail all concurrent lookups as the first step in
>> the invalidation.
>>
>> Emilio's QHT resulted in a rewrite of tb_find_physical, but the basic
>> concepts are the same.
> 
> That could work, I think, if we re-check under tb_lock whether the TB is
> still valid before adding a direct jump to it.

Right, this can still happen:

	tb_find_fast			tb_phys_invalidate
					 tb_lock
	 jmp_cache miss
	 -> tb_find_slow
	  -> tb_find_physical
	   QHT hit
	 tb_lock
					 invalidate tb->pc
					 remove from lists
					 tb_unlock
	 tb_add_jump
	 tb_unlock

I seem to recall that Emilio added a seqlock for this purpose, but
adding a tb_check_invalidated(TranslationBlock *tb) inline function will
also do.

Paolo
Sergey Fedorov July 11, 2016, 2:27 p.m. UTC | #13
On 11/07/16 17:03, Paolo Bonzini wrote:
>
> On 11/07/2016 15:06, Sergey Fedorov wrote:
>> On 08/07/16 23:52, Paolo Bonzini wrote:
>>> On 08/07/2016 22:24, Sergey Fedorov wrote:
>>>> I remember, I've just found that we discussed it in this thread:
>>>>
>>>> http://thread.gmane.org/gmane.comp.emulators.qemu/401723/focus=406852
>>>>
>>>> I was thinking of just doing 'tb_jmp_cache' lookup out of the lock, not
>>>> tb_find_physical(). Now thanks to QHT, we could do tb_find_physical()
>>>> out of the lock, too. This changes things.
>>> But in my patch ("tcg: move tb_find_fast outside the tb_lock critical
>>> section", which originally was written by Fred---most of my contribution
>>> was getting the invalidation right, not the lock-free lookup)
>>> tb_find_physical was also done out of the lock.  It was then retried
>>> inside the lock, if it failed.
>>>
>>> This is why I needed to fail all concurrent lookups as the first step in
>>> the invalidation.
>>>
>>> Emilio's QHT resulted in a rewrite of tb_find_physical, but the basic
>>> concepts are the same.
>> That could work, I think, if we re-check under tb_lock whether the TB is
>> still valid before adding a direct jump to it.
> Right, this can still happen:
>
> 	tb_find_fast			tb_phys_invalidate
> 					 tb_lock
> 	 jmp_cache miss
> 	 -> tb_find_slow
> 	  -> tb_find_physical
> 	   QHT hit
> 	 tb_lock
> 					 invalidate tb->pc
> 					 remove from lists
> 					 tb_unlock
> 	 tb_add_jump
> 	 tb_unlock
>
> I seem to recall that Emilio added a seqlock for this purpose, but
> adding a tb_check_invalidated(TranslationBlock *tb) inline function will
> also do.

He used seqlock for 'tb_jmp_cache' only:
http://thread.gmane.org/gmane.comp.emulators.qemu/356765/focus=356774

He also added a dedicated field into TranslationBlock struction to mark
it invalid:
http://thread.gmane.org/gmane.comp.emulators.qemu/356765/focus=356785

Kind regards,
Sergey
diff mbox

Patch

diff --git a/cpu-exec.c b/cpu-exec.c
index dd0bd5007701..54c935039592 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -295,7 +295,8 @@  static TranslationBlock *tb_find_slow(CPUState *cpu,
 
         /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
          * taken outside tb_lock. As system emulation is currently
-         * single threaded the locks are NOPs.
+         * single threaded the locks are NOPs. Both locks are to be
+         * released at the end of tb_find_fast().
          */
         mmap_lock();
         tb_lock();
@@ -308,9 +309,6 @@  static TranslationBlock *tb_find_slow(CPUState *cpu,
             /* if no translated code available, then translate it now */
             tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
         }
-
-        tb_unlock();
-        mmap_unlock();
     }
 
     /* We add the TB in the virtual pc hash table for the fast lookup */
@@ -354,10 +352,15 @@  static inline TranslationBlock *tb_find_fast(CPUState *cpu,
 #endif
     /* See if we can patch the calling TB. */
     if (*last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
-        tb_lock();
+        if (!tb_lock_locked()) {
+            tb_lock();
+        }
         tb_add_jump(*last_tb, tb_exit, tb);
-        tb_unlock();
     }
+
+    tb_lock_reset();
+    mmap_lock_reset();
+
     return tb;
 }