Message ID | 1467909880-18834-4-git-send-email-sergey.fedorov@linaro.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Sergey Fedorov <sergey.fedorov@linaro.org> writes: > From: Sergey Fedorov <serge.fdrv@gmail.com> > > Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com> > Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org> > --- > cpu-exec.c | 15 +++++++++------ > 1 file changed, 9 insertions(+), 6 deletions(-) > > diff --git a/cpu-exec.c b/cpu-exec.c > index dd0bd5007701..54c935039592 100644 > --- a/cpu-exec.c > +++ b/cpu-exec.c > @@ -295,7 +295,8 @@ static TranslationBlock *tb_find_slow(CPUState *cpu, > > /* mmap_lock is needed by tb_gen_code, and mmap_lock must be > * taken outside tb_lock. As system emulation is currently > - * single threaded the locks are NOPs. > + * single threaded the locks are NOPs. Both locks are to be > + * released at the end of tb_find_fast(). > */ > mmap_lock(); > tb_lock(); > @@ -308,9 +309,6 @@ static TranslationBlock *tb_find_slow(CPUState *cpu, > /* if no translated code available, then translate it now */ > tb = tb_gen_code(cpu, pc, cs_base, flags, 0); > } > - > - tb_unlock(); > - mmap_unlock(); Hmm pushing these outside of tb_find_slow() makes me uncomfortable. I guess tb_find_fast/slow are intimately tied together but the idea of taking locks which are the responsibility of the calling function to clear seems ugly to me. > } > > /* We add the TB in the virtual pc hash table for the fast lookup */ > @@ -354,10 +352,15 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu, > #endif > /* See if we can patch the calling TB. */ > if (*last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) { > - tb_lock(); > + if (!tb_lock_locked()) { > + tb_lock(); > + } > tb_add_jump(*last_tb, tb_exit, tb); > - tb_unlock(); > } > + > + tb_lock_reset(); > + mmap_lock_reset(); > + > return tb; > } -- Alex Bennée
> diff --git a/cpu-exec.c b/cpu-exec.c > index dd0bd5007701..54c935039592 100644 > --- a/cpu-exec.c > +++ b/cpu-exec.c > @@ -295,7 +295,8 @@ static TranslationBlock *tb_find_slow(CPUState *cpu, > > /* mmap_lock is needed by tb_gen_code, and mmap_lock must be > * taken outside tb_lock. As system emulation is currently > - * single threaded the locks are NOPs. > + * single threaded the locks are NOPs. Both locks are to be > + * released at the end of tb_find_fast(). > */ Even better: add a "bool *tb_locked" argument to tb_find_slow, and don't move the mmap_lock release. Then tb_find_fast knows directly whether tb_lock is taken, and you don't need any of tb_lock_reset or mmap_lock_reset. Thanks, Paolo > mmap_lock(); > tb_lock(); > @@ -308,9 +309,6 @@ static TranslationBlock *tb_find_slow(CPUState *cpu, > /* if no translated code available, then translate it now */ > tb = tb_gen_code(cpu, pc, cs_base, flags, 0); > } > - > - tb_unlock(); > - mmap_unlock(); > } > > /* We add the TB in the virtual pc hash table for the fast lookup */ > @@ -354,10 +352,15 @@ static inline TranslationBlock *tb_find_fast(CPUState > *cpu, > #endif > /* See if we can patch the calling TB. */ > if (*last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) { > - tb_lock(); > + if (!tb_lock_locked()) { > + tb_lock(); > + } > tb_add_jump(*last_tb, tb_exit, tb); > - tb_unlock(); > } > + > + tb_lock_reset(); > + mmap_lock_reset(); > + > return tb; > } > > -- > 1.9.1 > >
On 08/07/16 11:40, Paolo Bonzini wrote: > Even better: add a "bool *tb_locked" argument to tb_find_slow, and > don't move the mmap_lock release. Then tb_find_fast knows directly > whether tb_lock is taken, and you don't need any of tb_lock_reset > or mmap_lock_reset. I think we can do even better. One option is using a separate tiny lock to protect direct jump set/reset instead of tb_lock. Another option which I've had in my mind for some time is to make direct jump set/reset thread-safe. We already have thread-safe TB patching. The only question is the right order of operations and handling jmp_list_next/jmp_list_first safely. I think that could be done by removing tb_remove_from_jmp_list() and making RCU-like manipulation with jmp_list_next/jmp_list_first. What do you think? Kind regards, Sergey
> On 08/07/16 11:40, Paolo Bonzini wrote: > > Even better: add a "bool *tb_locked" argument to tb_find_slow, and > > don't move the mmap_lock release. Then tb_find_fast knows directly > > whether tb_lock is taken, and you don't need any of tb_lock_reset > > or mmap_lock_reset. > > I think we can do even better. One option is using a separate tiny lock > to protect direct jump set/reset instead of tb_lock. If you have to use a separate tiny lock, you don't gain anything compared to the two critical sections, do you? In any case, this seems to be more complex than necessary. The "bool *" convention is pretty common in Linux for example---it works well. The one below is even more complicated. I'm all for simple lock-free stacks (e.g. QSLIST_INSERT_HEAD_ATOMIC and QSLIST_MOVE_ATOMIC), but lock-free lists are too much, especially if with the complicated first/next mechanism of TCG's chained block lists. Paolo > Another option which I've had in my mind for some time is to make direct > jump set/reset thread-safe. We already have thread-safe TB patching. The > only question is the right order of operations and handling > jmp_list_next/jmp_list_first safely. I think that could be done by > removing tb_remove_from_jmp_list() and making RCU-like manipulation with > jmp_list_next/jmp_list_first. What do you think? > > Kind regards, > Sergey >
On 08/07/16 14:02, Paolo Bonzini wrote: >> On 08/07/16 11:40, Paolo Bonzini wrote: >>> Even better: add a "bool *tb_locked" argument to tb_find_slow, and >>> don't move the mmap_lock release. Then tb_find_fast knows directly >>> whether tb_lock is taken, and you don't need any of tb_lock_reset >>> or mmap_lock_reset. >> I think we can do even better. One option is using a separate tiny lock >> to protect direct jump set/reset instead of tb_lock. > If you have to use a separate tiny lock, you don't gain anything compared > to the two critical sections, do you? If we have a separate lock for direct jump set/reset then we can do fast TB lookup + direct jump patching without taking tb_lock at all. How much this would reduce lock contention largely depends on the workload we use. > > In any case, this seems to be more complex than necessary. The "bool *" > convention is pretty common in Linux for example---it works well. It could work, no doubts. > > The one below is even more complicated. I'm all for simple lock-free > stacks (e.g. QSLIST_INSERT_HEAD_ATOMIC and QSLIST_MOVE_ATOMIC), but lock-free > lists are too much, especially if with the complicated first/next mechanism > of TCG's chained block lists. Direct jump handling code is pretty isolated and self-contained. It would require to back out of tb_remove_from_jmp_list() and sprinkle a couple of atomic_rcu_read()/atomic_rcu_set() with some comments, I think. Maybe it could be easier to justify looking at actual patches. Thanks, Sergey > > Paolo > >> Another option which I've had in my mind for some time is to make direct >> jump set/reset thread-safe. We already have thread-safe TB patching. The >> only question is the right order of operations and handling >> jmp_list_next/jmp_list_first safely. I think that could be done by >> removing tb_remove_from_jmp_list() and making RCU-like manipulation with >> jmp_list_next/jmp_list_first. What do you think? >> >> Kind regards, >> Sergey >>
On 08/07/2016 14:32, Sergey Fedorov wrote: >>> >> I think we can do even better. One option is using a separate tiny lock >>> >> to protect direct jump set/reset instead of tb_lock. >> > If you have to use a separate tiny lock, you don't gain anything compared >> > to the two critical sections, do you? > If we have a separate lock for direct jump set/reset then we can do fast > TB lookup + direct jump patching without taking tb_lock at all. How much > this would reduce lock contention largely depends on the workload we use. Yeah, it probably would be easy enough that it's hard to object to it (unlike the other idea below, which I'm not very comfortable with, at least without seeing patches). The main advantage would be that this tiny lock could be a spinlock rather than a mutex. Paolo >> The one below is even more complicated. I'm all for simple lock-free >> stacks (e.g. QSLIST_INSERT_HEAD_ATOMIC and QSLIST_MOVE_ATOMIC), but lock-free >> lists are too much, especially if with the complicated first/next mechanism >> of TCG's chained block lists. > > Direct jump handling code is pretty isolated and self-contained. It > would require to back out of tb_remove_from_jmp_list() and sprinkle a > couple of atomic_rcu_read()/atomic_rcu_set() with some comments, I > think. Maybe it could be easier to justify looking at actual patches.
On 08/07/16 17:07, Paolo Bonzini wrote: > > On 08/07/2016 14:32, Sergey Fedorov wrote: >>>>>> I think we can do even better. One option is using a separate tiny lock >>>>>> to protect direct jump set/reset instead of tb_lock. >>>> If you have to use a separate tiny lock, you don't gain anything compared >>>> to the two critical sections, do you? >> If we have a separate lock for direct jump set/reset then we can do fast >> TB lookup + direct jump patching without taking tb_lock at all. How much >> this would reduce lock contention largely depends on the workload we use. > Yeah, it probably would be easy enough that it's hard to object to it > (unlike the other idea below, which I'm not very comfortable with, at > least without seeing patches). > > The main advantage would be that this tiny lock could be a spinlock > rather than a mutex. Well, the problem is more subtle than we thought: tb_find_fast() can race with tb_phys_invalidate(). The first tb_find_phys() out of the lock can return a TB which is being invalidated. Then a direct jump can be set up to this TB. It can happen after concurrent tb_phys_invalidate() resets all the direct jumps to the TB. Thus we can end up with a direct jump to an invalidated TB. Even extending tb_lock critical section wouldn't help if at least one tb_find_phys() is performed out of the lock. Kind regards, Sergey > >>> The one below is even more complicated. I'm all for simple lock-free >>> stacks (e.g. QSLIST_INSERT_HEAD_ATOMIC and QSLIST_MOVE_ATOMIC), but lock-free >>> lists are too much, especially if with the complicated first/next mechanism >>> of TCG's chained block lists. >> Direct jump handling code is pretty isolated and self-contained. It >> would require to back out of tb_remove_from_jmp_list() and sprinkle a >> couple of atomic_rcu_read()/atomic_rcu_set() with some comments, I >> think. Maybe it could be easier to justify looking at actual patches.
On 08/07/2016 21:55, Sergey Fedorov wrote: > On 08/07/16 17:07, Paolo Bonzini wrote: >> >> On 08/07/2016 14:32, Sergey Fedorov wrote: >>>>>>> I think we can do even better. One option is using a separate tiny lock >>>>>>> to protect direct jump set/reset instead of tb_lock. >>>>> If you have to use a separate tiny lock, you don't gain anything compared >>>>> to the two critical sections, do you? >>> If we have a separate lock for direct jump set/reset then we can do fast >>> TB lookup + direct jump patching without taking tb_lock at all. How much >>> this would reduce lock contention largely depends on the workload we use. >> Yeah, it probably would be easy enough that it's hard to object to it >> (unlike the other idea below, which I'm not very comfortable with, at >> least without seeing patches). >> >> The main advantage would be that this tiny lock could be a spinlock >> rather than a mutex. > > Well, the problem is more subtle than we thought: tb_find_fast() can > race with tb_phys_invalidate(). The first tb_find_phys() out of the lock > can return a TB which is being invalidated. Then a direct jump can be > set up to this TB. It can happen after concurrent tb_phys_invalidate() > resets all the direct jumps to the TB. Thus we can end up with a direct > jump to an invalidated TB. Even extending tb_lock critical section > wouldn't help if at least one tb_find_phys() is performed out of the lock. Ahem, isn't this exactly why tb_find_phys was invalidating the PC in my patches, as the very first step?... (The smp_wmb after invalidating the PC paired with an atomic_rcu_read in tb_find_fast; now we could do it after computing the hash and before calling qht_remove). It turned out that invalidating the PC wasn't as easy as writing -1 to the pc, but it's possible to do one of these: 1) set cs_base to an invalid value (all-ones works for everything except x86---instead anything nonzero is enough except on x86 and SPARC) 2) set the flags to an invalid combination (x86 can use all ones or rename the useless HF_SOFTMMU_MASK to HF_INVALID_MASK). Paolo
On 08/07/16 23:18, Paolo Bonzini wrote: > > On 08/07/2016 21:55, Sergey Fedorov wrote: >> On 08/07/16 17:07, Paolo Bonzini wrote: >>> On 08/07/2016 14:32, Sergey Fedorov wrote: >>>>>>>> I think we can do even better. One option is using a separate tiny lock >>>>>>>> to protect direct jump set/reset instead of tb_lock. >>>>>> If you have to use a separate tiny lock, you don't gain anything compared >>>>>> to the two critical sections, do you? >>>> If we have a separate lock for direct jump set/reset then we can do fast >>>> TB lookup + direct jump patching without taking tb_lock at all. How much >>>> this would reduce lock contention largely depends on the workload we use. >>> Yeah, it probably would be easy enough that it's hard to object to it >>> (unlike the other idea below, which I'm not very comfortable with, at >>> least without seeing patches). >>> >>> The main advantage would be that this tiny lock could be a spinlock >>> rather than a mutex. >> Well, the problem is more subtle than we thought: tb_find_fast() can >> race with tb_phys_invalidate(). The first tb_find_phys() out of the lock >> can return a TB which is being invalidated. Then a direct jump can be >> set up to this TB. It can happen after concurrent tb_phys_invalidate() >> resets all the direct jumps to the TB. Thus we can end up with a direct >> jump to an invalidated TB. Even extending tb_lock critical section >> wouldn't help if at least one tb_find_phys() is performed out of the lock. > Ahem, isn't this exactly why tb_find_phys was invalidating the PC in my > patches, as the very first step?... (The smp_wmb after invalidating the > PC paired with an atomic_rcu_read in tb_find_fast; now we could do it > after computing the hash and before calling qht_remove). > > It turned out that invalidating the PC wasn't as easy as writing -1 to > the pc, but it's possible to do one of these: > > 1) set cs_base to an invalid value (all-ones works for everything except > x86---instead anything nonzero is enough except on > x86 and SPARC) > > 2) set the flags to an invalid combination (x86 can use all ones or > rename the useless HF_SOFTMMU_MASK to HF_INVALID_MASK). I remember, I've just found that we discussed it in this thread: http://thread.gmane.org/gmane.comp.emulators.qemu/401723/focus=406852 I was thinking of just doing 'tb_jmp_cache' lookup out of the lock, not tb_find_physical(). Now thanks to QHT, we could do tb_find_physical() out of the lock, too. This changes things. Kind regards, Sergey
On 08/07/2016 22:24, Sergey Fedorov wrote: > I remember, I've just found that we discussed it in this thread: > > http://thread.gmane.org/gmane.comp.emulators.qemu/401723/focus=406852 > > I was thinking of just doing 'tb_jmp_cache' lookup out of the lock, not > tb_find_physical(). Now thanks to QHT, we could do tb_find_physical() > out of the lock, too. This changes things. But in my patch ("tcg: move tb_find_fast outside the tb_lock critical section", which originally was written by Fred---most of my contribution was getting the invalidation right, not the lock-free lookup) tb_find_physical was also done out of the lock. It was then retried inside the lock, if it failed. This is why I needed to fail all concurrent lookups as the first step in the invalidation. Emilio's QHT resulted in a rewrite of tb_find_physical, but the basic concepts are the same. Paolo
On 08/07/16 23:52, Paolo Bonzini wrote: > > On 08/07/2016 22:24, Sergey Fedorov wrote: >> I remember, I've just found that we discussed it in this thread: >> >> http://thread.gmane.org/gmane.comp.emulators.qemu/401723/focus=406852 >> >> I was thinking of just doing 'tb_jmp_cache' lookup out of the lock, not >> tb_find_physical(). Now thanks to QHT, we could do tb_find_physical() >> out of the lock, too. This changes things. > But in my patch ("tcg: move tb_find_fast outside the tb_lock critical > section", which originally was written by Fred---most of my contribution > was getting the invalidation right, not the lock-free lookup) > tb_find_physical was also done out of the lock. It was then retried > inside the lock, if it failed. > > This is why I needed to fail all concurrent lookups as the first step in > the invalidation. > > Emilio's QHT resulted in a rewrite of tb_find_physical, but the basic > concepts are the same. That could work, I think, if we re-check under tb_lock whether the TB is still valid before adding a direct jump to it. Kind regards, Sergey
On 11/07/2016 15:06, Sergey Fedorov wrote: > On 08/07/16 23:52, Paolo Bonzini wrote: >> >> On 08/07/2016 22:24, Sergey Fedorov wrote: >>> I remember, I've just found that we discussed it in this thread: >>> >>> http://thread.gmane.org/gmane.comp.emulators.qemu/401723/focus=406852 >>> >>> I was thinking of just doing 'tb_jmp_cache' lookup out of the lock, not >>> tb_find_physical(). Now thanks to QHT, we could do tb_find_physical() >>> out of the lock, too. This changes things. >> But in my patch ("tcg: move tb_find_fast outside the tb_lock critical >> section", which originally was written by Fred---most of my contribution >> was getting the invalidation right, not the lock-free lookup) >> tb_find_physical was also done out of the lock. It was then retried >> inside the lock, if it failed. >> >> This is why I needed to fail all concurrent lookups as the first step in >> the invalidation. >> >> Emilio's QHT resulted in a rewrite of tb_find_physical, but the basic >> concepts are the same. > > That could work, I think, if we re-check under tb_lock whether the TB is > still valid before adding a direct jump to it. Right, this can still happen: tb_find_fast tb_phys_invalidate tb_lock jmp_cache miss -> tb_find_slow -> tb_find_physical QHT hit tb_lock invalidate tb->pc remove from lists tb_unlock tb_add_jump tb_unlock I seem to recall that Emilio added a seqlock for this purpose, but adding a tb_check_invalidated(TranslationBlock *tb) inline function will also do. Paolo
On 11/07/16 17:03, Paolo Bonzini wrote: > > On 11/07/2016 15:06, Sergey Fedorov wrote: >> On 08/07/16 23:52, Paolo Bonzini wrote: >>> On 08/07/2016 22:24, Sergey Fedorov wrote: >>>> I remember, I've just found that we discussed it in this thread: >>>> >>>> http://thread.gmane.org/gmane.comp.emulators.qemu/401723/focus=406852 >>>> >>>> I was thinking of just doing 'tb_jmp_cache' lookup out of the lock, not >>>> tb_find_physical(). Now thanks to QHT, we could do tb_find_physical() >>>> out of the lock, too. This changes things. >>> But in my patch ("tcg: move tb_find_fast outside the tb_lock critical >>> section", which originally was written by Fred---most of my contribution >>> was getting the invalidation right, not the lock-free lookup) >>> tb_find_physical was also done out of the lock. It was then retried >>> inside the lock, if it failed. >>> >>> This is why I needed to fail all concurrent lookups as the first step in >>> the invalidation. >>> >>> Emilio's QHT resulted in a rewrite of tb_find_physical, but the basic >>> concepts are the same. >> That could work, I think, if we re-check under tb_lock whether the TB is >> still valid before adding a direct jump to it. > Right, this can still happen: > > tb_find_fast tb_phys_invalidate > tb_lock > jmp_cache miss > -> tb_find_slow > -> tb_find_physical > QHT hit > tb_lock > invalidate tb->pc > remove from lists > tb_unlock > tb_add_jump > tb_unlock > > I seem to recall that Emilio added a seqlock for this purpose, but > adding a tb_check_invalidated(TranslationBlock *tb) inline function will > also do. He used seqlock for 'tb_jmp_cache' only: http://thread.gmane.org/gmane.comp.emulators.qemu/356765/focus=356774 He also added a dedicated field into TranslationBlock struction to mark it invalid: http://thread.gmane.org/gmane.comp.emulators.qemu/356765/focus=356785 Kind regards, Sergey
diff --git a/cpu-exec.c b/cpu-exec.c index dd0bd5007701..54c935039592 100644 --- a/cpu-exec.c +++ b/cpu-exec.c @@ -295,7 +295,8 @@ static TranslationBlock *tb_find_slow(CPUState *cpu, /* mmap_lock is needed by tb_gen_code, and mmap_lock must be * taken outside tb_lock. As system emulation is currently - * single threaded the locks are NOPs. + * single threaded the locks are NOPs. Both locks are to be + * released at the end of tb_find_fast(). */ mmap_lock(); tb_lock(); @@ -308,9 +309,6 @@ static TranslationBlock *tb_find_slow(CPUState *cpu, /* if no translated code available, then translate it now */ tb = tb_gen_code(cpu, pc, cs_base, flags, 0); } - - tb_unlock(); - mmap_unlock(); } /* We add the TB in the virtual pc hash table for the fast lookup */ @@ -354,10 +352,15 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu, #endif /* See if we can patch the calling TB. */ if (*last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) { - tb_lock(); + if (!tb_lock_locked()) { + tb_lock(); + } tb_add_jump(*last_tb, tb_exit, tb); - tb_unlock(); } + + tb_lock_reset(); + mmap_lock_reset(); + return tb; }