Message ID | 1464986428-6739-5-git-send-email-alex.bennee@linaro.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 03/06/16 23:40, Alex Bennée wrote: > This is a current DRAFT of a design proposal for upgrading TCG emulation > to take advantage of modern CPUs by running a thread-per-CPU. The > document goes through the various areas of the code affected by such a > change and proposes design requirements for each part of the solution. > > It has been written *without* explicit reference to the current ongoing > efforts to introduce this feature. The hope being we can review and > discuss the design choices without assuming the current choices taken by > the implementation are correct. > > Signed-off-by: Alex Bennée <alex.bennee@linaro.org> > > --- > v1 > - initial version > v2 > - update discussion on locks > - bit more detail on vCPU scheduling > - explicitly mention Translation Blocks > - emulated hardware state already covered by iomutex > - a few minor rewords > v3 > - mention this covers system-mode > - describe main main-loop and lookup hot-path > - mention multi-concurrent-reader lookups > - enumerate reasons for invalidation > - add more details on lookup structures > - describe the softmmu hot-path better > - mention store-after-load barrier problem > --- > docs/multi-thread-tcg.txt | 225 ++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 225 insertions(+) > create mode 100644 docs/multi-thread-tcg.txt > > diff --git a/docs/multi-thread-tcg.txt b/docs/multi-thread-tcg.txt > new file mode 100644 > index 0000000..5c88c99 > --- /dev/null > +++ b/docs/multi-thread-tcg.txt > @@ -0,0 +1,225 @@ > +Copyright (c) 2015 Linaro Ltd. > + > +This work is licensed under the terms of the GNU GPL, version 2 or later. See > +the COPYING file in the top-level directory. > + > +STATUS: DRAFTING > + > +Introduction > +============ > + > +This document outlines the design for multi-threaded TCG system-mode > +emulation. The current user-mode emulation mirrors the thread > +structure of the translated executable. > + > +The original system-mode TCG implementation was single threaded and > +dealt with multiple CPUs by with simple round-robin scheduling. This > +simplified a lot of things but became increasingly limited as systems > +being emulated gained additional cores and per-core performance gains > +for host systems started to level off. > + > +vCPU Scheduling > +=============== > + > +We introduce a new running mode where each vCPU will run on its own > +user-space thread. This will be enabled by default for all > +FE/BE combinations that have had the required work done to support > +this safely. > + > +In the general case of running translated code there should be no > +inter-vCPU dependencies and all vCPUs should be able to run at full > +speed. Synchronisation will only be required while accessing internal > +shared data structures or when the emulated architecture requires a > +coherent representation of the emulated machine state. > + > +Shared Data Structures > +====================== > + > +Main Run Loop > +------------- > + > +Even when there is no code being generated there are a number of > +structures associated with the hot-path through the main run-loop. > +These are associated with looking up the next translation block to > +execute. These include: > + > + tb_jmp_cache (per-vCPU, cache of recent jumps) > + tb_phys_hash (global, phys address->tb lookup) > + > +As TB linking only occurs when blocks are in the same page this code > +is critical to performance as looking up the next TB to execute is the > +most common reason to exit the generated code. > + > +DESIGN REQUIREMENT: Make access to lookup structures safe with > +multiple reader/writer threads. Minimise any lock contention to do it. > + > +Global TCG State > +---------------- > + > +We need to protect the entire code generation cycle including any post > +generation patching of the translated code. This also implies a shared > +translation buffer which contains code running on all cores. Any > +execution path that comes to the main run loop will need to hold a > +mutex for code generation. This also includes times when we need flush > +code or entries from any shared lookups/caches. Structures held on a > +per-vCPU basis won't need locking unless other vCPUs will need to > +modify them. > + > +DESIGN REQUIREMENT: Add locking around all code generation and TB > +patching. If possible make shared lookup/caches able to handle multiple > +readers without locks otherwise protect them with locks as well. > + > +Translation Blocks > +------------------ > + > +Currently the whole system shares a single code generation buffer > +which when full will force a flush of all translations and start from > +scratch again. > + > +Once a basic block has been translated it will continue to be used > +until it is invalidated. These invalidation events are typically due > +a change to the state of a physical page: > + - code modification (self modify code, patching code) > + - page changes (new mapping to physical page) Mapping changes does invalidate translation blocks in user-mode emulation only. In system-mode emulation, we just do TLB flush and clean CPU's 'tb_jmp_cache' but don't invalidate any translation block, just follow tlb_flush(). That is why in system-mode emulation we can't do direct jumps to another page or to a TB which span a page boundary. > + - debugging operations (breakpoint insertion/removal) > + > +There exist several places reference to TBs exist which need to be > +cleared in a safe way. You mean: "There are several places where references to TBs exist which ..."? > + > +The main reference is a global page table (l1_map) which provides a 2 > +level look-up for PageDesc structures which contain pointers to the > +start of a linked list of all Translation Blocks in that page (see > +page_next). Actually, 'l1_map' is multi-level, see the comment above 'V_L2_BITS' macro definition. > + > +When a block is invalidated any blocks which directly jump to it need > +to have those jumps removed. This requires navigating the tb_jump_list > +linked list as well as patching the jump code in a safe way. > + > +Finally there are a number of look-up mechanisms for accelerating > +lookup of the next TB. These cache and hashed tables need to have > +references removed in a safe way. > + > +DESIGN REQUIREMENT: Safely handle invalidation of TBs > + - safely patch direct jumps > + - remove central PageDesc lookup entries > + - ensure lookup caches/hashes are safely updated > + > +Memory maps and TLBs > +-------------------- > + > +The memory handling code is fairly critical to the speed of memory > +access in the emulated system. The SoftMMU code is designed so the > +hot-path can be handled entirely within translated code. This is > +handled with a per-vCPU TLB structure which once populated will allow > +a series of accesses to the page to occur without exiting the > +translated code. It is possible to set flags in the TLB address which > +will ensure the slow-path is taken for each access. This can be done > +to support: > + > + - Memory regions (dividing up access to PIO, MMIO and RAM) > + - Dirty page tracking (for code gen, migration and display) > + - Virtual TLB (for translating guest address->real address) > + > +When the TLB tables are updated we need to ensure they are done in a > +safe way by bringing all executing threads to a halt before making the > +modifications. Actually, we just need to be thread-safe when modifying vCPU TLB entries from other thread. If it is possible to do using (relaxed) atomic memory access, we could obviously benefit from it. > + > +DESIGN REQUIREMENTS: > + > + - TLB Flush All/Page > + - can be across-CPUs > + - will need all other CPUs brought to a halt Only *cross-CPU* TLB flush *may* need other CPU brought to halt. > + - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs) > + - This is a per-CPU table - by definition can't race > + - updated by its own thread when the slow-path is forced > + > +Emulated hardware state > +----------------------- > + > +Currently thanks to KVM work any access to IO memory is automatically > +protected by the global iothread mutex. Any IO region that doesn't use > +global mutex is expected to do its own locking. Worth it mentioning that we're going to push iothread locking down in TCG as much as possible? > + > +Memory Consistency > +================== > + > +Between emulated guests and host systems there are a range of memory > +consistency models. Even emulating weakly ordered systems on strongly > +ordered hosts needs to ensure things like store-after-load re-ordering > +can be prevented when the guest wants to. > + > +Memory Barriers > +--------------- > + > +Barriers (sometimes known as fences) provide a mechanism for software > +to enforce a particular ordering of memory operations from the point > +of view of external observers (e.g. another processor core). They can > +apply to any memory operations as well as just loads or stores. > + > +The Linux kernel has an excellent write-up on the various forms of > +memory barrier and the guarantees they can provide [1]. > + > +Barriers are often wrapped around synchronisation primitives to > +provide explicit memory ordering semantics. However they can be used > +by themselves to provide safe lockless access by ensuring for example > +a signal flag will always be set after a payload. > + > +DESIGN REQUIREMENT: Add a new tcg_memory_barrier op > + > +This would enforce a strong load/store ordering so all loads/stores > +complete at the memory barrier. On single-core non-SMP strongly > +ordered backends this could become a NOP. > + > +There may be a case for further refinement if this causes performance > +bottlenecks. I think it's worth mentioning that aside from explicit standalone memory barrier instructions there's also implicit memory ordering semantics which comes with each guest memory access instruction, e.g. relaxed, acquire/release, sequentially consistent. > + > +Memory Control and Maintenance > +------------------------------ > + > +This includes a class of instructions for controlling system cache > +behaviour. While QEMU doesn't model cache behaviour these instructions > +are often seen when code modification has taken place to ensure the > +changes take effect. > + > +Synchronisation Primitives > +-------------------------- > + > +There are two broad types of synchronisation primitives found in > +modern ISAs: atomic instructions and exclusive regions. > + > +The first type offer a simple atomic instruction which will guarantee > +some sort of test and conditional store will be truly atomic w.r.t. > +other cores sharing access to the memory. The classic example is the > +x86 cmpxchg instruction. > + > +The second type offer a pair of load/store instructions which offer a > +guarantee that an region of memory has not been touched between the > +load and store instructions. An example of this is ARM's ldrex/strex > +pair where the strex instruction will return a flag indicating a > +successful store only if no other CPU has accessed the memory region > +since the ldrex. > + > +Traditionally TCG has generated a series of operations that work > +because they are within the context of a single translation block so > +will have completed before another CPU is scheduled. However with > +the ability to have multiple threads running to emulate multiple CPUs > +we will need to explicitly expose these semantics. > + > +DESIGN REQUIREMENTS: > + - atomics What "atomics" means here ... > + - Introduce some atomic TCG ops for the common semantics > + - The default fallback helper function will use qemu_atomics > + - Each backend can then add a more efficient implementation > + - load/store exclusive ... and how they relates to "load/store exclusive"? > + [AJB: > + There are currently a number proposals of interest: > + - Greensocs tweaks to ldst ex (using locks) > + - Slow-path for atomic instruction translation [2] > + - Helper-based Atomic Instruction Emulation (AIE) [3] > + ] We also have a problem emulating 64-bit guest on 32-bit host: while a naturally aligned machine-word memory access is usually atomic, a single 64-bit guest memory access instruction is translated into a series of two 32-bit host memory access instruction. That can (will) break guest code assumptions about memory access atomicity. Kind regards, Sergey > + > +========== > + > +[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt > +[2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561 > +[3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297
diff --git a/docs/multi-thread-tcg.txt b/docs/multi-thread-tcg.txt new file mode 100644 index 0000000..5c88c99 --- /dev/null +++ b/docs/multi-thread-tcg.txt @@ -0,0 +1,225 @@ +Copyright (c) 2015 Linaro Ltd. + +This work is licensed under the terms of the GNU GPL, version 2 or later. See +the COPYING file in the top-level directory. + +STATUS: DRAFTING + +Introduction +============ + +This document outlines the design for multi-threaded TCG system-mode +emulation. The current user-mode emulation mirrors the thread +structure of the translated executable. + +The original system-mode TCG implementation was single threaded and +dealt with multiple CPUs by with simple round-robin scheduling. This +simplified a lot of things but became increasingly limited as systems +being emulated gained additional cores and per-core performance gains +for host systems started to level off. + +vCPU Scheduling +=============== + +We introduce a new running mode where each vCPU will run on its own +user-space thread. This will be enabled by default for all +FE/BE combinations that have had the required work done to support +this safely. + +In the general case of running translated code there should be no +inter-vCPU dependencies and all vCPUs should be able to run at full +speed. Synchronisation will only be required while accessing internal +shared data structures or when the emulated architecture requires a +coherent representation of the emulated machine state. + +Shared Data Structures +====================== + +Main Run Loop +------------- + +Even when there is no code being generated there are a number of +structures associated with the hot-path through the main run-loop. +These are associated with looking up the next translation block to +execute. These include: + + tb_jmp_cache (per-vCPU, cache of recent jumps) + tb_phys_hash (global, phys address->tb lookup) + +As TB linking only occurs when blocks are in the same page this code +is critical to performance as looking up the next TB to execute is the +most common reason to exit the generated code. + +DESIGN REQUIREMENT: Make access to lookup structures safe with +multiple reader/writer threads. Minimise any lock contention to do it. + +Global TCG State +---------------- + +We need to protect the entire code generation cycle including any post +generation patching of the translated code. This also implies a shared +translation buffer which contains code running on all cores. Any +execution path that comes to the main run loop will need to hold a +mutex for code generation. This also includes times when we need flush +code or entries from any shared lookups/caches. Structures held on a +per-vCPU basis won't need locking unless other vCPUs will need to +modify them. + +DESIGN REQUIREMENT: Add locking around all code generation and TB +patching. If possible make shared lookup/caches able to handle multiple +readers without locks otherwise protect them with locks as well. + +Translation Blocks +------------------ + +Currently the whole system shares a single code generation buffer +which when full will force a flush of all translations and start from +scratch again. + +Once a basic block has been translated it will continue to be used +until it is invalidated. These invalidation events are typically due +a change to the state of a physical page: + - code modification (self modify code, patching code) + - page changes (new mapping to physical page) + - debugging operations (breakpoint insertion/removal) + +There exist several places reference to TBs exist which need to be +cleared in a safe way. + +The main reference is a global page table (l1_map) which provides a 2 +level look-up for PageDesc structures which contain pointers to the +start of a linked list of all Translation Blocks in that page (see +page_next). + +When a block is invalidated any blocks which directly jump to it need +to have those jumps removed. This requires navigating the tb_jump_list +linked list as well as patching the jump code in a safe way. + +Finally there are a number of look-up mechanisms for accelerating +lookup of the next TB. These cache and hashed tables need to have +references removed in a safe way. + +DESIGN REQUIREMENT: Safely handle invalidation of TBs + - safely patch direct jumps + - remove central PageDesc lookup entries + - ensure lookup caches/hashes are safely updated + +Memory maps and TLBs +-------------------- + +The memory handling code is fairly critical to the speed of memory +access in the emulated system. The SoftMMU code is designed so the +hot-path can be handled entirely within translated code. This is +handled with a per-vCPU TLB structure which once populated will allow +a series of accesses to the page to occur without exiting the +translated code. It is possible to set flags in the TLB address which +will ensure the slow-path is taken for each access. This can be done +to support: + + - Memory regions (dividing up access to PIO, MMIO and RAM) + - Dirty page tracking (for code gen, migration and display) + - Virtual TLB (for translating guest address->real address) + +When the TLB tables are updated we need to ensure they are done in a +safe way by bringing all executing threads to a halt before making the +modifications. + +DESIGN REQUIREMENTS: + + - TLB Flush All/Page + - can be across-CPUs + - will need all other CPUs brought to a halt + - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs) + - This is a per-CPU table - by definition can't race + - updated by its own thread when the slow-path is forced + +Emulated hardware state +----------------------- + +Currently thanks to KVM work any access to IO memory is automatically +protected by the global iothread mutex. Any IO region that doesn't use +global mutex is expected to do its own locking. + +Memory Consistency +================== + +Between emulated guests and host systems there are a range of memory +consistency models. Even emulating weakly ordered systems on strongly +ordered hosts needs to ensure things like store-after-load re-ordering +can be prevented when the guest wants to. + +Memory Barriers +--------------- + +Barriers (sometimes known as fences) provide a mechanism for software +to enforce a particular ordering of memory operations from the point +of view of external observers (e.g. another processor core). They can +apply to any memory operations as well as just loads or stores. + +The Linux kernel has an excellent write-up on the various forms of +memory barrier and the guarantees they can provide [1]. + +Barriers are often wrapped around synchronisation primitives to +provide explicit memory ordering semantics. However they can be used +by themselves to provide safe lockless access by ensuring for example +a signal flag will always be set after a payload. + +DESIGN REQUIREMENT: Add a new tcg_memory_barrier op + +This would enforce a strong load/store ordering so all loads/stores +complete at the memory barrier. On single-core non-SMP strongly +ordered backends this could become a NOP. + +There may be a case for further refinement if this causes performance +bottlenecks. + +Memory Control and Maintenance +------------------------------ + +This includes a class of instructions for controlling system cache +behaviour. While QEMU doesn't model cache behaviour these instructions +are often seen when code modification has taken place to ensure the +changes take effect. + +Synchronisation Primitives +-------------------------- + +There are two broad types of synchronisation primitives found in +modern ISAs: atomic instructions and exclusive regions. + +The first type offer a simple atomic instruction which will guarantee +some sort of test and conditional store will be truly atomic w.r.t. +other cores sharing access to the memory. The classic example is the +x86 cmpxchg instruction. + +The second type offer a pair of load/store instructions which offer a +guarantee that an region of memory has not been touched between the +load and store instructions. An example of this is ARM's ldrex/strex +pair where the strex instruction will return a flag indicating a +successful store only if no other CPU has accessed the memory region +since the ldrex. + +Traditionally TCG has generated a series of operations that work +because they are within the context of a single translation block so +will have completed before another CPU is scheduled. However with +the ability to have multiple threads running to emulate multiple CPUs +we will need to explicitly expose these semantics. + +DESIGN REQUIREMENTS: + - atomics + - Introduce some atomic TCG ops for the common semantics + - The default fallback helper function will use qemu_atomics + - Each backend can then add a more efficient implementation + - load/store exclusive + [AJB: + There are currently a number proposals of interest: + - Greensocs tweaks to ldst ex (using locks) + - Slow-path for atomic instruction translation [2] + - Helper-based Atomic Instruction Emulation (AIE) [3] + ] + +========== + +[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt +[2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561 +[3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297
This is a current DRAFT of a design proposal for upgrading TCG emulation to take advantage of modern CPUs by running a thread-per-CPU. The document goes through the various areas of the code affected by such a change and proposes design requirements for each part of the solution. It has been written *without* explicit reference to the current ongoing efforts to introduce this feature. The hope being we can review and discuss the design choices without assuming the current choices taken by the implementation are correct. Signed-off-by: Alex Bennée <alex.bennee@linaro.org> --- v1 - initial version v2 - update discussion on locks - bit more detail on vCPU scheduling - explicitly mention Translation Blocks - emulated hardware state already covered by iomutex - a few minor rewords v3 - mention this covers system-mode - describe main main-loop and lookup hot-path - mention multi-concurrent-reader lookups - enumerate reasons for invalidation - add more details on lookup structures - describe the softmmu hot-path better - mention store-after-load barrier problem --- docs/multi-thread-tcg.txt | 225 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 225 insertions(+) create mode 100644 docs/multi-thread-tcg.txt