Message ID | 20200929091727.8692-1-wangjingyi11@huawei.com (mailing list archive) |
---|---|
Headers | show |
Series | Add support for ARMv8.6 TWED feature | expand |
On 2020-09-29 10:17, Jingyi Wang wrote: > TWE Delay is an optional feature in ARMv8.6 Extentions. There is a > performance benefit in waiting for a period of time for an event to > arrive before taking the trap as it is common that event will arrive > “quite soon” after executing the WFE instruction. Define "quite soon". Quantify "performance benefits". Which are the workloads that actually benefit from this imitation of the x86 PLE? I was opposed to this when the spec was drafted, and I still am given that there is zero supporting evidence that it bring any gain over immediate trapping in an oversubscribed environment (which is the only case where it matters). Thanks, M. > > This series adds support for TWED feature and implements TWE delay > value dynamic adjustment. > > Thanks for Shameer's advice on this series. The function of this patch > has been tested on TWED supported hardware and the performance of it is > still on test, any advice will be welcomed. > > Jingyi Wang (2): > KVM: arm64: Make use of TWED feature > KVM: arm64: Use dynamic TWE Delay value > > Zengruan Ye (2): > arm64: cpufeature: TWED support detection > KVM: arm64: Add trace for TWED update > > arch/arm64/Kconfig | 10 +++++ > arch/arm64/include/asm/cpucaps.h | 3 +- > arch/arm64/include/asm/kvm_arm.h | 5 +++ > arch/arm64/include/asm/kvm_emulate.h | 38 ++++++++++++++++++ > arch/arm64/include/asm/kvm_host.h | 19 ++++++++- > arch/arm64/include/asm/virt.h | 8 ++++ > arch/arm64/kernel/cpufeature.c | 12 ++++++ > arch/arm64/kvm/arm.c | 58 ++++++++++++++++++++++++++++ > arch/arm64/kvm/handle_exit.c | 2 + > arch/arm64/kvm/trace_arm.h | 21 ++++++++++ > 10 files changed, 174 insertions(+), 2 deletions(-)
Hi Marc, On 9/29/2020 6:50 PM, Marc Zyngier wrote: > On 2020-09-29 10:17, Jingyi Wang wrote: >> TWE Delay is an optional feature in ARMv8.6 Extentions. There is a >> performance benefit in waiting for a period of time for an event to >> arrive before taking the trap as it is common that event will arrive >> “quite soon” after executing the WFE instruction. > > Define "quite soon". Quantify "performance benefits". Which are the > workloads that actually benefit from this imitation of the x86 PLE? > > I was opposed to this when the spec was drafted, and I still am given > that there is zero supporting evidence that it bring any gain over > immediate trapping in an oversubscribed environment (which is the only > case where it matters). > > Thanks, > > M. Sure, I will do more performance tests and post the results as soon as possible. Thanks, Jingyi
Hi all, Sorry for the delay. I have been testing the TWED feature performance lately. We select unixbench as the benchmark for some items of it is lock-intensive(fstime/fsbuffer/fsdisk). We run unixbench on a 4-VCPU VM, and bind every two VCPUs on one PCPU. Fixed TWED value is used and here is the result. twed_value | fstime | fsbuffer | fsdisk --------------+---------------+------------+------------ disable | 16.0 | 14.1 | 18.0 0 | 16.3 | 13.5 | 17.2 1 | 17.5 | 14.7 | 17.4 2 | 17.3 | 15.3 | 18.0 3 | 17.7 | 15.2 | 18.9 4 | 17.9 | 14.3 | 18.2 5 | 17.2 | 14.1 | 19.0 6 | 5.8 | 4.2 | 5.7 7 | 6.2 | 5.6 | 12.8 Note: fstime: File Copy 1024 bufsize 2000 maxblocks fsbuffer: File Copy 256 bufsize 500 maxblocks fsdisk: File Copy 4096 bufsize 8000 maxblocks The index of unixbench, higher is better. It is shown that, compared to the circumstance that TWED is disabled, lock-intensive testing items have better performance if an appropriate TWED value is set(up to 5.6%~11.9%). Meanwhile, the complete unixbench test is run to prove that other testing items are not sensitive to this parameter. Thanks Jingyi On 9/29/2020 5:17 PM, Jingyi Wang wrote: > TWE Delay is an optional feature in ARMv8.6 Extentions. There is a > performance benefit in waiting for a period of time for an event to > arrive before taking the trap as it is common that event will arrive > “quite soon” after executing the WFE instruction. > > This series adds support for TWED feature and implements TWE delay > value dynamic adjustment. > > Thanks for Shameer's advice on this series. The function of this patch > has been tested on TWED supported hardware and the performance of it is > still on test, any advice will be welcomed. > > Jingyi Wang (2): > KVM: arm64: Make use of TWED feature > KVM: arm64: Use dynamic TWE Delay value > > Zengruan Ye (2): > arm64: cpufeature: TWED support detection > KVM: arm64: Add trace for TWED update > > arch/arm64/Kconfig | 10 +++++ > arch/arm64/include/asm/cpucaps.h | 3 +- > arch/arm64/include/asm/kvm_arm.h | 5 +++ > arch/arm64/include/asm/kvm_emulate.h | 38 ++++++++++++++++++ > arch/arm64/include/asm/kvm_host.h | 19 ++++++++- > arch/arm64/include/asm/virt.h | 8 ++++ > arch/arm64/kernel/cpufeature.c | 12 ++++++ > arch/arm64/kvm/arm.c | 58 ++++++++++++++++++++++++++++ > arch/arm64/kvm/handle_exit.c | 2 + > arch/arm64/kvm/trace_arm.h | 21 ++++++++++ > 10 files changed, 174 insertions(+), 2 deletions(-) >
Hi Marc, Gentle ping, could you please give some comments on this patch or the current test results? Thanks in advance. Thanks, Jingyi On 11/13/2020 3:54 PM, Jingyi Wang wrote: > Hi all, > > Sorry for the delay. I have been testing the TWED feature performance > lately. We select unixbench as the benchmark for some items of it is > lock-intensive(fstime/fsbuffer/fsdisk). We run unixbench on a 4-VCPU > VM, and bind every two VCPUs on one PCPU. Fixed TWED value is used and > here is the result. > > twed_value | fstime | fsbuffer | fsdisk > --------------+---------------+------------+------------ > disable | 16.0 | 14.1 | 18.0 > 0 | 16.3 | 13.5 | 17.2 > 1 | 17.5 | 14.7 | 17.4 > 2 | 17.3 | 15.3 | 18.0 > 3 | 17.7 | 15.2 | 18.9 > 4 | 17.9 | 14.3 | 18.2 > 5 | 17.2 | 14.1 | 19.0 > 6 | 5.8 | 4.2 | 5.7 > 7 | 6.2 | 5.6 | 12.8 > > Note: > fstime: File Copy 1024 bufsize 2000 maxblocks > fsbuffer: File Copy 256 bufsize 500 maxblocks > fsdisk: File Copy 4096 bufsize 8000 maxblocks > The index of unixbench, higher is better. > > It is shown that, compared to the circumstance that TWED is disabled, > lock-intensive testing items have better performance if an appropriate > TWED value is set(up to 5.6%~11.9%). Meanwhile, the complete unixbench > test is run to prove that other testing items are not sensitive to this > parameter. > > Thanks > Jingyi > > On 9/29/2020 5:17 PM, Jingyi Wang wrote: >> TWE Delay is an optional feature in ARMv8.6 Extentions. There is a >> performance benefit in waiting for a period of time for an event to >> arrive before taking the trap as it is common that event will arrive >> “quite soon” after executing the WFE instruction. >> >> This series adds support for TWED feature and implements TWE delay >> value dynamic adjustment. >> >> Thanks for Shameer's advice on this series. The function of this patch >> has been tested on TWED supported hardware and the performance of it is >> still on test, any advice will be welcomed. >> >> Jingyi Wang (2): >> KVM: arm64: Make use of TWED feature >> KVM: arm64: Use dynamic TWE Delay value >> >> Zengruan Ye (2): >> arm64: cpufeature: TWED support detection >> KVM: arm64: Add trace for TWED update >> >> arch/arm64/Kconfig | 10 +++++ >> arch/arm64/include/asm/cpucaps.h | 3 +- >> arch/arm64/include/asm/kvm_arm.h | 5 +++ >> arch/arm64/include/asm/kvm_emulate.h | 38 ++++++++++++++++++ >> arch/arm64/include/asm/kvm_host.h | 19 ++++++++- >> arch/arm64/include/asm/virt.h | 8 ++++ >> arch/arm64/kernel/cpufeature.c | 12 ++++++ >> arch/arm64/kvm/arm.c | 58 ++++++++++++++++++++++++++++ >> arch/arm64/kvm/handle_exit.c | 2 + >> arch/arm64/kvm/trace_arm.h | 21 ++++++++++ >> 10 files changed, 174 insertions(+), 2 deletions(-) >>
On 2020-11-13 07:54, Jingyi Wang wrote: > Hi all, > > Sorry for the delay. I have been testing the TWED feature performance > lately. We select unixbench as the benchmark for some items of it is > lock-intensive(fstime/fsbuffer/fsdisk). We run unixbench on a 4-VCPU > VM, and bind every two VCPUs on one PCPU. Fixed TWED value is used and > here is the result. How representative is this? TBH, I only know of two real world configurations: one where the vCPUs are pinned to different physical CPUs (and in this case your patch has absolutely no effect as long as there is no concurrent tasks), and one where there is oversubscription, and the scheduler moves things around as it sees fit, depending on the load. Having two vCPUs pinned per CPU feels like a test that has been picked to give the result you wanted. I'd like to see the full picture, including the case that matters for current use cases. I'm specially interested in the cases where the system is oversubscribed, because TWED is definitely going to screw with the scheduler latency. Thanks, M.
Hi Marc, I will consider more circumstances in the later test. Thanks for the advice. Thanks, Jingyi On 11/24/2020 7:02 PM, Marc Zyngier wrote: > On 2020-11-13 07:54, Jingyi Wang wrote: >> Hi all, >> >> Sorry for the delay. I have been testing the TWED feature performance >> lately. We select unixbench as the benchmark for some items of it is >> lock-intensive(fstime/fsbuffer/fsdisk). We run unixbench on a 4-VCPU >> VM, and bind every two VCPUs on one PCPU. Fixed TWED value is used and >> here is the result. > > How representative is this? > > TBH, I only know of two real world configurations: one where > the vCPUs are pinned to different physical CPUs (and in this > case your patch has absolutely no effect as long as there is > no concurrent tasks), and one where there is oversubscription, > and the scheduler moves things around as it sees fit, depending > on the load. > > Having two vCPUs pinned per CPU feels like a test that has been > picked to give the result you wanted. I'd like to see the full > picture, including the case that matters for current use cases. > I'm specially interested in the cases where the system is > oversubscribed, because TWED is definitely going to screw with > the scheduler latency. > > Thanks, > > M.