mbox series

[RFC,0/4] Add support for ARMv8.6 TWED feature

Message ID 20200929091727.8692-1-wangjingyi11@huawei.com (mailing list archive)
Headers show
Series Add support for ARMv8.6 TWED feature | expand

Message

Jingyi Wang Sept. 29, 2020, 9:17 a.m. UTC
TWE Delay is an optional feature in ARMv8.6 Extentions. There is a
performance benefit in waiting for a period of time for an event to
arrive before taking the trap as it is common that event will arrive
“quite soon” after executing the WFE instruction.

This series adds support for TWED feature and implements TWE delay
value dynamic adjustment.

Thanks for Shameer's advice on this series. The function of this patch
has been tested on TWED supported hardware and the performance of it is
still on test, any advice will be welcomed.

Jingyi Wang (2):
  KVM: arm64: Make use of TWED feature
  KVM: arm64: Use dynamic TWE Delay value

Zengruan Ye (2):
  arm64: cpufeature: TWED support detection
  KVM: arm64: Add trace for TWED update

 arch/arm64/Kconfig                   | 10 +++++
 arch/arm64/include/asm/cpucaps.h     |  3 +-
 arch/arm64/include/asm/kvm_arm.h     |  5 +++
 arch/arm64/include/asm/kvm_emulate.h | 38 ++++++++++++++++++
 arch/arm64/include/asm/kvm_host.h    | 19 ++++++++-
 arch/arm64/include/asm/virt.h        |  8 ++++
 arch/arm64/kernel/cpufeature.c       | 12 ++++++
 arch/arm64/kvm/arm.c                 | 58 ++++++++++++++++++++++++++++
 arch/arm64/kvm/handle_exit.c         |  2 +
 arch/arm64/kvm/trace_arm.h           | 21 ++++++++++
 10 files changed, 174 insertions(+), 2 deletions(-)

Comments

Marc Zyngier Sept. 29, 2020, 10:50 a.m. UTC | #1
On 2020-09-29 10:17, Jingyi Wang wrote:
> TWE Delay is an optional feature in ARMv8.6 Extentions. There is a
> performance benefit in waiting for a period of time for an event to
> arrive before taking the trap as it is common that event will arrive
> “quite soon” after executing the WFE instruction.

Define "quite soon". Quantify "performance benefits". Which are the
workloads that actually benefit from this imitation of the x86 PLE?

I was opposed to this when the spec was drafted, and I still am given
that there is zero supporting evidence that it bring any gain over
immediate trapping in an oversubscribed environment (which is the only
case where it matters).

Thanks,

         M.

> 
> This series adds support for TWED feature and implements TWE delay
> value dynamic adjustment.
> 
> Thanks for Shameer's advice on this series. The function of this patch
> has been tested on TWED supported hardware and the performance of it is
> still on test, any advice will be welcomed.
> 
> Jingyi Wang (2):
>   KVM: arm64: Make use of TWED feature
>   KVM: arm64: Use dynamic TWE Delay value
> 
> Zengruan Ye (2):
>   arm64: cpufeature: TWED support detection
>   KVM: arm64: Add trace for TWED update
> 
>  arch/arm64/Kconfig                   | 10 +++++
>  arch/arm64/include/asm/cpucaps.h     |  3 +-
>  arch/arm64/include/asm/kvm_arm.h     |  5 +++
>  arch/arm64/include/asm/kvm_emulate.h | 38 ++++++++++++++++++
>  arch/arm64/include/asm/kvm_host.h    | 19 ++++++++-
>  arch/arm64/include/asm/virt.h        |  8 ++++
>  arch/arm64/kernel/cpufeature.c       | 12 ++++++
>  arch/arm64/kvm/arm.c                 | 58 ++++++++++++++++++++++++++++
>  arch/arm64/kvm/handle_exit.c         |  2 +
>  arch/arm64/kvm/trace_arm.h           | 21 ++++++++++
>  10 files changed, 174 insertions(+), 2 deletions(-)
Jingyi Wang Sept. 30, 2020, 1:21 a.m. UTC | #2
Hi Marc,

On 9/29/2020 6:50 PM, Marc Zyngier wrote:
> On 2020-09-29 10:17, Jingyi Wang wrote:
>> TWE Delay is an optional feature in ARMv8.6 Extentions. There is a
>> performance benefit in waiting for a period of time for an event to
>> arrive before taking the trap as it is common that event will arrive
>> “quite soon” after executing the WFE instruction.
> 
> Define "quite soon". Quantify "performance benefits". Which are the
> workloads that actually benefit from this imitation of the x86 PLE?
> 
> I was opposed to this when the spec was drafted, and I still am given
> that there is zero supporting evidence that it bring any gain over
> immediate trapping in an oversubscribed environment (which is the only
> case where it matters).
> 
> Thanks,
> 
>          M.

Sure, I will do more performance tests and post the results as soon as
possible.

Thanks,
Jingyi
Jingyi Wang Nov. 13, 2020, 7:54 a.m. UTC | #3
Hi all,

Sorry for the delay. I have been testing the TWED feature performance
lately. We select unixbench as the benchmark for some items of it is 
lock-intensive(fstime/fsbuffer/fsdisk). We run unixbench on a 4-VCPU
VM, and bind every two VCPUs on one PCPU. Fixed TWED value is used and 
here is the result.

      twed_value   | fstime        | fsbuffer   | fsdisk
     --------------+---------------+------------+------------
      disable      | 16.0          | 14.1       | 18.0
      0            | 16.3          | 13.5       | 17.2
      1            | 17.5          | 14.7       | 17.4
      2            | 17.3          | 15.3       | 18.0
      3            | 17.7          | 15.2       | 18.9
      4            | 17.9          | 14.3       | 18.2
      5            | 17.2          | 14.1       | 19.0
      6            | 5.8           | 4.2        | 5.7
      7            | 6.2           | 5.6        | 12.8

Note:
fstime: File Copy 1024 bufsize 2000 maxblocks
fsbuffer: File Copy 256 bufsize 500 maxblocks
fsdisk: File Copy 4096 bufsize 8000 maxblocks
The index of unixbench, higher is better.

It is shown that, compared to the circumstance that TWED is disabled,
lock-intensive testing items have better performance if an appropriate
TWED value is set(up to 5.6%~11.9%). Meanwhile, the complete unixbench
test is run to prove that other testing items are not sensitive to this
parameter.

Thanks
Jingyi

On 9/29/2020 5:17 PM, Jingyi Wang wrote:
> TWE Delay is an optional feature in ARMv8.6 Extentions. There is a
> performance benefit in waiting for a period of time for an event to
> arrive before taking the trap as it is common that event will arrive
> “quite soon” after executing the WFE instruction.
> 
> This series adds support for TWED feature and implements TWE delay
> value dynamic adjustment.
> 
> Thanks for Shameer's advice on this series. The function of this patch
> has been tested on TWED supported hardware and the performance of it is
> still on test, any advice will be welcomed.
> 
> Jingyi Wang (2):
>    KVM: arm64: Make use of TWED feature
>    KVM: arm64: Use dynamic TWE Delay value
> 
> Zengruan Ye (2):
>    arm64: cpufeature: TWED support detection
>    KVM: arm64: Add trace for TWED update
> 
>   arch/arm64/Kconfig                   | 10 +++++
>   arch/arm64/include/asm/cpucaps.h     |  3 +-
>   arch/arm64/include/asm/kvm_arm.h     |  5 +++
>   arch/arm64/include/asm/kvm_emulate.h | 38 ++++++++++++++++++
>   arch/arm64/include/asm/kvm_host.h    | 19 ++++++++-
>   arch/arm64/include/asm/virt.h        |  8 ++++
>   arch/arm64/kernel/cpufeature.c       | 12 ++++++
>   arch/arm64/kvm/arm.c                 | 58 ++++++++++++++++++++++++++++
>   arch/arm64/kvm/handle_exit.c         |  2 +
>   arch/arm64/kvm/trace_arm.h           | 21 ++++++++++
>   10 files changed, 174 insertions(+), 2 deletions(-)
>
Jingyi Wang Nov. 24, 2020, 3:19 a.m. UTC | #4
Hi Marc,

Gentle ping, could you please give some comments on this patch or the
current test results? Thanks in advance.

Thanks,
Jingyi

On 11/13/2020 3:54 PM, Jingyi Wang wrote:
> Hi all,
> 
> Sorry for the delay. I have been testing the TWED feature performance
> lately. We select unixbench as the benchmark for some items of it is 
> lock-intensive(fstime/fsbuffer/fsdisk). We run unixbench on a 4-VCPU
> VM, and bind every two VCPUs on one PCPU. Fixed TWED value is used and 
> here is the result.
> 
>       twed_value  | fstime     | fsbuffer  | fsdisk
>      --------------+---------------+------------+------------
>       disable   | 16.0      | 14.1    | 18.0
>       0      | 16.3      | 13.5    | 17.2
>       1      | 17.5      | 14.7    | 17.4
>       2      | 17.3      | 15.3    | 18.0
>       3      | 17.7      | 15.2    | 18.9
>       4      | 17.9      | 14.3    | 18.2
>       5      | 17.2      | 14.1    | 19.0
>       6      | 5.8       | 4.2     | 5.7
>       7      | 6.2      | 5.6     | 12.8
> 
> Note:
> fstime: File Copy 1024 bufsize 2000 maxblocks
> fsbuffer: File Copy 256 bufsize 500 maxblocks
> fsdisk: File Copy 4096 bufsize 8000 maxblocks
> The index of unixbench, higher is better.
> 
> It is shown that, compared to the circumstance that TWED is disabled,
> lock-intensive testing items have better performance if an appropriate
> TWED value is set(up to 5.6%~11.9%). Meanwhile, the complete unixbench
> test is run to prove that other testing items are not sensitive to this
> parameter.
> 
> Thanks
> Jingyi
> 
> On 9/29/2020 5:17 PM, Jingyi Wang wrote:
>> TWE Delay is an optional feature in ARMv8.6 Extentions. There is a
>> performance benefit in waiting for a period of time for an event to
>> arrive before taking the trap as it is common that event will arrive
>> “quite soon” after executing the WFE instruction.
>>
>> This series adds support for TWED feature and implements TWE delay
>> value dynamic adjustment.
>>
>> Thanks for Shameer's advice on this series. The function of this patch
>> has been tested on TWED supported hardware and the performance of it is
>> still on test, any advice will be welcomed.
>>
>> Jingyi Wang (2):
>>    KVM: arm64: Make use of TWED feature
>>    KVM: arm64: Use dynamic TWE Delay value
>>
>> Zengruan Ye (2):
>>    arm64: cpufeature: TWED support detection
>>    KVM: arm64: Add trace for TWED update
>>
>>   arch/arm64/Kconfig                   | 10 +++++
>>   arch/arm64/include/asm/cpucaps.h     |  3 +-
>>   arch/arm64/include/asm/kvm_arm.h     |  5 +++
>>   arch/arm64/include/asm/kvm_emulate.h | 38 ++++++++++++++++++
>>   arch/arm64/include/asm/kvm_host.h    | 19 ++++++++-
>>   arch/arm64/include/asm/virt.h        |  8 ++++
>>   arch/arm64/kernel/cpufeature.c       | 12 ++++++
>>   arch/arm64/kvm/arm.c                 | 58 ++++++++++++++++++++++++++++
>>   arch/arm64/kvm/handle_exit.c         |  2 +
>>   arch/arm64/kvm/trace_arm.h           | 21 ++++++++++
>>   10 files changed, 174 insertions(+), 2 deletions(-)
>>
Marc Zyngier Nov. 24, 2020, 11:02 a.m. UTC | #5
On 2020-11-13 07:54, Jingyi Wang wrote:
> Hi all,
> 
> Sorry for the delay. I have been testing the TWED feature performance
> lately. We select unixbench as the benchmark for some items of it is
> lock-intensive(fstime/fsbuffer/fsdisk). We run unixbench on a 4-VCPU
> VM, and bind every two VCPUs on one PCPU. Fixed TWED value is used and
> here is the result.

How representative is this?

TBH, I only know of two real world configurations: one where
the vCPUs are pinned to different physical CPUs (and in this
case your patch has absolutely no effect as long as there is
no concurrent tasks), and one where there is oversubscription,
and the scheduler moves things around as it sees fit, depending
on the load.

Having two vCPUs pinned per CPU feels like a test that has been
picked to give the result you wanted. I'd like to see the full
picture, including the case that matters for current use cases.
I'm specially interested in the cases where the system is
oversubscribed, because TWED is definitely going to screw with
the scheduler latency.

Thanks,

         M.
Jingyi Wang Nov. 26, 2020, 2:31 a.m. UTC | #6
Hi Marc,

I will consider more circumstances in the later test. Thanks for the
advice.

Thanks,
Jingyi


On 11/24/2020 7:02 PM, Marc Zyngier wrote:
> On 2020-11-13 07:54, Jingyi Wang wrote:
>> Hi all,
>>
>> Sorry for the delay. I have been testing the TWED feature performance
>> lately. We select unixbench as the benchmark for some items of it is
>> lock-intensive(fstime/fsbuffer/fsdisk). We run unixbench on a 4-VCPU
>> VM, and bind every two VCPUs on one PCPU. Fixed TWED value is used and
>> here is the result.
> 
> How representative is this?
> 
> TBH, I only know of two real world configurations: one where
> the vCPUs are pinned to different physical CPUs (and in this
> case your patch has absolutely no effect as long as there is
> no concurrent tasks), and one where there is oversubscription,
> and the scheduler moves things around as it sees fit, depending
> on the load.
> 
> Having two vCPUs pinned per CPU feels like a test that has been
> picked to give the result you wanted. I'd like to see the full
> picture, including the case that matters for current use cases.
> I'm specially interested in the cases where the system is
> oversubscribed, because TWED is definitely going to screw with
> the scheduler latency.
> 
> Thanks,
> 
>          M.