diff mbox

[v3] target-i386: present virtual L3 cache info for vcpus

Message ID 1472782975-20056-1-git-send-email-longpeng2@huawei.com (mailing list archive)
State New, archived
Headers show

Commit Message

Longpeng(Mike) Sept. 2, 2016, 2:22 a.m. UTC
From: "Longpeng(Mike)" <longpeng@huawei2.com>

Some software algorithms are based on the hardware's cache info, for example,
for x86 linux kernel, when cpu1 want to wakeup a task on cpu2, cpu1 will trigger
a resched IPI and told cpu2 to do the wakeup if they don't share low level
cache. Oppositely, cpu1 will access cpu2's runqueue directly if they share llc.
The relevant linux-kernel code as bellow:

	static void ttwu_queue(struct task_struct *p, int cpu)
	{
		struct rq *rq = cpu_rq(cpu);
		......
		if (... && !cpus_share_cache(smp_processor_id(), cpu)) {
			......
			ttwu_queue_remote(p, cpu); /* will trigger RES IPI */
			return;
		}
		......
		ttwu_do_activate(rq, p, 0); /* access target's rq directly */
		......
	}

In real hardware, the cpus on the same socket share L3 cache, so one won't
trigger a resched IPIs when wakeup a task on others. But QEMU doesn't present a
virtual L3 cache info for VM, then the linux guest will trigger lots of RES IPIs
under some workloads even if the virtual cpus belongs to the same virtual socket.

For KVM, this degrades performance, because there will be lots of vmexit due to
guest send IPIs.

The workload is a SAP HANA's testsuite, we run it one round(about 40 minuates)
and observe the (Suse11sp3)Guest's amounts of RES IPIs which triggering during
the period:

        No-L3           With-L3(applied this patch)
cpu0:	363890		44582
cpu1:	373405		43109
cpu2:	340783		43797
cpu3:	333854		43409
cpu4:	327170		40038
cpu5:	325491		39922
cpu6:	319129		42391
cpu7:	306480		41035
cpu8:	161139		32188
cpu9:	164649		31024
cpu10:	149823		30398
cpu11:	149823		32455
cpu12:	164830		35143
cpu13:	172269		35805
cpu14:	179979		33898
cpu15:	194505		32754
avg:	268963.6	40129.8

The VM's topology is "1*socket 8*cores 2*threads".
After present virtual L3 cache info for VM, the amounts of RES IPIs in guest
reduce 85%.

What's more, for KVM, vcpus send IPIs will cause vmexit which is expensive.
We had tested the overall system performance if vcpus actually run on sparate
physical socket. With L3 cache, the performance improves 7.2%~33.1%(avg:15.7%).

Signed-off-by: Longpeng(Mike) <longpeng@huawei2.com>
---
Changes since v2:
  - add more useful commit mesage.
  - rename "compat-cache" to "l3-cache-shared".

Changes since v1:
  - fix the compat problem: set compat_props on PC_COMPAT_2_7.
  - fix a "intentionally introducde bug": make intel's and amd's consistently.
  - fix the CPUID.(EAX=4, ECX=3):EAX[25:14].
  - test the performance if vcpus running on sparate sockets: with L3 cache,
    the performance improves 7.2%~33.1%(avg: 15.7%).
---
 include/hw/i386/pc.h |  8 ++++++++
 target-i386/cpu.c    | 49 ++++++++++++++++++++++++++++++++++++++++++++-----
 target-i386/cpu.h    |  5 +++++
 3 files changed, 57 insertions(+), 5 deletions(-)

Comments

Gonglei (Arei) Sept. 2, 2016, 2:41 a.m. UTC | #1
> -----Original Message-----
> From: longpeng
> Sent: Friday, September 02, 2016 10:23 AM
> To: ehabkost@redhat.com; rth@twiddle.net; pbonzini@redhat.com;
> mst@redhat.com
> Cc: Zhaoshenglong; Gonglei (Arei); Huangpeng (Peter); Herongguang (Stephen);
> qemu-devel@nongnu.org; Longpeng(Mike)
> Subject: [PATCH v3] target-i386: present virtual L3 cache info for vcpus
> 
> From: "Longpeng(Mike)" <longpeng@huawei2.com>
> 

A typo in email address, pls resend the v3.

> Some software algorithms are based on the hardware's cache info, for
> example,
> for x86 linux kernel, when cpu1 want to wakeup a task on cpu2, cpu1 will
> trigger
> a resched IPI and told cpu2 to do the wakeup if they don't share low level
> cache. Oppositely, cpu1 will access cpu2's runqueue directly if they share llc.
> The relevant linux-kernel code as bellow:
> 
> 	static void ttwu_queue(struct task_struct *p, int cpu)
> 	{
> 		struct rq *rq = cpu_rq(cpu);
> 		......
> 		if (... && !cpus_share_cache(smp_processor_id(), cpu)) {
> 			......
> 			ttwu_queue_remote(p, cpu); /* will trigger RES IPI */
> 			return;
> 		}
> 		......
> 		ttwu_do_activate(rq, p, 0); /* access target's rq directly */
> 		......
> 	}
> 
> In real hardware, the cpus on the same socket share L3 cache, so one won't
> trigger a resched IPIs when wakeup a task on others. But QEMU doesn't
> present a
> virtual L3 cache info for VM, then the linux guest will trigger lots of RES IPIs
> under some workloads even if the virtual cpus belongs to the same virtual
> socket.
> 
> For KVM, this degrades performance, because there will be lots of vmexit due
> to
> guest send IPIs.
> 
> The workload is a SAP HANA's testsuite, we run it one round(about 40
> minuates)
> and observe the (Suse11sp3)Guest's amounts of RES IPIs which triggering
> during
> the period:
> 
>         No-L3           With-L3(applied this patch)
> cpu0:	363890		44582
> cpu1:	373405		43109
> cpu2:	340783		43797
> cpu3:	333854		43409
> cpu4:	327170		40038
> cpu5:	325491		39922
> cpu6:	319129		42391
> cpu7:	306480		41035
> cpu8:	161139		32188
> cpu9:	164649		31024
> cpu10:	149823		30398
> cpu11:	149823		32455
> cpu12:	164830		35143
> cpu13:	172269		35805
> cpu14:	179979		33898
> cpu15:	194505		32754
> avg:	268963.6	40129.8
> 
> The VM's topology is "1*socket 8*cores 2*threads".
> After present virtual L3 cache info for VM, the amounts of RES IPIs in guest
> reduce 85%.
> 
> What's more, for KVM, vcpus send IPIs will cause vmexit which is expensive.
> We had tested the overall system performance if vcpus actually run on sparate
> physical socket. With L3 cache, the performance improves
> 7.2%~33.1%(avg:15.7%).
> 
> Signed-off-by: Longpeng(Mike) <longpeng@huawei2.com>
>
Here as well.

Regards,
-Gonglei
Michael S. Tsirkin Sept. 2, 2016, 10:52 p.m. UTC | #2
On Fri, Sep 02, 2016 at 10:22:55AM +0800, Longpeng(Mike) wrote:
> From: "Longpeng(Mike)" <longpeng@huawei2.com>
> 
> Some software algorithms are based on the hardware's cache info, for example,
> for x86 linux kernel, when cpu1 want to wakeup a task on cpu2, cpu1 will trigger
> a resched IPI and told cpu2 to do the wakeup if they don't share low level
> cache. Oppositely, cpu1 will access cpu2's runqueue directly if they share llc.
> The relevant linux-kernel code as bellow:
> 
> 	static void ttwu_queue(struct task_struct *p, int cpu)
> 	{
> 		struct rq *rq = cpu_rq(cpu);
> 		......
> 		if (... && !cpus_share_cache(smp_processor_id(), cpu)) {
> 			......
> 			ttwu_queue_remote(p, cpu); /* will trigger RES IPI */
> 			return;
> 		}
> 		......
> 		ttwu_do_activate(rq, p, 0); /* access target's rq directly */
> 		......
> 	}
> 
> In real hardware, the cpus on the same socket share L3 cache, so one won't
> trigger a resched IPIs when wakeup a task on others. But QEMU doesn't present a
> virtual L3 cache info for VM, then the linux guest will trigger lots of RES IPIs
> under some workloads even if the virtual cpus belongs to the same virtual socket.
> 
> For KVM, this degrades performance, because there will be lots of vmexit due to
> guest send IPIs.
> 
> The workload is a SAP HANA's testsuite, we run it one round(about 40 minuates)
> and observe the (Suse11sp3)Guest's amounts of RES IPIs which triggering during
> the period:
> 
>         No-L3           With-L3(applied this patch)
> cpu0:	363890		44582
> cpu1:	373405		43109
> cpu2:	340783		43797
> cpu3:	333854		43409
> cpu4:	327170		40038
> cpu5:	325491		39922
> cpu6:	319129		42391
> cpu7:	306480		41035
> cpu8:	161139		32188
> cpu9:	164649		31024
> cpu10:	149823		30398
> cpu11:	149823		32455
> cpu12:	164830		35143
> cpu13:	172269		35805
> cpu14:	179979		33898
> cpu15:	194505		32754
> avg:	268963.6	40129.8
> 
> The VM's topology is "1*socket 8*cores 2*threads".
> After present virtual L3 cache info for VM, the amounts of RES IPIs in guest
> reduce 85%.
> 
> What's more, for KVM, vcpus send IPIs will cause vmexit which is expensive.
> We had tested the overall system performance if vcpus actually run on sparate
> physical socket. With L3 cache, the performance improves 7.2%~33.1%(avg:15.7%).
> 
> Signed-off-by: Longpeng(Mike) <longpeng@huawei2.com>

For PC bits:
Acked-by: Michael S. Tsirkin <mst@redhat.com>


> ---
> Changes since v2:
>   - add more useful commit mesage.
>   - rename "compat-cache" to "l3-cache-shared".
> 
> Changes since v1:
>   - fix the compat problem: set compat_props on PC_COMPAT_2_7.
>   - fix a "intentionally introducde bug": make intel's and amd's consistently.
>   - fix the CPUID.(EAX=4, ECX=3):EAX[25:14].
>   - test the performance if vcpus running on sparate sockets: with L3 cache,
>     the performance improves 7.2%~33.1%(avg: 15.7%).
> ---
>  include/hw/i386/pc.h |  8 ++++++++
>  target-i386/cpu.c    | 49 ++++++++++++++++++++++++++++++++++++++++++++-----
>  target-i386/cpu.h    |  5 +++++
>  3 files changed, 57 insertions(+), 5 deletions(-)
> 
> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> index 74c175c..c92c54e 100644
> --- a/include/hw/i386/pc.h
> +++ b/include/hw/i386/pc.h
> @@ -367,7 +367,15 @@ int e820_add_entry(uint64_t, uint64_t, uint32_t);
>  int e820_get_num_entries(void);
>  bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
>  
> +#define PC_COMPAT_2_7 \
> +    {\
> +        .driver   = TYPE_X86_CPU,\
> +        .property = "l3-cache-shared",\
> +        .value    = "off",\
> +    },
> +
>  #define PC_COMPAT_2_6 \
> +    PC_COMPAT_2_7 \
>      HW_COMPAT_2_6 \
>      {\
>          .driver   = "fw_cfg_io",\
> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> index 6a1afab..4f93922 100644
> --- a/target-i386/cpu.c
> +++ b/target-i386/cpu.c
> @@ -57,6 +57,7 @@
>  #define CPUID_2_L1D_32KB_8WAY_64B 0x2c
>  #define CPUID_2_L1I_32KB_8WAY_64B 0x30
>  #define CPUID_2_L2_2MB_8WAY_64B   0x7d
> +#define CPUID_2_L3_16MB_16WAY_64B 0x4d
>  
>  
>  /* CPUID Leaf 4 constants: */
> @@ -131,11 +132,18 @@
>  #define L2_LINES_PER_TAG       1
>  #define L2_SIZE_KB_AMD       512
>  
> -/* No L3 cache: */
> +/* Level 3 unified cache: */
>  #define L3_SIZE_KB             0 /* disabled */
>  #define L3_ASSOCIATIVITY       0 /* disabled */
>  #define L3_LINES_PER_TAG       0 /* disabled */
>  #define L3_LINE_SIZE           0 /* disabled */
> +#define L3_N_LINE_SIZE         64
> +#define L3_N_ASSOCIATIVITY     16
> +#define L3_N_SETS           16384
> +#define L3_N_PARTITIONS         1
> +#define L3_N_DESCRIPTOR CPUID_2_L3_16MB_16WAY_64B
> +#define L3_N_LINES_PER_TAG      1
> +#define L3_N_SIZE_KB_AMD    16384
>  
>  /* TLB definitions: */
>  
> @@ -2275,6 +2283,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
>  {
>      X86CPU *cpu = x86_env_get_cpu(env);
>      CPUState *cs = CPU(cpu);
> +    uint32_t pkg_offset;
>  
>      /* test if maximum index reached */
>      if (index & 0x80000000) {
> @@ -2328,7 +2337,11 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
>          }
>          *eax = 1; /* Number of CPUID[EAX=2] calls required */
>          *ebx = 0;
> -        *ecx = 0;
> +        if (!cpu->enable_l3_cache_shared) {
> +            *ecx = 0;
> +        } else {
> +            *ecx = L3_N_DESCRIPTOR;
> +        }
>          *edx = (L1D_DESCRIPTOR << 16) | \
>                 (L1I_DESCRIPTOR <<  8) | \
>                 (L2_DESCRIPTOR);
> @@ -2374,6 +2387,25 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
>                  *ecx = L2_SETS - 1;
>                  *edx = CPUID_4_NO_INVD_SHARING;
>                  break;
> +            case 3: /* L3 cache info */
> +                if (!cpu->enable_l3_cache_shared) {
> +                    *eax = 0;
> +                    *ebx = 0;
> +                    *ecx = 0;
> +                    *edx = 0;
> +                    break;
> +                }
> +                *eax |= CPUID_4_TYPE_UNIFIED | \
> +                        CPUID_4_LEVEL(3) | \
> +                        CPUID_4_SELF_INIT_LEVEL;
> +                pkg_offset = apicid_pkg_offset(cs->nr_cores, cs->nr_threads);
> +                *eax |= ((1 << pkg_offset) - 1) << 14;
> +                *ebx = (L3_N_LINE_SIZE - 1) | \
> +                       ((L3_N_PARTITIONS - 1) << 12) | \
> +                       ((L3_N_ASSOCIATIVITY - 1) << 22);
> +                *ecx = L3_N_SETS - 1;
> +                *edx = CPUID_4_INCLUSIVE | CPUID_4_COMPLEX_IDX;
> +                break;
>              default: /* end of info */
>                  *eax = 0;
>                  *ebx = 0;
> @@ -2585,9 +2617,15 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
>          *ecx = (L2_SIZE_KB_AMD << 16) | \
>                 (AMD_ENC_ASSOC(L2_ASSOCIATIVITY) << 12) | \
>                 (L2_LINES_PER_TAG << 8) | (L2_LINE_SIZE);
> -        *edx = ((L3_SIZE_KB/512) << 18) | \
> -               (AMD_ENC_ASSOC(L3_ASSOCIATIVITY) << 12) | \
> -               (L3_LINES_PER_TAG << 8) | (L3_LINE_SIZE);
> +        if (!cpu->enable_l3_cache_shared) {
> +            *edx = ((L3_SIZE_KB / 512) << 18) | \
> +                   (AMD_ENC_ASSOC(L3_ASSOCIATIVITY) << 12) | \
> +                   (L3_LINES_PER_TAG << 8) | (L3_LINE_SIZE);
> +        } else {
> +            *edx = ((L3_N_SIZE_KB_AMD / 512) << 18) | \
> +                   (AMD_ENC_ASSOC(L3_N_ASSOCIATIVITY) << 12) | \
> +                   (L3_N_LINES_PER_TAG << 8) | (L3_N_LINE_SIZE);
> +        }
>          break;
>      case 0x80000007:
>          *eax = 0;
> @@ -3364,6 +3402,7 @@ static Property x86_cpu_properties[] = {
>      DEFINE_PROP_STRING("hv-vendor-id", X86CPU, hyperv_vendor_id),
>      DEFINE_PROP_BOOL("cpuid-0xb", X86CPU, enable_cpuid_0xb, true),
>      DEFINE_PROP_BOOL("lmce", X86CPU, enable_lmce, false),
> +    DEFINE_PROP_BOOL("l3-cache-shared", X86CPU, enable_l3_cache_shared, true),
>      DEFINE_PROP_END_OF_LIST()
>  };
>  
> diff --git a/target-i386/cpu.h b/target-i386/cpu.h
> index 65615c0..355bf47 100644
> --- a/target-i386/cpu.h
> +++ b/target-i386/cpu.h
> @@ -1202,6 +1202,11 @@ struct X86CPU {
>       */
>      bool enable_lmce;
>  
> +    /* Compatibility bits for old machine types.
> +     * If true present virtual l3 cache for VM.

"pretend that all CPUs share an l3 cache"?


> +     */
> +    bool enable_l3_cache_shared;
> +
>      /* Compatibility bits for old machine types: */
>      bool enable_cpuid_0xb;
>  
> -- 
> 1.8.3.1
>
Longpeng(Mike) Sept. 5, 2016, 1:16 a.m. UTC | #3
Hi Michael,

On 2016/9/3 6:52, Michael S. Tsirkin wrote:

> On Fri, Sep 02, 2016 at 10:22:55AM +0800, Longpeng(Mike) wrote:
>> From: "Longpeng(Mike)" <longpeng@huawei2.com>
>>
>> Some software algorithms are based on the hardware's cache info, for example,
>> for x86 linux kernel, when cpu1 want to wakeup a task on cpu2, cpu1 will trigger
>> a resched IPI and told cpu2 to do the wakeup if they don't share low level
>> cache. Oppositely, cpu1 will access cpu2's runqueue directly if they share llc.
>> The relevant linux-kernel code as bellow:
>>
>> 	static void ttwu_queue(struct task_struct *p, int cpu)
>> 	{
>> 		struct rq *rq = cpu_rq(cpu);
>> 		......
>> 		if (... && !cpus_share_cache(smp_processor_id(), cpu)) {
>> 			......
>> 			ttwu_queue_remote(p, cpu); /* will trigger RES IPI */
>> 			return;
>> 		}
>> 		......
>> 		ttwu_do_activate(rq, p, 0); /* access target's rq directly */
>> 		......
>> 	}
>>
>> In real hardware, the cpus on the same socket share L3 cache, so one won't
>> trigger a resched IPIs when wakeup a task on others. But QEMU doesn't present a
>> virtual L3 cache info for VM, then the linux guest will trigger lots of RES IPIs
>> under some workloads even if the virtual cpus belongs to the same virtual socket.
>>
>> For KVM, this degrades performance, because there will be lots of vmexit due to
>> guest send IPIs.
>>
>> The workload is a SAP HANA's testsuite, we run it one round(about 40 minuates)
>> and observe the (Suse11sp3)Guest's amounts of RES IPIs which triggering during
>> the period:
>>
>>         No-L3           With-L3(applied this patch)
>> cpu0:	363890		44582
>> cpu1:	373405		43109
>> cpu2:	340783		43797
>> cpu3:	333854		43409
>> cpu4:	327170		40038
>> cpu5:	325491		39922
>> cpu6:	319129		42391
>> cpu7:	306480		41035
>> cpu8:	161139		32188
>> cpu9:	164649		31024
>> cpu10:	149823		30398
>> cpu11:	149823		32455
>> cpu12:	164830		35143
>> cpu13:	172269		35805
>> cpu14:	179979		33898
>> cpu15:	194505		32754
>> avg:	268963.6	40129.8
>>
>> The VM's topology is "1*socket 8*cores 2*threads".
>> After present virtual L3 cache info for VM, the amounts of RES IPIs in guest
>> reduce 85%.
>>
>> What's more, for KVM, vcpus send IPIs will cause vmexit which is expensive.
>> We had tested the overall system performance if vcpus actually run on sparate
>> physical socket. With L3 cache, the performance improves 7.2%~33.1%(avg:15.7%).
>>
>> Signed-off-by: Longpeng(Mike) <longpeng@huawei2.com>
> 
> For PC bits:
> Acked-by: Michael S. Tsirkin <mst@redhat.com>

Thanks!

> 
> 
>> ---
>> Changes since v2:
>>   - add more useful commit mesage.
>>   - rename "compat-cache" to "l3-cache-shared".
>>
>> Changes since v1:
>>   - fix the compat problem: set compat_props on PC_COMPAT_2_7.
>>   - fix a "intentionally introducde bug": make intel's and amd's consistently.
>>   - fix the CPUID.(EAX=4, ECX=3):EAX[25:14].
>>   - test the performance if vcpus running on sparate sockets: with L3 cache,
>>     the performance improves 7.2%~33.1%(avg: 15.7%).
>> ---
>>  include/hw/i386/pc.h |  8 ++++++++
>>  target-i386/cpu.c    | 49 ++++++++++++++++++++++++++++++++++++++++++++-----
>>  target-i386/cpu.h    |  5 +++++
>>  3 files changed, 57 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
>> index 74c175c..c92c54e 100644
>> --- a/include/hw/i386/pc.h
>> +++ b/include/hw/i386/pc.h
>> @@ -367,7 +367,15 @@ int e820_add_entry(uint64_t, uint64_t, uint32_t);
>>  int e820_get_num_entries(void);
>>  bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
>>  
>> +#define PC_COMPAT_2_7 \
>> +    {\
>> +        .driver   = TYPE_X86_CPU,\
>> +        .property = "l3-cache-shared",\
>> +        .value    = "off",\
>> +    },
>> +
>>  #define PC_COMPAT_2_6 \
>> +    PC_COMPAT_2_7 \
>>      HW_COMPAT_2_6 \
>>      {\
>>          .driver   = "fw_cfg_io",\
>> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
>> index 6a1afab..4f93922 100644
>> --- a/target-i386/cpu.c
>> +++ b/target-i386/cpu.c
>> @@ -57,6 +57,7 @@
>>  #define CPUID_2_L1D_32KB_8WAY_64B 0x2c
>>  #define CPUID_2_L1I_32KB_8WAY_64B 0x30
>>  #define CPUID_2_L2_2MB_8WAY_64B   0x7d
>> +#define CPUID_2_L3_16MB_16WAY_64B 0x4d
>>  
>>  
>>  /* CPUID Leaf 4 constants: */
>> @@ -131,11 +132,18 @@
>>  #define L2_LINES_PER_TAG       1
>>  #define L2_SIZE_KB_AMD       512
>>  
>> -/* No L3 cache: */
>> +/* Level 3 unified cache: */
>>  #define L3_SIZE_KB             0 /* disabled */
>>  #define L3_ASSOCIATIVITY       0 /* disabled */
>>  #define L3_LINES_PER_TAG       0 /* disabled */
>>  #define L3_LINE_SIZE           0 /* disabled */
>> +#define L3_N_LINE_SIZE         64
>> +#define L3_N_ASSOCIATIVITY     16
>> +#define L3_N_SETS           16384
>> +#define L3_N_PARTITIONS         1
>> +#define L3_N_DESCRIPTOR CPUID_2_L3_16MB_16WAY_64B
>> +#define L3_N_LINES_PER_TAG      1
>> +#define L3_N_SIZE_KB_AMD    16384
>>  
>>  /* TLB definitions: */
>>  
>> @@ -2275,6 +2283,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
>>  {
>>      X86CPU *cpu = x86_env_get_cpu(env);
>>      CPUState *cs = CPU(cpu);
>> +    uint32_t pkg_offset;
>>  
>>      /* test if maximum index reached */
>>      if (index & 0x80000000) {
>> @@ -2328,7 +2337,11 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
>>          }
>>          *eax = 1; /* Number of CPUID[EAX=2] calls required */
>>          *ebx = 0;
>> -        *ecx = 0;
>> +        if (!cpu->enable_l3_cache_shared) {
>> +            *ecx = 0;
>> +        } else {
>> +            *ecx = L3_N_DESCRIPTOR;
>> +        }
>>          *edx = (L1D_DESCRIPTOR << 16) | \
>>                 (L1I_DESCRIPTOR <<  8) | \
>>                 (L2_DESCRIPTOR);
>> @@ -2374,6 +2387,25 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
>>                  *ecx = L2_SETS - 1;
>>                  *edx = CPUID_4_NO_INVD_SHARING;
>>                  break;
>> +            case 3: /* L3 cache info */
>> +                if (!cpu->enable_l3_cache_shared) {
>> +                    *eax = 0;
>> +                    *ebx = 0;
>> +                    *ecx = 0;
>> +                    *edx = 0;
>> +                    break;
>> +                }
>> +                *eax |= CPUID_4_TYPE_UNIFIED | \
>> +                        CPUID_4_LEVEL(3) | \
>> +                        CPUID_4_SELF_INIT_LEVEL;
>> +                pkg_offset = apicid_pkg_offset(cs->nr_cores, cs->nr_threads);
>> +                *eax |= ((1 << pkg_offset) - 1) << 14;
>> +                *ebx = (L3_N_LINE_SIZE - 1) | \
>> +                       ((L3_N_PARTITIONS - 1) << 12) | \
>> +                       ((L3_N_ASSOCIATIVITY - 1) << 22);
>> +                *ecx = L3_N_SETS - 1;
>> +                *edx = CPUID_4_INCLUSIVE | CPUID_4_COMPLEX_IDX;
>> +                break;
>>              default: /* end of info */
>>                  *eax = 0;
>>                  *ebx = 0;
>> @@ -2585,9 +2617,15 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
>>          *ecx = (L2_SIZE_KB_AMD << 16) | \
>>                 (AMD_ENC_ASSOC(L2_ASSOCIATIVITY) << 12) | \
>>                 (L2_LINES_PER_TAG << 8) | (L2_LINE_SIZE);
>> -        *edx = ((L3_SIZE_KB/512) << 18) | \
>> -               (AMD_ENC_ASSOC(L3_ASSOCIATIVITY) << 12) | \
>> -               (L3_LINES_PER_TAG << 8) | (L3_LINE_SIZE);
>> +        if (!cpu->enable_l3_cache_shared) {
>> +            *edx = ((L3_SIZE_KB / 512) << 18) | \
>> +                   (AMD_ENC_ASSOC(L3_ASSOCIATIVITY) << 12) | \
>> +                   (L3_LINES_PER_TAG << 8) | (L3_LINE_SIZE);
>> +        } else {
>> +            *edx = ((L3_N_SIZE_KB_AMD / 512) << 18) | \
>> +                   (AMD_ENC_ASSOC(L3_N_ASSOCIATIVITY) << 12) | \
>> +                   (L3_N_LINES_PER_TAG << 8) | (L3_N_LINE_SIZE);
>> +        }
>>          break;
>>      case 0x80000007:
>>          *eax = 0;
>> @@ -3364,6 +3402,7 @@ static Property x86_cpu_properties[] = {
>>      DEFINE_PROP_STRING("hv-vendor-id", X86CPU, hyperv_vendor_id),
>>      DEFINE_PROP_BOOL("cpuid-0xb", X86CPU, enable_cpuid_0xb, true),
>>      DEFINE_PROP_BOOL("lmce", X86CPU, enable_lmce, false),
>> +    DEFINE_PROP_BOOL("l3-cache-shared", X86CPU, enable_l3_cache_shared, true),
>>      DEFINE_PROP_END_OF_LIST()
>>  };
>>  
>> diff --git a/target-i386/cpu.h b/target-i386/cpu.h
>> index 65615c0..355bf47 100644
>> --- a/target-i386/cpu.h
>> +++ b/target-i386/cpu.h
>> @@ -1202,6 +1202,11 @@ struct X86CPU {
>>       */
>>      bool enable_lmce;
>>  
>> +    /* Compatibility bits for old machine types.
>> +     * If true present virtual l3 cache for VM.
> 
> "pretend that all CPUs share an l3 cache"?
> 

The vcpus in the same virtual-socket share an virtual l3 cache.
I will make it more clearly later.

The 2.7 was released, so I will modify this patch for 2.8 later.

> 
>> +     */
>> +    bool enable_l3_cache_shared;
>> +
>>      /* Compatibility bits for old machine types: */
>>      bool enable_cpuid_0xb;
>>  
>> -- 
>> 1.8.3.1
>>
> 
> .
>
Eduardo Habkost Sept. 5, 2016, 6:53 p.m. UTC | #4
On Fri, Sep 02, 2016 at 10:22:55AM +0800, Longpeng(Mike) wrote:
[...]
> ---
> Changes since v2:
>   - add more useful commit mesage.
>   - rename "compat-cache" to "l3-cache-shared".

What exactly "shared" means here? All the property does is to
enable/disable the L3 cache, as its own description says:

> 
> +    /* Compatibility bits for old machine types.
> +     * If true present virtual l3 cache for VM.
> +     */
> +    bool enable_l3_cache_shared;
> +

Why not just "l3-cache" or "l3-cache-enabled"?
Longpeng(Mike) Sept. 6, 2016, 12:31 a.m. UTC | #5
Hi Eduardo,

On 2016/9/6 2:53, Eduardo Habkost wrote:

> On Fri, Sep 02, 2016 at 10:22:55AM +0800, Longpeng(Mike) wrote:
> [...]
>> ---
>> Changes since v2:
>>   - add more useful commit mesage.
>>   - rename "compat-cache" to "l3-cache-shared".
> 
> What exactly "shared" means here? All the property does is to
> enable/disable the L3 cache, as its own description says:
> 
>>
>> +    /* Compatibility bits for old machine types.
>> +     * If true present virtual l3 cache for VM.
>> +     */
>> +    bool enable_l3_cache_shared;
>> +
> 
> Why not just "l3-cache" or "l3-cache-enabled"?
> 

I wanted to fix l1/l2's inconsistent bugs together originally, so I named it
"compat-cache". But later I thought it's too ugly to adding much l1/l2's
compatible macros, so I given up this idea. Instead, rename it to "l3-cache-shared".

Thanks for your good suggestion. I will choose "l3-cache-enabled" in the v5.
diff mbox

Patch

diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 74c175c..c92c54e 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -367,7 +367,15 @@  int e820_add_entry(uint64_t, uint64_t, uint32_t);
 int e820_get_num_entries(void);
 bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
 
+#define PC_COMPAT_2_7 \
+    {\
+        .driver   = TYPE_X86_CPU,\
+        .property = "l3-cache-shared",\
+        .value    = "off",\
+    },
+
 #define PC_COMPAT_2_6 \
+    PC_COMPAT_2_7 \
     HW_COMPAT_2_6 \
     {\
         .driver   = "fw_cfg_io",\
diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 6a1afab..4f93922 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -57,6 +57,7 @@ 
 #define CPUID_2_L1D_32KB_8WAY_64B 0x2c
 #define CPUID_2_L1I_32KB_8WAY_64B 0x30
 #define CPUID_2_L2_2MB_8WAY_64B   0x7d
+#define CPUID_2_L3_16MB_16WAY_64B 0x4d
 
 
 /* CPUID Leaf 4 constants: */
@@ -131,11 +132,18 @@ 
 #define L2_LINES_PER_TAG       1
 #define L2_SIZE_KB_AMD       512
 
-/* No L3 cache: */
+/* Level 3 unified cache: */
 #define L3_SIZE_KB             0 /* disabled */
 #define L3_ASSOCIATIVITY       0 /* disabled */
 #define L3_LINES_PER_TAG       0 /* disabled */
 #define L3_LINE_SIZE           0 /* disabled */
+#define L3_N_LINE_SIZE         64
+#define L3_N_ASSOCIATIVITY     16
+#define L3_N_SETS           16384
+#define L3_N_PARTITIONS         1
+#define L3_N_DESCRIPTOR CPUID_2_L3_16MB_16WAY_64B
+#define L3_N_LINES_PER_TAG      1
+#define L3_N_SIZE_KB_AMD    16384
 
 /* TLB definitions: */
 
@@ -2275,6 +2283,7 @@  void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
 {
     X86CPU *cpu = x86_env_get_cpu(env);
     CPUState *cs = CPU(cpu);
+    uint32_t pkg_offset;
 
     /* test if maximum index reached */
     if (index & 0x80000000) {
@@ -2328,7 +2337,11 @@  void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
         }
         *eax = 1; /* Number of CPUID[EAX=2] calls required */
         *ebx = 0;
-        *ecx = 0;
+        if (!cpu->enable_l3_cache_shared) {
+            *ecx = 0;
+        } else {
+            *ecx = L3_N_DESCRIPTOR;
+        }
         *edx = (L1D_DESCRIPTOR << 16) | \
                (L1I_DESCRIPTOR <<  8) | \
                (L2_DESCRIPTOR);
@@ -2374,6 +2387,25 @@  void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
                 *ecx = L2_SETS - 1;
                 *edx = CPUID_4_NO_INVD_SHARING;
                 break;
+            case 3: /* L3 cache info */
+                if (!cpu->enable_l3_cache_shared) {
+                    *eax = 0;
+                    *ebx = 0;
+                    *ecx = 0;
+                    *edx = 0;
+                    break;
+                }
+                *eax |= CPUID_4_TYPE_UNIFIED | \
+                        CPUID_4_LEVEL(3) | \
+                        CPUID_4_SELF_INIT_LEVEL;
+                pkg_offset = apicid_pkg_offset(cs->nr_cores, cs->nr_threads);
+                *eax |= ((1 << pkg_offset) - 1) << 14;
+                *ebx = (L3_N_LINE_SIZE - 1) | \
+                       ((L3_N_PARTITIONS - 1) << 12) | \
+                       ((L3_N_ASSOCIATIVITY - 1) << 22);
+                *ecx = L3_N_SETS - 1;
+                *edx = CPUID_4_INCLUSIVE | CPUID_4_COMPLEX_IDX;
+                break;
             default: /* end of info */
                 *eax = 0;
                 *ebx = 0;
@@ -2585,9 +2617,15 @@  void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
         *ecx = (L2_SIZE_KB_AMD << 16) | \
                (AMD_ENC_ASSOC(L2_ASSOCIATIVITY) << 12) | \
                (L2_LINES_PER_TAG << 8) | (L2_LINE_SIZE);
-        *edx = ((L3_SIZE_KB/512) << 18) | \
-               (AMD_ENC_ASSOC(L3_ASSOCIATIVITY) << 12) | \
-               (L3_LINES_PER_TAG << 8) | (L3_LINE_SIZE);
+        if (!cpu->enable_l3_cache_shared) {
+            *edx = ((L3_SIZE_KB / 512) << 18) | \
+                   (AMD_ENC_ASSOC(L3_ASSOCIATIVITY) << 12) | \
+                   (L3_LINES_PER_TAG << 8) | (L3_LINE_SIZE);
+        } else {
+            *edx = ((L3_N_SIZE_KB_AMD / 512) << 18) | \
+                   (AMD_ENC_ASSOC(L3_N_ASSOCIATIVITY) << 12) | \
+                   (L3_N_LINES_PER_TAG << 8) | (L3_N_LINE_SIZE);
+        }
         break;
     case 0x80000007:
         *eax = 0;
@@ -3364,6 +3402,7 @@  static Property x86_cpu_properties[] = {
     DEFINE_PROP_STRING("hv-vendor-id", X86CPU, hyperv_vendor_id),
     DEFINE_PROP_BOOL("cpuid-0xb", X86CPU, enable_cpuid_0xb, true),
     DEFINE_PROP_BOOL("lmce", X86CPU, enable_lmce, false),
+    DEFINE_PROP_BOOL("l3-cache-shared", X86CPU, enable_l3_cache_shared, true),
     DEFINE_PROP_END_OF_LIST()
 };
 
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 65615c0..355bf47 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -1202,6 +1202,11 @@  struct X86CPU {
      */
     bool enable_lmce;
 
+    /* Compatibility bits for old machine types.
+     * If true present virtual l3 cache for VM.
+     */
+    bool enable_l3_cache_shared;
+
     /* Compatibility bits for old machine types: */
     bool enable_cpuid_0xb;