diff mbox series

libxc/x86: avoid overflow in CPUID APIC ID adjustments

Message ID b080fa0f-08d2-34d0-3f54-549e1303eeb4@suse.com (mailing list archive)
State Superseded
Headers show
Series libxc/x86: avoid overflow in CPUID APIC ID adjustments | expand

Commit Message

Jan Beulich Sept. 19, 2019, 11:48 a.m. UTC
Recent AMD processors may report up to 128 logical processors in CPUID
leaf 1. Doubling this value produces 0 (which OSes sincerely dislike),
as the respective field is only 8 bits wide. Suppress doubling the value
(and its leaf 0x80000008 counterpart) in such a case.

Additionally don't even do any adjustment when the host topology already
includes room for multiple threads per core.

Furthermore don't double the Maximum Cores Per Package at all - by us
introducing a fake HTT effect, the core count doesn't need to change.
Instead adjust the Maximum Logical Processors Sharing Cache field, which
so far was zapped altogether.

Also zap leaf 4 (and at the same time leaf 2) EDX output for AMD.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
TBD: Using xc_physinfo() output here needs a better solution. The
     threads_per_core value returned is the count of active siblings of
     CPU 0, rather than a system wide applicable value (and constant
     over the entire session). Using CPUID output (leaves 4 and
     8000001e) doesn't look viable either, due to this not really being
     the host values on PVH. Judging from the host feature set's HTT
     flag also wouldn't tell us whether there actually are multiple
     threads per core.
TBD: The adjustment of Maximum Logical Processors Sharing Cache should
     perhaps occur only if an adjustment to leaf 1 has occurred.

Comments

Andrew Cooper Sept. 20, 2019, 10:05 a.m. UTC | #1
On 19/09/2019 12:48, Jan Beulich wrote:
> Recent AMD processors may report up to 128 logical processors in CPUID
> leaf 1. Doubling this value produces 0 (which OSes sincerely dislike),
> as the respective field is only 8 bits wide. Suppress doubling the value
> (and its leaf 0x80000008 counterpart) in such a case.
>
> Additionally don't even do any adjustment when the host topology already
> includes room for multiple threads per core.
>
> Furthermore don't double the Maximum Cores Per Package at all - by us
> introducing a fake HTT effect, the core count doesn't need to change.
> Instead adjust the Maximum Logical Processors Sharing Cache field, which
> so far was zapped altogether.
>
> Also zap leaf 4 (and at the same time leaf 2) EDX output for AMD.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> TBD: Using xc_physinfo() output here needs a better solution. The
>      threads_per_core value returned is the count of active siblings of
>      CPU 0, rather than a system wide applicable value (and constant
>      over the entire session). Using CPUID output (leaves 4 and
>      8000001e) doesn't look viable either, due to this not really being
>      the host values on PVH. Judging from the host feature set's HTT
>      flag also wouldn't tell us whether there actually are multiple
>      threads per core.

The key thing is that htt != "more than one thread per core".  HTT is
strictly a bit indicating that topology information is available in a
new form in the CPUID leaves (or in AMDs case, the same information
should be interpreted in a new way).  Just because HTT is set (and it
does get set in non-HT capable systems), doesn't mean there is space for
more than thread per core in topology information.

For PV guests, my adjustment in the CPUID series shows (what I believe
to be) the only correct way of propagating the host HTT/CMP_LEGACY
settings through.

For HVM guests, it really shouldn't really have anything to do with the
host setting.  We should be choosing how many threads/core to give to
the guest, then constructing the topology information from first principles.

Ignore the PVH case.  It is totally broken for several other reasons as
well, and PVH Dom0 isn't a production-ready thing yet.

This gets us back to the PV case where the host information is actually
in view, and (for backport purposes) can be trusted.

~Andrew
Jan Beulich Sept. 20, 2019, 10:20 a.m. UTC | #2
On 20.09.2019 12:05, Andrew Cooper wrote:
> On 19/09/2019 12:48, Jan Beulich wrote:
>> Recent AMD processors may report up to 128 logical processors in CPUID
>> leaf 1. Doubling this value produces 0 (which OSes sincerely dislike),
>> as the respective field is only 8 bits wide. Suppress doubling the value
>> (and its leaf 0x80000008 counterpart) in such a case.
>>
>> Additionally don't even do any adjustment when the host topology already
>> includes room for multiple threads per core.
>>
>> Furthermore don't double the Maximum Cores Per Package at all - by us
>> introducing a fake HTT effect, the core count doesn't need to change.
>> Instead adjust the Maximum Logical Processors Sharing Cache field, which
>> so far was zapped altogether.
>>
>> Also zap leaf 4 (and at the same time leaf 2) EDX output for AMD.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> TBD: Using xc_physinfo() output here needs a better solution. The
>>      threads_per_core value returned is the count of active siblings of
>>      CPU 0, rather than a system wide applicable value (and constant
>>      over the entire session). Using CPUID output (leaves 4 and
>>      8000001e) doesn't look viable either, due to this not really being
>>      the host values on PVH. Judging from the host feature set's HTT
>>      flag also wouldn't tell us whether there actually are multiple
>>      threads per core.
> 
> The key thing is that htt != "more than one thread per core".  HTT is
> strictly a bit indicating that topology information is available in a
> new form in the CPUID leaves (or in AMDs case, the same information
> should be interpreted in a new way).  Just because HTT is set (and it
> does get set in non-HT capable systems), doesn't mean there is space for
> more than thread per core in topology information.
> 
> For PV guests, my adjustment in the CPUID series shows (what I believe
> to be) the only correct way of propagating the host HTT/CMP_LEGACY
> settings through.
> 
> For HVM guests, it really shouldn't really have anything to do with the
> host setting.  We should be choosing how many threads/core to give to
> the guest, then constructing the topology information from first principles.
> 
> Ignore the PVH case.  It is totally broken for several other reasons as
> well, and PVH Dom0 isn't a production-ready thing yet.
> 
> This gets us back to the PV case where the host information is actually
> in view, and (for backport purposes) can be trusted.

Okay, this means I'll revive and finish the half cpuid() based attempt
I had made initially. A fundamental question remains open though from
your reply: Do you agree with the idea of avoiding the multiplication
by 2 if the host topology already provides at least one bit of thread
ID within the APIC ID? Related to which then the question whether you
also agree with my approach of ditching the adjustment to Maximum Cores
Per Package? (I'm asking because I'd like to avoid several more round
trips of the patch itself.)

Jan
Andrew Cooper Sept. 20, 2019, 12:40 p.m. UTC | #3
On 20/09/2019 11:20, Jan Beulich wrote:
> On 20.09.2019 12:05, Andrew Cooper wrote:
>> On 19/09/2019 12:48, Jan Beulich wrote:
>>> Recent AMD processors may report up to 128 logical processors in CPUID
>>> leaf 1. Doubling this value produces 0 (which OSes sincerely dislike),
>>> as the respective field is only 8 bits wide. Suppress doubling the value
>>> (and its leaf 0x80000008 counterpart) in such a case.
>>>
>>> Additionally don't even do any adjustment when the host topology already
>>> includes room for multiple threads per core.
>>>
>>> Furthermore don't double the Maximum Cores Per Package at all - by us
>>> introducing a fake HTT effect, the core count doesn't need to change.
>>> Instead adjust the Maximum Logical Processors Sharing Cache field, which
>>> so far was zapped altogether.
>>>
>>> Also zap leaf 4 (and at the same time leaf 2) EDX output for AMD.
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> ---
>>> TBD: Using xc_physinfo() output here needs a better solution. The
>>>      threads_per_core value returned is the count of active siblings of
>>>      CPU 0, rather than a system wide applicable value (and constant
>>>      over the entire session). Using CPUID output (leaves 4 and
>>>      8000001e) doesn't look viable either, due to this not really being
>>>      the host values on PVH. Judging from the host feature set's HTT
>>>      flag also wouldn't tell us whether there actually are multiple
>>>      threads per core.
>> The key thing is that htt != "more than one thread per core".  HTT is
>> strictly a bit indicating that topology information is available in a
>> new form in the CPUID leaves (or in AMDs case, the same information
>> should be interpreted in a new way).  Just because HTT is set (and it
>> does get set in non-HT capable systems), doesn't mean there is space for
>> more than thread per core in topology information.
>>
>> For PV guests, my adjustment in the CPUID series shows (what I believe
>> to be) the only correct way of propagating the host HTT/CMP_LEGACY
>> settings through.
>>
>> For HVM guests, it really shouldn't really have anything to do with the
>> host setting.  We should be choosing how many threads/core to give to
>> the guest, then constructing the topology information from first principles.
>>
>> Ignore the PVH case.  It is totally broken for several other reasons as
>> well, and PVH Dom0 isn't a production-ready thing yet.
>>
>> This gets us back to the PV case where the host information is actually
>> in view, and (for backport purposes) can be trusted.
> Okay, this means I'll revive and finish the half cpuid() based attempt
> I had made initially. A fundamental question remains open though from
> your reply: Do you agree with the idea of avoiding the multiplication
> by 2 if the host topology already provides at least one bit of thread
> ID within the APIC ID?

In theory, yes.  In practice, I'd err on the side of not.

A further problem with CPUID handling is that it is recalculated from
scratch even after migrate.  Therefore, any changes to the algorithm
will cause inconsistencies to be seen in the guest across
migrate/upgrade.  This problem becomes substantially worse if the patch
is backported to stable trees.

Now that get_cpu_policy has existed for a little while, and
set_cpu_policy is imminent, fixing the "CPUID changes across migrate"
problem is almost doable, and is on the plan for toolstack work.

That said, ultimately, anything "pre 4.14" => "4.14" is going to hit a
discontinuity, because there is information discarded on the source side
which can't be reconstructed on the destination.

Overall, I would suggest doing the absolute minimum change required to
unbreak Rome CPUs.  Everything further is going to cause differences
across migrate.

In 4.14, I think we can reasonably fix all of:
1) CPUID data discarded for migrate
2) domain builder uses native CPUID
3) topology handling isn't consistent with SDM/APM

all of which is libxc/libxl work, once set_cpu_policy() is in place.

~Andrew
Jan Beulich Sept. 20, 2019, 1:13 p.m. UTC | #4
On 20.09.2019 14:40, Andrew Cooper wrote:
> Overall, I would suggest doing the absolute minimum change required to
> unbreak Rome CPUs.

Well, okay, I'm zapping everything else, but it feels wrong to
leave in place the similar overflow in intel_xc_cpuid_policy().

Jan
diff mbox series

Patch

--- a/tools/libxc/xc_cpuid_x86.c
+++ b/tools/libxc/xc_cpuid_x86.c
@@ -251,6 +251,8 @@  struct cpuid_domain_info
     uint32_t *featureset;
     unsigned int nr_features;
 
+    bool host_htt;
+
     /* PV-only information. */
     bool pv64;
 
@@ -290,6 +292,7 @@  static int get_cpuid_domain_info(xc_inte
     xc_dominfo_t di;
     unsigned int in[2] = { 0, ~0U }, regs[4];
     unsigned int i, host_nr_features = xc_get_cpu_featureset_size();
+    xc_physinfo_t physinfo;
     int rc;
 
     cpuid(in, regs);
@@ -343,6 +346,10 @@  static int get_cpuid_domain_info(xc_inte
 
     info->xfeature_mask = domctl.u.vcpuextstate.xfeature_mask;
 
+    rc = xc_physinfo(xch, &physinfo);
+    if ( !rc && physinfo.threads_per_core > 1 )
+        info->host_htt = true;
+
     if ( di.hvm )
     {
         uint64_t val;
@@ -385,7 +392,7 @@  static void amd_xc_cpuid_policy(const st
     {
     case 0x00000002:
     case 0x00000004:
-        regs[0] = regs[1] = regs[2] = 0;
+        regs[0] = regs[1] = regs[2] = regs[3] = 0;
         break;
 
     case 0x80000000:
@@ -395,11 +402,25 @@  static void amd_xc_cpuid_policy(const st
 
     case 0x80000008:
         /*
-         * ECX[15:12] is ApicIdCoreSize: ECX[7:0] is NumberOfCores (minus one).
-         * Update to reflect vLAPIC_ID = vCPU_ID * 2.
+         * ECX[15:12] is ApicIdCoreSize.
+         * ECX[7:0] is NumberOfCores (minus one).
+         */
+        if ( info->host_htt )
+        {
+            regs[2] &= 0xf0ffu;
+            break;
+        }
+        /*
+         * Update to reflect vLAPIC_ID = vCPU_ID * 2.  But make sure to avoid
+         * - overflow,
+         * - going out of sync with leaf 1 EBX[23:16],
+         * - incrementing ApicIdCoreSize when it's zero (which changes the
+         *   meaning of bits 7:0).
          */
-        regs[2] = ((regs[2] + (1u << 12)) & 0xf000u) |
-                  ((regs[2] & 0xffu) << 1) | 1u;
+        if ( (regs[2] & 0xf000u) && (regs[2] & 0xf000u) != 0xf000u )
+            regs[2] = ((regs[2] + (1u << 12)) & 0xf000u) | (regs[2] & 0xffu);
+        if ( (regs[2] & 0x7fu) < 0x7fu )
+            regs[2] = (regs[2] & 0xf000u) | ((regs[2] & 0x7fu) << 1) | 1u;
         break;
 
     case 0x8000000a: {
@@ -442,10 +463,19 @@  static void intel_xc_cpuid_policy(const
     case 0x00000004:
         /*
          * EAX[31:26] is Maximum Cores Per Package (minus one).
-         * Update to reflect vLAPIC_ID = vCPU_ID * 2.
+         * EAX[25:14] is Maximum Logical Processors Sharing Cache (minus one).
          */
-        regs[0] = (((regs[0] & 0x7c000000u) << 1) | 0x04000000u |
-                   (regs[0] & 0x3ffu));
+        if ( info->host_htt )
+            regs[0] &= 0xffffc3ffu;
+        else
+        {
+            /*
+             * Update to reflect vLAPIC_ID = vCPU_ID * 2.  Note that overflow
+             * is sufficiently benign here.
+             */
+            regs[0] = (((regs[0] | 0x00002000u) << 1) & 0x03ffc000u) |
+                      (regs[0] & 0xfc0003ffu);
+        }
         regs[3] &= 0x3ffu;
         break;
 
@@ -478,9 +508,13 @@  static void xc_cpuid_hvm_policy(const st
     case 0x00000001:
         /*
          * EBX[23:16] is Maximum Logical Processors Per Package.
-         * Update to reflect vLAPIC_ID = vCPU_ID * 2.
+         * Update to reflect vLAPIC_ID = vCPU_ID * 2, but make sure to avoid
+         * overflow.
          */
-        regs[1] = (regs[1] & 0x0000ffffu) | ((regs[1] & 0x007f0000u) << 1);
+        if ( !info->host_htt && !(regs[1] & 0x00800000u) )
+            regs[1] = (regs[1] & 0x0000ffffu) | ((regs[1] & 0x007f0000u) << 1);
+        else
+            regs[1] &= 0x00ffffffu;
 
         regs[2] = info->featureset[featureword_of(X86_FEATURE_SSE3)];
         regs[3] = (info->featureset[featureword_of(X86_FEATURE_FPU)] |