diff mbox series

x86/AMD: correct certain Fam17 checks

Message ID 5C90B36D02000078002202AE@prv1-mh.provo.novell.com (mailing list archive)
State Superseded
Headers show
Series x86/AMD: correct certain Fam17 checks | expand

Commit Message

Jan Beulich March 19, 2019, 9:16 a.m. UTC
Commit 3157bb4e13 ("Add MSR support for various feature AMD processor
families") converted certain checks for Fam11 to include families all
the way up to Fam17. The commit having no description, it is hard to
tell whether this was a mechanical dec->hex conversion mistake, or
indeed intended. In any event the NB_CFG handling needs to be restricted
to Fam16 and below: Fam17 doesn't have such an MSR anymore.

A non-MMCFG extended config space access mechanism still appears to
exist, but code to deal with it will need to be written down the road,
when it can actually be tested.

Reported-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I'm also not sure whether e.g. init_amd()'s C1E disabling is still
applicable to Fam17.

Comments

Andrew Cooper March 22, 2019, 7:56 p.m. UTC | #1
On 19/03/2019 09:16, Jan Beulich wrote:
> Commit 3157bb4e13 ("Add MSR support for various feature AMD processor
> families") converted certain checks for Fam11 to include families all
> the way up to Fam17. The commit having no description, it is hard to
> tell whether this was a mechanical dec->hex conversion mistake, or
> indeed intended. In any event the NB_CFG handling needs to be restricted
> to Fam16 and below: Fam17 doesn't have such an MSR anymore.
>
> A non-MMCFG extended config space access mechanism still appears to
> exist, but code to deal with it will need to be written down the road,
> when it can actually be tested.
>
> Reported-by: Pu Wen <puwen@hygon.cn>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Having looked through various spec documents, this is a chronic mess.

First, to NB_CTL MSR itself.  It certainly is documented in Fam15, and
is absence in the documentation of Fam17.

In Fam15, it is explicitly documented as an alias of
00:18.3[NB_CFG_LOW/HIGH] which are registers at offset 0x88 and 0x8c in
config space.

Fam17 documents that the extended cfc/cf8 mechanism does still exist,
and the new controls for that found in 00:18.4[CoreMasterAccessCtrl]
with a different bit layout.

Experimentation on a Rome system indicates that NB_CTL is fully
read0/write discard, so this patch is probably an improvement.

Therefore, in principle, Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

However, the actual code touched is completely insane.

HVM guests have it automatically read0/write discard, even for Intel
(citing cross vendor migration).

PV guest handling is complicated.  For CPUs without the MSR, it is read
#GP, write discard.  For CPUs which do have the MSR, it is read0/write
discard for domU or nonpinned dom0, which is 100% of usecases.  The PV
vs HVM differences cause an asymetry for the hardware domain.

A pinned PV dom0 is permitted to change just the IO_ECS bit, eliciting a
warning but no #GP for modifying other bits.  As NB_CTL is a per-node
control (not a per-core control), unless dom0 vcpus == host pcpus, this
creates an asymmetry across the system as to whether IO_ECS is enabled
or not.

The HVM IOREQ path, and PV cfg_ok() path, when seeing an IO_ECS access
from the guest, reads the MSR every time, which is results in behaviour
which doesn't match the settings a guest can see, and comes with a
massive perf hit.  It also means the behaviour of the guest IO depends
on which node it is currently scheduled on.


Moving on to Linux...

Linux tries to enable IO_ECS on all AMD and HYGON CPUs >= Fam10.

One path tries to enable things using PCI config space, and the PCI
device list includes Fam17h and HYGON northbridges, but the code only
operates on the Fam15h information, meaning that it clobbers what
appears to be a reserved register on Fam17/Hygon hardware.

Linux also uses NB_CTL in all circumstances, but given hardware's
read0/write discard behaviour, Linux fails to notice that it doesn't
enable IO_ECS at all.

Nothing ever reads back settings to check whether IO_ECS has been
correctly enabled, which means that it really isn't enabled on
Fam17h/Hygon, and really isn't when running under Xen, but Linux is
under the expectation that it is enabled.

In practice, this means that the use of raw_pci_ext_ops between being
set up in pci_direct_init() and possibly overridden in pci_mmcfg_init().


Back to Xen...

The IO_ECS setting should be chosen once at boot, made consistent across
the entire system, and never touched at runtime.

In all cases for guests, we can offer MMCFG even on a system which
doesn't have IO_ECS, and they will prefer that.  The behaviour of the
extra 4 bits is reserved, so we could have IO_ECS working in practice
with no signal.  However, we could equally drop IO_ECS entirely.  Guests
can't find its setting to begin with, so can't reliably use it, and also
wouldn't in the presence of MMCFG anyway.

~Andrew
Jan Beulich March 25, 2019, 8:41 a.m. UTC | #2
>>> On 22.03.19 at 20:56, <andrew.cooper3@citrix.com> wrote:
> On 19/03/2019 09:16, Jan Beulich wrote:
>> Commit 3157bb4e13 ("Add MSR support for various feature AMD processor
>> families") converted certain checks for Fam11 to include families all
>> the way up to Fam17. The commit having no description, it is hard to
>> tell whether this was a mechanical dec->hex conversion mistake, or
>> indeed intended. In any event the NB_CFG handling needs to be restricted
>> to Fam16 and below: Fam17 doesn't have such an MSR anymore.
>>
>> A non-MMCFG extended config space access mechanism still appears to
>> exist, but code to deal with it will need to be written down the road,
>> when it can actually be tested.
>>
>> Reported-by: Pu Wen <puwen@hygon.cn>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Having looked through various spec documents, this is a chronic mess.
> 
> First, to NB_CTL MSR itself.  It certainly is documented in Fam15, and
> is absence in the documentation of Fam17.
> 
> In Fam15, it is explicitly documented as an alias of
> 00:18.3[NB_CFG_LOW/HIGH] which are registers at offset 0x88 and 0x8c in
> config space.
> 
> Fam17 documents that the extended cfc/cf8 mechanism does still exist,
> and the new controls for that found in 00:18.4[CoreMasterAccessCtrl]
> with a different bit layout.
> 
> Experimentation on a Rome system indicates that NB_CTL is fully
> read0/write discard, so this patch is probably an improvement.

Hmm, if it's read-zero / write-discard, then I guess we would better
emulate that behavior. I simply didn't expect it to be that way, as
I would generally assume undocumented _and_ unused MSRs to
raise #GP. In which case the simplest thing to do would be to drop
the one respective hunk changing emul-priv-op.c:write_msr().

> Therefore, in principle, Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

Well, even without the point above I wouldn't be sure whether
using an "in principle" ack as justification for committing is
appropriate. So perhaps (see above and below) we'd better
settle on something that you could agree to with less
reservations (by directly adjusting this patch, or by adding a
2nd one on top).

> However, the actual code touched is completely insane.
> 
> HVM guests have it automatically read0/write discard, even for Intel
> (citing cross vendor migration).
> 
> PV guest handling is complicated.  For CPUs without the MSR, it is read
> #GP, write discard.  For CPUs which do have the MSR, it is read0/write
> discard for domU or nonpinned dom0, which is 100% of usecases.  The PV
> vs HVM differences cause an asymetry for the hardware domain.

So how about we make it uniformly read-zero / write-discard, as long
as the emulated CPUID has vendor AMD and family 0x10 and higher?
Or even independent of vendor and family, to fully match current
HVM behavior?

> A pinned PV dom0 is permitted to change just the IO_ECS bit, eliciting a
> warning but no #GP for modifying other bits.  As NB_CTL is a per-node
> control (not a per-core control), unless dom0 vcpus == host pcpus, this
> creates an asymmetry across the system as to whether IO_ECS is enabled
> or not.

Correct, but there's no inconsistency from Dom0's POV. Furthermore,
as you explain further down, the PCI config space method of changing
the bit was specifically added to Linux to avoid this inconsistency.

> The HVM IOREQ path, and PV cfg_ok() path, when seeing an IO_ECS access
> from the guest, reads the MSR every time, which is results in behaviour
> which doesn't match the settings a guest can see, and comes with a
> massive perf hit.

Indeed it should be the guest view of the MSR which ought to be used
there.

>  It also means the behaviour of the guest IO depends
> on which node it is currently scheduled on.

If Dom0 enabled things in an asymmetric way.

> The IO_ECS setting should be chosen once at boot, made consistent across
> the entire system, and never touched at runtime.

I don't fully agree (but it somewhat depends on what "at boot" is
supposed to mean in your reply): We should leave the choice to Dom0,
but it would probably not hurt to enforce a consistent setting. Doing
this when Dom0 uses the MSR method would be relatively easy, but
then again Dom0 shouldnt use that approach anyway. Doing this
when Dom0 uses PCI config space writes would be less straightforward,
as we'd have to write protect the north bridge devices' config spaces.
Which wouldn't be very useful imo, as Dom0 - if it already uses the
PCI config space approach - would update all north bridges anyway.

> In all cases for guests, we can offer MMCFG even on a system which
> doesn't have IO_ECS, and they will prefer that.  The behaviour of the
> extra 4 bits is reserved, so we could have IO_ECS working in practice
> with no signal.  However, we could equally drop IO_ECS entirely.  Guests
> can't find its setting to begin with, so can't reliably use it, and also
> wouldn't in the presence of MMCFG anyway.

All this assuming that platforms correctly surface MMCFG. This not
having been the case in at least the Fam10 days was one of the
reasons why Linux (and the Xen) gained all this code. Furthermore
to date I didn't think we would surface MMCFG to any of our guests.

Jan
Jan Beulich May 27, 2019, 9:32 a.m. UTC | #3
>>> On 25.03.19 at 09:41,  wrote:
>>>> On 22.03.19 at 20:56, <andrew.cooper3@citrix.com> wrote:
> > On 19/03/2019 09:16, Jan Beulich wrote:
> >> Commit 3157bb4e13 ("Add MSR support for various feature AMD processor
> >> families") converted certain checks for Fam11 to include families all
> >> the way up to Fam17. The commit having no description, it is hard to
> >> tell whether this was a mechanical dec->hex conversion mistake, or
> >> indeed intended. In any event the NB_CFG handling needs to be restricted
> >> to Fam16 and below: Fam17 doesn't have such an MSR anymore.
> >>
> >> A non-MMCFG extended config space access mechanism still appears to
> >> exist, but code to deal with it will need to be written down the road,
> >> when it can actually be tested.
> >>
> >> Reported-by: Pu Wen <puwen@hygon.cn>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > 
> > Having looked through various spec documents, this is a chronic mess.
> > 
> > First, to NB_CTL MSR itself.  It certainly is documented in Fam15, and
> > is absence in the documentation of Fam17.
> > 
> > In Fam15, it is explicitly documented as an alias of
> > 00:18.3[NB_CFG_LOW/HIGH] which are registers at offset 0x88 and 0x8c in
> > config space.
> > 
> > Fam17 documents that the extended cfc/cf8 mechanism does still exist,
> > and the new controls for that found in 00:18.4[CoreMasterAccessCtrl]
> > with a different bit layout.
> > 
> > Experimentation on a Rome system indicates that NB_CTL is fully
> > read0/write discard, so this patch is probably an improvement.
> 
> Hmm, if it's read-zero / write-discard, then I guess we would better
> emulate that behavior. I simply didn't expect it to be that way, as
> I would generally assume undocumented _and_ unused MSRs to
> raise #GP. In which case the simplest thing to do would be to drop
> the one respective hunk changing emul-priv-op.c:write_msr().
> 
> > Therefore, in principle, Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
> 
> Well, even without the point above I wouldn't be sure whether
> using an "in principle" ack as justification for committing is
> appropriate. So perhaps (see above and below) we'd better
> settle on something that you could agree to with less
> reservations (by directly adjusting this patch, or by adding a
> 2nd one on top).

I continue to be hesitant to commit with the above kind of
"restricted" ack. Could you clarify if we want to work out a better
solution, or whether I indeed should put in what's there right now?

Thanks, Jan

> > However, the actual code touched is completely insane.
> > 
> > HVM guests have it automatically read0/write discard, even for Intel
> > (citing cross vendor migration).
> > 
> > PV guest handling is complicated.  For CPUs without the MSR, it is read
> > #GP, write discard.  For CPUs which do have the MSR, it is read0/write
> > discard for domU or nonpinned dom0, which is 100% of usecases.  The PV
> > vs HVM differences cause an asymetry for the hardware domain.
> 
> So how about we make it uniformly read-zero / write-discard, as long
> as the emulated CPUID has vendor AMD and family 0x10 and higher?
> Or even independent of vendor and family, to fully match current
> HVM behavior?
> 
> > A pinned PV dom0 is permitted to change just the IO_ECS bit, eliciting a
> > warning but no #GP for modifying other bits.  As NB_CTL is a per-node
> > control (not a per-core control), unless dom0 vcpus == host pcpus, this
> > creates an asymmetry across the system as to whether IO_ECS is enabled
> > or not.
> 
> Correct, but there's no inconsistency from Dom0's POV. Furthermore,
> as you explain further down, the PCI config space method of changing
> the bit was specifically added to Linux to avoid this inconsistency.
> 
> > The HVM IOREQ path, and PV cfg_ok() path, when seeing an IO_ECS access
> > from the guest, reads the MSR every time, which is results in behaviour
> > which doesn't match the settings a guest can see, and comes with a
> > massive perf hit.
> 
> Indeed it should be the guest view of the MSR which ought to be used
> there.
> 
> >  It also means the behaviour of the guest IO depends
> > on which node it is currently scheduled on.
> 
> If Dom0 enabled things in an asymmetric way.
> 
> > The IO_ECS setting should be chosen once at boot, made consistent across
> > the entire system, and never touched at runtime.
> 
> I don't fully agree (but it somewhat depends on what "at boot" is
> supposed to mean in your reply): We should leave the choice to Dom0,
> but it would probably not hurt to enforce a consistent setting. Doing
> this when Dom0 uses the MSR method would be relatively easy, but
> then again Dom0 shouldnt use that approach anyway. Doing this
> when Dom0 uses PCI config space writes would be less straightforward,
> as we'd have to write protect the north bridge devices' config spaces.
> Which wouldn't be very useful imo, as Dom0 - if it already uses the
> PCI config space approach - would update all north bridges anyway.
> 
> > In all cases for guests, we can offer MMCFG even on a system which
> > doesn't have IO_ECS, and they will prefer that.  The behaviour of the
> > extra 4 bits is reserved, so we could have IO_ECS working in practice
> > with no signal.  However, we could equally drop IO_ECS entirely.  Guests
> > can't find its setting to begin with, so can't reliably use it, and also
> > wouldn't in the presence of MMCFG anyway.
> 
> All this assuming that platforms correctly surface MMCFG. This not
> having been the case in at least the Fam10 days was one of the
> reasons why Linux (and the Xen) gained all this code. Furthermore
> to date I didn't think we would surface MMCFG to any of our guests.
> 
> Jan
> 
>
diff mbox series

Patch

--- a/xen/arch/x86/hvm/ioreq.c
+++ b/xen/arch/x86/hvm/ioreq.c
@@ -1288,7 +1288,7 @@  struct hvm_ioreq_server *hvm_select_iore
              d->arch.cpuid->x86_vendor == X86_VENDOR_AMD &&
              (x86_fam = get_cpu_family(
                  d->arch.cpuid->basic.raw_fms, NULL, NULL)) > 0x10 &&
-             x86_fam <= 0x17 )
+             x86_fam < 0x17 )
         {
             uint64_t msr_val;
 
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -195,7 +195,7 @@  static bool pci_cfg_ok(struct domain *cu
     /* AMD extended configuration space access? */
     if ( CF8_ADDR_HI(currd->arch.pci_cf8) &&
          boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
-         boot_cpu_data.x86 >= 0x10 && boot_cpu_data.x86 <= 0x17 )
+         boot_cpu_data.x86 >= 0x10 && boot_cpu_data.x86 < 0x17 )
     {
         uint64_t msr_val;
 
@@ -1015,7 +1015,7 @@  static int write_msr(unsigned int reg, u
 
     case MSR_AMD64_NB_CFG:
         if ( boot_cpu_data.x86_vendor != X86_VENDOR_AMD ||
-             boot_cpu_data.x86 < 0x10 || boot_cpu_data.x86 > 0x17 )
+             boot_cpu_data.x86 < 0x10 || boot_cpu_data.x86 >= 0x17 )
             break;
         if ( !is_hardware_domain(currd) || !is_pinned_vcpu(curr) )
             return X86EMUL_OKAY;
@@ -1028,7 +1028,7 @@  static int write_msr(unsigned int reg, u
 
     case MSR_FAM10H_MMIO_CONF_BASE:
         if ( boot_cpu_data.x86_vendor != X86_VENDOR_AMD ||
-             boot_cpu_data.x86 < 0x10 || boot_cpu_data.x86 > 0x17 )
+             boot_cpu_data.x86 < 0x10 || boot_cpu_data.x86 >= 0x17 )
             break;
         if ( !is_hardware_domain(currd) || !is_pinned_vcpu(curr) )
             return X86EMUL_OKAY;