diff mbox series

x86/pci: Correct ECS handling with CF8/CFC emulation

Message ID 20230331175719.500285-1-andrew.cooper3@citrix.com (mailing list archive)
State New, archived
Headers show
Series x86/pci: Correct ECS handling with CF8/CFC emulation | expand

Commit Message

Andrew Cooper March 31, 2023, 5:57 p.m. UTC
This started by noticing the dubious Fam17h check in
arch_ioreq_server_get_type_addr(), and then realising that checking the host
CF8_EXT setting is utterly bogus for the guest <-> qemu emulation path.

What should be consulted here is the guest's MSR_AMD64_NB_CFG setting, but a)
there isn't one, and b) the vestigial remnants of the cross-vendor migration
logic cause MSR_AMD64_NB_CFG to be unconditionally read-as-zero, making the
CF8_EXT path unused by any suitably-written OS in the first place.

MSR_AMD64_NB_CFG really has been removed on Fam17h (it's now just a read-zero,
write-discard stub), and the ECS extension is unconditionally active, meaning
it is not correct for Xen to ignore the ExtRegNo field on newer AMD CPUs.

It turns out that Xen did even have this behaviour in 4.5 and earlier, with
this problematic CF8_EXT checking being added in 4.6.  Therefore, revert back
to Xen's older behaviour - it is objectively less wrong than the current
logic.

While fixing this, get rid of hvm_pci_decode_addr() - it is more complicated
to follow (and to call) than using the CF8* macros in the calling context.
Rename CF8_ADDR() to CF8_REG() to better describe what it does, and write a
comment explaining all about CF8/CFC accesses.

There's one rare case when CF8_EXT is visible to guests, and that is for a
pinned hwdom.  Right now, we permit such a dom0 to modify the CF8_EXT bit, but
this seems like a very unwise idea.  Leave a TODO for people to consider.

Fixes: e0fbf3bf9871 ("x86/AMD: correct certain Fam17 checks")
Fixes: 2d67a7a4d37a ("x86: synchronize PCI config space access decoding")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>

Whoever reviewed those two patches originally was clearly a fool...
---
 xen/arch/x86/hvm/io.c             | 24 ++++++------------------
 xen/arch/x86/hvm/ioreq.c          | 19 ++-----------------
 xen/arch/x86/include/asm/hvm/io.h |  4 ----
 xen/arch/x86/include/asm/pci.h    | 26 ++++++++++++++++++++++++--
 xen/arch/x86/pv/emul-priv-op.c    | 19 ++++++-------------
 5 files changed, 38 insertions(+), 54 deletions(-)

Comments

Roger Pau Monne April 3, 2023, 8:57 a.m. UTC | #1
On Fri, Mar 31, 2023 at 06:57:19PM +0100, Andrew Cooper wrote:
> This started by noticing the dubious Fam17h check in
> arch_ioreq_server_get_type_addr(), and then realising that checking the host
> CF8_EXT setting is utterly bogus for the guest <-> qemu emulation path.
> 
> What should be consulted here is the guest's MSR_AMD64_NB_CFG setting, but a)
> there isn't one, and b) the vestigial remnants of the cross-vendor migration
> logic cause MSR_AMD64_NB_CFG to be unconditionally read-as-zero, making the
> CF8_EXT path unused by any suitably-written OS in the first place.
> 
> MSR_AMD64_NB_CFG really has been removed on Fam17h (it's now just a read-zero,
> write-discard stub), and the ECS extension is unconditionally active, meaning
> it is not correct for Xen to ignore the ExtRegNo field on newer AMD CPUs.
> 
> It turns out that Xen did even have this behaviour in 4.5 and earlier, with
> this problematic CF8_EXT checking being added in 4.6.  Therefore, revert back
> to Xen's older behaviour - it is objectively less wrong than the current
> logic.
> 
> While fixing this, get rid of hvm_pci_decode_addr() - it is more complicated
> to follow (and to call) than using the CF8* macros in the calling context.
> Rename CF8_ADDR() to CF8_REG() to better describe what it does, and write a
> comment explaining all about CF8/CFC accesses.
> 
> There's one rare case when CF8_EXT is visible to guests, and that is for a
> pinned hwdom.  Right now, we permit such a dom0 to modify the CF8_EXT bit, but
> this seems like a very unwise idea.  Leave a TODO for people to consider.

One weirdness I've noticed is that for vPCI we decode the accesses
taking the extended CF8 bit after this change, but then if the access
is relayed to the hardware using vpci_{read,write}_hw it will be
forwarded to the hardware using pci_conf_{read,write}<size> which
doesn't have support for CF8_EXT.  So if the underlying hardware
doesn't have MMCFG support and the reg is > 255 it will be dropped.

> Fixes: e0fbf3bf9871 ("x86/AMD: correct certain Fam17 checks")
> Fixes: 2d67a7a4d37a ("x86: synchronize PCI config space access decoding")
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> CC: Wei Liu <wl@xen.org>
> 
> Whoever reviewed those two patches originally was clearly a fool...
> ---
>  xen/arch/x86/hvm/io.c             | 24 ++++++------------------
>  xen/arch/x86/hvm/ioreq.c          | 19 ++-----------------
>  xen/arch/x86/include/asm/hvm/io.h |  4 ----
>  xen/arch/x86/include/asm/pci.h    | 26 ++++++++++++++++++++++++--
>  xen/arch/x86/pv/emul-priv-op.c    | 19 ++++++-------------
>  5 files changed, 38 insertions(+), 54 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
> index 5ae209d3b6b3..b0d3c236e985 100644
> --- a/xen/arch/x86/hvm/io.c
> +++ b/xen/arch/x86/hvm/io.c
> @@ -248,20 +248,6 @@ void register_g2m_portio_handler(struct domain *d)
>      handler->ops = &g2m_portio_ops;
>  }
>  
> -unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
> -                                 pci_sbdf_t *sbdf)
> -{
> -    ASSERT(CF8_ENABLED(cf8));
> -
> -    sbdf->bdf = CF8_BDF(cf8);
> -    sbdf->seg = 0;
> -    /*
> -     * NB: the lower 2 bits of the register address are fetched from the
> -     * offset into the 0xcfc register when reading/writing to it.
> -     */
> -    return CF8_ADDR_LO(cf8) | (addr & 3);
> -}
> -
>  /* vPCI config space IO ports handlers (0xcf8/0xcfc). */
>  static bool cf_check vpci_portio_accept(
>      const struct hvm_io_handler *handler, const ioreq_t *p)
> @@ -275,7 +261,7 @@ static int cf_check vpci_portio_read(
>  {
>      const struct domain *d = current->domain;
>      unsigned int reg;
> -    pci_sbdf_t sbdf;
> +    pci_sbdf_t sbdf = {};
>      uint32_t cf8;
>  
>      *data = ~(uint64_t)0;
> @@ -292,7 +278,8 @@ static int cf_check vpci_portio_read(
>      if ( !CF8_ENABLED(cf8) )
>          return X86EMUL_UNHANDLEABLE;
>  
> -    reg = hvm_pci_decode_addr(cf8, addr, &sbdf);
> +    sbdf.bdf = CF8_BDF(cf8);
> +    reg = CF8_REG(cf8) | (addr & 3);
>  
>      if ( !vpci_access_allowed(reg, size) )
>          return X86EMUL_OKAY;
> @@ -308,7 +295,7 @@ static int cf_check vpci_portio_write(
>  {
>      struct domain *d = current->domain;
>      unsigned int reg;
> -    pci_sbdf_t sbdf;
> +    pci_sbdf_t sbdf = {};
>      uint32_t cf8;
>  
>      if ( addr == 0xcf8 )
> @@ -323,7 +310,8 @@ static int cf_check vpci_portio_write(
>      if ( !CF8_ENABLED(cf8) )
>          return X86EMUL_UNHANDLEABLE;
>  
> -    reg = hvm_pci_decode_addr(cf8, addr, &sbdf);
> +    sbdf.bdf = CF8_BDF(cf8);
> +    reg = CF8_REG(cf8) | (addr & 3);
>  
>      if ( !vpci_access_allowed(reg, size) )
>          return X86EMUL_OKAY;
> diff --git a/xen/arch/x86/hvm/ioreq.c b/xen/arch/x86/hvm/ioreq.c
> index 0bdcca1e1a5f..325a9d118e52 100644
> --- a/xen/arch/x86/hvm/ioreq.c
> +++ b/xen/arch/x86/hvm/ioreq.c
> @@ -285,27 +285,12 @@ bool arch_ioreq_server_get_type_addr(const struct domain *d,
>           (p->addr & ~3) == 0xcfc &&
>           CF8_ENABLED(cf8) )
>      {
> -        unsigned int x86_fam, reg;
> -        pci_sbdf_t sbdf;
> -
> -        reg = hvm_pci_decode_addr(cf8, p->addr, &sbdf);
> +        pci_sbdf_t sbdf = { .bdf = CF8_BDF(cf8) };
> +        unsigned int reg = CF8_REG(cf8) | (p->addr & 3);
>  
>          /* PCI config data cycle */
>          *type = XEN_DMOP_IO_RANGE_PCI;
>          *addr = ((uint64_t)sbdf.sbdf << 32) | reg;
> -        /* AMD extended configuration space access? */
> -        if ( CF8_ADDR_HI(cf8) &&
> -             d->arch.cpuid->x86_vendor == X86_VENDOR_AMD &&
> -             (x86_fam = get_cpu_family(
> -                 d->arch.cpuid->basic.raw_fms, NULL, NULL)) >= 0x10 &&
> -             x86_fam < 0x17 )
> -        {
> -            uint64_t msr_val;
> -
> -            if ( !rdmsr_safe(MSR_AMD64_NB_CFG, msr_val) &&
> -                 (msr_val & (1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT)) )
> -                *addr |= CF8_ADDR_HI(cf8);
> -        }
>      }
>      else
>      {
> diff --git a/xen/arch/x86/include/asm/hvm/io.h b/xen/arch/x86/include/asm/hvm/io.h
> index 54e0161b492c..3f3fb6403ccb 100644
> --- a/xen/arch/x86/include/asm/hvm/io.h
> +++ b/xen/arch/x86/include/asm/hvm/io.h
> @@ -144,10 +144,6 @@ void stdvga_deinit(struct domain *d);
>  
>  extern void hvm_dpci_msi_eoi(struct domain *d, int vector);
>  
> -/* Decode a PCI port IO access into a bus/slot/func/reg. */
> -unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
> -                                 pci_sbdf_t *sbdf);
> -
>  /*
>   * HVM port IO handler that performs forwarding of guest IO ports into machine
>   * IO ports.
> diff --git a/xen/arch/x86/include/asm/pci.h b/xen/arch/x86/include/asm/pci.h
> index f4a58c8acf13..3b814f4ebacf 100644
> --- a/xen/arch/x86/include/asm/pci.h
> +++ b/xen/arch/x86/include/asm/pci.h
> @@ -3,10 +3,32 @@
>  
>  #include <xen/mm.h>
>  
> +/*
> + * PCI config space accesses with CF8/CFC:
> + *
> + * 1) Write {Enable | BDF | Reg} to CF8 to set an address
> + * 2) Read or write CF{C..F} to access the register
> + *
> + * For sub-dword register accesses, the bottom two register address bits come
> + * from the CF{C..F} address, not from CF8.
> + *
> + * AMD have extention to this protocol to access PCIe Extend Config Space by
> + * storing the 4 extra register address bits in the penultimate nibble of CF8.
> + * This extention:
> + *  - Is unconditionally active on Fam17h and later
> + *  - Has model specific enablement on Fam10h thru Fam16h
> + *  - Has reserved behaviour in all other cases, including other vendors
> + *
> + * For simplicity and because we are permitted to, given "reserved", Xen
> + * always treats ECS as active when emulating guest PCI config space accesses.
> + */
>  #define CF8_BDF(cf8)     (  ((cf8) & 0x00ffff00) >> 8)
> -#define CF8_ADDR_LO(cf8) (   (cf8) & 0x000000fc)
> -#define CF8_ADDR_HI(cf8) (  ((cf8) & 0x0f000000) >> 16)
>  #define CF8_ENABLED(cf8) (!!((cf8) & 0x80000000))
> +#define CF8_REG(cf8)                                    \
> +    ({                                                  \
> +        unsigned int _c = cf8;                          \
> +        ((_c & 0x0f000000) >> 16) | (_c & 0xfc);        \
> +    })

What happens on Intel when the bit is set, is it just ignored?

We only allow such accesses for dom0 anyway.

>  
>  #define IS_SNB_GFX(id) (id == 0x01068086 || id == 0x01168086 \
>                          || id == 0x01268086 || id == 0x01028086 \
> diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
> index 5da00e24e4ff..008367195c78 100644
> --- a/xen/arch/x86/pv/emul-priv-op.c
> +++ b/xen/arch/x86/pv/emul-priv-op.c
> @@ -245,19 +245,7 @@ static bool pci_cfg_ok(struct domain *currd, unsigned int start,
>          if ( ro_map && test_bit(machine_bdf, ro_map) )
>              return false;
>      }
> -    start |= CF8_ADDR_LO(currd->arch.pci_cf8);
> -    /* AMD extended configuration space access? */
> -    if ( CF8_ADDR_HI(currd->arch.pci_cf8) &&
> -         boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
> -         boot_cpu_data.x86 >= 0x10 && boot_cpu_data.x86 < 0x17 )
> -    {
> -        uint64_t msr_val;
> -
> -        if ( rdmsr_safe(MSR_AMD64_NB_CFG, msr_val) )
> -            return false;
> -        if ( msr_val & (1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT) )
> -            start |= CF8_ADDR_HI(currd->arch.pci_cf8);
> -    }
> +    start |= CF8_REG(currd->arch.pci_cf8);
>  
>      return !write ?
>             xsm_pci_config_permission(XSM_HOOK, currd, machine_bdf,
> @@ -1104,6 +1092,11 @@ static int cf_check write_msr(
>          if ( !is_hwdom_pinned_vcpu(curr) )
>              return X86EMUL_OKAY;
>          if ( (rdmsr_safe(MSR_AMD64_NB_CFG, temp) != 0) ||
> +             /*
> +              * TODO: this is broken.  What happens when dom0 is pinned but
> +              * can't see the full system?  CF8_EXT probably ought to be a
> +              * Xen-owned setting, and made symmetric across the system.
> +              */

I would assume CF8_EXT would be symmetric across the system, specially
given that it seems to be phased out and only used in older AMD
families that where all symmetric?

Thanks, Roger.
Andrew Cooper April 3, 2023, 10:16 a.m. UTC | #2
On 03/04/2023 9:57 am, Roger Pau Monné wrote:
> On Fri, Mar 31, 2023 at 06:57:19PM +0100, Andrew Cooper wrote:
>> This started by noticing the dubious Fam17h check in
>> arch_ioreq_server_get_type_addr(), and then realising that checking the host
>> CF8_EXT setting is utterly bogus for the guest <-> qemu emulation path.
>>
>> What should be consulted here is the guest's MSR_AMD64_NB_CFG setting, but a)
>> there isn't one, and b) the vestigial remnants of the cross-vendor migration
>> logic cause MSR_AMD64_NB_CFG to be unconditionally read-as-zero, making the
>> CF8_EXT path unused by any suitably-written OS in the first place.
>>
>> MSR_AMD64_NB_CFG really has been removed on Fam17h (it's now just a read-zero,
>> write-discard stub), and the ECS extension is unconditionally active, meaning
>> it is not correct for Xen to ignore the ExtRegNo field on newer AMD CPUs.
>>
>> It turns out that Xen did even have this behaviour in 4.5 and earlier, with
>> this problematic CF8_EXT checking being added in 4.6.  Therefore, revert back
>> to Xen's older behaviour - it is objectively less wrong than the current
>> logic.
>>
>> While fixing this, get rid of hvm_pci_decode_addr() - it is more complicated
>> to follow (and to call) than using the CF8* macros in the calling context.
>> Rename CF8_ADDR() to CF8_REG() to better describe what it does, and write a
>> comment explaining all about CF8/CFC accesses.
>>
>> There's one rare case when CF8_EXT is visible to guests, and that is for a
>> pinned hwdom.  Right now, we permit such a dom0 to modify the CF8_EXT bit, but
>> this seems like a very unwise idea.  Leave a TODO for people to consider.
> One weirdness I've noticed is that for vPCI we decode the accesses
> taking the extended CF8 bit after this change, but then if the access
> is relayed to the hardware using vpci_{read,write}_hw it will be
> forwarded to the hardware using pci_conf_{read,write}<size> which
> doesn't have support for CF8_EXT.  So if the underlying hardware
> doesn't have MMCFG support and the reg is > 255 it will be dropped.

It is important to stress that this change does not influence whether
the guest issues ECS accesses or not.  All it does is change Xen's
handling of such accesses.

Previously vPCI blindly ignored ECS accesses, so the vPCI layer
effectively truncated them to BCS accesses.

Now, from your analysis, when MMCFG isn't active, Xen's PCI layer will
effectively terminate ECS accesses with default behaviour, even on
systems where IO ECS is available.

So we've changed one valid behaviour for a different valid behaviour.


(Quick tangent...  Our PCI handling is currently very dumb. 
pci_mmcfg_read() returns its value by pointer but the callers never
check.  Swapping it to return by value would improve code gen quite a
lot.  Also, when MMCFG is active we still pass BCS accesses to IO ports.)

So I think we do want to improve Xen's general behaviour too, but this
difference here doesn't concern me.

>
>> Fixes: e0fbf3bf9871 ("x86/AMD: correct certain Fam17 checks")
>> Fixes: 2d67a7a4d37a ("x86: synchronize PCI config space access decoding")
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>> CC: Jan Beulich <JBeulich@suse.com>
>> CC: Roger Pau Monné <roger.pau@citrix.com>
>> CC: Wei Liu <wl@xen.org>
>>
>> Whoever reviewed those two patches originally was clearly a fool...
>> ---
>>  xen/arch/x86/hvm/io.c             | 24 ++++++------------------
>>  xen/arch/x86/hvm/ioreq.c          | 19 ++-----------------
>>  xen/arch/x86/include/asm/hvm/io.h |  4 ----
>>  xen/arch/x86/include/asm/pci.h    | 26 ++++++++++++++++++++++++--
>>  xen/arch/x86/pv/emul-priv-op.c    | 19 ++++++-------------
>>  5 files changed, 38 insertions(+), 54 deletions(-)
>>
>> diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
>> index 5ae209d3b6b3..b0d3c236e985 100644
>> --- a/xen/arch/x86/hvm/io.c
>> +++ b/xen/arch/x86/hvm/io.c
>> @@ -248,20 +248,6 @@ void register_g2m_portio_handler(struct domain *d)
>>      handler->ops = &g2m_portio_ops;
>>  }
>>  
>> -unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
>> -                                 pci_sbdf_t *sbdf)
>> -{
>> -    ASSERT(CF8_ENABLED(cf8));
>> -
>> -    sbdf->bdf = CF8_BDF(cf8);
>> -    sbdf->seg = 0;
>> -    /*
>> -     * NB: the lower 2 bits of the register address are fetched from the
>> -     * offset into the 0xcfc register when reading/writing to it.
>> -     */
>> -    return CF8_ADDR_LO(cf8) | (addr & 3);
>> -}
>> -
>>  /* vPCI config space IO ports handlers (0xcf8/0xcfc). */
>>  static bool cf_check vpci_portio_accept(
>>      const struct hvm_io_handler *handler, const ioreq_t *p)
>> @@ -275,7 +261,7 @@ static int cf_check vpci_portio_read(
>>  {
>>      const struct domain *d = current->domain;
>>      unsigned int reg;
>> -    pci_sbdf_t sbdf;
>> +    pci_sbdf_t sbdf = {};
>>      uint32_t cf8;
>>  
>>      *data = ~(uint64_t)0;
>> @@ -292,7 +278,8 @@ static int cf_check vpci_portio_read(
>>      if ( !CF8_ENABLED(cf8) )
>>          return X86EMUL_UNHANDLEABLE;
>>  
>> -    reg = hvm_pci_decode_addr(cf8, addr, &sbdf);
>> +    sbdf.bdf = CF8_BDF(cf8);
>> +    reg = CF8_REG(cf8) | (addr & 3);
>>  
>>      if ( !vpci_access_allowed(reg, size) )
>>          return X86EMUL_OKAY;
>> @@ -308,7 +295,7 @@ static int cf_check vpci_portio_write(
>>  {
>>      struct domain *d = current->domain;
>>      unsigned int reg;
>> -    pci_sbdf_t sbdf;
>> +    pci_sbdf_t sbdf = {};
>>      uint32_t cf8;
>>  
>>      if ( addr == 0xcf8 )
>> @@ -323,7 +310,8 @@ static int cf_check vpci_portio_write(
>>      if ( !CF8_ENABLED(cf8) )
>>          return X86EMUL_UNHANDLEABLE;
>>  
>> -    reg = hvm_pci_decode_addr(cf8, addr, &sbdf);
>> +    sbdf.bdf = CF8_BDF(cf8);
>> +    reg = CF8_REG(cf8) | (addr & 3);
>>  
>>      if ( !vpci_access_allowed(reg, size) )
>>          return X86EMUL_OKAY;
>> diff --git a/xen/arch/x86/hvm/ioreq.c b/xen/arch/x86/hvm/ioreq.c
>> index 0bdcca1e1a5f..325a9d118e52 100644
>> --- a/xen/arch/x86/hvm/ioreq.c
>> +++ b/xen/arch/x86/hvm/ioreq.c
>> @@ -285,27 +285,12 @@ bool arch_ioreq_server_get_type_addr(const struct domain *d,
>>           (p->addr & ~3) == 0xcfc &&
>>           CF8_ENABLED(cf8) )
>>      {
>> -        unsigned int x86_fam, reg;
>> -        pci_sbdf_t sbdf;
>> -
>> -        reg = hvm_pci_decode_addr(cf8, p->addr, &sbdf);
>> +        pci_sbdf_t sbdf = { .bdf = CF8_BDF(cf8) };
>> +        unsigned int reg = CF8_REG(cf8) | (p->addr & 3);
>>  
>>          /* PCI config data cycle */
>>          *type = XEN_DMOP_IO_RANGE_PCI;
>>          *addr = ((uint64_t)sbdf.sbdf << 32) | reg;
>> -        /* AMD extended configuration space access? */
>> -        if ( CF8_ADDR_HI(cf8) &&
>> -             d->arch.cpuid->x86_vendor == X86_VENDOR_AMD &&
>> -             (x86_fam = get_cpu_family(
>> -                 d->arch.cpuid->basic.raw_fms, NULL, NULL)) >= 0x10 &&
>> -             x86_fam < 0x17 )
>> -        {
>> -            uint64_t msr_val;
>> -
>> -            if ( !rdmsr_safe(MSR_AMD64_NB_CFG, msr_val) &&
>> -                 (msr_val & (1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT)) )
>> -                *addr |= CF8_ADDR_HI(cf8);
>> -        }
>>      }
>>      else
>>      {
>> diff --git a/xen/arch/x86/include/asm/hvm/io.h b/xen/arch/x86/include/asm/hvm/io.h
>> index 54e0161b492c..3f3fb6403ccb 100644
>> --- a/xen/arch/x86/include/asm/hvm/io.h
>> +++ b/xen/arch/x86/include/asm/hvm/io.h
>> @@ -144,10 +144,6 @@ void stdvga_deinit(struct domain *d);
>>  
>>  extern void hvm_dpci_msi_eoi(struct domain *d, int vector);
>>  
>> -/* Decode a PCI port IO access into a bus/slot/func/reg. */
>> -unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
>> -                                 pci_sbdf_t *sbdf);
>> -
>>  /*
>>   * HVM port IO handler that performs forwarding of guest IO ports into machine
>>   * IO ports.
>> diff --git a/xen/arch/x86/include/asm/pci.h b/xen/arch/x86/include/asm/pci.h
>> index f4a58c8acf13..3b814f4ebacf 100644
>> --- a/xen/arch/x86/include/asm/pci.h
>> +++ b/xen/arch/x86/include/asm/pci.h
>> @@ -3,10 +3,32 @@
>>  
>>  #include <xen/mm.h>
>>  
>> +/*
>> + * PCI config space accesses with CF8/CFC:
>> + *
>> + * 1) Write {Enable | BDF | Reg} to CF8 to set an address
>> + * 2) Read or write CF{C..F} to access the register
>> + *
>> + * For sub-dword register accesses, the bottom two register address bits come
>> + * from the CF{C..F} address, not from CF8.
>> + *
>> + * AMD have extention to this protocol to access PCIe Extend Config Space by
>> + * storing the 4 extra register address bits in the penultimate nibble of CF8.
>> + * This extention:
>> + *  - Is unconditionally active on Fam17h and later
>> + *  - Has model specific enablement on Fam10h thru Fam16h
>> + *  - Has reserved behaviour in all other cases, including other vendors
>> + *
>> + * For simplicity and because we are permitted to, given "reserved", Xen
>> + * always treats ECS as active when emulating guest PCI config space accesses.
>> + */
>>  #define CF8_BDF(cf8)     (  ((cf8) & 0x00ffff00) >> 8)
>> -#define CF8_ADDR_LO(cf8) (   (cf8) & 0x000000fc)
>> -#define CF8_ADDR_HI(cf8) (  ((cf8) & 0x0f000000) >> 16)
>>  #define CF8_ENABLED(cf8) (!!((cf8) & 0x80000000))
>> +#define CF8_REG(cf8)                                    \
>> +    ({                                                  \
>> +        unsigned int _c = cf8;                          \
>> +        ((_c & 0x0f000000) >> 16) | (_c & 0xfc);        \
>> +    })
> What happens on Intel when the bit is set, is it just ignored?

"the bit" => the ECS nibble, or the CF8_EXT bit?

The ECS nibble is ignored on Intel AFAICT, while the CF8_EXT bit is in a
very AMD-only MSR, so won't exist on Intel.

> We only allow such accesses for dom0 anyway.

And guests running on AMD hardware where CF8_EXT is active on the
northbridge of the core we are instantaneously scheduled on.

>>  
>>  #define IS_SNB_GFX(id) (id == 0x01068086 || id == 0x01168086 \
>>                          || id == 0x01268086 || id == 0x01028086 \
>> diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
>> index 5da00e24e4ff..008367195c78 100644
>> --- a/xen/arch/x86/pv/emul-priv-op.c
>> +++ b/xen/arch/x86/pv/emul-priv-op.c
>> @@ -245,19 +245,7 @@ static bool pci_cfg_ok(struct domain *currd, unsigned int start,
>>          if ( ro_map && test_bit(machine_bdf, ro_map) )
>>              return false;
>>      }
>> -    start |= CF8_ADDR_LO(currd->arch.pci_cf8);
>> -    /* AMD extended configuration space access? */
>> -    if ( CF8_ADDR_HI(currd->arch.pci_cf8) &&
>> -         boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
>> -         boot_cpu_data.x86 >= 0x10 && boot_cpu_data.x86 < 0x17 )
>> -    {
>> -        uint64_t msr_val;
>> -
>> -        if ( rdmsr_safe(MSR_AMD64_NB_CFG, msr_val) )
>> -            return false;
>> -        if ( msr_val & (1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT) )
>> -            start |= CF8_ADDR_HI(currd->arch.pci_cf8);
>> -    }
>> +    start |= CF8_REG(currd->arch.pci_cf8);
>>  
>>      return !write ?
>>             xsm_pci_config_permission(XSM_HOOK, currd, machine_bdf,
>> @@ -1104,6 +1092,11 @@ static int cf_check write_msr(
>>          if ( !is_hwdom_pinned_vcpu(curr) )
>>              return X86EMUL_OKAY;
>>          if ( (rdmsr_safe(MSR_AMD64_NB_CFG, temp) != 0) ||
>> +             /*
>> +              * TODO: this is broken.  What happens when dom0 is pinned but
>> +              * can't see the full system?  CF8_EXT probably ought to be a
>> +              * Xen-owned setting, and made symmetric across the system.
>> +              */
> I would assume CF8_EXT would be symmetric across the system, specially
> given that it seems to be phased out and only used in older AMD
> families that where all symmetric?

The CF8_EXT bit has been phased out.  The IO ECS functionality still exists.

But yes, the more I think about letting dom0 play with this, the more I
think it is a fundamentally broken idea...  I bet it was from the very
early AMD Fam10h days where dom0 knew how to turn it on, and Xen was
trying to pretend it didn't have to touch any PCI devices.

~Andrew
Jan Beulich April 3, 2023, 1:24 p.m. UTC | #3
On 31.03.2023 19:57, Andrew Cooper wrote:
> This started by noticing the dubious Fam17h check in
> arch_ioreq_server_get_type_addr(), and then realising that checking the host
> CF8_EXT setting is utterly bogus for the guest <-> qemu emulation path.
> 
> What should be consulted here is the guest's MSR_AMD64_NB_CFG setting, but a)
> there isn't one, and b) the vestigial remnants of the cross-vendor migration
> logic cause MSR_AMD64_NB_CFG to be unconditionally read-as-zero, making the
> CF8_EXT path unused by any suitably-written OS in the first place.
> 
> MSR_AMD64_NB_CFG really has been removed on Fam17h (it's now just a read-zero,
> write-discard stub), and the ECS extension is unconditionally active, meaning
> it is not correct for Xen to ignore the ExtRegNo field on newer AMD CPUs.

I don't buy "unconditionally active". Whether to enable this by default is
up to firmware; I view it as not unlikely that some firmware actually have
a setup control for it. Fam17 controls it through D1[8-F]F4x044:0, while
Fam19 has D1[8-F]F0xC00:0 for this purpose. (I've had a work item for quite
some time to actually make use of this bit to enable ECS.)

> It turns out that Xen did even have this behaviour in 4.5 and earlier, with
> this problematic CF8_EXT checking being added in 4.6.  Therefore, revert back
> to Xen's older behaviour - it is objectively less wrong than the current
> logic.
> 
> While fixing this, get rid of hvm_pci_decode_addr() - it is more complicated
> to follow (and to call) than using the CF8* macros in the calling context.
> Rename CF8_ADDR() to CF8_REG() to better describe what it does, and write a
> comment explaining all about CF8/CFC accesses.
> 
> There's one rare case when CF8_EXT is visible to guests, and that is for a
> pinned hwdom.  Right now, we permit such a dom0 to modify the CF8_EXT bit, but
> this seems like a very unwise idea.  Leave a TODO for people to consider.
> 
> Fixes: e0fbf3bf9871 ("x86/AMD: correct certain Fam17 checks")

Therefore I'm not convinced this Fixes: tag is warranted.

> Fixes: 2d67a7a4d37a ("x86: synchronize PCI config space access decoding")

I'm also curious which particular aspect of that change you are considering
to be fixed here.

Your claim about behavior being reserved is perhaps okay as far as hardware
is concerned, but by removing the checks you e.g. misguide
xsm_pci_config_permission() in case it handles certain register ranges
specially. Dealing with that was, according to the description, one of the
purposes of the commit above (it's been long enough, so I can only go from
the description in git).

> --- a/xen/arch/x86/hvm/io.c
> +++ b/xen/arch/x86/hvm/io.c
> @@ -248,20 +248,6 @@ void register_g2m_portio_handler(struct domain *d)
>      handler->ops = &g2m_portio_ops;
>  }
>  
> -unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
> -                                 pci_sbdf_t *sbdf)
> -{
> -    ASSERT(CF8_ENABLED(cf8));
> -
> -    sbdf->bdf = CF8_BDF(cf8);
> -    sbdf->seg = 0;
> -    /*
> -     * NB: the lower 2 bits of the register address are fetched from the
> -     * offset into the 0xcfc register when reading/writing to it.
> -     */
> -    return CF8_ADDR_LO(cf8) | (addr & 3);
> -}

I have to admit that I'm surprised that you fold replacing of this
function (purely mechanical afaict) into a change with a significant
functional change. Wouldn't you agree that this may better be split off?

> @@ -1104,6 +1092,11 @@ static int cf_check write_msr(
>          if ( !is_hwdom_pinned_vcpu(curr) )
>              return X86EMUL_OKAY;
>          if ( (rdmsr_safe(MSR_AMD64_NB_CFG, temp) != 0) ||
> +             /*
> +              * TODO: this is broken.  What happens when dom0 is pinned but
> +              * can't see the full system?  CF8_EXT probably ought to be a
> +              * Xen-owned setting, and made symmetric across the system.
> +              */
>               ((val ^ temp) & ~(1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT)) )
>              goto invalid;
>          if ( wrmsr_safe(MSR_AMD64_NB_CFG, val) == 0 )

What do you mean by "can't see the full system"? Originally Linux used
only the MSR view of the bit to enable ECS, but this was specifically
extended to prefer the PCI config space view instead (24d9b70b8c67
"x86: Use PCI method for enabling AMD extended config space before MSR
method").

Since here we're dealing with the MSR flavor, and since Linux shouldn't
be using this anymore (due to checking whether the bit is clear before
trying to set it), we may be okay with simply purging this code if we
don't care about very old Linux Dom0 anymore (or if we use your
argument of it not reliably affecting the entire system).

As to the setting becoming Xen-owned: Dom0 being responsible for
(almost) all of PCI, it would be somewhat odd to take away from it this
level of control.

Jan
Roger Pau Monne April 3, 2023, 1:26 p.m. UTC | #4
On Mon, Apr 03, 2023 at 11:16:52AM +0100, Andrew Cooper wrote:
> On 03/04/2023 9:57 am, Roger Pau Monné wrote:
> > On Fri, Mar 31, 2023 at 06:57:19PM +0100, Andrew Cooper wrote:
> >> This started by noticing the dubious Fam17h check in
> >> arch_ioreq_server_get_type_addr(), and then realising that checking the host
> >> CF8_EXT setting is utterly bogus for the guest <-> qemu emulation path.
> >>
> >> What should be consulted here is the guest's MSR_AMD64_NB_CFG setting, but a)
> >> there isn't one, and b) the vestigial remnants of the cross-vendor migration
> >> logic cause MSR_AMD64_NB_CFG to be unconditionally read-as-zero, making the
> >> CF8_EXT path unused by any suitably-written OS in the first place.
> >>
> >> MSR_AMD64_NB_CFG really has been removed on Fam17h (it's now just a read-zero,
> >> write-discard stub), and the ECS extension is unconditionally active, meaning
> >> it is not correct for Xen to ignore the ExtRegNo field on newer AMD CPUs.
> >>
> >> It turns out that Xen did even have this behaviour in 4.5 and earlier, with
> >> this problematic CF8_EXT checking being added in 4.6.  Therefore, revert back
> >> to Xen's older behaviour - it is objectively less wrong than the current
> >> logic.
> >>
> >> While fixing this, get rid of hvm_pci_decode_addr() - it is more complicated
> >> to follow (and to call) than using the CF8* macros in the calling context.
> >> Rename CF8_ADDR() to CF8_REG() to better describe what it does, and write a
> >> comment explaining all about CF8/CFC accesses.
> >>
> >> There's one rare case when CF8_EXT is visible to guests, and that is for a
> >> pinned hwdom.  Right now, we permit such a dom0 to modify the CF8_EXT bit, but
> >> this seems like a very unwise idea.  Leave a TODO for people to consider.
> > One weirdness I've noticed is that for vPCI we decode the accesses
> > taking the extended CF8 bit after this change, but then if the access
> > is relayed to the hardware using vpci_{read,write}_hw it will be
> > forwarded to the hardware using pci_conf_{read,write}<size> which
> > doesn't have support for CF8_EXT.  So if the underlying hardware
> > doesn't have MMCFG support and the reg is > 255 it will be dropped.
> 
> It is important to stress that this change does not influence whether
> the guest issues ECS accesses or not.  All it does is change Xen's
> handling of such accesses.
> 
> Previously vPCI blindly ignored ECS accesses, so the vPCI layer
> effectively truncated them to BCS accesses.
> 
> Now, from your analysis, when MMCFG isn't active, Xen's PCI layer will
> effectively terminate ECS accesses with default behaviour, even on
> systems where IO ECS is available.
> 
> So we've changed one valid behaviour for a different valid behaviour.

Given vPCI is currently limited to PVH dom0 I doubt there are many
systems capable of running PVH dom0 but not having MMCFG support.

Best way to fix would be to implement ECS accesses in
pci_conf_{read,write}<size>() if the register is > 255 and there's no
MMCFG.

> 
> (Quick tangent...  Our PCI handling is currently very dumb. 
> pci_mmcfg_read() returns its value by pointer but the callers never
> check.  Swapping it to return by value would improve code gen quite a
> lot.  Also, when MMCFG is active we still pass BCS accesses to IO ports.)

I wonder if it's really preferred to access registers below 255 using
the IO ports, as Linux seems to do the same (prefer IO port access if
possible).

> So I think we do want to improve Xen's general behaviour too, but this
> difference here doesn't concern me.
> 
> >
> >> Fixes: e0fbf3bf9871 ("x86/AMD: correct certain Fam17 checks")
> >> Fixes: 2d67a7a4d37a ("x86: synchronize PCI config space access decoding")
> >> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> >> ---
> >> CC: Jan Beulich <JBeulich@suse.com>
> >> CC: Roger Pau Monné <roger.pau@citrix.com>
> >> CC: Wei Liu <wl@xen.org>
> >>
> >> Whoever reviewed those two patches originally was clearly a fool...
> >> ---
> >>  xen/arch/x86/hvm/io.c             | 24 ++++++------------------
> >>  xen/arch/x86/hvm/ioreq.c          | 19 ++-----------------
> >>  xen/arch/x86/include/asm/hvm/io.h |  4 ----
> >>  xen/arch/x86/include/asm/pci.h    | 26 ++++++++++++++++++++++++--
> >>  xen/arch/x86/pv/emul-priv-op.c    | 19 ++++++-------------
> >>  5 files changed, 38 insertions(+), 54 deletions(-)
> >>
> >> diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
> >> index 5ae209d3b6b3..b0d3c236e985 100644
> >> --- a/xen/arch/x86/hvm/io.c
> >> +++ b/xen/arch/x86/hvm/io.c
> >> @@ -248,20 +248,6 @@ void register_g2m_portio_handler(struct domain *d)
> >>      handler->ops = &g2m_portio_ops;
> >>  }
> >>  
> >> -unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
> >> -                                 pci_sbdf_t *sbdf)
> >> -{
> >> -    ASSERT(CF8_ENABLED(cf8));
> >> -
> >> -    sbdf->bdf = CF8_BDF(cf8);
> >> -    sbdf->seg = 0;
> >> -    /*
> >> -     * NB: the lower 2 bits of the register address are fetched from the
> >> -     * offset into the 0xcfc register when reading/writing to it.
> >> -     */
> >> -    return CF8_ADDR_LO(cf8) | (addr & 3);
> >> -}
> >> -
> >>  /* vPCI config space IO ports handlers (0xcf8/0xcfc). */
> >>  static bool cf_check vpci_portio_accept(
> >>      const struct hvm_io_handler *handler, const ioreq_t *p)
> >> @@ -275,7 +261,7 @@ static int cf_check vpci_portio_read(
> >>  {
> >>      const struct domain *d = current->domain;
> >>      unsigned int reg;
> >> -    pci_sbdf_t sbdf;
> >> +    pci_sbdf_t sbdf = {};
> >>      uint32_t cf8;
> >>  
> >>      *data = ~(uint64_t)0;
> >> @@ -292,7 +278,8 @@ static int cf_check vpci_portio_read(
> >>      if ( !CF8_ENABLED(cf8) )
> >>          return X86EMUL_UNHANDLEABLE;
> >>  
> >> -    reg = hvm_pci_decode_addr(cf8, addr, &sbdf);
> >> +    sbdf.bdf = CF8_BDF(cf8);
> >> +    reg = CF8_REG(cf8) | (addr & 3);
> >>  
> >>      if ( !vpci_access_allowed(reg, size) )
> >>          return X86EMUL_OKAY;
> >> @@ -308,7 +295,7 @@ static int cf_check vpci_portio_write(
> >>  {
> >>      struct domain *d = current->domain;
> >>      unsigned int reg;
> >> -    pci_sbdf_t sbdf;
> >> +    pci_sbdf_t sbdf = {};
> >>      uint32_t cf8;
> >>  
> >>      if ( addr == 0xcf8 )
> >> @@ -323,7 +310,8 @@ static int cf_check vpci_portio_write(
> >>      if ( !CF8_ENABLED(cf8) )
> >>          return X86EMUL_UNHANDLEABLE;
> >>  
> >> -    reg = hvm_pci_decode_addr(cf8, addr, &sbdf);
> >> +    sbdf.bdf = CF8_BDF(cf8);
> >> +    reg = CF8_REG(cf8) | (addr & 3);
> >>  
> >>      if ( !vpci_access_allowed(reg, size) )
> >>          return X86EMUL_OKAY;
> >> diff --git a/xen/arch/x86/hvm/ioreq.c b/xen/arch/x86/hvm/ioreq.c
> >> index 0bdcca1e1a5f..325a9d118e52 100644
> >> --- a/xen/arch/x86/hvm/ioreq.c
> >> +++ b/xen/arch/x86/hvm/ioreq.c
> >> @@ -285,27 +285,12 @@ bool arch_ioreq_server_get_type_addr(const struct domain *d,
> >>           (p->addr & ~3) == 0xcfc &&
> >>           CF8_ENABLED(cf8) )
> >>      {
> >> -        unsigned int x86_fam, reg;
> >> -        pci_sbdf_t sbdf;
> >> -
> >> -        reg = hvm_pci_decode_addr(cf8, p->addr, &sbdf);
> >> +        pci_sbdf_t sbdf = { .bdf = CF8_BDF(cf8) };
> >> +        unsigned int reg = CF8_REG(cf8) | (p->addr & 3);
> >>  
> >>          /* PCI config data cycle */
> >>          *type = XEN_DMOP_IO_RANGE_PCI;
> >>          *addr = ((uint64_t)sbdf.sbdf << 32) | reg;
> >> -        /* AMD extended configuration space access? */
> >> -        if ( CF8_ADDR_HI(cf8) &&
> >> -             d->arch.cpuid->x86_vendor == X86_VENDOR_AMD &&
> >> -             (x86_fam = get_cpu_family(
> >> -                 d->arch.cpuid->basic.raw_fms, NULL, NULL)) >= 0x10 &&
> >> -             x86_fam < 0x17 )
> >> -        {
> >> -            uint64_t msr_val;
> >> -
> >> -            if ( !rdmsr_safe(MSR_AMD64_NB_CFG, msr_val) &&
> >> -                 (msr_val & (1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT)) )
> >> -                *addr |= CF8_ADDR_HI(cf8);
> >> -        }
> >>      }
> >>      else
> >>      {
> >> diff --git a/xen/arch/x86/include/asm/hvm/io.h b/xen/arch/x86/include/asm/hvm/io.h
> >> index 54e0161b492c..3f3fb6403ccb 100644
> >> --- a/xen/arch/x86/include/asm/hvm/io.h
> >> +++ b/xen/arch/x86/include/asm/hvm/io.h
> >> @@ -144,10 +144,6 @@ void stdvga_deinit(struct domain *d);
> >>  
> >>  extern void hvm_dpci_msi_eoi(struct domain *d, int vector);
> >>  
> >> -/* Decode a PCI port IO access into a bus/slot/func/reg. */
> >> -unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
> >> -                                 pci_sbdf_t *sbdf);
> >> -
> >>  /*
> >>   * HVM port IO handler that performs forwarding of guest IO ports into machine
> >>   * IO ports.
> >> diff --git a/xen/arch/x86/include/asm/pci.h b/xen/arch/x86/include/asm/pci.h
> >> index f4a58c8acf13..3b814f4ebacf 100644
> >> --- a/xen/arch/x86/include/asm/pci.h
> >> +++ b/xen/arch/x86/include/asm/pci.h
> >> @@ -3,10 +3,32 @@
> >>  
> >>  #include <xen/mm.h>
> >>  
> >> +/*
> >> + * PCI config space accesses with CF8/CFC:
> >> + *
> >> + * 1) Write {Enable | BDF | Reg} to CF8 to set an address
> >> + * 2) Read or write CF{C..F} to access the register
> >> + *
> >> + * For sub-dword register accesses, the bottom two register address bits come
> >> + * from the CF{C..F} address, not from CF8.
> >> + *
> >> + * AMD have extention to this protocol to access PCIe Extend Config Space by
> >> + * storing the 4 extra register address bits in the penultimate nibble of CF8.
> >> + * This extention:
> >> + *  - Is unconditionally active on Fam17h and later
> >> + *  - Has model specific enablement on Fam10h thru Fam16h
> >> + *  - Has reserved behaviour in all other cases, including other vendors
> >> + *
> >> + * For simplicity and because we are permitted to, given "reserved", Xen
> >> + * always treats ECS as active when emulating guest PCI config space accesses.
> >> + */
> >>  #define CF8_BDF(cf8)     (  ((cf8) & 0x00ffff00) >> 8)
> >> -#define CF8_ADDR_LO(cf8) (   (cf8) & 0x000000fc)
> >> -#define CF8_ADDR_HI(cf8) (  ((cf8) & 0x0f000000) >> 16)
> >>  #define CF8_ENABLED(cf8) (!!((cf8) & 0x80000000))
> >> +#define CF8_REG(cf8)                                    \
> >> +    ({                                                  \
> >> +        unsigned int _c = cf8;                          \
> >> +        ((_c & 0x0f000000) >> 16) | (_c & 0xfc);        \
> >> +    })
> > What happens on Intel when the bit is set, is it just ignored?
> 
> "the bit" => the ECS nibble, or the CF8_EXT bit?

The ECS nibble.

> 
> The ECS nibble is ignored on Intel AFAICT, while the CF8_EXT bit is in a
> very AMD-only MSR, so won't exist on Intel.
> 
> > We only allow such accesses for dom0 anyway.
> 
> And guests running on AMD hardware where CF8_EXT is active on the
> northbridge of the core we are instantaneously scheduled on.
> 
> >>  
> >>  #define IS_SNB_GFX(id) (id == 0x01068086 || id == 0x01168086 \
> >>                          || id == 0x01268086 || id == 0x01028086 \
> >> diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
> >> index 5da00e24e4ff..008367195c78 100644
> >> --- a/xen/arch/x86/pv/emul-priv-op.c
> >> +++ b/xen/arch/x86/pv/emul-priv-op.c
> >> @@ -245,19 +245,7 @@ static bool pci_cfg_ok(struct domain *currd, unsigned int start,
> >>          if ( ro_map && test_bit(machine_bdf, ro_map) )
> >>              return false;
> >>      }
> >> -    start |= CF8_ADDR_LO(currd->arch.pci_cf8);
> >> -    /* AMD extended configuration space access? */
> >> -    if ( CF8_ADDR_HI(currd->arch.pci_cf8) &&
> >> -         boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
> >> -         boot_cpu_data.x86 >= 0x10 && boot_cpu_data.x86 < 0x17 )
> >> -    {
> >> -        uint64_t msr_val;
> >> -
> >> -        if ( rdmsr_safe(MSR_AMD64_NB_CFG, msr_val) )
> >> -            return false;
> >> -        if ( msr_val & (1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT) )
> >> -            start |= CF8_ADDR_HI(currd->arch.pci_cf8);
> >> -    }
> >> +    start |= CF8_REG(currd->arch.pci_cf8);
> >>  
> >>      return !write ?
> >>             xsm_pci_config_permission(XSM_HOOK, currd, machine_bdf,
> >> @@ -1104,6 +1092,11 @@ static int cf_check write_msr(
> >>          if ( !is_hwdom_pinned_vcpu(curr) )
> >>              return X86EMUL_OKAY;
> >>          if ( (rdmsr_safe(MSR_AMD64_NB_CFG, temp) != 0) ||
> >> +             /*
> >> +              * TODO: this is broken.  What happens when dom0 is pinned but
> >> +              * can't see the full system?  CF8_EXT probably ought to be a
> >> +              * Xen-owned setting, and made symmetric across the system.
> >> +              */
> > I would assume CF8_EXT would be symmetric across the system, specially
> > given that it seems to be phased out and only used in older AMD
> > families that where all symmetric?
> 
> The CF8_EXT bit has been phased out.  The IO ECS functionality still exists.
> 
> But yes, the more I think about letting dom0 play with this, the more I
> think it is a fundamentally broken idea...  I bet it was from the very
> early AMD Fam10h days where dom0 knew how to turn it on, and Xen was
> trying to pretend it didn't have to touch any PCI devices.

It seems to me Xen should set CF8_EXT on all threads (when available)
and expose it to dom0, so that accesses using pci_{conf,write}_read()
work as expected?

Thanks, Roger.
Andrew Cooper April 4, 2023, 1:27 p.m. UTC | #5
On 03/04/2023 2:26 pm, Roger Pau Monné wrote:
> On Mon, Apr 03, 2023 at 11:16:52AM +0100, Andrew Cooper wrote:
>> On 03/04/2023 9:57 am, Roger Pau Monné wrote:
>> (Quick tangent...  Our PCI handling is currently very dumb. 
>> pci_mmcfg_read() returns its value by pointer but the callers never
>> check.  Swapping it to return by value would improve code gen quite a
>> lot.  Also, when MMCFG is active we still pass BCS accesses to IO ports.)
> I wonder if it's really preferred to access registers below 255 using
> the IO ports, as Linux seems to do the same (prefer IO port access if
> possible).

And see how many attempts there have been to change this, only blocked
on untangling the IO port mess on other architectures (a problem Xen
doesn't have to contend with).

MMCFG, when available is strictly preferable to IO ports.

An MMCFG access is a single UC read or write, whereas IO ports are a
pair of UC accesses *and* a global spinlock.

>>>>  
>>>>  #define IS_SNB_GFX(id) (id == 0x01068086 || id == 0x01168086 \
>>>>                          || id == 0x01268086 || id == 0x01028086 \
>>>> diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
>>>> index 5da00e24e4ff..008367195c78 100644
>>>> --- a/xen/arch/x86/pv/emul-priv-op.c
>>>> +++ b/xen/arch/x86/pv/emul-priv-op.c
>>>> @@ -245,19 +245,7 @@ static bool pci_cfg_ok(struct domain *currd, unsigned int start,
>>>>          if ( ro_map && test_bit(machine_bdf, ro_map) )
>>>>              return false;
>>>>      }
>>>> -    start |= CF8_ADDR_LO(currd->arch.pci_cf8);
>>>> -    /* AMD extended configuration space access? */
>>>> -    if ( CF8_ADDR_HI(currd->arch.pci_cf8) &&
>>>> -         boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
>>>> -         boot_cpu_data.x86 >= 0x10 && boot_cpu_data.x86 < 0x17 )
>>>> -    {
>>>> -        uint64_t msr_val;
>>>> -
>>>> -        if ( rdmsr_safe(MSR_AMD64_NB_CFG, msr_val) )
>>>> -            return false;
>>>> -        if ( msr_val & (1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT) )
>>>> -            start |= CF8_ADDR_HI(currd->arch.pci_cf8);
>>>> -    }
>>>> +    start |= CF8_REG(currd->arch.pci_cf8);
>>>>  
>>>>      return !write ?
>>>>             xsm_pci_config_permission(XSM_HOOK, currd, machine_bdf,
>>>> @@ -1104,6 +1092,11 @@ static int cf_check write_msr(
>>>>          if ( !is_hwdom_pinned_vcpu(curr) )
>>>>              return X86EMUL_OKAY;
>>>>          if ( (rdmsr_safe(MSR_AMD64_NB_CFG, temp) != 0) ||
>>>> +             /*
>>>> +              * TODO: this is broken.  What happens when dom0 is pinned but
>>>> +              * can't see the full system?  CF8_EXT probably ought to be a
>>>> +              * Xen-owned setting, and made symmetric across the system.
>>>> +              */
>>> I would assume CF8_EXT would be symmetric across the system, specially
>>> given that it seems to be phased out and only used in older AMD
>>> families that where all symmetric?
>> The CF8_EXT bit has been phased out.  The IO ECS functionality still exists.
>>
>> But yes, the more I think about letting dom0 play with this, the more I
>> think it is a fundamentally broken idea...  I bet it was from the very
>> early AMD Fam10h days where dom0 knew how to turn it on, and Xen was
>> trying to pretend it didn't have to touch any PCI devices.
> It seems to me Xen should set CF8_EXT on all threads (when available)
> and expose it to dom0, so that accesses using pci_{conf,write}_read()
> work as expected?

It's per northbridge in the system, not per thread.  Hence needing the
spinlock protecting the CF8/CFC IO port pair access, and why MMCFG is
strictly preferable.

~Andrew
Roger Pau Monne April 4, 2023, 3:04 p.m. UTC | #6
On Tue, Apr 04, 2023 at 02:27:36PM +0100, Andrew Cooper wrote:
> On 03/04/2023 2:26 pm, Roger Pau Monné wrote:
> > On Mon, Apr 03, 2023 at 11:16:52AM +0100, Andrew Cooper wrote:
> >> On 03/04/2023 9:57 am, Roger Pau Monné wrote:
> >> (Quick tangent...  Our PCI handling is currently very dumb. 
> >> pci_mmcfg_read() returns its value by pointer but the callers never
> >> check.  Swapping it to return by value would improve code gen quite a
> >> lot.  Also, when MMCFG is active we still pass BCS accesses to IO ports.)
> > I wonder if it's really preferred to access registers below 255 using
> > the IO ports, as Linux seems to do the same (prefer IO port access if
> > possible).
> 
> And see how many attempts there have been to change this, only blocked
> on untangling the IO port mess on other architectures (a problem Xen
> doesn't have to contend with).
> 
> MMCFG, when available is strictly preferable to IO ports.
> 
> An MMCFG access is a single UC read or write, whereas IO ports are a
> pair of UC accesses *and* a global spinlock.

Right, I know it's better from a performance PoV, but didn't know
whether there was some known glitches for not using MMCFG when
accessing regs < 255.

> >>>>  
> >>>>  #define IS_SNB_GFX(id) (id == 0x01068086 || id == 0x01168086 \
> >>>>                          || id == 0x01268086 || id == 0x01028086 \
> >>>> diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
> >>>> index 5da00e24e4ff..008367195c78 100644
> >>>> --- a/xen/arch/x86/pv/emul-priv-op.c
> >>>> +++ b/xen/arch/x86/pv/emul-priv-op.c
> >>>> @@ -245,19 +245,7 @@ static bool pci_cfg_ok(struct domain *currd, unsigned int start,
> >>>>          if ( ro_map && test_bit(machine_bdf, ro_map) )
> >>>>              return false;
> >>>>      }
> >>>> -    start |= CF8_ADDR_LO(currd->arch.pci_cf8);
> >>>> -    /* AMD extended configuration space access? */
> >>>> -    if ( CF8_ADDR_HI(currd->arch.pci_cf8) &&
> >>>> -         boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
> >>>> -         boot_cpu_data.x86 >= 0x10 && boot_cpu_data.x86 < 0x17 )
> >>>> -    {
> >>>> -        uint64_t msr_val;
> >>>> -
> >>>> -        if ( rdmsr_safe(MSR_AMD64_NB_CFG, msr_val) )
> >>>> -            return false;
> >>>> -        if ( msr_val & (1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT) )
> >>>> -            start |= CF8_ADDR_HI(currd->arch.pci_cf8);
> >>>> -    }
> >>>> +    start |= CF8_REG(currd->arch.pci_cf8);
> >>>>  
> >>>>      return !write ?
> >>>>             xsm_pci_config_permission(XSM_HOOK, currd, machine_bdf,
> >>>> @@ -1104,6 +1092,11 @@ static int cf_check write_msr(
> >>>>          if ( !is_hwdom_pinned_vcpu(curr) )
> >>>>              return X86EMUL_OKAY;
> >>>>          if ( (rdmsr_safe(MSR_AMD64_NB_CFG, temp) != 0) ||
> >>>> +             /*
> >>>> +              * TODO: this is broken.  What happens when dom0 is pinned but
> >>>> +              * can't see the full system?  CF8_EXT probably ought to be a
> >>>> +              * Xen-owned setting, and made symmetric across the system.
> >>>> +              */
> >>> I would assume CF8_EXT would be symmetric across the system, specially
> >>> given that it seems to be phased out and only used in older AMD
> >>> families that where all symmetric?
> >> The CF8_EXT bit has been phased out.  The IO ECS functionality still exists.
> >>
> >> But yes, the more I think about letting dom0 play with this, the more I
> >> think it is a fundamentally broken idea...  I bet it was from the very
> >> early AMD Fam10h days where dom0 knew how to turn it on, and Xen was
> >> trying to pretend it didn't have to touch any PCI devices.
> > It seems to me Xen should set CF8_EXT on all threads (when available)
> > and expose it to dom0, so that accesses using pci_{conf,write}_read()
> > work as expected?
> 
> It's per northbridge in the system, not per thread.  Hence needing the
> spinlock protecting the CF8/CFC IO port pair access, and why MMCFG is
> strictly preferable.

So just setting CF8_EXT_ENABLE on MSR_AMD64_NB_CFG by the BSP should
be enough to have it enabled?  I expect all other threads will see the
bit as set in the MSR then.

Thanks, Roger.
Jan Beulich April 4, 2023, 3:18 p.m. UTC | #7
On 04.04.2023 17:04, Roger Pau Monné wrote:
> On Tue, Apr 04, 2023 at 02:27:36PM +0100, Andrew Cooper wrote:
>> On 03/04/2023 2:26 pm, Roger Pau Monné wrote:
>>> On Mon, Apr 03, 2023 at 11:16:52AM +0100, Andrew Cooper wrote:
>>>> On 03/04/2023 9:57 am, Roger Pau Monné wrote:
>>>>>> @@ -1104,6 +1092,11 @@ static int cf_check write_msr(
>>>>>>          if ( !is_hwdom_pinned_vcpu(curr) )
>>>>>>              return X86EMUL_OKAY;
>>>>>>          if ( (rdmsr_safe(MSR_AMD64_NB_CFG, temp) != 0) ||
>>>>>> +             /*
>>>>>> +              * TODO: this is broken.  What happens when dom0 is pinned but
>>>>>> +              * can't see the full system?  CF8_EXT probably ought to be a
>>>>>> +              * Xen-owned setting, and made symmetric across the system.
>>>>>> +              */
>>>>> I would assume CF8_EXT would be symmetric across the system, specially
>>>>> given that it seems to be phased out and only used in older AMD
>>>>> families that where all symmetric?
>>>> The CF8_EXT bit has been phased out.  The IO ECS functionality still exists.
>>>>
>>>> But yes, the more I think about letting dom0 play with this, the more I
>>>> think it is a fundamentally broken idea...  I bet it was from the very
>>>> early AMD Fam10h days where dom0 knew how to turn it on, and Xen was
>>>> trying to pretend it didn't have to touch any PCI devices.
>>> It seems to me Xen should set CF8_EXT on all threads (when available)
>>> and expose it to dom0, so that accesses using pci_{conf,write}_read()
>>> work as expected?
>>
>> It's per northbridge in the system, not per thread.  Hence needing the
>> spinlock protecting the CF8/CFC IO port pair access, and why MMCFG is
>> strictly preferable.
> 
> So just setting CF8_EXT_ENABLE on MSR_AMD64_NB_CFG by the BSP should
> be enough to have it enabled?  I expect all other threads will see the
> bit as set in the MSR then.

No, it's one bit per socket iirc.

Jan
diff mbox series

Patch

diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
index 5ae209d3b6b3..b0d3c236e985 100644
--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -248,20 +248,6 @@  void register_g2m_portio_handler(struct domain *d)
     handler->ops = &g2m_portio_ops;
 }
 
-unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
-                                 pci_sbdf_t *sbdf)
-{
-    ASSERT(CF8_ENABLED(cf8));
-
-    sbdf->bdf = CF8_BDF(cf8);
-    sbdf->seg = 0;
-    /*
-     * NB: the lower 2 bits of the register address are fetched from the
-     * offset into the 0xcfc register when reading/writing to it.
-     */
-    return CF8_ADDR_LO(cf8) | (addr & 3);
-}
-
 /* vPCI config space IO ports handlers (0xcf8/0xcfc). */
 static bool cf_check vpci_portio_accept(
     const struct hvm_io_handler *handler, const ioreq_t *p)
@@ -275,7 +261,7 @@  static int cf_check vpci_portio_read(
 {
     const struct domain *d = current->domain;
     unsigned int reg;
-    pci_sbdf_t sbdf;
+    pci_sbdf_t sbdf = {};
     uint32_t cf8;
 
     *data = ~(uint64_t)0;
@@ -292,7 +278,8 @@  static int cf_check vpci_portio_read(
     if ( !CF8_ENABLED(cf8) )
         return X86EMUL_UNHANDLEABLE;
 
-    reg = hvm_pci_decode_addr(cf8, addr, &sbdf);
+    sbdf.bdf = CF8_BDF(cf8);
+    reg = CF8_REG(cf8) | (addr & 3);
 
     if ( !vpci_access_allowed(reg, size) )
         return X86EMUL_OKAY;
@@ -308,7 +295,7 @@  static int cf_check vpci_portio_write(
 {
     struct domain *d = current->domain;
     unsigned int reg;
-    pci_sbdf_t sbdf;
+    pci_sbdf_t sbdf = {};
     uint32_t cf8;
 
     if ( addr == 0xcf8 )
@@ -323,7 +310,8 @@  static int cf_check vpci_portio_write(
     if ( !CF8_ENABLED(cf8) )
         return X86EMUL_UNHANDLEABLE;
 
-    reg = hvm_pci_decode_addr(cf8, addr, &sbdf);
+    sbdf.bdf = CF8_BDF(cf8);
+    reg = CF8_REG(cf8) | (addr & 3);
 
     if ( !vpci_access_allowed(reg, size) )
         return X86EMUL_OKAY;
diff --git a/xen/arch/x86/hvm/ioreq.c b/xen/arch/x86/hvm/ioreq.c
index 0bdcca1e1a5f..325a9d118e52 100644
--- a/xen/arch/x86/hvm/ioreq.c
+++ b/xen/arch/x86/hvm/ioreq.c
@@ -285,27 +285,12 @@  bool arch_ioreq_server_get_type_addr(const struct domain *d,
          (p->addr & ~3) == 0xcfc &&
          CF8_ENABLED(cf8) )
     {
-        unsigned int x86_fam, reg;
-        pci_sbdf_t sbdf;
-
-        reg = hvm_pci_decode_addr(cf8, p->addr, &sbdf);
+        pci_sbdf_t sbdf = { .bdf = CF8_BDF(cf8) };
+        unsigned int reg = CF8_REG(cf8) | (p->addr & 3);
 
         /* PCI config data cycle */
         *type = XEN_DMOP_IO_RANGE_PCI;
         *addr = ((uint64_t)sbdf.sbdf << 32) | reg;
-        /* AMD extended configuration space access? */
-        if ( CF8_ADDR_HI(cf8) &&
-             d->arch.cpuid->x86_vendor == X86_VENDOR_AMD &&
-             (x86_fam = get_cpu_family(
-                 d->arch.cpuid->basic.raw_fms, NULL, NULL)) >= 0x10 &&
-             x86_fam < 0x17 )
-        {
-            uint64_t msr_val;
-
-            if ( !rdmsr_safe(MSR_AMD64_NB_CFG, msr_val) &&
-                 (msr_val & (1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT)) )
-                *addr |= CF8_ADDR_HI(cf8);
-        }
     }
     else
     {
diff --git a/xen/arch/x86/include/asm/hvm/io.h b/xen/arch/x86/include/asm/hvm/io.h
index 54e0161b492c..3f3fb6403ccb 100644
--- a/xen/arch/x86/include/asm/hvm/io.h
+++ b/xen/arch/x86/include/asm/hvm/io.h
@@ -144,10 +144,6 @@  void stdvga_deinit(struct domain *d);
 
 extern void hvm_dpci_msi_eoi(struct domain *d, int vector);
 
-/* Decode a PCI port IO access into a bus/slot/func/reg. */
-unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
-                                 pci_sbdf_t *sbdf);
-
 /*
  * HVM port IO handler that performs forwarding of guest IO ports into machine
  * IO ports.
diff --git a/xen/arch/x86/include/asm/pci.h b/xen/arch/x86/include/asm/pci.h
index f4a58c8acf13..3b814f4ebacf 100644
--- a/xen/arch/x86/include/asm/pci.h
+++ b/xen/arch/x86/include/asm/pci.h
@@ -3,10 +3,32 @@ 
 
 #include <xen/mm.h>
 
+/*
+ * PCI config space accesses with CF8/CFC:
+ *
+ * 1) Write {Enable | BDF | Reg} to CF8 to set an address
+ * 2) Read or write CF{C..F} to access the register
+ *
+ * For sub-dword register accesses, the bottom two register address bits come
+ * from the CF{C..F} address, not from CF8.
+ *
+ * AMD have extention to this protocol to access PCIe Extend Config Space by
+ * storing the 4 extra register address bits in the penultimate nibble of CF8.
+ * This extention:
+ *  - Is unconditionally active on Fam17h and later
+ *  - Has model specific enablement on Fam10h thru Fam16h
+ *  - Has reserved behaviour in all other cases, including other vendors
+ *
+ * For simplicity and because we are permitted to, given "reserved", Xen
+ * always treats ECS as active when emulating guest PCI config space accesses.
+ */
 #define CF8_BDF(cf8)     (  ((cf8) & 0x00ffff00) >> 8)
-#define CF8_ADDR_LO(cf8) (   (cf8) & 0x000000fc)
-#define CF8_ADDR_HI(cf8) (  ((cf8) & 0x0f000000) >> 16)
 #define CF8_ENABLED(cf8) (!!((cf8) & 0x80000000))
+#define CF8_REG(cf8)                                    \
+    ({                                                  \
+        unsigned int _c = cf8;                          \
+        ((_c & 0x0f000000) >> 16) | (_c & 0xfc);        \
+    })
 
 #define IS_SNB_GFX(id) (id == 0x01068086 || id == 0x01168086 \
                         || id == 0x01268086 || id == 0x01028086 \
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index 5da00e24e4ff..008367195c78 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -245,19 +245,7 @@  static bool pci_cfg_ok(struct domain *currd, unsigned int start,
         if ( ro_map && test_bit(machine_bdf, ro_map) )
             return false;
     }
-    start |= CF8_ADDR_LO(currd->arch.pci_cf8);
-    /* AMD extended configuration space access? */
-    if ( CF8_ADDR_HI(currd->arch.pci_cf8) &&
-         boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
-         boot_cpu_data.x86 >= 0x10 && boot_cpu_data.x86 < 0x17 )
-    {
-        uint64_t msr_val;
-
-        if ( rdmsr_safe(MSR_AMD64_NB_CFG, msr_val) )
-            return false;
-        if ( msr_val & (1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT) )
-            start |= CF8_ADDR_HI(currd->arch.pci_cf8);
-    }
+    start |= CF8_REG(currd->arch.pci_cf8);
 
     return !write ?
            xsm_pci_config_permission(XSM_HOOK, currd, machine_bdf,
@@ -1104,6 +1092,11 @@  static int cf_check write_msr(
         if ( !is_hwdom_pinned_vcpu(curr) )
             return X86EMUL_OKAY;
         if ( (rdmsr_safe(MSR_AMD64_NB_CFG, temp) != 0) ||
+             /*
+              * TODO: this is broken.  What happens when dom0 is pinned but
+              * can't see the full system?  CF8_EXT probably ought to be a
+              * Xen-owned setting, and made symmetric across the system.
+              */
              ((val ^ temp) & ~(1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT)) )
             goto invalid;
         if ( wrmsr_safe(MSR_AMD64_NB_CFG, val) == 0 )