x86: retrieve and log CPU frequency information

Message ID	1fd091d2-30e2-0691-0485-3f5142bd457f@suse.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=NS8K=57=lists.xenproject.org=xen-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1F87B20737 To: "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org> From: Jan Beulich <jbeulich@suse.com> Subject: [PATCH] x86: retrieve and log CPU frequency information Message-ID: <1fd091d2-30e2-0691-0485-3f5142bd457f@suse.com> Date: Wed, 15 Apr 2020 13:55:24 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.7.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Precedence: list Cc: Andrew Cooper <andrew.cooper3@citrix.com>, Wei Liu <wl@xen.org>, =?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com> Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>
Series	x86: retrieve and log CPU frequency information \| expand x86: retrieve and log CPU frequency information

Jan Beulich April 15, 2020, 11:55 a.m. UTC

While from just a single Skylake system it is already clear that we
can't base any of our logic on CPUID leaf 15 [1] (leaf 16 is
documented to be used for display purposes only anyway), logging this
information may still give us some reference in case of problems as well
as for future work. Additionally on the AMD side it is unclear whether
the deviation between reported and measured frequencies is because of us
not doing well, or because of nominal and actual frequencies being quite
far apart.

The chosen variable naming in amd_log_freq() has pointed out a naming
problem in rdmsr_safe(), which is being taken care of at the same time.
Symmetrically wrmsr_safe(), being an inline function, also gets an
unnecessary underscore dropped from one of its local variables.

[1] With a core crystal clock of 24MHz and a ratio of 216/2, the
    reported frequency nevertheless is 2600MHz, rather than the to be
    expected (and calibrated by both us and Linux) 2592MHz.

Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
TBD: The node ID retrieval using extended leaf 1E implies it won't work
     on older hardware (pre-Fam15 I think). Besides the Node ID MSR,
     which doesn't get advertised on my Fam10 box (and it's zero on all
     processors despite there being two nodes as per the PCI device
     map), and which isn't even documented for Fam11, Fam12, and Fam14,
     I didn't find any other means to retrieve the node ID a CPU is
     associated with - the NodeId register in PCI config space depends
     on one already knowing the node ID for doing the access, as the
     device to be used is a function of the node ID.

Roger Pau Monné May 14, 2020, 1:10 p.m. UTC | #1

On Wed, Apr 15, 2020 at 01:55:24PM +0200, Jan Beulich wrote:
> While from just a single Skylake system it is already clear that we
> can't base any of our logic on CPUID leaf 15 [1] (leaf 16 is
> documented to be used for display purposes only anyway), logging this
> information may still give us some reference in case of problems as well
> as for future work. Additionally on the AMD side it is unclear whether
> the deviation between reported and measured frequencies is because of us
> not doing well, or because of nominal and actual frequencies being quite
> far apart.

Can you add some reference to the AMD implementation? I've looked at
the PMs and haven't been able to find a description of some of the
MSRs, like 0xC0010064.

> The chosen variable naming in amd_log_freq() has pointed out a naming
> problem in rdmsr_safe(), which is being taken care of at the same time.
> Symmetrically wrmsr_safe(), being an inline function, also gets an
> unnecessary underscore dropped from one of its local variables.
> 
> [1] With a core crystal clock of 24MHz and a ratio of 216/2, the
>     reported frequency nevertheless is 2600MHz, rather than the to be
>     expected (and calibrated by both us and Linux) 2592MHz.
> 
> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> TBD: The node ID retrieval using extended leaf 1E implies it won't work
>      on older hardware (pre-Fam15 I think). Besides the Node ID MSR,
>      which doesn't get advertised on my Fam10 box (and it's zero on all
>      processors despite there being two nodes as per the PCI device
>      map), and which isn't even documented for Fam11, Fam12, and Fam14,
>      I didn't find any other means to retrieve the node ID a CPU is
>      associated with - the NodeId register in PCI config space depends
>      on one already knowing the node ID for doing the access, as the
>      device to be used is a function of the node ID.
> 
> --- a/xen/arch/x86/cpu/amd.c
> +++ b/xen/arch/x86/cpu/amd.c
> @@ -532,6 +532,102 @@ static void amd_get_topology(struct cpui
>                                                            : c->cpu_core_id);
>  }
>  
> +void amd_log_freq(const struct cpuinfo_x86 *c)
> +{
> +	unsigned int idx = 0, h;
> +	uint64_t hi, lo, val;
> +
> +	if (c->x86 < 0x10 || c->x86 > 0x19 ||
> +	    (c != &boot_cpu_data &&
> +	     (!opt_cpu_info || (c->apicid & (c->x86_num_siblings - 1)))))
> +		return;
> +
> +	if (c->x86 < 0x17) {
> +		unsigned int node = 0;
> +		uint64_t nbcfg;
> +
> +		/*
> +		 * Make an attempt at determining the node ID, but assume
> +		 * symmetric setup (using node 0) if this fails.
> +		 */
> +		if (c->extended_cpuid_level >= 0x8000001e &&
> +		    cpu_has(c, X86_FEATURE_TOPOEXT)) {
> +			node = cpuid_ecx(0x8000001e) & 0xff;
> +			if (node > 7)
> +				node = 0;
> +		} else if (cpu_has(c, X86_FEATURE_NODEID_MSR)) {
> +			rdmsrl(0xC001100C, val);
> +			node = val & 7;
> +		}
> +
> +		/*
> +		 * Enable (and use) Extended Config Space accesses, as we
> +		 * can't be certain that MCFG is available here during boot.
> +		 */
> +		rdmsrl(MSR_AMD64_NB_CFG, nbcfg);
> +		wrmsrl(MSR_AMD64_NB_CFG,
> +		       nbcfg | (1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT));
> +#define PCI_ECS_ADDRESS(sbdf, reg) \
> +    (0x80000000 | ((sbdf).bdf << 8) | ((reg) & 0xfc) | (((reg) & 0xf00) << 16))
> +
> +		for ( ; ; ) {
> +			pci_sbdf_t sbdf = PCI_SBDF(0, 0, 0x18 | node, 4);
> +
> +			switch (pci_conf_read32(sbdf, PCI_VENDOR_ID)) {
> +			case 0x00000000:
> +			case 0xffffffff:
> +				/* No device at this SBDF. */
> +				if (!node)
> +					break;
> +				node = 0;
> +				continue;
> +
> +			default:
> +				/*
> +				 * Core Performance Boost Control, family
> +				 * dependent up to 3 bits starting at bit 2.
> +				 */
> +				switch (c->x86) {
> +				case 0x10: idx = 1; break;
> +				case 0x12: idx = 7; break;
> +				case 0x14: idx = 7; break;
> +				case 0x15: idx = 7; break;
> +				case 0x16: idx = 7; break;
> +				}
> +				idx &= pci_conf_read(PCI_ECS_ADDRESS(sbdf,
> +				                                     0x15c),
> +				                     0, 4) >> 2;
> +				break;
> +			}
> +			break;
> +		}
> +
> +#undef PCI_ECS_ADDRESS
> +		wrmsrl(MSR_AMD64_NB_CFG, nbcfg);
> +	}
> +
> +	lo = 0; /* gcc may not recognize the loop having at least 5 iterations */
> +	for (h = c->x86 == 0x10 ? 5 : 8; h--; )
> +		if (!rdmsr_safe(0xC0010064 + h, lo) && (lo >> 63))
> +			break;
> +	if (!(lo >> 63))
> +		return;
> +
> +#define FREQ(v) (c->x86 < 0x17 ? ((((v) & 0x3f) + 0x10) * 100) >> (((v) >> 6) & 7) \
> +		                     : (((v) & 0xff) * 25 * 8) / (((v) >> 8) & 0x3f))
> +	if (idx && idx < h &&
> +	    !rdmsr_safe(0xC0010064 + idx, val) && (val >> 63) &&
> +	    !rdmsr_safe(0xC0010064, hi) && (hi >> 63))
> +		printk("CPU%u: %lu (%lu..%lu) MHz\n",
> +		       smp_processor_id(), FREQ(val), FREQ(lo), FREQ(hi));
> +	else if (h && !rdmsr_safe(0xC0010064, hi) && (hi >> 63))
> +		printk("CPU%u: %lu..%lu MHz\n",
> +		       smp_processor_id(), FREQ(lo), FREQ(hi));
> +	else
> +		printk("CPU%u: %lu MHz\n", smp_processor_id(), FREQ(lo));
> +#undef FREQ
> +}
> +
>  void early_init_amd(struct cpuinfo_x86 *c)
>  {
>  	if (c == &boot_cpu_data)
> @@ -803,6 +899,8 @@ static void init_amd(struct cpuinfo_x86
>  		disable_c1_ramping();
>  
>  	check_syscfg_dram_mod_en();
> +
> +	amd_log_freq(c);
>  }
>  
>  const struct cpu_dev amd_cpu_dev = {
> --- a/xen/arch/x86/cpu/cpu.h
> +++ b/xen/arch/x86/cpu/cpu.h
> @@ -19,3 +19,4 @@ extern void detect_ht(struct cpuinfo_x86
>  extern bool detect_extended_topology(struct cpuinfo_x86 *c);
>  
>  void early_init_amd(struct cpuinfo_x86 *c);
> +void amd_log_freq(const struct cpuinfo_x86 *c);
> --- a/xen/arch/x86/cpu/hygon.c
> +++ b/xen/arch/x86/cpu/hygon.c
> @@ -99,6 +99,8 @@ static void init_hygon(struct cpuinfo_x8
>  		value |= (1 << 27); /* Enable read-only APERF/MPERF bit */
>  		wrmsrl(MSR_K7_HWCR, value);
>  	}
> +
> +	amd_log_freq(c);
>  }
>  
>  const struct cpu_dev hygon_cpu_dev = {
> --- a/xen/arch/x86/cpu/intel.c
> +++ b/xen/arch/x86/cpu/intel.c
> @@ -378,6 +378,72 @@ static void init_intel(struct cpuinfo_x8
>  	     ( c->cpuid_level >= 0x00000006 ) &&
>  	     ( cpuid_eax(0x00000006) & (1u<<2) ) )
>  		__set_bit(X86_FEATURE_ARAT, c->x86_capability);
> +

I would split this into a separate helper, ie: intel_log_freq. That
will allow you to exit early and reduce some of the indentation IMO.

> +    if ( (opt_cpu_info && !(c->apicid & (c->x86_num_siblings - 1))) ||
> +         c == &boot_cpu_data )
> +    {
> +        unsigned int eax, ebx, ecx, edx;
> +        uint64_t msrval;
> +
> +        if ( c->cpuid_level >= 0x15 )
> +        {
> +            cpuid(0x15, &eax, &ebx, &ecx, &edx);
> +            if ( ecx && ebx && eax )
> +            {
> +                unsigned long long val = ecx;
> +
> +                val *= ebx;
> +                do_div(val, eax);
> +                printk("CPU%u: TSC: %uMHz * %u / %u = %LuMHz\n",
> +                       smp_processor_id(), ecx, ebx, eax, val);
> +            }
> +            else if ( ecx | eax | ebx )
> +            {
> +                printk("CPU%u: TSC:", smp_processor_id());
> +                if ( ecx )
> +                    printk(" core: %uMHz", ecx);
> +                if ( ebx && eax )
> +                    printk(" ratio: %u / %u", ebx, eax);
> +                printk("\n");
> +            }
> +        }
> +
> +        if ( c->cpuid_level >= 0x16 )
> +        {
> +            cpuid(0x16, &eax, &ebx, &ecx, &edx);
> +            if ( ecx | eax | ebx )
> +            {
> +                printk("CPU%u:", smp_processor_id());
> +                if ( ecx )
> +                    printk(" bus: %uMHz", ecx);
> +                if ( eax )
> +                    printk(" base: %uMHz", eax);
> +                if ( ebx )
> +                    printk(" max: %uMHz", ebx);
> +                printk("\n");
> +            }
> +        }
> +
> +        if ( !rdmsr_safe(MSR_INTEL_PLATFORM_INFO, msrval) &&
> +             (uint8_t)(msrval >> 8) )

I would introduce a mask for it would be cleaner, since you use it
here and below (and would avoid the casting to uint8_t.

> +        {
> +            unsigned int factor = 10000;
> +
> +            if ( c->x86 == 6 )
> +                switch ( c->x86_model )
> +                {
> +                case 0x1a: case 0x1e: case 0x1f: case 0x2e: /* Nehalem */
> +                case 0x25: case 0x2c: case 0x2f: /* Westmere */
> +                    factor = 13333;

The SDM lists ratio * 100MHz without any notes, why are those models
different, is this some errata?

> +                    break;
> +                }
> +
> +            printk("CPU%u: ", smp_processor_id());
> +            if ( (uint8_t)(msrval >> 40) )
> +                printk("%u..", (factor * (uint8_t)(msrval >> 40) + 50) / 100);
> +            printk("%u MHz\n", (factor * (uint8_t)(msrval >> 8) + 50) / 100);

Since you are calculating using Hz, should you use an unsigned long
factor to prevent capping at 4GHz?

> +        }
> +    }
>  }
>  
>  const struct cpu_dev intel_cpu_dev = {
> --- a/xen/include/asm-x86/msr.h
> +++ b/xen/include/asm-x86/msr.h
> @@ -40,8 +40,8 @@ static inline void wrmsrl(unsigned int m
>  
>  /* rdmsr with exception handling */
>  #define rdmsr_safe(msr,val) ({\
> -    int _rc; \
> -    uint32_t lo, hi; \
> +    int rc_; \
> +    uint32_t lo_, hi_; \
>      __asm__ __volatile__( \
>          "1: rdmsr\n2:\n" \
>          ".section .fixup,\"ax\"\n" \
> @@ -49,15 +49,15 @@ static inline void wrmsrl(unsigned int m
>          "   movl %5,%2\n; jmp 2b\n" \
>          ".previous\n" \
>          _ASM_EXTABLE(1b, 3b) \
> -        : "=a" (lo), "=d" (hi), "=&r" (_rc) \
> +        : "=a" (lo_), "=d" (hi_), "=&r" (rc_) \
>          : "c" (msr), "2" (0), "i" (-EFAULT)); \
> -    val = lo | ((uint64_t)hi << 32); \
> -    _rc; })
> +    val = lo_ | ((uint64_t)hi_ << 32); \
> +    rc_; })

Since you are changing the local variable names, I would just switch
rdmsr_safe to a static inline, and drop the underlines. I don't see a
reason this has to stay as a macro.

>  
>  /* wrmsr with exception handling */
>  static inline int wrmsr_safe(unsigned int msr, uint64_t val)
>  {
> -    int _rc;
> +    int rc;
>      uint32_t lo, hi;
>      lo = (uint32_t)val;
>      hi = (uint32_t)(val >> 32);

Since you are already playing with this, could you initialize lo and
hi at definition time?

Thanks, Roger.

Jan Beulich May 14, 2020, 1:38 p.m. UTC | #2

On 14.05.2020 15:10, Roger Pau Monné wrote:
> On Wed, Apr 15, 2020 at 01:55:24PM +0200, Jan Beulich wrote:
>> While from just a single Skylake system it is already clear that we
>> can't base any of our logic on CPUID leaf 15 [1] (leaf 16 is
>> documented to be used for display purposes only anyway), logging this
>> information may still give us some reference in case of problems as well
>> as for future work. Additionally on the AMD side it is unclear whether
>> the deviation between reported and measured frequencies is because of us
>> not doing well, or because of nominal and actual frequencies being quite
>> far apart.
> 
> Can you add some reference to the AMD implementation? I've looked at
> the PMs and haven't been able to find a description of some of the
> MSRs, like 0xC0010064.

Take a look at

https://developer.amd.com/resources/developer-guides-manuals/

I'm unconvinced a reference needs adding here.

>> --- a/xen/arch/x86/cpu/intel.c
>> +++ b/xen/arch/x86/cpu/intel.c
>> @@ -378,6 +378,72 @@ static void init_intel(struct cpuinfo_x8
>>  	     ( c->cpuid_level >= 0x00000006 ) &&
>>  	     ( cpuid_eax(0x00000006) & (1u<<2) ) )
>>  		__set_bit(X86_FEATURE_ARAT, c->x86_capability);
>> +
> 
> I would split this into a separate helper, ie: intel_log_freq. That
> will allow you to exit early and reduce some of the indentation IMO.

Can do; splitting this for AMD/Hygon however was merely to
facilitate using it for both vendors, though.

>> +    if ( (opt_cpu_info && !(c->apicid & (c->x86_num_siblings - 1))) ||
>> +         c == &boot_cpu_data )
>> +    {
>> +        unsigned int eax, ebx, ecx, edx;
>> +        uint64_t msrval;
>> +
>> +        if ( c->cpuid_level >= 0x15 )
>> +        {
>> +            cpuid(0x15, &eax, &ebx, &ecx, &edx);
>> +            if ( ecx && ebx && eax )
>> +            {
>> +                unsigned long long val = ecx;
>> +
>> +                val *= ebx;
>> +                do_div(val, eax);
>> +                printk("CPU%u: TSC: %uMHz * %u / %u = %LuMHz\n",
>> +                       smp_processor_id(), ecx, ebx, eax, val);
>> +            }
>> +            else if ( ecx | eax | ebx )
>> +            {
>> +                printk("CPU%u: TSC:", smp_processor_id());
>> +                if ( ecx )
>> +                    printk(" core: %uMHz", ecx);
>> +                if ( ebx && eax )
>> +                    printk(" ratio: %u / %u", ebx, eax);
>> +                printk("\n");
>> +            }
>> +        }
>> +
>> +        if ( c->cpuid_level >= 0x16 )
>> +        {
>> +            cpuid(0x16, &eax, &ebx, &ecx, &edx);
>> +            if ( ecx | eax | ebx )
>> +            {
>> +                printk("CPU%u:", smp_processor_id());
>> +                if ( ecx )
>> +                    printk(" bus: %uMHz", ecx);
>> +                if ( eax )
>> +                    printk(" base: %uMHz", eax);
>> +                if ( ebx )
>> +                    printk(" max: %uMHz", ebx);
>> +                printk("\n");
>> +            }
>> +        }
>> +
>> +        if ( !rdmsr_safe(MSR_INTEL_PLATFORM_INFO, msrval) &&
>> +             (uint8_t)(msrval >> 8) )
> 
> I would introduce a mask for it would be cleaner, since you use it
> here and below (and would avoid the casting to uint8_t.

To avoid the casts (also below) I could introduce local variables.
I specifically wanted to avoid MASK_EXTR() such that the rest of the
calculations in

            if ( (uint8_t)(msrval >> 40) )
                printk("%u..", (factor * (uint8_t)(msrval >> 40) + 50) / 100);
            printk("%u MHz\n", (factor * (uint8_t)(msrval >> 8) + 50) / 100);

can be done as 32-bit arithmetic.

>> +        {
>> +            unsigned int factor = 10000;
>> +
>> +            if ( c->x86 == 6 )
>> +                switch ( c->x86_model )
>> +                {
>> +                case 0x1a: case 0x1e: case 0x1f: case 0x2e: /* Nehalem */
>> +                case 0x25: case 0x2c: case 0x2f: /* Westmere */
>> +                    factor = 13333;
> 
> The SDM lists ratio * 100MHz without any notes, why are those models
> different, is this some errata?

Did you go through the MSR lists for the various models? It's there
where I found this anomaly, not in any spec updates.

>> +                    break;
>> +                }
>> +
>> +            printk("CPU%u: ", smp_processor_id());
>> +            if ( (uint8_t)(msrval >> 40) )
>> +                printk("%u..", (factor * (uint8_t)(msrval >> 40) + 50) / 100);
>> +            printk("%u MHz\n", (factor * (uint8_t)(msrval >> 8) + 50) / 100);
> 
> Since you are calculating using Hz, should you use an unsigned long
> factor to prevent capping at 4GHz?

Hmm, the calculation looks to be in units of 10kHz, until the division
by 100. I don't think we'd cap at 4GHz this way.

>> --- a/xen/include/asm-x86/msr.h
>> +++ b/xen/include/asm-x86/msr.h
>> @@ -40,8 +40,8 @@ static inline void wrmsrl(unsigned int m
>>  
>>  /* rdmsr with exception handling */
>>  #define rdmsr_safe(msr,val) ({\
>> -    int _rc; \
>> -    uint32_t lo, hi; \
>> +    int rc_; \
>> +    uint32_t lo_, hi_; \
>>      __asm__ __volatile__( \
>>          "1: rdmsr\n2:\n" \
>>          ".section .fixup,\"ax\"\n" \
>> @@ -49,15 +49,15 @@ static inline void wrmsrl(unsigned int m
>>          "   movl %5,%2\n; jmp 2b\n" \
>>          ".previous\n" \
>>          _ASM_EXTABLE(1b, 3b) \
>> -        : "=a" (lo), "=d" (hi), "=&r" (_rc) \
>> +        : "=a" (lo_), "=d" (hi_), "=&r" (rc_) \
>>          : "c" (msr), "2" (0), "i" (-EFAULT)); \
>> -    val = lo | ((uint64_t)hi << 32); \
>> -    _rc; })
>> +    val = lo_ | ((uint64_t)hi_ << 32); \
>> +    rc_; })
> 
> Since you are changing the local variable names, I would just switch
> rdmsr_safe to a static inline, and drop the underlines. I don't see a
> reason this has to stay as a macro.

Well, all callers would need to be changed to pass the address of
the variable to store the value read into. That's quite a bit of
code churn, and hence nothing I'd want to do in this patch.

>>  /* wrmsr with exception handling */
>>  static inline int wrmsr_safe(unsigned int msr, uint64_t val)
>>  {
>> -    int _rc;
>> +    int rc;
>>      uint32_t lo, hi;
>>      lo = (uint32_t)val;
>>      hi = (uint32_t)(val >> 32);
> 
> Since you are already playing with this, could you initialize lo and
> hi at definition time?

If I was touching any of the three lines anyway - yes. But the way
it is it looks like an unrelated change to me.

Jan

Roger Pau Monné May 14, 2020, 3:32 p.m. UTC | #3

On Thu, May 14, 2020 at 03:38:18PM +0200, Jan Beulich wrote:
> On 14.05.2020 15:10, Roger Pau Monné wrote:
> > On Wed, Apr 15, 2020 at 01:55:24PM +0200, Jan Beulich wrote:
> >> While from just a single Skylake system it is already clear that we
> >> can't base any of our logic on CPUID leaf 15 [1] (leaf 16 is
> >> documented to be used for display purposes only anyway), logging this
> >> information may still give us some reference in case of problems as well
> >> as for future work. Additionally on the AMD side it is unclear whether
> >> the deviation between reported and measured frequencies is because of us
> >> not doing well, or because of nominal and actual frequencies being quite
> >> far apart.
> > 
> > Can you add some reference to the AMD implementation? I've looked at
> > the PMs and haven't been able to find a description of some of the
> > MSRs, like 0xC0010064.
> 
> Take a look at
> 
> https://developer.amd.com/resources/developer-guides-manuals/
> 
> I'm unconvinced a reference needs adding here.

Do you think it would be sensible to introduce some defines for at
least 0xC0010064? (ie: MSR_AMD_PSTATE_DEF_BASE)

I think it would make it easier to find on the manuals.

> 
> >> --- a/xen/arch/x86/cpu/intel.c
> >> +++ b/xen/arch/x86/cpu/intel.c
> >> @@ -378,6 +378,72 @@ static void init_intel(struct cpuinfo_x8
> >>  	     ( c->cpuid_level >= 0x00000006 ) &&
> >>  	     ( cpuid_eax(0x00000006) & (1u<<2) ) )
> >>  		__set_bit(X86_FEATURE_ARAT, c->x86_capability);
> >> +
> > 
> > I would split this into a separate helper, ie: intel_log_freq. That
> > will allow you to exit early and reduce some of the indentation IMO.
> 
> Can do; splitting this for AMD/Hygon however was merely to
> facilitate using it for both vendors, though.
> 
> >> +    if ( (opt_cpu_info && !(c->apicid & (c->x86_num_siblings - 1))) ||
> >> +         c == &boot_cpu_data )
> >> +    {
> >> +        unsigned int eax, ebx, ecx, edx;
> >> +        uint64_t msrval;
> >> +
> >> +        if ( c->cpuid_level >= 0x15 )
> >> +        {
> >> +            cpuid(0x15, &eax, &ebx, &ecx, &edx);
> >> +            if ( ecx && ebx && eax )
> >> +            {
> >> +                unsigned long long val = ecx;
> >> +
> >> +                val *= ebx;
> >> +                do_div(val, eax);
> >> +                printk("CPU%u: TSC: %uMHz * %u / %u = %LuMHz\n",
> >> +                       smp_processor_id(), ecx, ebx, eax, val);
> >> +            }
> >> +            else if ( ecx | eax | ebx )
> >> +            {
> >> +                printk("CPU%u: TSC:", smp_processor_id());
> >> +                if ( ecx )
> >> +                    printk(" core: %uMHz", ecx);
> >> +                if ( ebx && eax )
> >> +                    printk(" ratio: %u / %u", ebx, eax);
> >> +                printk("\n");
> >> +            }
> >> +        }
> >> +
> >> +        if ( c->cpuid_level >= 0x16 )
> >> +        {
> >> +            cpuid(0x16, &eax, &ebx, &ecx, &edx);
> >> +            if ( ecx | eax | ebx )
> >> +            {
> >> +                printk("CPU%u:", smp_processor_id());
> >> +                if ( ecx )
> >> +                    printk(" bus: %uMHz", ecx);
> >> +                if ( eax )
> >> +                    printk(" base: %uMHz", eax);
> >> +                if ( ebx )
> >> +                    printk(" max: %uMHz", ebx);
> >> +                printk("\n");
> >> +            }
> >> +        }
> >> +
> >> +        if ( !rdmsr_safe(MSR_INTEL_PLATFORM_INFO, msrval) &&
> >> +             (uint8_t)(msrval >> 8) )
> > 
> > I would introduce a mask for it would be cleaner, since you use it
> > here and below (and would avoid the casting to uint8_t.
> 
> To avoid the casts (also below) I could introduce local variables.
> I specifically wanted to avoid MASK_EXTR() such that the rest of the
> calculations in
> 
>             if ( (uint8_t)(msrval >> 40) )
>                 printk("%u..", (factor * (uint8_t)(msrval >> 40) + 50) / 100);
>             printk("%u MHz\n", (factor * (uint8_t)(msrval >> 8) + 50) / 100);
> 
> can be done as 32-bit arithmetic.

Might be cleaner with the local variables.

> >> +        {
> >> +            unsigned int factor = 10000;
> >> +
> >> +            if ( c->x86 == 6 )
> >> +                switch ( c->x86_model )
> >> +                {
> >> +                case 0x1a: case 0x1e: case 0x1f: case 0x2e: /* Nehalem */
> >> +                case 0x25: case 0x2c: case 0x2f: /* Westmere */
> >> +                    factor = 13333;
> > 
> > The SDM lists ratio * 100MHz without any notes, why are those models
> > different, is this some errata?
> 
> Did you go through the MSR lists for the various models? It's there
> where I found this anomaly, not in any spec updates.

My bad, I was looking at the Atom table I think, and didn't realize
they where multiple tables instead of a single table with different
notes for models.

> 
> >> +                    break;
> >> +                }
> >> +
> >> +            printk("CPU%u: ", smp_processor_id());
> >> +            if ( (uint8_t)(msrval >> 40) )
> >> +                printk("%u..", (factor * (uint8_t)(msrval >> 40) + 50) / 100);
> >> +            printk("%u MHz\n", (factor * (uint8_t)(msrval >> 8) + 50) / 100);
> > 
> > Since you are calculating using Hz, should you use an unsigned long
> > factor to prevent capping at 4GHz?
> 
> Hmm, the calculation looks to be in units of 10kHz, until the division
> by 100. I don't think we'd cap at 4GHz this way.

Oh yes, sorry, it's kHz, not Hz.

> 
> >> --- a/xen/include/asm-x86/msr.h
> >> +++ b/xen/include/asm-x86/msr.h
> >> @@ -40,8 +40,8 @@ static inline void wrmsrl(unsigned int m
> >>  
> >>  /* rdmsr with exception handling */
> >>  #define rdmsr_safe(msr,val) ({\
> >> -    int _rc; \
> >> -    uint32_t lo, hi; \
> >> +    int rc_; \
> >> +    uint32_t lo_, hi_; \
> >>      __asm__ __volatile__( \
> >>          "1: rdmsr\n2:\n" \
> >>          ".section .fixup,\"ax\"\n" \
> >> @@ -49,15 +49,15 @@ static inline void wrmsrl(unsigned int m
> >>          "   movl %5,%2\n; jmp 2b\n" \
> >>          ".previous\n" \
> >>          _ASM_EXTABLE(1b, 3b) \
> >> -        : "=a" (lo), "=d" (hi), "=&r" (_rc) \
> >> +        : "=a" (lo_), "=d" (hi_), "=&r" (rc_) \
> >>          : "c" (msr), "2" (0), "i" (-EFAULT)); \
> >> -    val = lo | ((uint64_t)hi << 32); \
> >> -    _rc; })
> >> +    val = lo_ | ((uint64_t)hi_ << 32); \
> >> +    rc_; })
> > 
> > Since you are changing the local variable names, I would just switch
> > rdmsr_safe to a static inline, and drop the underlines. I don't see a
> > reason this has to stay as a macro.
> 
> Well, all callers would need to be changed to pass the address of
> the variable to store the value read into. That's quite a bit of
> code churn, and hence nothing I'd want to do in this patch.

Oh, right, didn't realize it's a macro for that reason.

Thanks, Roger.

Jan Beulich May 14, 2020, 3:50 p.m. UTC | #4

On 14.05.2020 17:32, Roger Pau Monné wrote:
> On Thu, May 14, 2020 at 03:38:18PM +0200, Jan Beulich wrote:
>> On 14.05.2020 15:10, Roger Pau Monné wrote:
>>> On Wed, Apr 15, 2020 at 01:55:24PM +0200, Jan Beulich wrote:
>>>> While from just a single Skylake system it is already clear that we
>>>> can't base any of our logic on CPUID leaf 15 [1] (leaf 16 is
>>>> documented to be used for display purposes only anyway), logging this
>>>> information may still give us some reference in case of problems as well
>>>> as for future work. Additionally on the AMD side it is unclear whether
>>>> the deviation between reported and measured frequencies is because of us
>>>> not doing well, or because of nominal and actual frequencies being quite
>>>> far apart.
>>>
>>> Can you add some reference to the AMD implementation? I've looked at
>>> the PMs and haven't been able to find a description of some of the
>>> MSRs, like 0xC0010064.
>>
>> Take a look at
>>
>> https://developer.amd.com/resources/developer-guides-manuals/
>>
>> I'm unconvinced a reference needs adding here.
> 
> Do you think it would be sensible to introduce some defines for at
> least 0xC0010064? (ie: MSR_AMD_PSTATE_DEF_BASE)
> 
> I think it would make it easier to find on the manuals.

I did consider doing so at the time, but dropped the idea as we have
a few more examples where we use bare MSR numbers when they're used
just once or very rarely. What I'm not sure about is whether the
name would help finding the doc - the doc is organized by MSR number
after all.

Jan

Roger Pau Monné May 14, 2020, 4:51 p.m. UTC | #5

On Thu, May 14, 2020 at 05:50:29PM +0200, Jan Beulich wrote:
> [CAUTION - EXTERNAL EMAIL] DO NOT reply, click links, or open attachments unless you have verified the sender and know the content is safe.
> 
> On 14.05.2020 17:32, Roger Pau Monné wrote:
> > On Thu, May 14, 2020 at 03:38:18PM +0200, Jan Beulich wrote:
> >> On 14.05.2020 15:10, Roger Pau Monné wrote:
> >>> On Wed, Apr 15, 2020 at 01:55:24PM +0200, Jan Beulich wrote:
> >>>> While from just a single Skylake system it is already clear that we
> >>>> can't base any of our logic on CPUID leaf 15 [1] (leaf 16 is
> >>>> documented to be used for display purposes only anyway), logging this
> >>>> information may still give us some reference in case of problems as well
> >>>> as for future work. Additionally on the AMD side it is unclear whether
> >>>> the deviation between reported and measured frequencies is because of us
> >>>> not doing well, or because of nominal and actual frequencies being quite
> >>>> far apart.
> >>>
> >>> Can you add some reference to the AMD implementation? I've looked at
> >>> the PMs and haven't been able to find a description of some of the
> >>> MSRs, like 0xC0010064.
> >>
> >> Take a look at
> >>
> >> https://developer.amd.com/resources/developer-guides-manuals/
> >>
> >> I'm unconvinced a reference needs adding here.
> > 
> > Do you think it would be sensible to introduce some defines for at
> > least 0xC0010064? (ie: MSR_AMD_PSTATE_DEF_BASE)
> > 
> > I think it would make it easier to find on the manuals.
> 
> I did consider doing so at the time, but dropped the idea as we have
> a few more examples where we use bare MSR numbers when they're used
> just once or very rarely. What I'm not sure about is whether the
> name would help finding the doc - the doc is organized by MSR number
> after all.

I would prefer if we add names as much as possible, as I think it
makes the code easier to understand, but I can also see the point of
that being more churn as you have to end up looking at the manual to
see exactly what each MSR contains anyway.

FTR, I wasn't finding the MSR in the AMD docs because I was searching
as C0010064 when I should be instead using C001_0064.

Thanks, Roger.

Roger Pau Monné May 15, 2020, 8:32 a.m. UTC | #6

On Wed, Apr 15, 2020 at 01:55:24PM +0200, Jan Beulich wrote:
> While from just a single Skylake system it is already clear that we
> can't base any of our logic on CPUID leaf 15 [1] (leaf 16 is
> documented to be used for display purposes only anyway), logging this
> information may still give us some reference in case of problems as well
> as for future work. Additionally on the AMD side it is unclear whether
> the deviation between reported and measured frequencies is because of us
> not doing well, or because of nominal and actual frequencies being quite
> far apart.
> 
> The chosen variable naming in amd_log_freq() has pointed out a naming
> problem in rdmsr_safe(), which is being taken care of at the same time.
> Symmetrically wrmsr_safe(), being an inline function, also gets an
> unnecessary underscore dropped from one of its local variables.
> 
> [1] With a core crystal clock of 24MHz and a ratio of 216/2, the
>     reported frequency nevertheless is 2600MHz, rather than the to be
>     expected (and calibrated by both us and Linux) 2592MHz.
> 
> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

I have one question below about P-state limits.

> ---
> TBD: The node ID retrieval using extended leaf 1E implies it won't work
>      on older hardware (pre-Fam15 I think). Besides the Node ID MSR,
>      which doesn't get advertised on my Fam10 box (and it's zero on all
>      processors despite there being two nodes as per the PCI device
>      map), and which isn't even documented for Fam11, Fam12, and Fam14,
>      I didn't find any other means to retrieve the node ID a CPU is
>      associated with - the NodeId register in PCI config space depends
>      on one already knowing the node ID for doing the access, as the
>      device to be used is a function of the node ID.

I there a real chance of the boost states being different between
nodes? Won't Xen explode elsewhere due to possibly diverging features
between nodes?

> 
> --- a/xen/arch/x86/cpu/amd.c
> +++ b/xen/arch/x86/cpu/amd.c
> @@ -532,6 +532,102 @@ static void amd_get_topology(struct cpui
>                                                            : c->cpu_core_id);
>  }
>  
> +void amd_log_freq(const struct cpuinfo_x86 *c)
> +{
> +	unsigned int idx = 0, h;
> +	uint64_t hi, lo, val;
> +
> +	if (c->x86 < 0x10 || c->x86 > 0x19 ||
> +	    (c != &boot_cpu_data &&
> +	     (!opt_cpu_info || (c->apicid & (c->x86_num_siblings - 1)))))
> +		return;
> +
> +	if (c->x86 < 0x17) {
> +		unsigned int node = 0;
> +		uint64_t nbcfg;
> +
> +		/*
> +		 * Make an attempt at determining the node ID, but assume
> +		 * symmetric setup (using node 0) if this fails.
> +		 */
> +		if (c->extended_cpuid_level >= 0x8000001e &&
> +		    cpu_has(c, X86_FEATURE_TOPOEXT)) {
> +			node = cpuid_ecx(0x8000001e) & 0xff;
> +			if (node > 7)
> +				node = 0;
> +		} else if (cpu_has(c, X86_FEATURE_NODEID_MSR)) {
> +			rdmsrl(0xC001100C, val);
> +			node = val & 7;
> +		}
> +
> +		/*
> +		 * Enable (and use) Extended Config Space accesses, as we
> +		 * can't be certain that MCFG is available here during boot.
> +		 */
> +		rdmsrl(MSR_AMD64_NB_CFG, nbcfg);
> +		wrmsrl(MSR_AMD64_NB_CFG,
> +		       nbcfg | (1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT));
> +#define PCI_ECS_ADDRESS(sbdf, reg) \
> +    (0x80000000 | ((sbdf).bdf << 8) | ((reg) & 0xfc) | (((reg) & 0xf00) << 16))
> +
> +		for ( ; ; ) {
> +			pci_sbdf_t sbdf = PCI_SBDF(0, 0, 0x18 | node, 4);
> +
> +			switch (pci_conf_read32(sbdf, PCI_VENDOR_ID)) {
> +			case 0x00000000:
> +			case 0xffffffff:
> +				/* No device at this SBDF. */
> +				if (!node)
> +					break;
> +				node = 0;
> +				continue;
> +
> +			default:
> +				/*
> +				 * Core Performance Boost Control, family
> +				 * dependent up to 3 bits starting at bit 2.


I would add:

"Note that boot states operate at a frequency above the base one, and
thus need to be accounted for in order to correctly fetch the nominal
frequency of the processor."

> +				 */
> +				switch (c->x86) {
> +				case 0x10: idx = 1; break;
> +				case 0x12: idx = 7; break;
> +				case 0x14: idx = 7; break;
> +				case 0x15: idx = 7; break;
> +				case 0x16: idx = 7; break;
> +				}
> +				idx &= pci_conf_read(PCI_ECS_ADDRESS(sbdf,
> +				                                     0x15c),
> +				                     0, 4) >> 2;
> +				break;
> +			}
> +			break;
> +		}
> +
> +#undef PCI_ECS_ADDRESS
> +		wrmsrl(MSR_AMD64_NB_CFG, nbcfg);
> +	}
> +
> +	lo = 0; /* gcc may not recognize the loop having at least 5 iterations */
> +	for (h = c->x86 == 0x10 ? 5 : 8; h--; )
> +		if (!rdmsr_safe(0xC0010064 + h, lo) && (lo >> 63))
> +			break;
> +	if (!(lo >> 63))
> +		return;

Should you also take the P-state limit here into account (from MSR
0xC0010061)?

I assume the firmware could set a minimum P-state higher than the
possible ones present in the list of P-states, effectively preventing
switching to lowest-performance P-states?

The rest LGTM, Thanks.

Jan Beulich May 15, 2020, 9:08 a.m. UTC | #7

On 15.05.2020 10:32, Roger Pau Monné wrote:
> On Wed, Apr 15, 2020 at 01:55:24PM +0200, Jan Beulich wrote:
>> While from just a single Skylake system it is already clear that we
>> can't base any of our logic on CPUID leaf 15 [1] (leaf 16 is
>> documented to be used for display purposes only anyway), logging this
>> information may still give us some reference in case of problems as well
>> as for future work. Additionally on the AMD side it is unclear whether
>> the deviation between reported and measured frequencies is because of us
>> not doing well, or because of nominal and actual frequencies being quite
>> far apart.
>>
>> The chosen variable naming in amd_log_freq() has pointed out a naming
>> problem in rdmsr_safe(), which is being taken care of at the same time.
>> Symmetrically wrmsr_safe(), being an inline function, also gets an
>> unnecessary underscore dropped from one of its local variables.
>>
>> [1] With a core crystal clock of 24MHz and a ratio of 216/2, the
>>     reported frequency nevertheless is 2600MHz, rather than the to be
>>     expected (and calibrated by both us and Linux) 2592MHz.
>>
>> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, but please clarify whether this is with or without the
two suggested changes (breaking out intel_log_freq() and introducing
local variables for (uint8_t)(msrval >> NN)), or whether you
mean to leave it to me whether to make them. And if I'm to make the
change, whether you'd trust me to not screw up things, i.e. whether
I can keep you R-b in that case.

> I have one question below about P-state limits.
> 
>> ---
>> TBD: The node ID retrieval using extended leaf 1E implies it won't work
>>      on older hardware (pre-Fam15 I think). Besides the Node ID MSR,
>>      which doesn't get advertised on my Fam10 box (and it's zero on all
>>      processors despite there being two nodes as per the PCI device
>>      map), and which isn't even documented for Fam11, Fam12, and Fam14,
>>      I didn't find any other means to retrieve the node ID a CPU is
>>      associated with - the NodeId register in PCI config space depends
>>      on one already knowing the node ID for doing the access, as the
>>      device to be used is a function of the node ID.
> 
> I there a real chance of the boost states being different between
> nodes?

Probably not, but doing things properly would still have been
nice.

> Won't Xen explode elsewhere due to possibly diverging features
> between nodes?

For many features - yes, but for boost states being different I
don't think it would.

>> --- a/xen/arch/x86/cpu/amd.c
>> +++ b/xen/arch/x86/cpu/amd.c
>> @@ -532,6 +532,102 @@ static void amd_get_topology(struct cpui
>>                                                            : c->cpu_core_id);
>>  }
>>  
>> +void amd_log_freq(const struct cpuinfo_x86 *c)
>> +{
>> +	unsigned int idx = 0, h;
>> +	uint64_t hi, lo, val;
>> +
>> +	if (c->x86 < 0x10 || c->x86 > 0x19 ||
>> +	    (c != &boot_cpu_data &&
>> +	     (!opt_cpu_info || (c->apicid & (c->x86_num_siblings - 1)))))
>> +		return;
>> +
>> +	if (c->x86 < 0x17) {
>> +		unsigned int node = 0;
>> +		uint64_t nbcfg;
>> +
>> +		/*
>> +		 * Make an attempt at determining the node ID, but assume
>> +		 * symmetric setup (using node 0) if this fails.
>> +		 */
>> +		if (c->extended_cpuid_level >= 0x8000001e &&
>> +		    cpu_has(c, X86_FEATURE_TOPOEXT)) {
>> +			node = cpuid_ecx(0x8000001e) & 0xff;
>> +			if (node > 7)
>> +				node = 0;
>> +		} else if (cpu_has(c, X86_FEATURE_NODEID_MSR)) {
>> +			rdmsrl(0xC001100C, val);
>> +			node = val & 7;
>> +		}
>> +
>> +		/*
>> +		 * Enable (and use) Extended Config Space accesses, as we
>> +		 * can't be certain that MCFG is available here during boot.
>> +		 */
>> +		rdmsrl(MSR_AMD64_NB_CFG, nbcfg);
>> +		wrmsrl(MSR_AMD64_NB_CFG,
>> +		       nbcfg | (1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT));
>> +#define PCI_ECS_ADDRESS(sbdf, reg) \
>> +    (0x80000000 | ((sbdf).bdf << 8) | ((reg) & 0xfc) | (((reg) & 0xf00) << 16))
>> +
>> +		for ( ; ; ) {
>> +			pci_sbdf_t sbdf = PCI_SBDF(0, 0, 0x18 | node, 4);
>> +
>> +			switch (pci_conf_read32(sbdf, PCI_VENDOR_ID)) {
>> +			case 0x00000000:
>> +			case 0xffffffff:
>> +				/* No device at this SBDF. */
>> +				if (!node)
>> +					break;
>> +				node = 0;
>> +				continue;
>> +
>> +			default:
>> +				/*
>> +				 * Core Performance Boost Control, family
>> +				 * dependent up to 3 bits starting at bit 2.
> 
> 
> I would add:
> 
> "Note that boot states operate at a frequency above the base one, and
> thus need to be accounted for in order to correctly fetch the nominal
> frequency of the processor."

Done.

>> +				 */
>> +				switch (c->x86) {
>> +				case 0x10: idx = 1; break;
>> +				case 0x12: idx = 7; break;
>> +				case 0x14: idx = 7; break;
>> +				case 0x15: idx = 7; break;
>> +				case 0x16: idx = 7; break;
>> +				}
>> +				idx &= pci_conf_read(PCI_ECS_ADDRESS(sbdf,
>> +				                                     0x15c),
>> +				                     0, 4) >> 2;
>> +				break;
>> +			}
>> +			break;
>> +		}
>> +
>> +#undef PCI_ECS_ADDRESS
>> +		wrmsrl(MSR_AMD64_NB_CFG, nbcfg);
>> +	}
>> +
>> +	lo = 0; /* gcc may not recognize the loop having at least 5 iterations */
>> +	for (h = c->x86 == 0x10 ? 5 : 8; h--; )
>> +		if (!rdmsr_safe(0xC0010064 + h, lo) && (lo >> 63))
>> +			break;
>> +	if (!(lo >> 63))
>> +		return;
> 
> Should you also take the P-state limit here into account (from MSR
> 0xC0010061)?
> 
> I assume the firmware could set a minimum P-state higher than the
> possible ones present in the list of P-states, effectively preventing
> switching to lowest-performance P-states?

We're not after permitted P-states here - these would matter only if
we were meaning to alter the current P-state by direct MSR accesses.
Here we're only after logging capabilities (and the P-state limits
can, aiui, in principle also change at runtime).

Jan

Roger Pau Monné May 15, 2020, 9:17 a.m. UTC | #8

On Fri, May 15, 2020 at 11:08:04AM +0200, Jan Beulich wrote:
> On 15.05.2020 10:32, Roger Pau Monné wrote:
> > On Wed, Apr 15, 2020 at 01:55:24PM +0200, Jan Beulich wrote:
> >> While from just a single Skylake system it is already clear that we
> >> can't base any of our logic on CPUID leaf 15 [1] (leaf 16 is
> >> documented to be used for display purposes only anyway), logging this
> >> information may still give us some reference in case of problems as well
> >> as for future work. Additionally on the AMD side it is unclear whether
> >> the deviation between reported and measured frequencies is because of us
> >> not doing well, or because of nominal and actual frequencies being quite
> >> far apart.
> >>
> >> The chosen variable naming in amd_log_freq() has pointed out a naming
> >> problem in rdmsr_safe(), which is being taken care of at the same time.
> >> Symmetrically wrmsr_safe(), being an inline function, also gets an
> >> unnecessary underscore dropped from one of its local variables.
> >>
> >> [1] With a core crystal clock of 24MHz and a ratio of 216/2, the
> >>     reported frequency nevertheless is 2600MHz, rather than the to be
> >>     expected (and calibrated by both us and Linux) 2592MHz.
> >>
> >> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > 
> > Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Thanks, but please clarify whether this is with or without the
> two suggested changes (breaking out intel_log_freq() and introducing
> local variables for (uint8_t)(msrval >> NN)), or whether you
> mean to leave it to me whether to make them. And if I'm to make the
> change, whether you'd trust me to not screw up things, i.e. whether
> I can keep you R-b in that case.

None of those are mandatory in order to keep the RB, just suggestions.

> >> +#undef PCI_ECS_ADDRESS
> >> +		wrmsrl(MSR_AMD64_NB_CFG, nbcfg);
> >> +	}
> >> +
> >> +	lo = 0; /* gcc may not recognize the loop having at least 5 iterations */
> >> +	for (h = c->x86 == 0x10 ? 5 : 8; h--; )
> >> +		if (!rdmsr_safe(0xC0010064 + h, lo) && (lo >> 63))
> >> +			break;
> >> +	if (!(lo >> 63))
> >> +		return;
> > 
> > Should you also take the P-state limit here into account (from MSR
> > 0xC0010061)?
> > 
> > I assume the firmware could set a minimum P-state higher than the
> > possible ones present in the list of P-states, effectively preventing
> > switching to lowest-performance P-states?
> 
> We're not after permitted P-states here - these would matter only if
> we were meaning to alter the current P-state by direct MSR accesses.
> Here we're only after logging capabilities (and the P-state limits
> can, aiui, in principle also change at runtime).

OK, I have to admit I'm not aware of how this is supposed to work.

One scenario I had in mind was that the firmware would set P-state
control to a value higher than the minimum one in order to prevent the
OS from switching to a lower P-state, in which case reporting the
frequency of the minimum P-state would be kind of incorrect since
switching to it won't be possible anyway. But if this is supposed to
change at runtime then reporting the lowest possible makes sense.

Thanks, Roger.

x86: retrieve and log CPU frequency information

Commit Message

Comments

Patch