diff mbox series

[7/8] x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx

Message ID 20201014083300.19077-8-ankur.a.arora@oracle.com (mailing list archive)
State New, archived
Headers show
Series Use uncached writes while clearing gigantic pages | expand

Commit Message

Ankur Arora Oct. 14, 2020, 8:32 a.m. UTC
System:           Oracle X6-2
CPU:              2 nodes * 10 cores/node * 2 threads/core
		  Intel Xeon E5-2630 v4 (Broadwellx, 6:79:1)
Memory:           256 GB evenly split between nodes
Microcode:        0xb00002e
scaling_governor: performance
L3 size:          25MB
intel_pstate/no_turbo: 1

Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
(X86_FEATURE_ERMS) and x86-64-movnt (X86_FEATURE_NT_GOOD):

              x86-64-stosb (5 runs)     x86-64-movnt (5 runs)       speedup
              -----------------------   -----------------------     -------
     size       BW        (   pstdev)          BW   (   pstdev)

     16MB      17.35 GB/s ( +- 9.27%)    11.83 GB/s ( +- 0.19%)     -31.81%
    128MB       5.31 GB/s ( +- 0.13%)    11.72 GB/s ( +- 0.44%)    +121.84%
   1024MB       5.42 GB/s ( +- 0.13%)    11.78 GB/s ( +- 0.03%)    +117.34%
   4096MB       5.41 GB/s ( +- 0.41%)    11.76 GB/s ( +- 0.07%)    +117.37%

The next workload exercises the page-clearing path directly by faulting over
an anonymous mmap region backed by 1GB pages. This workload is similar to the
creation phase of pinned guests in QEMU.

$ cat pf-test.c
 #include <stdlib.h>
 #include <sys/mman.h>
 #include <linux/mman.h>

 #define HPAGE_BITS 30
 int main(int argc, char **argv) {
	int i;
	unsigned long len = atoi(argv[1]); /* In GB */
	unsigned long offset = 0;
	unsigned long numpages;
	char *base;

	len *= 1UL << 30;
	numpages = len >> HPAGE_BITS;

	base = mmap(NULL, len, PROT_READ|PROT_WRITE,
	            MAP_PRIVATE | MAP_ANONYMOUS |
		    MAP_HUGETLB | MAP_HUGE_1GB, 0, 0);

	for (i = 0; i < numpages; i++) {
	        *((volatile char *)base + offset) = *(base + offset);
	        offset += 1UL << HPAGE_BITS;
	}

	return 0;
 }

The specific test is for a 128GB region but this is a single-threaded
O(n) workload so the exact region size is not material.

Page-clearing throughput for clear_page_erms(): 3.72 GBps
$ perf stat -r 5 --all-kernel -e ... bin/pf-test 128

 Performance counter stats for 'bin/pf-test 128' (5 runs):

    74,799,496,556      cpu-cycles                #    2.176 GHz                      ( +-  2.22% )  (29.41%)
     1,474,615,023      instructions              #    0.02  insn per cycle           ( +-  0.23% )  (35.29%)
     2,148,580,131      cache-references          #   62.502 M/sec                    ( +-  0.02% )  (35.29%)
        71,736,985      cache-misses              #    3.339 % of all cache refs      ( +-  0.94% )  (35.29%)
       433,713,165      branch-instructions       #   12.617 M/sec                    ( +-  0.15% )  (35.30%)
         1,008,251      branch-misses             #    0.23% of all branches          ( +-  1.88% )  (35.30%)
     3,406,821,966      bus-cycles                #   99.104 M/sec                    ( +-  2.22% )  (23.53%)
     2,156,059,110      L1-dcache-load-misses     #  445.35% of all L1-dcache accesses  ( +-  0.01% )  (23.53%)
       484,128,243      L1-dcache-loads           #   14.083 M/sec                    ( +-  0.22% )  (23.53%)
           944,216      LLC-loads                 #    0.027 M/sec                    ( +-  7.41% )  (23.53%)
           537,989      LLC-load-misses           #   56.98% of all LL-cache accesses  ( +- 13.64% )  (23.53%)
     2,150,138,476      LLC-stores                #   62.547 M/sec                    ( +-  0.01% )  (11.76%)
        69,598,760      LLC-store-misses          #    2.025 M/sec                    ( +-  0.47% )  (11.76%)
       483,923,875      dTLB-loads                #   14.077 M/sec                    ( +-  0.21% )  (17.64%)
             1,892      dTLB-load-misses          #    0.00% of all dTLB cache accesses  ( +- 30.63% )  (23.53%)
     4,799,154,980      dTLB-stores               #  139.606 M/sec                    ( +-  0.03% )  (23.53%)
                90      dTLB-store-misses         #    0.003 K/sec                    ( +- 35.92% )  (23.53%)

            34.377 +- 0.760 seconds time elapsed  ( +-  2.21% )

Page-clearing throughput with clear_page_nt(): 11.78GBps
$ perf stat -r 5 --all-kernel -e ... bin/pf-test 128

 Performance counter stats for 'bin/pf-test 128' (5 runs):

    23,699,446,603      cpu-cycles                #    2.182 GHz                      ( +-  0.01% )  (23.53%)
    24,794,548,512      instructions              #    1.05  insn per cycle           ( +-  0.00% )  (29.41%)
           432,775      cache-references          #    0.040 M/sec                    ( +-  3.96% )  (29.41%)
            75,580      cache-misses              #   17.464 % of all cache refs      ( +- 51.42% )  (29.41%)
     2,492,858,290      branch-instructions       #  229.475 M/sec                    ( +-  0.00% )  (29.42%)
        34,016,826      branch-misses             #    1.36% of all branches          ( +-  0.04% )  (29.42%)
     1,078,468,643      bus-cycles                #   99.276 M/sec                    ( +-  0.01% )  (23.53%)
           717,228      L1-dcache-load-misses     #    0.20% of all L1-dcache accesses  ( +-  3.77% )  (23.53%)
       351,999,535      L1-dcache-loads           #   32.403 M/sec                    ( +-  0.04% )  (23.53%)
            75,988      LLC-loads                 #    0.007 M/sec                    ( +-  4.20% )  (23.53%)
            24,503      LLC-load-misses           #   32.25% of all LL-cache accesses  ( +- 53.30% )  (23.53%)
            57,283      LLC-stores                #    0.005 M/sec                    ( +-  2.15% )  (11.76%)
            19,738      LLC-store-misses          #    0.002 M/sec                    ( +- 46.55% )  (11.76%)
       351,836,498      dTLB-loads                #   32.388 M/sec                    ( +-  0.04% )  (17.65%)
             1,171      dTLB-load-misses          #    0.00% of all dTLB cache accesses  ( +- 42.68% )  (23.53%)
    17,385,579,725      dTLB-stores               # 1600.392 M/sec                    ( +-  0.00% )  (23.53%)
               200      dTLB-store-misses         #    0.018 K/sec                    ( +- 10.63% )  (23.53%)

         10.863678 +- 0.000804 seconds time elapsed  ( +-  0.01% )

L1-dcache-load-misses (L1D.REPLACEMENT) is substantially lower which
suggests that, as expected, we aren't doing write-allocate or RFO.

Note that the IPC and instruction counts etc are quite different, but
that's just an artifact of switching from a single 'REP; STOSB' per
PAGE_SIZE region to a MOVNTI loop.

The page-clearing BW is substantially higher (~100% or more), so enable
X86_FEATURE_NT_GOOD for Intel Broadwellx.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/kernel/cpu/intel.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Ingo Molnar Oct. 14, 2020, 3:31 p.m. UTC | #1
* Ankur Arora <ankur.a.arora@oracle.com> wrote:

> System:           Oracle X6-2
> CPU:              2 nodes * 10 cores/node * 2 threads/core
> 		  Intel Xeon E5-2630 v4 (Broadwellx, 6:79:1)
> Memory:           256 GB evenly split between nodes
> Microcode:        0xb00002e
> scaling_governor: performance
> L3 size:          25MB
> intel_pstate/no_turbo: 1
> 
> Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
> (X86_FEATURE_ERMS) and x86-64-movnt (X86_FEATURE_NT_GOOD):
> 
>               x86-64-stosb (5 runs)     x86-64-movnt (5 runs)       speedup
>               -----------------------   -----------------------     -------
>      size       BW        (   pstdev)          BW   (   pstdev)
> 
>      16MB      17.35 GB/s ( +- 9.27%)    11.83 GB/s ( +- 0.19%)     -31.81%
>     128MB       5.31 GB/s ( +- 0.13%)    11.72 GB/s ( +- 0.44%)    +121.84%
>    1024MB       5.42 GB/s ( +- 0.13%)    11.78 GB/s ( +- 0.03%)    +117.34%
>    4096MB       5.41 GB/s ( +- 0.41%)    11.76 GB/s ( +- 0.07%)    +117.37%

> +	if (c->x86 == 6 && c->x86_model == INTEL_FAM6_BROADWELL_X)
> +		set_cpu_cap(c, X86_FEATURE_NT_GOOD);

So while I agree with how you've done careful measurements to isolate bad 
microarchitectures where non-temporal stores are slow, I do think this 
approach of opt-in doesn't scale and is hard to maintain.

Instead I'd suggest enabling this by default everywhere, and creating a 
X86_FEATURE_NT_BAD quirk table for the bad microarchitectures.

This means that with new microarchitectures we'd get automatic enablement, 
and hopefully chip testing would identify cases where performance isn't as 
good.

I.e. the 'trust but verify' method.

Thanks,

	Ingo
Ankur Arora Oct. 14, 2020, 7:23 p.m. UTC | #2
On 2020-10-14 8:31 a.m., Ingo Molnar wrote:
> 
> * Ankur Arora <ankur.a.arora@oracle.com> wrote:
> 
>> System:           Oracle X6-2
>> CPU:              2 nodes * 10 cores/node * 2 threads/core
>> 		  Intel Xeon E5-2630 v4 (Broadwellx, 6:79:1)
>> Memory:           256 GB evenly split between nodes
>> Microcode:        0xb00002e
>> scaling_governor: performance
>> L3 size:          25MB
>> intel_pstate/no_turbo: 1
>>
>> Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
>> (X86_FEATURE_ERMS) and x86-64-movnt (X86_FEATURE_NT_GOOD):
>>
>>                x86-64-stosb (5 runs)     x86-64-movnt (5 runs)       speedup
>>                -----------------------   -----------------------     -------
>>       size       BW        (   pstdev)          BW   (   pstdev)
>>
>>       16MB      17.35 GB/s ( +- 9.27%)    11.83 GB/s ( +- 0.19%)     -31.81%
>>      128MB       5.31 GB/s ( +- 0.13%)    11.72 GB/s ( +- 0.44%)    +121.84%
>>     1024MB       5.42 GB/s ( +- 0.13%)    11.78 GB/s ( +- 0.03%)    +117.34%
>>     4096MB       5.41 GB/s ( +- 0.41%)    11.76 GB/s ( +- 0.07%)    +117.37%
> 
>> +	if (c->x86 == 6 && c->x86_model == INTEL_FAM6_BROADWELL_X)
>> +		set_cpu_cap(c, X86_FEATURE_NT_GOOD);
> 
> So while I agree with how you've done careful measurements to isolate bad
> microarchitectures where non-temporal stores are slow, I do think this
> approach of opt-in doesn't scale and is hard to maintain.
> 
> Instead I'd suggest enabling this by default everywhere, and creating a
> X86_FEATURE_NT_BAD quirk table for the bad microarchitectures.
Okay, some kind of quirk table is a great idea. Also means that there's a
single place for keeping this rather than it being scattered all over in
the code.

That also simplifies my handling of features like X86_FEATURE_CLZERO.
I was concerned that if you squint a bit, it seems to be an alias to
X86_FEATURE_NT_GOOD and that seemed ugly.

> 
> This means that with new microarchitectures we'd get automatic enablement,
> and hopefully chip testing would identify cases where performance isn't as
> good.
Makes sense to me. A first class citizen, as it were...

Thanks for reviewing btw.

Ankur

> 
> I.e. the 'trust but verify' method.


> 
> Thanks,
> 
> 	Ingo
>
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 59a1e3ce3f14..161028c1dee0 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -662,6 +662,8 @@  static void init_intel(struct cpuinfo_x86 *c)
 		c->x86_cache_alignment = c->x86_clflush_size * 2;
 	if (c->x86 == 6)
 		set_cpu_cap(c, X86_FEATURE_REP_GOOD);
+	if (c->x86 == 6 && c->x86_model == INTEL_FAM6_BROADWELL_X)
+		set_cpu_cap(c, X86_FEATURE_NT_GOOD);
 #else
 	/*
 	 * Names for the Pentium II/Celeron processors