diff mbox

x86: strange behavior of invlpg

Message ID 573992B7.20300@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Paolo Bonzini May 16, 2016, 9:28 a.m. UTC
On 14/05/2016 11:35, Nadav Amit wrote:
> I encountered a strange phenomenum and I would appreciate your sanity check
> and opinion. It looks as if 'invlpg' that runs in a VM causes a very broad
> flush.
> 
> I created a small kvm-unit-test (below) to show what I talk about. The test 
> touches 50 pages, and then either: (1) runs full flush, (2) runs invlpg to
> an arbitrary (other) address, or (3) runs memory barrier.
> 
> It appears that the execution time of the test is indeed determined by TLB
> misses, since the runtime of the memory barrier flavor is considerably lower.

Did you check the performance counters?  Another explanation is that 
there are no TLB misses, but CR3 writes are optimized in such a way 
that they do not incur TLB misses either.  (Disclaimer: I didn't check 
the performance counters to prove the alternative theory ;)).

> What I find strange is that if I compute the net access time for tests 1 & 2,
> by deducing the time of the flushes, the time is almost identical. I am aware 
> that invlpg flushes the page-walk caches, but I would still expect the invlpg
> flavor to run considerably faster than the full-flush flavor.

That's interesting.  I guess you're using EPT because I get very 
similar number on an Ivy Bridge laptop:

  with invlpg:        902,224,568
  with full flush:    880,103,513
  invlpg only         113,186,461
  full flushes only   100,236,620
  access net          104,454,125
  w/full flush net    779,866,893
  w/invlpg net        789,038,107

(commas added for readability).

Out of curiosity I tried making all pages global (patch after my
signature).  Both invlpg and write to CR3 become much faster, but
invlpg now is faster than full flush, even though in theory it
should be the opposite...

  with invlpg:        223,079,661
  with full flush:    294,280,788
  invlpg only         126,236,334
  full flushes only   107,614,525
  access net           90,830,503
  w/full flush net    186,666,263
  w/invlpg net         96,843,327

Thanks for the interesting test!

Paolo


> Am I missing something?
> 
> 
> On my Haswell EP I get the following results:
> 
> with invlpg:        948965249
> with full flush:    1047927009
> invlpg only         127682028
> full flushes only   224055273
> access net          107691277	--> considerably lower than w/flushes
> w/full flush net    823871736
> w/invlpg net        821283221	--> almost identical to full-flush net
> 
> ---
> 
> 
> #include "libcflat.h"
> #include "fwcfg.h"
> #include "vm.h"
> #include "smp.h"
> 
> #define N_PAGES	(50)
> #define ITERATIONS (500000)
> volatile char buf[N_PAGES * PAGE_SIZE] __attribute__ ((aligned (PAGE_SIZE)));
> 
> int main(void)
> {
>     void *another_addr = (void*)0x50f9000;
>     int i, j;
>     unsigned long t_start, t_single, t_full, t_single_only, t_full_only,
> 		  t_access;
>     unsigned long cr3;
>     char v = 0;
> 
>     setup_vm();
> 
>     cr3 = read_cr3();
> 
>     t_start = rdtsc();
>     for (i = 0; i < ITERATIONS; i++) {
>         invlpg(another_addr);
> 	for (j = 0; j < N_PAGES; j++)
>             v = buf[PAGE_SIZE * j];
>     }
>     t_single = rdtsc() - t_start;
>     printf("with invlpg:        %lu\n", t_single);
> 
>     t_start = rdtsc();
>     for (i = 0; i < ITERATIONS; i++) {
>     	write_cr3(cr3);
> 	for (j = 0; j < N_PAGES; j++)
>             v = buf[PAGE_SIZE * j];
>     }
>     t_full = rdtsc() - t_start;
>     printf("with full flush:    %lu\n", t_full);
> 
>     t_start = rdtsc();
>     for (i = 0; i < ITERATIONS; i++)
>          invlpg(another_addr);
>     t_single_only = rdtsc() - t_start;
>     printf("invlpg only         %lu\n", t_single_only);
> 
>     t_start = rdtsc();
>     for (i = 0; i < ITERATIONS; i++)
>     	 write_cr3(cr3);
>     t_full_only = rdtsc() - t_start;
>     printf("full flushes only   %lu\n", t_full_only);
> 
>     t_start = rdtsc();
>     for (i = 0; i < ITERATIONS; i++) {
> 	for (j = 0; j < N_PAGES; j++)
>             v = buf[PAGE_SIZE * j];
> 	mb();
>     }
>     t_access = rdtsc()-t_start;
>     printf("access net          %lu\n", t_access);
>     printf("w/full flush net    %lu\n", t_full - t_full_only);
>     printf("w/invlpg net        %lu\n", t_single - t_single_only);
> 
>     (void)v;
>     return 0;
> }--
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Nadav Amit May 16, 2016, 4:51 p.m. UTC | #1
Thanks! I appreciate it.

I think your experiment with global paging just corraborate that the
latency is caused by TLB misses. I measured TLB misses (and especially STLB
misses) in other experiments but not in this one. I will run some more
experiments, specifically to test how AMD behaves.

I should note this is a byproduct of a study I did, and it is not as if I was
looking for strange behaviors (no more validation papers for me!).

The strangest thing is that on bare-metal I don’t see this phenomenon - I doubt
it is a CPU “feature”. Once we understand it, the very least it may affect
the recommended value of “tlb_single_page_flush_ceiling”, that controls when
the kernel performs full TLB flush vs. selective flushes.

Nadav

Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 14/05/2016 11:35, Nadav Amit wrote:
>> I encountered a strange phenomenum and I would appreciate your sanity check
>> and opinion. It looks as if 'invlpg' that runs in a VM causes a very broad
>> flush.
>> 
>> I created a small kvm-unit-test (below) to show what I talk about. The test 
>> touches 50 pages, and then either: (1) runs full flush, (2) runs invlpg to
>> an arbitrary (other) address, or (3) runs memory barrier.
>> 
>> It appears that the execution time of the test is indeed determined by TLB
>> misses, since the runtime of the memory barrier flavor is considerably lower.
> 
> Did you check the performance counters?  Another explanation is that 
> there are no TLB misses, but CR3 writes are optimized in such a way 
> that they do not incur TLB misses either.  (Disclaimer: I didn't check 
> the performance counters to prove the alternative theory ;)).
> 
>> What I find strange is that if I compute the net access time for tests 1 & 2,
>> by deducing the time of the flushes, the time is almost identical. I am aware 
>> that invlpg flushes the page-walk caches, but I would still expect the invlpg
>> flavor to run considerably faster than the full-flush flavor.
> 
> That's interesting.  I guess you're using EPT because I get very 
> similar number on an Ivy Bridge laptop:
> 
>  with invlpg:        902,224,568
>  with full flush:    880,103,513
>  invlpg only         113,186,461
>  full flushes only   100,236,620
>  access net          104,454,125
>  w/full flush net    779,866,893
>  w/invlpg net        789,038,107
> 
> (commas added for readability).
> 
> Out of curiosity I tried making all pages global (patch after my
> signature).  Both invlpg and write to CR3 become much faster, but
> invlpg now is faster than full flush, even though in theory it
> should be the opposite...
> 
>  with invlpg:        223,079,661
>  with full flush:    294,280,788
>  invlpg only         126,236,334
>  full flushes only   107,614,525
>  access net           90,830,503
>  w/full flush net    186,666,263
>  w/invlpg net         96,843,327
> 
> Thanks for the interesting test!
> 
> Paolo
> 
> diff --git a/lib/x86/vm.c b/lib/x86/vm.c
> index 7ce7bbc..3b9b81a 100644
> --- a/lib/x86/vm.c
> +++ b/lib/x86/vm.c
> @@ -2,6 +2,7 @@
> #include "vm.h"
> #include "libcflat.h"
> 
> +#define PTE_GLOBAL      256
> #define PAGE_SIZE 4096ul
> #ifdef __x86_64__
> #define LARGE_PAGE_SIZE (512 * PAGE_SIZE)
> @@ -106,14 +107,14 @@ unsigned long *install_large_page(unsigned long *cr3,
> 				  void *virt)
> {
>     return install_pte(cr3, 2, virt,
> -		       phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_PSE, 0);
> +		       phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_PSE | PTE_GLOBAL, 0);
> }
> 
> unsigned long *install_page(unsigned long *cr3,
> 			    unsigned long phys,
> 			    void *virt)
> {
> -    return install_pte(cr3, 1, virt, phys | PTE_PRESENT | PTE_WRITE | PTE_USER, 0);
> +    return install_pte(cr3, 1, virt, phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_GLOBAL, 0);
> }
> 
> 
> 
>> Am I missing something?
>> 
>> 
>> On my Haswell EP I get the following results:
>> 
>> with invlpg:        948965249
>> with full flush:    1047927009
>> invlpg only         127682028
>> full flushes only   224055273
>> access net          107691277	--> considerably lower than w/flushes
>> w/full flush net    823871736
>> w/invlpg net        821283221	--> almost identical to full-flush net
>> 
>> ---
>> 
>> 
>> #include "libcflat.h"
>> #include "fwcfg.h"
>> #include "vm.h"
>> #include "smp.h"
>> 
>> #define N_PAGES	(50)
>> #define ITERATIONS (500000)
>> volatile char buf[N_PAGES * PAGE_SIZE] __attribute__ ((aligned (PAGE_SIZE)));
>> 
>> int main(void)
>> {
>>    void *another_addr = (void*)0x50f9000;
>>    int i, j;
>>    unsigned long t_start, t_single, t_full, t_single_only, t_full_only,
>> 		  t_access;
>>    unsigned long cr3;
>>    char v = 0;
>> 
>>    setup_vm();
>> 
>>    cr3 = read_cr3();
>> 
>>    t_start = rdtsc();
>>    for (i = 0; i < ITERATIONS; i++) {
>>        invlpg(another_addr);
>> 	for (j = 0; j < N_PAGES; j++)
>>            v = buf[PAGE_SIZE * j];
>>    }
>>    t_single = rdtsc() - t_start;
>>    printf("with invlpg:        %lu\n", t_single);
>> 
>>    t_start = rdtsc();
>>    for (i = 0; i < ITERATIONS; i++) {
>>    	write_cr3(cr3);
>> 	for (j = 0; j < N_PAGES; j++)
>>            v = buf[PAGE_SIZE * j];
>>    }
>>    t_full = rdtsc() - t_start;
>>    printf("with full flush:    %lu\n", t_full);
>> 
>>    t_start = rdtsc();
>>    for (i = 0; i < ITERATIONS; i++)
>>         invlpg(another_addr);
>>    t_single_only = rdtsc() - t_start;
>>    printf("invlpg only         %lu\n", t_single_only);
>> 
>>    t_start = rdtsc();
>>    for (i = 0; i < ITERATIONS; i++)
>>    	 write_cr3(cr3);
>>    t_full_only = rdtsc() - t_start;
>>    printf("full flushes only   %lu\n", t_full_only);
>> 
>>    t_start = rdtsc();
>>    for (i = 0; i < ITERATIONS; i++) {
>> 	for (j = 0; j < N_PAGES; j++)
>>            v = buf[PAGE_SIZE * j];
>> 	mb();
>>    }
>>    t_access = rdtsc()-t_start;
>>    printf("access net          %lu\n", t_access);
>>    printf("w/full flush net    %lu\n", t_full - t_full_only);
>>    printf("w/invlpg net        %lu\n", t_single - t_single_only);
>> 
>>    (void)v;
>>    return 0;
>> }--
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini May 16, 2016, 4:56 p.m. UTC | #2
On 16/05/2016 18:51, Nadav Amit wrote:
> Thanks! I appreciate it.
> 
> I think your experiment with global paging just corraborate that the
> latency is caused by TLB misses. I measured TLB misses (and especially STLB
> misses) in other experiments but not in this one. I will run some more
> experiments, specifically to test how AMD behaves.

I'm curious about AMD too now...

  with invlpg:        285,639,427
  with full flush:    584,419,299
  invlpg only          70,681,128
  full flushes only   265,238,766
  access net          242,538,804
  w/full flush net    319,180,533
  w/invlpg net        214,958,299

Roughly the same with and without pte.g.  So AMD behaves as it should.

> I should note this is a byproduct of a study I did, and it is not as if I was
> looking for strange behaviors (no more validation papers for me!).
> 
> The strangest thing is that on bare-metal I don’t see this phenomenon - I doubt
> it is a CPU “feature”. Once we understand it, the very least it may affect
> the recommended value of “tlb_single_page_flush_ceiling”, that controls when
> the kernel performs full TLB flush vs. selective flushes.

Do you have a kernel module to reproduce the test on bare metal? (/me is
lazy).

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nadav Amit May 16, 2016, 7:39 p.m. UTC | #3
Argh... I don’t get the same behavior in the guest with the module test.
I’ll need some more time to figure it out.

Just a small comment regarding your “global” test: you forgot to set
CR4.PGE.

Once I set it, I get reasonable numbers (excluding the invlpg flavor).

with invlpg:        964431529
with full flush:    268190767
invlpg only         126114041
full flushes only   185971818
access net          111229828
w/full flush net    82218949		—> similar to access net
w/invlpg net        838317488

I’ll be back when I have more understanding of the situation.

Thanks,
Nadav


Paolo Bonzini <pbonzini@redhat.com> wrote:

> 
> 
> On 16/05/2016 18:51, Nadav Amit wrote:
>> Thanks! I appreciate it.
>> 
>> I think your experiment with global paging just corraborate that the
>> latency is caused by TLB misses. I measured TLB misses (and especially STLB
>> misses) in other experiments but not in this one. I will run some more
>> experiments, specifically to test how AMD behaves.
> 
> I'm curious about AMD too now...
> 
>  with invlpg:        285,639,427
>  with full flush:    584,419,299
>  invlpg only          70,681,128
>  full flushes only   265,238,766
>  access net          242,538,804
>  w/full flush net    319,180,533
>  w/invlpg net        214,958,299
> 
> Roughly the same with and without pte.g.  So AMD behaves as it should.
> 
>> I should note this is a byproduct of a study I did, and it is not as if I was
>> looking for strange behaviors (no more validation papers for me!).
>> 
>> The strangest thing is that on bare-metal I don’t see this phenomenon - I doubt
>> it is a CPU “feature”. Once we understand it, the very least it may affect
>> the recommended value of “tlb_single_page_flush_ceiling”, that controls when
>> the kernel performs full TLB flush vs. selective flushes.
> 
> Do you have a kernel module to reproduce the test on bare metal? (/me is
> lazy).
> 
> Paolo


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nadav Amit May 17, 2016, 4:27 a.m. UTC | #4
Ok, it seems to be related to the guest/host page sizes.

It seems that if you run 2MB pages in a VM on top of 4KB pages in
the host, any invlpg in the VM causes all 2MB guest pages to be flushed.

I’ll try to find time to make sure there is nothing else to it.

Thanks for the assistance, and let me know if you need my hacky tests.

The measurements below are of VM and bare-metal:


	Host 	Guest 	Full Flush	Selective Flush
	PGsize	PGsize	(dTLB misses)	(dTLB misses)
	-----------------------------------------------
VM	4KB	4KB	103,008,052	93,172
	4KB	2MB	102,022,557	102,038,021
	2MB	4KB	103,005,083	2,888
	2MB	2MB	4,002,969	2,556
HOST	4KB		50,000,572	789
	2MB		1,000,454	537

Nadav Amit <nadav.amit@gmail.com> wrote:

> Argh... I don’t get the same behavior in the guest with the module test.
> I’ll need some more time to figure it out.
> 
> Just a small comment regarding your “global” test: you forgot to set
> CR4.PGE.
> 
> Once I set it, I get reasonable numbers (excluding the invlpg flavor).
> 
> with invlpg:        964431529
> with full flush:    268190767
> invlpg only         126114041
> full flushes only   185971818
> access net          111229828
> w/full flush net    82218949		—> similar to access net
> w/invlpg net        838317488
> 
> I’ll be back when I have more understanding of the situation.
> 
> Thanks,
> Nadav
> 
> 
> Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
>> On 16/05/2016 18:51, Nadav Amit wrote:
>>> Thanks! I appreciate it.
>>> 
>>> I think your experiment with global paging just corraborate that the
>>> latency is caused by TLB misses. I measured TLB misses (and especially STLB
>>> misses) in other experiments but not in this one. I will run some more
>>> experiments, specifically to test how AMD behaves.
>> 
>> I'm curious about AMD too now...
>> 
>> with invlpg:        285,639,427
>> with full flush:    584,419,299
>> invlpg only          70,681,128
>> full flushes only   265,238,766
>> access net          242,538,804
>> w/full flush net    319,180,533
>> w/invlpg net        214,958,299
>> 
>> Roughly the same with and without pte.g.  So AMD behaves as it should.
>> 
>>> I should note this is a byproduct of a study I did, and it is not as if I was
>>> looking for strange behaviors (no more validation papers for me!).
>>> 
>>> The strangest thing is that on bare-metal I don’t see this phenomenon - I doubt
>>> it is a CPU “feature”. Once we understand it, the very least it may affect
>>> the recommended value of “tlb_single_page_flush_ceiling”, that controls when
>>> the kernel performs full TLB flush vs. selective flushes.
>> 
>> Do you have a kernel module to reproduce the test on bare metal? (/me is
>> lazy).
>> 
>> Paolo


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nadav Amit Feb. 15, 2018, 10:43 p.m. UTC | #5
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 16/05/2016 18:51, Nadav Amit wrote:
>> Thanks! I appreciate it.
>> 
>> I think your experiment with global paging just corraborate that the
>> latency is caused by TLB misses. I measured TLB misses (and especially STLB
>> misses) in other experiments but not in this one. I will run some more
>> experiments, specifically to test how AMD behaves.
> 
> I'm curious about AMD too now...
> 
>  with invlpg:        285,639,427
>  with full flush:    584,419,299
>  invlpg only          70,681,128
>  full flushes only   265,238,766
>  access net          242,538,804
>  w/full flush net    319,180,533
>  w/invlpg net        214,958,299
> 
> Roughly the same with and without pte.g.  So AMD behaves as it should.
> 
>> I should note this is a byproduct of a study I did, and it is not as if I was
>> looking for strange behaviors (no more validation papers for me!).
>> 
>> The strangest thing is that on bare-metal I don’t see this phenomenon - I doubt
>> it is a CPU “feature”. Once we understand it, the very least it may affect
>> the recommended value of “tlb_single_page_flush_ceiling”, that controls when
>> the kernel performs full TLB flush vs. selective flushes.
> 
> Do you have a kernel module to reproduce the test on bare metal? (/me is
> lazy).

It came to my mind that I didn’t tell you what turned eventually to be the
issue. (Yes, I know it is a very old thread, but you may still be
interested).

It turns out that Intel has something that is called “page fracturing”.
After the TLB caches a translation that came from 2MB guest page and 4KB
host page, INVLPG ends up flushing the entire TLB is flushed.

I guess they need to do it to follow the SDM 4.10.4.1 (regarding pages
larger than 4 KBytes): "The INVLPG instruction and page faults provide the
same assurances that they provide when a single TLB entry is used: they
invalidate all TLB entries corresponding to the translation specified by the
paging structures.”

Thanks again for your help,
Nadav
diff mbox

Patch

diff --git a/lib/x86/vm.c b/lib/x86/vm.c
index 7ce7bbc..3b9b81a 100644
--- a/lib/x86/vm.c
+++ b/lib/x86/vm.c
@@ -2,6 +2,7 @@ 
 #include "vm.h"
 #include "libcflat.h"
 
+#define PTE_GLOBAL      256
 #define PAGE_SIZE 4096ul
 #ifdef __x86_64__
 #define LARGE_PAGE_SIZE (512 * PAGE_SIZE)
@@ -106,14 +107,14 @@  unsigned long *install_large_page(unsigned long *cr3,
 				  void *virt)
 {
     return install_pte(cr3, 2, virt,
-		       phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_PSE, 0);
+		       phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_PSE | PTE_GLOBAL, 0);
 }
 
 unsigned long *install_page(unsigned long *cr3,
 			    unsigned long phys,
 			    void *virt)
 {
-    return install_pte(cr3, 1, virt, phys | PTE_PRESENT | PTE_WRITE | PTE_USER, 0);
+    return install_pte(cr3, 1, virt, phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_GLOBAL, 0);
 }