Message ID | 573992B7.20300@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Thanks! I appreciate it. I think your experiment with global paging just corraborate that the latency is caused by TLB misses. I measured TLB misses (and especially STLB misses) in other experiments but not in this one. I will run some more experiments, specifically to test how AMD behaves. I should note this is a byproduct of a study I did, and it is not as if I was looking for strange behaviors (no more validation papers for me!). The strangest thing is that on bare-metal I don’t see this phenomenon - I doubt it is a CPU “feature”. Once we understand it, the very least it may affect the recommended value of “tlb_single_page_flush_ceiling”, that controls when the kernel performs full TLB flush vs. selective flushes. Nadav Paolo Bonzini <pbonzini@redhat.com> wrote: > On 14/05/2016 11:35, Nadav Amit wrote: >> I encountered a strange phenomenum and I would appreciate your sanity check >> and opinion. It looks as if 'invlpg' that runs in a VM causes a very broad >> flush. >> >> I created a small kvm-unit-test (below) to show what I talk about. The test >> touches 50 pages, and then either: (1) runs full flush, (2) runs invlpg to >> an arbitrary (other) address, or (3) runs memory barrier. >> >> It appears that the execution time of the test is indeed determined by TLB >> misses, since the runtime of the memory barrier flavor is considerably lower. > > Did you check the performance counters? Another explanation is that > there are no TLB misses, but CR3 writes are optimized in such a way > that they do not incur TLB misses either. (Disclaimer: I didn't check > the performance counters to prove the alternative theory ;)). > >> What I find strange is that if I compute the net access time for tests 1 & 2, >> by deducing the time of the flushes, the time is almost identical. I am aware >> that invlpg flushes the page-walk caches, but I would still expect the invlpg >> flavor to run considerably faster than the full-flush flavor. > > That's interesting. I guess you're using EPT because I get very > similar number on an Ivy Bridge laptop: > > with invlpg: 902,224,568 > with full flush: 880,103,513 > invlpg only 113,186,461 > full flushes only 100,236,620 > access net 104,454,125 > w/full flush net 779,866,893 > w/invlpg net 789,038,107 > > (commas added for readability). > > Out of curiosity I tried making all pages global (patch after my > signature). Both invlpg and write to CR3 become much faster, but > invlpg now is faster than full flush, even though in theory it > should be the opposite... > > with invlpg: 223,079,661 > with full flush: 294,280,788 > invlpg only 126,236,334 > full flushes only 107,614,525 > access net 90,830,503 > w/full flush net 186,666,263 > w/invlpg net 96,843,327 > > Thanks for the interesting test! > > Paolo > > diff --git a/lib/x86/vm.c b/lib/x86/vm.c > index 7ce7bbc..3b9b81a 100644 > --- a/lib/x86/vm.c > +++ b/lib/x86/vm.c > @@ -2,6 +2,7 @@ > #include "vm.h" > #include "libcflat.h" > > +#define PTE_GLOBAL 256 > #define PAGE_SIZE 4096ul > #ifdef __x86_64__ > #define LARGE_PAGE_SIZE (512 * PAGE_SIZE) > @@ -106,14 +107,14 @@ unsigned long *install_large_page(unsigned long *cr3, > void *virt) > { > return install_pte(cr3, 2, virt, > - phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_PSE, 0); > + phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_PSE | PTE_GLOBAL, 0); > } > > unsigned long *install_page(unsigned long *cr3, > unsigned long phys, > void *virt) > { > - return install_pte(cr3, 1, virt, phys | PTE_PRESENT | PTE_WRITE | PTE_USER, 0); > + return install_pte(cr3, 1, virt, phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_GLOBAL, 0); > } > > > >> Am I missing something? >> >> >> On my Haswell EP I get the following results: >> >> with invlpg: 948965249 >> with full flush: 1047927009 >> invlpg only 127682028 >> full flushes only 224055273 >> access net 107691277 --> considerably lower than w/flushes >> w/full flush net 823871736 >> w/invlpg net 821283221 --> almost identical to full-flush net >> >> --- >> >> >> #include "libcflat.h" >> #include "fwcfg.h" >> #include "vm.h" >> #include "smp.h" >> >> #define N_PAGES (50) >> #define ITERATIONS (500000) >> volatile char buf[N_PAGES * PAGE_SIZE] __attribute__ ((aligned (PAGE_SIZE))); >> >> int main(void) >> { >> void *another_addr = (void*)0x50f9000; >> int i, j; >> unsigned long t_start, t_single, t_full, t_single_only, t_full_only, >> t_access; >> unsigned long cr3; >> char v = 0; >> >> setup_vm(); >> >> cr3 = read_cr3(); >> >> t_start = rdtsc(); >> for (i = 0; i < ITERATIONS; i++) { >> invlpg(another_addr); >> for (j = 0; j < N_PAGES; j++) >> v = buf[PAGE_SIZE * j]; >> } >> t_single = rdtsc() - t_start; >> printf("with invlpg: %lu\n", t_single); >> >> t_start = rdtsc(); >> for (i = 0; i < ITERATIONS; i++) { >> write_cr3(cr3); >> for (j = 0; j < N_PAGES; j++) >> v = buf[PAGE_SIZE * j]; >> } >> t_full = rdtsc() - t_start; >> printf("with full flush: %lu\n", t_full); >> >> t_start = rdtsc(); >> for (i = 0; i < ITERATIONS; i++) >> invlpg(another_addr); >> t_single_only = rdtsc() - t_start; >> printf("invlpg only %lu\n", t_single_only); >> >> t_start = rdtsc(); >> for (i = 0; i < ITERATIONS; i++) >> write_cr3(cr3); >> t_full_only = rdtsc() - t_start; >> printf("full flushes only %lu\n", t_full_only); >> >> t_start = rdtsc(); >> for (i = 0; i < ITERATIONS; i++) { >> for (j = 0; j < N_PAGES; j++) >> v = buf[PAGE_SIZE * j]; >> mb(); >> } >> t_access = rdtsc()-t_start; >> printf("access net %lu\n", t_access); >> printf("w/full flush net %lu\n", t_full - t_full_only); >> printf("w/invlpg net %lu\n", t_single - t_single_only); >> >> (void)v; >> return 0; >> }-- >> To unsubscribe from this list: send the line "unsubscribe kvm" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 16/05/2016 18:51, Nadav Amit wrote: > Thanks! I appreciate it. > > I think your experiment with global paging just corraborate that the > latency is caused by TLB misses. I measured TLB misses (and especially STLB > misses) in other experiments but not in this one. I will run some more > experiments, specifically to test how AMD behaves. I'm curious about AMD too now... with invlpg: 285,639,427 with full flush: 584,419,299 invlpg only 70,681,128 full flushes only 265,238,766 access net 242,538,804 w/full flush net 319,180,533 w/invlpg net 214,958,299 Roughly the same with and without pte.g. So AMD behaves as it should. > I should note this is a byproduct of a study I did, and it is not as if I was > looking for strange behaviors (no more validation papers for me!). > > The strangest thing is that on bare-metal I don’t see this phenomenon - I doubt > it is a CPU “feature”. Once we understand it, the very least it may affect > the recommended value of “tlb_single_page_flush_ceiling”, that controls when > the kernel performs full TLB flush vs. selective flushes. Do you have a kernel module to reproduce the test on bare metal? (/me is lazy). Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Argh... I don’t get the same behavior in the guest with the module test. I’ll need some more time to figure it out. Just a small comment regarding your “global” test: you forgot to set CR4.PGE. Once I set it, I get reasonable numbers (excluding the invlpg flavor). with invlpg: 964431529 with full flush: 268190767 invlpg only 126114041 full flushes only 185971818 access net 111229828 w/full flush net 82218949 —> similar to access net w/invlpg net 838317488 I’ll be back when I have more understanding of the situation. Thanks, Nadav Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 16/05/2016 18:51, Nadav Amit wrote: >> Thanks! I appreciate it. >> >> I think your experiment with global paging just corraborate that the >> latency is caused by TLB misses. I measured TLB misses (and especially STLB >> misses) in other experiments but not in this one. I will run some more >> experiments, specifically to test how AMD behaves. > > I'm curious about AMD too now... > > with invlpg: 285,639,427 > with full flush: 584,419,299 > invlpg only 70,681,128 > full flushes only 265,238,766 > access net 242,538,804 > w/full flush net 319,180,533 > w/invlpg net 214,958,299 > > Roughly the same with and without pte.g. So AMD behaves as it should. > >> I should note this is a byproduct of a study I did, and it is not as if I was >> looking for strange behaviors (no more validation papers for me!). >> >> The strangest thing is that on bare-metal I don’t see this phenomenon - I doubt >> it is a CPU “feature”. Once we understand it, the very least it may affect >> the recommended value of “tlb_single_page_flush_ceiling”, that controls when >> the kernel performs full TLB flush vs. selective flushes. > > Do you have a kernel module to reproduce the test on bare metal? (/me is > lazy). > > Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ok, it seems to be related to the guest/host page sizes. It seems that if you run 2MB pages in a VM on top of 4KB pages in the host, any invlpg in the VM causes all 2MB guest pages to be flushed. I’ll try to find time to make sure there is nothing else to it. Thanks for the assistance, and let me know if you need my hacky tests. The measurements below are of VM and bare-metal: Host Guest Full Flush Selective Flush PGsize PGsize (dTLB misses) (dTLB misses) ----------------------------------------------- VM 4KB 4KB 103,008,052 93,172 4KB 2MB 102,022,557 102,038,021 2MB 4KB 103,005,083 2,888 2MB 2MB 4,002,969 2,556 HOST 4KB 50,000,572 789 2MB 1,000,454 537 Nadav Amit <nadav.amit@gmail.com> wrote: > Argh... I don’t get the same behavior in the guest with the module test. > I’ll need some more time to figure it out. > > Just a small comment regarding your “global” test: you forgot to set > CR4.PGE. > > Once I set it, I get reasonable numbers (excluding the invlpg flavor). > > with invlpg: 964431529 > with full flush: 268190767 > invlpg only 126114041 > full flushes only 185971818 > access net 111229828 > w/full flush net 82218949 —> similar to access net > w/invlpg net 838317488 > > I’ll be back when I have more understanding of the situation. > > Thanks, > Nadav > > > Paolo Bonzini <pbonzini@redhat.com> wrote: > >> On 16/05/2016 18:51, Nadav Amit wrote: >>> Thanks! I appreciate it. >>> >>> I think your experiment with global paging just corraborate that the >>> latency is caused by TLB misses. I measured TLB misses (and especially STLB >>> misses) in other experiments but not in this one. I will run some more >>> experiments, specifically to test how AMD behaves. >> >> I'm curious about AMD too now... >> >> with invlpg: 285,639,427 >> with full flush: 584,419,299 >> invlpg only 70,681,128 >> full flushes only 265,238,766 >> access net 242,538,804 >> w/full flush net 319,180,533 >> w/invlpg net 214,958,299 >> >> Roughly the same with and without pte.g. So AMD behaves as it should. >> >>> I should note this is a byproduct of a study I did, and it is not as if I was >>> looking for strange behaviors (no more validation papers for me!). >>> >>> The strangest thing is that on bare-metal I don’t see this phenomenon - I doubt >>> it is a CPU “feature”. Once we understand it, the very least it may affect >>> the recommended value of “tlb_single_page_flush_ceiling”, that controls when >>> the kernel performs full TLB flush vs. selective flushes. >> >> Do you have a kernel module to reproduce the test on bare metal? (/me is >> lazy). >> >> Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Paolo Bonzini <pbonzini@redhat.com> wrote: > On 16/05/2016 18:51, Nadav Amit wrote: >> Thanks! I appreciate it. >> >> I think your experiment with global paging just corraborate that the >> latency is caused by TLB misses. I measured TLB misses (and especially STLB >> misses) in other experiments but not in this one. I will run some more >> experiments, specifically to test how AMD behaves. > > I'm curious about AMD too now... > > with invlpg: 285,639,427 > with full flush: 584,419,299 > invlpg only 70,681,128 > full flushes only 265,238,766 > access net 242,538,804 > w/full flush net 319,180,533 > w/invlpg net 214,958,299 > > Roughly the same with and without pte.g. So AMD behaves as it should. > >> I should note this is a byproduct of a study I did, and it is not as if I was >> looking for strange behaviors (no more validation papers for me!). >> >> The strangest thing is that on bare-metal I don’t see this phenomenon - I doubt >> it is a CPU “feature”. Once we understand it, the very least it may affect >> the recommended value of “tlb_single_page_flush_ceiling”, that controls when >> the kernel performs full TLB flush vs. selective flushes. > > Do you have a kernel module to reproduce the test on bare metal? (/me is > lazy). It came to my mind that I didn’t tell you what turned eventually to be the issue. (Yes, I know it is a very old thread, but you may still be interested). It turns out that Intel has something that is called “page fracturing”. After the TLB caches a translation that came from 2MB guest page and 4KB host page, INVLPG ends up flushing the entire TLB is flushed. I guess they need to do it to follow the SDM 4.10.4.1 (regarding pages larger than 4 KBytes): "The INVLPG instruction and page faults provide the same assurances that they provide when a single TLB entry is used: they invalidate all TLB entries corresponding to the translation specified by the paging structures.” Thanks again for your help, Nadav
diff --git a/lib/x86/vm.c b/lib/x86/vm.c index 7ce7bbc..3b9b81a 100644 --- a/lib/x86/vm.c +++ b/lib/x86/vm.c @@ -2,6 +2,7 @@ #include "vm.h" #include "libcflat.h" +#define PTE_GLOBAL 256 #define PAGE_SIZE 4096ul #ifdef __x86_64__ #define LARGE_PAGE_SIZE (512 * PAGE_SIZE) @@ -106,14 +107,14 @@ unsigned long *install_large_page(unsigned long *cr3, void *virt) { return install_pte(cr3, 2, virt, - phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_PSE, 0); + phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_PSE | PTE_GLOBAL, 0); } unsigned long *install_page(unsigned long *cr3, unsigned long phys, void *virt) { - return install_pte(cr3, 1, virt, phys | PTE_PRESENT | PTE_WRITE | PTE_USER, 0); + return install_pte(cr3, 1, virt, phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_GLOBAL, 0); }