Message ID | 1483601042-6435-1-git-send-email-jitendra.kolhe@hpe.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Jitendra Kolhe <jitendra.kolhe@hpe.com> wrote: > Using "-mem-prealloc" option for a very large guest leads to huge guest > start-up and migration time. This is because with "-mem-prealloc" option > qemu tries to map every guest page (create address translations), and > make sure the pages are available during runtime. virsh/libvirt by > default, seems to use "-mem-prealloc" option in case the guest is > configured to use huge pages. The patch tries to map all guest pages > simultaneously by spawning multiple threads. Given the problem is more > prominent for large guests, the patch limits the changes to the guests > of at-least 64GB of memory size. Currently limiting the change to QEMU > library functions on POSIX compliant host only, as we are not sure if > the problem exists on win32. Below are some stats with "-mem-prealloc" > option for guest configured to use huge pages. > > ------------------------------------------------------------------------ > Idle Guest | Start-up time | Migration time > ------------------------------------------------------------------------ > Guest stats with 2M HugePage usage - single threaded (existing code) > ------------------------------------------------------------------------ > 64 Core - 4TB | 54m11.796s | 75m43.843s ^^^^^^^^^^ > 64 Core - 1TB | 8m56.576s | 14m29.049s > 64 Core - 256GB | 2m11.245s | 3m26.598s > ------------------------------------------------------------------------ > Guest stats with 2M HugePage usage - map guest pages using 8 threads > ------------------------------------------------------------------------ > 64 Core - 4TB | 5m1.027s | 34m10.565s > 64 Core - 1TB | 1m10.366s | 8m28.188s > 64 Core - 256GB | 0m19.040s | 2m10.148s > ----------------------------------------------------------------------- > Guest stats with 2M HugePage usage - map guest pages using 16 threads > ----------------------------------------------------------------------- > 64 Core - 4TB | 1m58.970s | 31m43.400s ^^^^^^^^^ Impressive, not everyday one get an speedup of 20 O:-) > +static void *do_touch_pages(void *arg) > +{ > + PageRange *range = (PageRange *)arg; > + char *start_addr = range->addr; > + uint64_t numpages = range->numpages; > + uint64_t hpagesize = range->hpagesize; > + uint64_t i = 0; > + > + for (i = 0; i < numpages; i++) { > + memset(start_addr + (hpagesize * i), 0, 1); I would use the range->addr and similar here directly, but it is just a question of taste. > - /* MAP_POPULATE silently ignores failures */ > - for (i = 0; i < numpages; i++) { > - memset(area + (hpagesize * i), 0, 1); > + /* touch pages simultaneously for memory >= 64G */ > + if (memory < (1ULL << 36)) { 64GB guest already took quite a bit of time, I think I would put it always as min(num_vcpus, 16). So, we always execute the multiple theard codepath? But very nice, thanks. Later, Juan.
* Jitendra Kolhe (jitendra.kolhe@hpe.com) wrote: > Using "-mem-prealloc" option for a very large guest leads to huge guest > start-up and migration time. This is because with "-mem-prealloc" option > qemu tries to map every guest page (create address translations), and > make sure the pages are available during runtime. virsh/libvirt by > default, seems to use "-mem-prealloc" option in case the guest is > configured to use huge pages. The patch tries to map all guest pages > simultaneously by spawning multiple threads. Given the problem is more > prominent for large guests, the patch limits the changes to the guests > of at-least 64GB of memory size. Currently limiting the change to QEMU > library functions on POSIX compliant host only, as we are not sure if > the problem exists on win32. Below are some stats with "-mem-prealloc" > option for guest configured to use huge pages. > > ------------------------------------------------------------------------ > Idle Guest | Start-up time | Migration time > ------------------------------------------------------------------------ > Guest stats with 2M HugePage usage - single threaded (existing code) > ------------------------------------------------------------------------ > 64 Core - 4TB | 54m11.796s | 75m43.843s > 64 Core - 1TB | 8m56.576s | 14m29.049s > 64 Core - 256GB | 2m11.245s | 3m26.598s > ------------------------------------------------------------------------ > Guest stats with 2M HugePage usage - map guest pages using 8 threads > ------------------------------------------------------------------------ > 64 Core - 4TB | 5m1.027s | 34m10.565s > 64 Core - 1TB | 1m10.366s | 8m28.188s > 64 Core - 256GB | 0m19.040s | 2m10.148s > ----------------------------------------------------------------------- > Guest stats with 2M HugePage usage - map guest pages using 16 threads > ----------------------------------------------------------------------- > 64 Core - 4TB | 1m58.970s | 31m43.400s > 64 Core - 1TB | 0m39.885s | 7m55.289s > 64 Core - 256GB | 0m11.960s | 2m0.135s > ----------------------------------------------------------------------- That's a nice improvement. > Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com> > --- > util/oslib-posix.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++--- > 1 file changed, 61 insertions(+), 3 deletions(-) > > diff --git a/util/oslib-posix.c b/util/oslib-posix.c > index f631464..a8bd7c2 100644 > --- a/util/oslib-posix.c > +++ b/util/oslib-posix.c > @@ -55,6 +55,13 @@ > #include "qemu/error-report.h" > #endif > > +#define PAGE_TOUCH_THREAD_COUNT 8 It seems a shame to fix that number as a constant. > +typedef struct { > + char *addr; > + uint64_t numpages; > + uint64_t hpagesize; > +} PageRange; > + > int qemu_get_thread_id(void) > { > #if defined(__linux__) > @@ -323,6 +330,52 @@ static void sigbus_handler(int signal) > siglongjmp(sigjump, 1); > } > > +static void *do_touch_pages(void *arg) > +{ > + PageRange *range = (PageRange *)arg; > + char *start_addr = range->addr; > + uint64_t numpages = range->numpages; > + uint64_t hpagesize = range->hpagesize; > + uint64_t i = 0; > + > + for (i = 0; i < numpages; i++) { > + memset(start_addr + (hpagesize * i), 0, 1); > + } > + qemu_thread_exit(NULL); > + > + return NULL; > +} > + > +static int touch_all_pages(char *area, size_t hpagesize, size_t numpages) > +{ > + QemuThread page_threads[PAGE_TOUCH_THREAD_COUNT]; > + PageRange page_range[PAGE_TOUCH_THREAD_COUNT]; > + uint64_t numpage_per_thread, size_per_thread; > + int i = 0, tcount = 0; > + > + numpage_per_thread = (numpages / PAGE_TOUCH_THREAD_COUNT); > + size_per_thread = (hpagesize * numpage_per_thread); > + for (i = 0; i < (PAGE_TOUCH_THREAD_COUNT - 1); i++) { > + page_range[i].addr = area; > + page_range[i].numpages = numpage_per_thread; > + page_range[i].hpagesize = hpagesize; > + > + qemu_thread_create(page_threads + i, "touch_pages", > + do_touch_pages, (page_range + i), > + QEMU_THREAD_JOINABLE); > + tcount++; > + area += size_per_thread; > + numpages -= numpage_per_thread; > + } > + for (i = 0; i < numpages; i++) { > + memset(area + (hpagesize * i), 0, 1); > + } > + for (i = 0; i < tcount; i++) { > + qemu_thread_join(page_threads + i); > + } > + return 0; > +} > + > void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp) > { > int ret; > @@ -353,9 +406,14 @@ void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp) > size_t hpagesize = qemu_fd_getpagesize(fd); > size_t numpages = DIV_ROUND_UP(memory, hpagesize); > > - /* MAP_POPULATE silently ignores failures */ > - for (i = 0; i < numpages; i++) { > - memset(area + (hpagesize * i), 0, 1); > + /* touch pages simultaneously for memory >= 64G */ > + if (memory < (1ULL << 36)) { > + /* MAP_POPULATE silently ignores failures */ > + for (i = 0; i < numpages; i++) { > + memset(area + (hpagesize * i), 0, 1); > + } > + } else { > + touch_all_pages(area, hpagesize, numpages); > } > } Maybe it's possible to do this quicker? If we are using NUMA, and have separate memory-blocks for each NUMA node, wont this call os_mem_prealloc separately for each node? I wonder if it's possible to get that to run in parallel? Dave > -- > 1.8.3.1 > > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On 27/01/2017 13:53, Juan Quintela wrote: >> +static void *do_touch_pages(void *arg) >> +{ >> + PageRange *range = (PageRange *)arg; >> + char *start_addr = range->addr; >> + uint64_t numpages = range->numpages; >> + uint64_t hpagesize = range->hpagesize; >> + uint64_t i = 0; >> + >> + for (i = 0; i < numpages; i++) { >> + memset(start_addr + (hpagesize * i), 0, 1); > > I would use the range->addr and similar here directly, but it is just a > question of taste. > >> - /* MAP_POPULATE silently ignores failures */ >> - for (i = 0; i < numpages; i++) { >> - memset(area + (hpagesize * i), 0, 1); >> + /* touch pages simultaneously for memory >= 64G */ >> + if (memory < (1ULL << 36)) { > > 64GB guest already took quite a bit of time, I think I would put it > always as min(num_vcpus, 16). So, we always execute the multiple theard > codepath? I too would like some kind of heuristic to choose the number of threads. Juan's suggested usage of the VCPUs (smp_cpus) is a good one. Paolo
On Thu, Jan 05, 2017 at 12:54:02PM +0530, Jitendra Kolhe wrote: > Using "-mem-prealloc" option for a very large guest leads to huge guest > start-up and migration time. This is because with "-mem-prealloc" option > qemu tries to map every guest page (create address translations), and > make sure the pages are available during runtime. virsh/libvirt by > default, seems to use "-mem-prealloc" option in case the guest is > configured to use huge pages. The patch tries to map all guest pages > simultaneously by spawning multiple threads. Given the problem is more > prominent for large guests, the patch limits the changes to the guests > of at-least 64GB of memory size. Currently limiting the change to QEMU > library functions on POSIX compliant host only, as we are not sure if > the problem exists on win32. Below are some stats with "-mem-prealloc" > option for guest configured to use huge pages. > > ------------------------------------------------------------------------ > Idle Guest | Start-up time | Migration time > ------------------------------------------------------------------------ > Guest stats with 2M HugePage usage - single threaded (existing code) > ------------------------------------------------------------------------ > 64 Core - 4TB | 54m11.796s | 75m43.843s > 64 Core - 1TB | 8m56.576s | 14m29.049s > 64 Core - 256GB | 2m11.245s | 3m26.598s > ------------------------------------------------------------------------ > Guest stats with 2M HugePage usage - map guest pages using 8 threads > ------------------------------------------------------------------------ > 64 Core - 4TB | 5m1.027s | 34m10.565s > 64 Core - 1TB | 1m10.366s | 8m28.188s > 64 Core - 256GB | 0m19.040s | 2m10.148s > ----------------------------------------------------------------------- > Guest stats with 2M HugePage usage - map guest pages using 16 threads > ----------------------------------------------------------------------- > 64 Core - 4TB | 1m58.970s | 31m43.400s > 64 Core - 1TB | 0m39.885s | 7m55.289s > 64 Core - 256GB | 0m11.960s | 2m0.135s > ----------------------------------------------------------------------- For comparison, what is performance like if you replace memset() in the current code with a call to mlock(). IIUC, huge pages are non-swappable once allocated, so it feels like we ought to be able to just call mlock() to preallocate them with no downside, rather than spawning many threads to memset() them. Of course you'd still need the memset() trick if qemu was given non-hugepages in combination with --mem-prealloc, as you don't want to lock normal pages into ram permanently. Regards, Daniel
On 1/27/2017 6:23 PM, Juan Quintela wrote: > Jitendra Kolhe <jitendra.kolhe@hpe.com> wrote: >> Using "-mem-prealloc" option for a very large guest leads to huge guest >> start-up and migration time. This is because with "-mem-prealloc" option >> qemu tries to map every guest page (create address translations), and >> make sure the pages are available during runtime. virsh/libvirt by >> default, seems to use "-mem-prealloc" option in case the guest is >> configured to use huge pages. The patch tries to map all guest pages >> simultaneously by spawning multiple threads. Given the problem is more >> prominent for large guests, the patch limits the changes to the guests >> of at-least 64GB of memory size. Currently limiting the change to QEMU >> library functions on POSIX compliant host only, as we are not sure if >> the problem exists on win32. Below are some stats with "-mem-prealloc" >> option for guest configured to use huge pages. >> >> ------------------------------------------------------------------------ >> Idle Guest | Start-up time | Migration time >> ------------------------------------------------------------------------ >> Guest stats with 2M HugePage usage - single threaded (existing code) >> ------------------------------------------------------------------------ >> 64 Core - 4TB | 54m11.796s | 75m43.843s > ^^^^^^^^^^ > >> 64 Core - 1TB | 8m56.576s | 14m29.049s >> 64 Core - 256GB | 2m11.245s | 3m26.598s >> ------------------------------------------------------------------------ >> Guest stats with 2M HugePage usage - map guest pages using 8 threads >> ------------------------------------------------------------------------ >> 64 Core - 4TB | 5m1.027s | 34m10.565s >> 64 Core - 1TB | 1m10.366s | 8m28.188s >> 64 Core - 256GB | 0m19.040s | 2m10.148s >> ----------------------------------------------------------------------- >> Guest stats with 2M HugePage usage - map guest pages using 16 threads >> ----------------------------------------------------------------------- >> 64 Core - 4TB | 1m58.970s | 31m43.400s > ^^^^^^^^^ > > Impressive, not everyday one get an speedup of 20 O:-) > > >> +static void *do_touch_pages(void *arg) >> +{ >> + PageRange *range = (PageRange *)arg; >> + char *start_addr = range->addr; >> + uint64_t numpages = range->numpages; >> + uint64_t hpagesize = range->hpagesize; >> + uint64_t i = 0; >> + >> + for (i = 0; i < numpages; i++) { >> + memset(start_addr + (hpagesize * i), 0, 1); > > I would use the range->addr and similar here directly, but it is just a > question of taste. > Thanks for your response, will update my next patch. >> - /* MAP_POPULATE silently ignores failures */ >> - for (i = 0; i < numpages; i++) { >> - memset(area + (hpagesize * i), 0, 1); >> + /* touch pages simultaneously for memory >= 64G */ >> + if (memory < (1ULL << 36)) { > > 64GB guest already took quite a bit of time, I think I would put it > always as min(num_vcpus, 16). So, we always execute the multiple theard > codepath? > It sounds good idea to have a heuristic on vcpu count. I will add in my next version. But shouldn't we also restrict based on guest RAM size too, to avoid overhead spawning multiple threads for thin guests? Thanks, - Jitendra > But very nice, thanks. > > Later, Juan. >
On 1/27/2017 6:33 PM, Dr. David Alan Gilbert wrote: > * Jitendra Kolhe (jitendra.kolhe@hpe.com) wrote: >> Using "-mem-prealloc" option for a very large guest leads to huge guest >> start-up and migration time. This is because with "-mem-prealloc" option >> qemu tries to map every guest page (create address translations), and >> make sure the pages are available during runtime. virsh/libvirt by >> default, seems to use "-mem-prealloc" option in case the guest is >> configured to use huge pages. The patch tries to map all guest pages >> simultaneously by spawning multiple threads. Given the problem is more >> prominent for large guests, the patch limits the changes to the guests >> of at-least 64GB of memory size. Currently limiting the change to QEMU >> library functions on POSIX compliant host only, as we are not sure if >> the problem exists on win32. Below are some stats with "-mem-prealloc" >> option for guest configured to use huge pages. >> >> ------------------------------------------------------------------------ >> Idle Guest | Start-up time | Migration time >> ------------------------------------------------------------------------ >> Guest stats with 2M HugePage usage - single threaded (existing code) >> ------------------------------------------------------------------------ >> 64 Core - 4TB | 54m11.796s | 75m43.843s >> 64 Core - 1TB | 8m56.576s | 14m29.049s >> 64 Core - 256GB | 2m11.245s | 3m26.598s >> ------------------------------------------------------------------------ >> Guest stats with 2M HugePage usage - map guest pages using 8 threads >> ------------------------------------------------------------------------ >> 64 Core - 4TB | 5m1.027s | 34m10.565s >> 64 Core - 1TB | 1m10.366s | 8m28.188s >> 64 Core - 256GB | 0m19.040s | 2m10.148s >> ----------------------------------------------------------------------- >> Guest stats with 2M HugePage usage - map guest pages using 16 threads >> ----------------------------------------------------------------------- >> 64 Core - 4TB | 1m58.970s | 31m43.400s >> 64 Core - 1TB | 0m39.885s | 7m55.289s >> 64 Core - 256GB | 0m11.960s | 2m0.135s >> ----------------------------------------------------------------------- > > That's a nice improvement. > >> Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com> >> --- >> util/oslib-posix.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++--- >> 1 file changed, 61 insertions(+), 3 deletions(-) >> >> diff --git a/util/oslib-posix.c b/util/oslib-posix.c >> index f631464..a8bd7c2 100644 >> --- a/util/oslib-posix.c >> +++ b/util/oslib-posix.c >> @@ -55,6 +55,13 @@ >> #include "qemu/error-report.h" >> #endif >> >> +#define PAGE_TOUCH_THREAD_COUNT 8 > > It seems a shame to fix that number as a constant. > Yes, as per comments received we will update patch to incorporate vcpu count. >> +typedef struct { >> + char *addr; >> + uint64_t numpages; >> + uint64_t hpagesize; >> +} PageRange; >> + >> int qemu_get_thread_id(void) >> { >> #if defined(__linux__) >> @@ -323,6 +330,52 @@ static void sigbus_handler(int signal) >> siglongjmp(sigjump, 1); >> } >> >> +static void *do_touch_pages(void *arg) >> +{ >> + PageRange *range = (PageRange *)arg; >> + char *start_addr = range->addr; >> + uint64_t numpages = range->numpages; >> + uint64_t hpagesize = range->hpagesize; >> + uint64_t i = 0; >> + >> + for (i = 0; i < numpages; i++) { >> + memset(start_addr + (hpagesize * i), 0, 1); >> + } >> + qemu_thread_exit(NULL); >> + >> + return NULL; >> +} >> + >> +static int touch_all_pages(char *area, size_t hpagesize, size_t numpages) >> +{ >> + QemuThread page_threads[PAGE_TOUCH_THREAD_COUNT]; >> + PageRange page_range[PAGE_TOUCH_THREAD_COUNT]; >> + uint64_t numpage_per_thread, size_per_thread; >> + int i = 0, tcount = 0; >> + >> + numpage_per_thread = (numpages / PAGE_TOUCH_THREAD_COUNT); >> + size_per_thread = (hpagesize * numpage_per_thread); >> + for (i = 0; i < (PAGE_TOUCH_THREAD_COUNT - 1); i++) { >> + page_range[i].addr = area; >> + page_range[i].numpages = numpage_per_thread; >> + page_range[i].hpagesize = hpagesize; >> + >> + qemu_thread_create(page_threads + i, "touch_pages", >> + do_touch_pages, (page_range + i), >> + QEMU_THREAD_JOINABLE); >> + tcount++; >> + area += size_per_thread; >> + numpages -= numpage_per_thread; >> + } >> + for (i = 0; i < numpages; i++) { >> + memset(area + (hpagesize * i), 0, 1); >> + } >> + for (i = 0; i < tcount; i++) { >> + qemu_thread_join(page_threads + i); >> + } >> + return 0; >> +} >> + >> void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp) >> { >> int ret; >> @@ -353,9 +406,14 @@ void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp) >> size_t hpagesize = qemu_fd_getpagesize(fd); >> size_t numpages = DIV_ROUND_UP(memory, hpagesize); >> >> - /* MAP_POPULATE silently ignores failures */ >> - for (i = 0; i < numpages; i++) { >> - memset(area + (hpagesize * i), 0, 1); >> + /* touch pages simultaneously for memory >= 64G */ >> + if (memory < (1ULL << 36)) { >> + /* MAP_POPULATE silently ignores failures */ >> + for (i = 0; i < numpages; i++) { >> + memset(area + (hpagesize * i), 0, 1); >> + } >> + } else { >> + touch_all_pages(area, hpagesize, numpages); >> } >> } > > Maybe it's possible to do this quicker? > If we are using NUMA, and have separate memory-blocks for each NUMA node, > wont this call os_mem_prealloc separately for each node? > I wonder if it's possible to get that to run in parallel? > I will investigate. Thanks, - Jitendra > Dave > >> -- >> 1.8.3.1 >> >> > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK >
On 1/27/2017 6:56 PM, Daniel P. Berrange wrote: > On Thu, Jan 05, 2017 at 12:54:02PM +0530, Jitendra Kolhe wrote: >> Using "-mem-prealloc" option for a very large guest leads to huge guest >> start-up and migration time. This is because with "-mem-prealloc" option >> qemu tries to map every guest page (create address translations), and >> make sure the pages are available during runtime. virsh/libvirt by >> default, seems to use "-mem-prealloc" option in case the guest is >> configured to use huge pages. The patch tries to map all guest pages >> simultaneously by spawning multiple threads. Given the problem is more >> prominent for large guests, the patch limits the changes to the guests >> of at-least 64GB of memory size. Currently limiting the change to QEMU >> library functions on POSIX compliant host only, as we are not sure if >> the problem exists on win32. Below are some stats with "-mem-prealloc" >> option for guest configured to use huge pages. >> >> ------------------------------------------------------------------------ >> Idle Guest | Start-up time | Migration time >> ------------------------------------------------------------------------ >> Guest stats with 2M HugePage usage - single threaded (existing code) >> ------------------------------------------------------------------------ >> 64 Core - 4TB | 54m11.796s | 75m43.843s >> 64 Core - 1TB | 8m56.576s | 14m29.049s >> 64 Core - 256GB | 2m11.245s | 3m26.598s >> ------------------------------------------------------------------------ >> Guest stats with 2M HugePage usage - map guest pages using 8 threads >> ------------------------------------------------------------------------ >> 64 Core - 4TB | 5m1.027s | 34m10.565s >> 64 Core - 1TB | 1m10.366s | 8m28.188s >> 64 Core - 256GB | 0m19.040s | 2m10.148s >> ----------------------------------------------------------------------- >> Guest stats with 2M HugePage usage - map guest pages using 16 threads >> ----------------------------------------------------------------------- >> 64 Core - 4TB | 1m58.970s | 31m43.400s >> 64 Core - 1TB | 0m39.885s | 7m55.289s >> 64 Core - 256GB | 0m11.960s | 2m0.135s >> ----------------------------------------------------------------------- > > For comparison, what is performance like if you replace memset() in > the current code with a call to mlock(). > It doesn't look like we get much benefit by replacing memset() for loop, with a single instance of mlock(). Here are some numbers from my system. #hugepages | memset | memset | memset | mlock (entire range) (2M size) | (1 thread) | (8 threads) | (16 threads)| (1 thread) --------------|-------------|-------------|-------------|-------------------- 1048576 (2TB) | 1790661 ms | 105577 ms | 37331 ms | 1789580 ms 524288 (1TB) | 895119 ms | 52795 ms | 18686 ms | 894199 ms 131072 (256G) | 173081 ms | 9337 ms | 4667 ms | 172506 ms ----------------------------------------------------------------------------- > IIUC, huge pages are non-swappable once allocated, so it feels like > we ought to be able to just call mlock() to preallocate them with > no downside, rather than spawning many threads to memset() them. > yes, for me too it looks like mlock() should do the job in case of hugepages. > Of course you'd still need the memset() trick if qemu was given > non-hugepages in combination with --mem-prealloc, as you don't > want to lock normal pages into ram permanently. > given above numbers, I think we can stick to memset() implementation for both hugepage and non-hugepage cases? Thanks, - Jitendra > Regards, > Daniel >
On 02/02/2017 01:35, Jitendra Kolhe wrote: >> Of course you'd still need the memset() trick if qemu was given >> non-hugepages in combination with --mem-prealloc, as you don't >> want to lock normal pages into ram permanently. >> > given above numbers, I think we can stick to memset() implementation for > both hugepage and non-hugepage cases? Yes, of course! Paolo
On 1/30/2017 2:02 PM, Jitendra Kolhe wrote: > On 1/27/2017 6:33 PM, Dr. David Alan Gilbert wrote: >> * Jitendra Kolhe (jitendra.kolhe@hpe.com) wrote: >>> Using "-mem-prealloc" option for a very large guest leads to huge guest >>> start-up and migration time. This is because with "-mem-prealloc" option >>> qemu tries to map every guest page (create address translations), and >>> make sure the pages are available during runtime. virsh/libvirt by >>> default, seems to use "-mem-prealloc" option in case the guest is >>> configured to use huge pages. The patch tries to map all guest pages >>> simultaneously by spawning multiple threads. Given the problem is more >>> prominent for large guests, the patch limits the changes to the guests >>> of at-least 64GB of memory size. Currently limiting the change to QEMU >>> library functions on POSIX compliant host only, as we are not sure if >>> the problem exists on win32. Below are some stats with "-mem-prealloc" >>> option for guest configured to use huge pages. >>> >>> ------------------------------------------------------------------------ >>> Idle Guest | Start-up time | Migration time >>> ------------------------------------------------------------------------ >>> Guest stats with 2M HugePage usage - single threaded (existing code) >>> ------------------------------------------------------------------------ >>> 64 Core - 4TB | 54m11.796s | 75m43.843s >>> 64 Core - 1TB | 8m56.576s | 14m29.049s >>> 64 Core - 256GB | 2m11.245s | 3m26.598s >>> ------------------------------------------------------------------------ >>> Guest stats with 2M HugePage usage - map guest pages using 8 threads >>> ------------------------------------------------------------------------ >>> 64 Core - 4TB | 5m1.027s | 34m10.565s >>> 64 Core - 1TB | 1m10.366s | 8m28.188s >>> 64 Core - 256GB | 0m19.040s | 2m10.148s >>> ----------------------------------------------------------------------- >>> Guest stats with 2M HugePage usage - map guest pages using 16 threads >>> ----------------------------------------------------------------------- >>> 64 Core - 4TB | 1m58.970s | 31m43.400s >>> 64 Core - 1TB | 0m39.885s | 7m55.289s >>> 64 Core - 256GB | 0m11.960s | 2m0.135s >>> ----------------------------------------------------------------------- >> >> That's a nice improvement. >> >>> Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com> >>> --- >>> util/oslib-posix.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++--- >>> 1 file changed, 61 insertions(+), 3 deletions(-) >>> >>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c >>> index f631464..a8bd7c2 100644 >>> --- a/util/oslib-posix.c >>> +++ b/util/oslib-posix.c >>> @@ -55,6 +55,13 @@ >>> #include "qemu/error-report.h" >>> #endif >>> >>> +#define PAGE_TOUCH_THREAD_COUNT 8 >> >> It seems a shame to fix that number as a constant. >> > > Yes, as per comments received we will update patch to incorporate vcpu count. > >>> +typedef struct { >>> + char *addr; >>> + uint64_t numpages; >>> + uint64_t hpagesize; >>> +} PageRange; >>> + >>> int qemu_get_thread_id(void) >>> { >>> #if defined(__linux__) >>> @@ -323,6 +330,52 @@ static void sigbus_handler(int signal) >>> siglongjmp(sigjump, 1); >>> } >>> >>> +static void *do_touch_pages(void *arg) >>> +{ >>> + PageRange *range = (PageRange *)arg; >>> + char *start_addr = range->addr; >>> + uint64_t numpages = range->numpages; >>> + uint64_t hpagesize = range->hpagesize; >>> + uint64_t i = 0; >>> + >>> + for (i = 0; i < numpages; i++) { >>> + memset(start_addr + (hpagesize * i), 0, 1); >>> + } >>> + qemu_thread_exit(NULL); >>> + >>> + return NULL; >>> +} >>> + >>> +static int touch_all_pages(char *area, size_t hpagesize, size_t numpages) >>> +{ >>> + QemuThread page_threads[PAGE_TOUCH_THREAD_COUNT]; >>> + PageRange page_range[PAGE_TOUCH_THREAD_COUNT]; >>> + uint64_t numpage_per_thread, size_per_thread; >>> + int i = 0, tcount = 0; >>> + >>> + numpage_per_thread = (numpages / PAGE_TOUCH_THREAD_COUNT); >>> + size_per_thread = (hpagesize * numpage_per_thread); >>> + for (i = 0; i < (PAGE_TOUCH_THREAD_COUNT - 1); i++) { >>> + page_range[i].addr = area; >>> + page_range[i].numpages = numpage_per_thread; >>> + page_range[i].hpagesize = hpagesize; >>> + >>> + qemu_thread_create(page_threads + i, "touch_pages", >>> + do_touch_pages, (page_range + i), >>> + QEMU_THREAD_JOINABLE); >>> + tcount++; >>> + area += size_per_thread; >>> + numpages -= numpage_per_thread; >>> + } >>> + for (i = 0; i < numpages; i++) { >>> + memset(area + (hpagesize * i), 0, 1); >>> + } >>> + for (i = 0; i < tcount; i++) { >>> + qemu_thread_join(page_threads + i); >>> + } >>> + return 0; >>> +} >>> + >>> void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp) >>> { >>> int ret; >>> @@ -353,9 +406,14 @@ void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp) >>> size_t hpagesize = qemu_fd_getpagesize(fd); >>> size_t numpages = DIV_ROUND_UP(memory, hpagesize); >>> >>> - /* MAP_POPULATE silently ignores failures */ >>> - for (i = 0; i < numpages; i++) { >>> - memset(area + (hpagesize * i), 0, 1); >>> + /* touch pages simultaneously for memory >= 64G */ >>> + if (memory < (1ULL << 36)) { >>> + /* MAP_POPULATE silently ignores failures */ >>> + for (i = 0; i < numpages; i++) { >>> + memset(area + (hpagesize * i), 0, 1); >>> + } >>> + } else { >>> + touch_all_pages(area, hpagesize, numpages); >>> } >>> } >> >> Maybe it's possible to do this quicker? >> If we are using NUMA, and have separate memory-blocks for each NUMA node, >> wont this call os_mem_prealloc separately for each node? >> I wonder if it's possible to get that to run in parallel? >> > > I will investigate. > each numa node, seems to be getting treated as an independent qemu object. While parsing and creating the object itself we try to touch pages in os_mem_prealloc(). To parallelize numa node creation we would need to modify host_memory_backend_memory_complete() for the last numa object to wait for all previously spawned numa-node creation threads to finish there job. It involves parsing of cmdline options more than once (to identify if the current numa node being serviced is the last numa node). Parsing cmdline in object specific implementation does not look correct? also by parallelizing each numa node, the # memset threads would be reduced accordingly so that we don’t spawn too may threads. For example # threads spawned per numa node = min(#vcpus, 16)/(# numa nodes). With current implementation we would see min(#vcpus, 16) threads spawned, working on each numa node a time. Both implementations should have almost same performance? Thanks, - Jitendra > Thanks, > - Jitendra > >> Dave >> >>> -- >>> 1.8.3.1 >>> >>> >> -- >> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK >> >
diff --git a/util/oslib-posix.c b/util/oslib-posix.c index f631464..a8bd7c2 100644 --- a/util/oslib-posix.c +++ b/util/oslib-posix.c @@ -55,6 +55,13 @@ #include "qemu/error-report.h" #endif +#define PAGE_TOUCH_THREAD_COUNT 8 +typedef struct { + char *addr; + uint64_t numpages; + uint64_t hpagesize; +} PageRange; + int qemu_get_thread_id(void) { #if defined(__linux__) @@ -323,6 +330,52 @@ static void sigbus_handler(int signal) siglongjmp(sigjump, 1); } +static void *do_touch_pages(void *arg) +{ + PageRange *range = (PageRange *)arg; + char *start_addr = range->addr; + uint64_t numpages = range->numpages; + uint64_t hpagesize = range->hpagesize; + uint64_t i = 0; + + for (i = 0; i < numpages; i++) { + memset(start_addr + (hpagesize * i), 0, 1); + } + qemu_thread_exit(NULL); + + return NULL; +} + +static int touch_all_pages(char *area, size_t hpagesize, size_t numpages) +{ + QemuThread page_threads[PAGE_TOUCH_THREAD_COUNT]; + PageRange page_range[PAGE_TOUCH_THREAD_COUNT]; + uint64_t numpage_per_thread, size_per_thread; + int i = 0, tcount = 0; + + numpage_per_thread = (numpages / PAGE_TOUCH_THREAD_COUNT); + size_per_thread = (hpagesize * numpage_per_thread); + for (i = 0; i < (PAGE_TOUCH_THREAD_COUNT - 1); i++) { + page_range[i].addr = area; + page_range[i].numpages = numpage_per_thread; + page_range[i].hpagesize = hpagesize; + + qemu_thread_create(page_threads + i, "touch_pages", + do_touch_pages, (page_range + i), + QEMU_THREAD_JOINABLE); + tcount++; + area += size_per_thread; + numpages -= numpage_per_thread; + } + for (i = 0; i < numpages; i++) { + memset(area + (hpagesize * i), 0, 1); + } + for (i = 0; i < tcount; i++) { + qemu_thread_join(page_threads + i); + } + return 0; +} + void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp) { int ret; @@ -353,9 +406,14 @@ void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp) size_t hpagesize = qemu_fd_getpagesize(fd); size_t numpages = DIV_ROUND_UP(memory, hpagesize); - /* MAP_POPULATE silently ignores failures */ - for (i = 0; i < numpages; i++) { - memset(area + (hpagesize * i), 0, 1); + /* touch pages simultaneously for memory >= 64G */ + if (memory < (1ULL << 36)) { + /* MAP_POPULATE silently ignores failures */ + for (i = 0; i < numpages; i++) { + memset(area + (hpagesize * i), 0, 1); + } + } else { + touch_all_pages(area, hpagesize, numpages); } }
Using "-mem-prealloc" option for a very large guest leads to huge guest start-up and migration time. This is because with "-mem-prealloc" option qemu tries to map every guest page (create address translations), and make sure the pages are available during runtime. virsh/libvirt by default, seems to use "-mem-prealloc" option in case the guest is configured to use huge pages. The patch tries to map all guest pages simultaneously by spawning multiple threads. Given the problem is more prominent for large guests, the patch limits the changes to the guests of at-least 64GB of memory size. Currently limiting the change to QEMU library functions on POSIX compliant host only, as we are not sure if the problem exists on win32. Below are some stats with "-mem-prealloc" option for guest configured to use huge pages. ------------------------------------------------------------------------ Idle Guest | Start-up time | Migration time ------------------------------------------------------------------------ Guest stats with 2M HugePage usage - single threaded (existing code) ------------------------------------------------------------------------ 64 Core - 4TB | 54m11.796s | 75m43.843s 64 Core - 1TB | 8m56.576s | 14m29.049s 64 Core - 256GB | 2m11.245s | 3m26.598s ------------------------------------------------------------------------ Guest stats with 2M HugePage usage - map guest pages using 8 threads ------------------------------------------------------------------------ 64 Core - 4TB | 5m1.027s | 34m10.565s 64 Core - 1TB | 1m10.366s | 8m28.188s 64 Core - 256GB | 0m19.040s | 2m10.148s ----------------------------------------------------------------------- Guest stats with 2M HugePage usage - map guest pages using 16 threads ----------------------------------------------------------------------- 64 Core - 4TB | 1m58.970s | 31m43.400s 64 Core - 1TB | 0m39.885s | 7m55.289s 64 Core - 256GB | 0m11.960s | 2m0.135s ----------------------------------------------------------------------- Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com> --- util/oslib-posix.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 61 insertions(+), 3 deletions(-)