diff mbox

[RFC,2/3] cpus-common: Cache allocated work items

Message ID 20170828035327.17146-2-bobby.prani@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Pranith Kumar Aug. 28, 2017, 3:53 a.m. UTC
Using heaptrack, I found that quite a few of our temporary allocations
are coming from allocating work items. Instead of doing this
continously, we can cache the allocated items and reuse them instead
of freeing them.

This reduces the number of allocations by 25% (200000 -> 150000 for
ARM64 boot+shutdown test).

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
---
 cpus-common.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 70 insertions(+), 15 deletions(-)

Comments

Richard Henderson Aug. 28, 2017, 5:47 p.m. UTC | #1
On 08/27/2017 08:53 PM, Pranith Kumar wrote:
> Using heaptrack, I found that quite a few of our temporary allocations
> are coming from allocating work items. Instead of doing this
> continously, we can cache the allocated items and reuse them instead
> of freeing them.
> 
> This reduces the number of allocations by 25% (200000 -> 150000 for
> ARM64 boot+shutdown test).
> 
> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
> ---
>  cpus-common.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 70 insertions(+), 15 deletions(-)
> 
> diff --git a/cpus-common.c b/cpus-common.c
> index 59f751ecf9..a1c4c7d1a3 100644
> --- a/cpus-common.c
> +++ b/cpus-common.c
> @@ -24,6 +24,7 @@
>  #include "sysemu/cpus.h"
>  
>  static QemuMutex qemu_cpu_list_lock;
> +static QemuMutex qemu_wi_pool_lock;
>  static QemuCond exclusive_cond;
>  static QemuCond exclusive_resume;
>  static QemuCond qemu_work_cond;
> @@ -33,6 +34,58 @@ static QemuCond qemu_work_cond;
>   */
>  static int pending_cpus;
>  
> +typedef struct qemu_work_item {
> +    struct qemu_work_item *next;
> +    run_on_cpu_func func;
> +    run_on_cpu_data data;
> +    bool free, exclusive, done;
> +} qemu_work_item;
> +
> +typedef struct qemu_wi_pool {
> +    qemu_work_item *first, *last;
> +} qemu_wi_pool;
> +
> +qemu_wi_pool *wi_free_pool;
> +
> +static void qemu_init_workitem_pool(void)
> +{
> +    wi_free_pool = g_malloc0(sizeof(qemu_wi_pool));
> +    wi_free_pool->first = NULL;
> +    wi_free_pool->last  = NULL;
> +}
> +
> +static void qemu_wi_pool_insert(qemu_work_item *item)
> +{
> +    qemu_mutex_lock(&qemu_wi_pool_lock);
> +    if (wi_free_pool->last == NULL) {
> +        wi_free_pool->first = item;
> +        wi_free_pool->last = item;
> +    } else {
> +        wi_free_pool->last->next = item;
> +        wi_free_pool->last = item;
> +    }
> +    qemu_mutex_unlock(&qemu_wi_pool_lock);
> +}
> +
> +static qemu_work_item* qemu_wi_pool_remove(void)
> +{
> +    qemu_mutex_lock(&qemu_wi_pool_lock);
> +    qemu_work_item *ret = wi_free_pool->first;
> +
> +    if (ret == NULL)
> +        goto out;
> +
> +    wi_free_pool->first = ret->next;
> +    if (wi_free_pool->last == ret) {
> +        wi_free_pool->last = NULL;
> +    }

Why does this list need to record a "last" element?
It would seem a simple lifo would be sufficient.

(You would also be able to manage the list via cmpxchg without a separate lock,
but perhaps the difference between the two isn't measurable.)


r~
Emilio Cota Aug. 28, 2017, 7:05 p.m. UTC | #2
On Sun, Aug 27, 2017 at 23:53:25 -0400, Pranith Kumar wrote:
> Using heaptrack, I found that quite a few of our temporary allocations
> are coming from allocating work items. Instead of doing this
> continously, we can cache the allocated items and reuse them instead
> of freeing them.
> 
> This reduces the number of allocations by 25% (200000 -> 150000 for
> ARM64 boot+shutdown test).
> 

But what is the perf difference, if any?

Adding a lock (or a cmpxchg) here is not a great idea. However, this is not yet
immediately obvious because of other scalability bottlenecks. (if
you boot many arm64 cores you'll see most of the time is spent idling
on the BQL, see
  https://lists.gnu.org/archive/html/qemu-devel/2017-08/msg05207.html )

You're most likely better off using glib's slices, see
  https://developer.gnome.org/glib/stable/glib-Memory-Slices.html
These slices use per-thread lists, so scalability should be OK.

I also suggest profiling with either or both of jemalloc/tcmalloc
(build with --enable-jemalloc/tcmalloc) in addition to using glibc's
allocator, and then based on perf numbers decide whether this is something
worth optimizing.

Thanks,

                Emilio
Pranith Kumar Aug. 28, 2017, 9:43 p.m. UTC | #3
On Mon, Aug 28, 2017 at 1:47 PM, Richard Henderson
<richard.henderson@linaro.org> wrote:
> On 08/27/2017 08:53 PM, Pranith Kumar wrote:
>> Using heaptrack, I found that quite a few of our temporary allocations
>> are coming from allocating work items. Instead of doing this
>> continously, we can cache the allocated items and reuse them instead
>> of freeing them.
>>
>> This reduces the number of allocations by 25% (200000 -> 150000 for
>> ARM64 boot+shutdown test).
>>
>> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
>
> Why does this list need to record a "last" element?
> It would seem a simple lifo would be sufficient.
>
> (You would also be able to manage the list via cmpxchg without a separate lock,
> but perhaps the difference between the two isn't measurable.)
>

Yes, seems like a better design choice. Will fix in next iteration.

Thanks,
Pranith Kumar Aug. 28, 2017, 9:51 p.m. UTC | #4
On Mon, Aug 28, 2017 at 3:05 PM, Emilio G. Cota <cota@braap.org> wrote:
> On Sun, Aug 27, 2017 at 23:53:25 -0400, Pranith Kumar wrote:
>> Using heaptrack, I found that quite a few of our temporary allocations
>> are coming from allocating work items. Instead of doing this
>> continously, we can cache the allocated items and reuse them instead
>> of freeing them.
>>
>> This reduces the number of allocations by 25% (200000 -> 150000 for
>> ARM64 boot+shutdown test).
>>
>
> But what is the perf difference, if any?
>
> Adding a lock (or a cmpxchg) here is not a great idea. However, this is not yet
> immediately obvious because of other scalability bottlenecks. (if
> you boot many arm64 cores you'll see most of the time is spent idling
> on the BQL, see
>   https://lists.gnu.org/archive/html/qemu-devel/2017-08/msg05207.html )
>
> You're most likely better off using glib's slices, see
>   https://developer.gnome.org/glib/stable/glib-Memory-Slices.html
> These slices use per-thread lists, so scalability should be OK.

I think we should modify our g_malloc() to internally use this. Seems
like an idea worth trying out.

>
> I also suggest profiling with either or both of jemalloc/tcmalloc
> (build with --enable-jemalloc/tcmalloc) in addition to using glibc's
> allocator, and then based on perf numbers decide whether this is something
> worth optimizing.
>

OK, I will try to get some perf numbers.
Paolo Bonzini Aug. 29, 2017, 8:38 p.m. UTC | #5
Il 28 ago 2017 11:43 PM, "Pranith Kumar" <bobby.prani@gmail.com> ha scritto:

On Mon, Aug 28, 2017 at 1:47 PM, Richard Henderson
<richard.henderson@linaro.org> wrote:
> On 08/27/2017 08:53 PM, Pranith Kumar wrote:
>> Using heaptrack, I found that quite a few of our temporary allocations
>> are coming from allocating work items. Instead of doing this
>> continously, we can cache the allocated items and reuse them instead
>> of freeing them.
>>
>> This reduces the number of allocations by 25% (200000 -> 150000 for
>> ARM64 boot+shutdown test).
>>
>> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
>
> Why does this list need to record a "last" element?
> It would seem a simple lifo would be sufficient.
>
> (You would also be able to manage the list via cmpxchg without a separate
lock,
> but perhaps the difference between the two isn't measurable.)
>

Yes, seems like a better design choice. Will fix in next iteration.


More recent glibc will also have an efficient per-thread allocator, and
though I haven't yet benchmarked the newer glibc malloc, GSlice is slower
than at least both tcmalloc and jemalloc. Perhaps you could instead make
work items statically allocated?

Thanks,

Paolo


Thanks,
--
Pranith
diff mbox

Patch

diff --git a/cpus-common.c b/cpus-common.c
index 59f751ecf9..a1c4c7d1a3 100644
--- a/cpus-common.c
+++ b/cpus-common.c
@@ -24,6 +24,7 @@ 
 #include "sysemu/cpus.h"
 
 static QemuMutex qemu_cpu_list_lock;
+static QemuMutex qemu_wi_pool_lock;
 static QemuCond exclusive_cond;
 static QemuCond exclusive_resume;
 static QemuCond qemu_work_cond;
@@ -33,6 +34,58 @@  static QemuCond qemu_work_cond;
  */
 static int pending_cpus;
 
+typedef struct qemu_work_item {
+    struct qemu_work_item *next;
+    run_on_cpu_func func;
+    run_on_cpu_data data;
+    bool free, exclusive, done;
+} qemu_work_item;
+
+typedef struct qemu_wi_pool {
+    qemu_work_item *first, *last;
+} qemu_wi_pool;
+
+qemu_wi_pool *wi_free_pool;
+
+static void qemu_init_workitem_pool(void)
+{
+    wi_free_pool = g_malloc0(sizeof(qemu_wi_pool));
+    wi_free_pool->first = NULL;
+    wi_free_pool->last  = NULL;
+}
+
+static void qemu_wi_pool_insert(qemu_work_item *item)
+{
+    qemu_mutex_lock(&qemu_wi_pool_lock);
+    if (wi_free_pool->last == NULL) {
+        wi_free_pool->first = item;
+        wi_free_pool->last = item;
+    } else {
+        wi_free_pool->last->next = item;
+        wi_free_pool->last = item;
+    }
+    qemu_mutex_unlock(&qemu_wi_pool_lock);
+}
+
+static qemu_work_item* qemu_wi_pool_remove(void)
+{
+    qemu_mutex_lock(&qemu_wi_pool_lock);
+    qemu_work_item *ret = wi_free_pool->first;
+
+    if (ret == NULL)
+        goto out;
+
+    wi_free_pool->first = ret->next;
+    if (wi_free_pool->last == ret) {
+        wi_free_pool->last = NULL;
+    }
+    ret->next = NULL;
+
+ out:
+    qemu_mutex_unlock(&qemu_wi_pool_lock);
+    return ret;
+}
+
 void qemu_init_cpu_list(void)
 {
     /* This is needed because qemu_init_cpu_list is also called by the
@@ -43,6 +96,9 @@  void qemu_init_cpu_list(void)
     qemu_cond_init(&exclusive_cond);
     qemu_cond_init(&exclusive_resume);
     qemu_cond_init(&qemu_work_cond);
+
+    qemu_init_workitem_pool();
+    qemu_mutex_init(&qemu_wi_pool_lock);
 }
 
 void cpu_list_lock(void)
@@ -106,14 +162,7 @@  void cpu_list_remove(CPUState *cpu)
     qemu_mutex_unlock(&qemu_cpu_list_lock);
 }
 
-struct qemu_work_item {
-    struct qemu_work_item *next;
-    run_on_cpu_func func;
-    run_on_cpu_data data;
-    bool free, exclusive, done;
-};
-
-static void queue_work_on_cpu(CPUState *cpu, struct qemu_work_item *wi)
+static void queue_work_on_cpu(CPUState *cpu, qemu_work_item *wi)
 {
     qemu_mutex_lock(&cpu->work_mutex);
     if (cpu->queued_work_first == NULL) {
@@ -132,7 +181,7 @@  static void queue_work_on_cpu(CPUState *cpu, struct qemu_work_item *wi)
 void do_run_on_cpu(CPUState *cpu, run_on_cpu_func func, run_on_cpu_data data,
                    QemuMutex *mutex)
 {
-    struct qemu_work_item wi;
+    qemu_work_item wi;
 
     if (qemu_cpu_is_self(cpu)) {
         func(cpu, data);
@@ -145,6 +194,7 @@  void do_run_on_cpu(CPUState *cpu, run_on_cpu_func func, run_on_cpu_data data,
     wi.free = false;
     wi.exclusive = false;
 
+
     queue_work_on_cpu(cpu, &wi);
     while (!atomic_mb_read(&wi.done)) {
         CPUState *self_cpu = current_cpu;
@@ -156,9 +206,11 @@  void do_run_on_cpu(CPUState *cpu, run_on_cpu_func func, run_on_cpu_data data,
 
 void async_run_on_cpu(CPUState *cpu, run_on_cpu_func func, run_on_cpu_data data)
 {
-    struct qemu_work_item *wi;
+    qemu_work_item *wi = qemu_wi_pool_remove();
 
-    wi = g_malloc0(sizeof(struct qemu_work_item));
+    if (!wi) {
+        wi = g_malloc0(sizeof(qemu_work_item));
+    }
     wi->func = func;
     wi->data = data;
     wi->free = true;
@@ -299,9 +351,11 @@  void cpu_exec_end(CPUState *cpu)
 void async_safe_run_on_cpu(CPUState *cpu, run_on_cpu_func func,
                            run_on_cpu_data data)
 {
-    struct qemu_work_item *wi;
+    qemu_work_item *wi = qemu_wi_pool_remove();
 
-    wi = g_malloc0(sizeof(struct qemu_work_item));
+    if (!wi) {
+        wi = g_malloc0(sizeof(qemu_work_item));
+    }
     wi->func = func;
     wi->data = data;
     wi->free = true;
@@ -312,7 +366,7 @@  void async_safe_run_on_cpu(CPUState *cpu, run_on_cpu_func func,
 
 void process_queued_cpu_work(CPUState *cpu)
 {
-    struct qemu_work_item *wi;
+    qemu_work_item *wi;
 
     if (cpu->queued_work_first == NULL) {
         return;
@@ -343,7 +397,8 @@  void process_queued_cpu_work(CPUState *cpu)
         }
         qemu_mutex_lock(&cpu->work_mutex);
         if (wi->free) {
-            g_free(wi);
+            memset(wi, 0, sizeof(qemu_work_item));
+            qemu_wi_pool_insert(wi);
         } else {
             atomic_mb_set(&wi->done, true);
         }