diff mbox

[v1,1/2] reduce qemu's heap Rss size from 12252kB to 2752KB

Message ID 1489158897-9206-1-git-send-email-yang.zhong@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Yang Zhong March 10, 2017, 3:14 p.m. UTC
There are a lot of memory allocation during the qemu bootup, which are
freed later by RCU thread,which means the heap size becomes biger and
biger when allocation happens, but the heap may not be shrinked even
after release happens,because some memory blocks in top of heap are still
being used.Decreasing the sleep and removing qemu_mutex_unlock_iothread()
lock, which make call_rcu_thread()thread response the free memory in time.
This patch will reduce heap Rss around 10M.

This patch is from Anthony xu <anthony.xu@intel.com>.

Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
 util/rcu.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

Comments

Paolo Bonzini March 10, 2017, 8:41 a.m. UTC | #1
On 10/03/2017 16:14, Yang Zhong wrote:
> There are a lot of memory allocation during the qemu bootup, which are
> freed later by RCU thread,which means the heap size becomes biger and
> biger when allocation happens, but the heap may not be shrinked even
> after release happens,because some memory blocks in top of heap are still
> being used.Decreasing the sleep and removing qemu_mutex_unlock_iothread()
> lock, which make call_rcu_thread()thread response the free memory in time.
> This patch will reduce heap Rss around 10M.
> 
> This patch is from Anthony xu <anthony.xu@intel.com>.
> 
> Signed-off-by: Yang Zhong <yang.zhong@intel.com>
> ---
>  util/rcu.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/util/rcu.c b/util/rcu.c
> index 9adc5e4..c5c373c 100644
> --- a/util/rcu.c
> +++ b/util/rcu.c
> @@ -167,7 +167,7 @@ void synchronize_rcu(void)
>  }
>  
>  
> -#define RCU_CALL_MIN_SIZE        30
> +#define RCU_CALL_MIN_SIZE        5
>  
>  /* Multi-producer, single-consumer queue based on urcu/static/wfqueue.h
>   * from liburcu.  Note that head is only used by the consumer.
> @@ -241,7 +241,7 @@ static void *call_rcu_thread(void *opaque)
>           * added before synchronize_rcu() starts.
>           */
>          while (n == 0 || (n < RCU_CALL_MIN_SIZE && ++tries <= 5)) {
> -            g_usleep(10000);
> +            g_usleep(100);
>              if (n == 0) {
>                  qemu_event_reset(&rcu_call_ready_event);
>                  n = atomic_read(&rcu_call_count);
> @@ -254,24 +254,20 @@ static void *call_rcu_thread(void *opaque)
>  
>          atomic_sub(&rcu_call_count, n);
>          synchronize_rcu();
> -        qemu_mutex_lock_iothread();
>          while (n > 0) {
>              node = try_dequeue();
>              while (!node) {
> -                qemu_mutex_unlock_iothread();
>                  qemu_event_reset(&rcu_call_ready_event);
>                  node = try_dequeue();
>                  if (!node) {
>                      qemu_event_wait(&rcu_call_ready_event);
>                      node = try_dequeue();
>                  }
> -                qemu_mutex_lock_iothread();
>              }
>  
>              n--;
>              node->func(node);
>          }
> -        qemu_mutex_unlock_iothread();

This is wrong.  RCU callbacks currently need the "big QEMU lock",
because they can free arbitrary QOM objects (including MemoryRegions).

Paolo

>      }
>      abort();
>  }
>
Xu, Anthony March 10, 2017, 4:05 p.m. UTC | #2
> -----Original Message-----

> From: Paolo Bonzini [mailto:pbonzini@redhat.com]

> Sent: Friday, March 10, 2017 12:42 AM

> To: Zhong, Yang <yang.zhong@intel.com>; qemu-devel@nongnu.org

> Cc: Xu, Anthony <anthony.xu@intel.com>; Peng, Chao P

> <chao.p.peng@intel.com>

> Subject: Re: [PATCH v1 1/2] reduce qemu's heap Rss size from 12252kB to

> 2752KB

> 

> 

> 

> On 10/03/2017 16:14, Yang Zhong wrote:

> > There are a lot of memory allocation during the qemu bootup, which are

> > freed later by RCU thread,which means the heap size becomes biger and

> > biger when allocation happens, but the heap may not be shrinked even

> > after release happens,because some memory blocks in top of heap are

> > still being used.Decreasing the sleep and removing

> > qemu_mutex_unlock_iothread() lock, which make call_rcu_thread()thread

> response the free memory in time.

> > This patch will reduce heap Rss around 10M.

> >

> > This patch is from Anthony xu <anthony.xu@intel.com>.

> >

> > Signed-off-by: Yang Zhong <yang.zhong@intel.com>

> > ---

> >  util/rcu.c | 8 ++------

> >  1 file changed, 2 insertions(+), 6 deletions(-)

> >

> > diff --git a/util/rcu.c b/util/rcu.c

> > index 9adc5e4..c5c373c 100644

> > --- a/util/rcu.c

> > +++ b/util/rcu.c

> > @@ -167,7 +167,7 @@ void synchronize_rcu(void)  }

> >

> >

> > -#define RCU_CALL_MIN_SIZE        30

> > +#define RCU_CALL_MIN_SIZE        5

> >

> >  /* Multi-producer, single-consumer queue based on

> urcu/static/wfqueue.h

> >   * from liburcu.  Note that head is only used by the consumer.

> > @@ -241,7 +241,7 @@ static void *call_rcu_thread(void *opaque)

> >           * added before synchronize_rcu() starts.

> >           */

> >          while (n == 0 || (n < RCU_CALL_MIN_SIZE && ++tries <= 5)) {

> > -            g_usleep(10000);

> > +            g_usleep(100);

> >              if (n == 0) {

> >                  qemu_event_reset(&rcu_call_ready_event);

> >                  n = atomic_read(&rcu_call_count); @@ -254,24 +254,20

> > @@ static void *call_rcu_thread(void *opaque)

> >

> >          atomic_sub(&rcu_call_count, n);

> >          synchronize_rcu();

> > -        qemu_mutex_lock_iothread();

> >          while (n > 0) {

> >              node = try_dequeue();

> >              while (!node) {

> > -                qemu_mutex_unlock_iothread();

> >                  qemu_event_reset(&rcu_call_ready_event);

> >                  node = try_dequeue();

> >                  if (!node) {

> >                      qemu_event_wait(&rcu_call_ready_event);

> >                      node = try_dequeue();

> >                  }

> > -                qemu_mutex_lock_iothread();

> >              }

> >

> >              n--;

> >              node->func(node);

> >          }

> > -        qemu_mutex_unlock_iothread();

> 

> This is wrong.  RCU callbacks currently need the "big QEMU lock", because

> they can free arbitrary QOM objects (including MemoryRegions).


Using "big QEMU lock" in RCU callbacks is a little bit heavy, we'd like to remove it.

We noticed an issue related to MemoryRegion popping up after we removed the global lock.
The root cause is in MemoryRegion destruction memory_region_transaction_commit is called, which may cause issue.
Freeing MemoryRegion seems not cause any issue.
Patch2 in this series fixes this issue,
We found out only subpage MemoryRegion is freed in RCU callbacks, and memory_region_transaction_commit
is not needed for subpage MemoryRegion because it doesn't have sub regions.

Ideally, freeing QOM object should not require a global lock. 
If you see any other QOM requiring a global lock, please let us know, we are willing to fix it.


Thanks,
Anthony


> 

> Paolo

> 

> >      }

> >      abort();

> >  }

> >
Paolo Bonzini March 10, 2017, 4:07 p.m. UTC | #3
On 10/03/2017 17:05, Xu, Anthony wrote:
> Ideally, freeing QOM object should not require a global lock. 
> If you see any other QOM requiring a global lock, please let us know, we are willing to fix it.

All of them.  When unplugging a device, the device object will be freed
from an RCU callback.

Paolo
Xu, Anthony March 10, 2017, 7:30 p.m. UTC | #4
> > Ideally, freeing QOM object should not require a global lock.

> > If you see any other QOM requiring a global lock, please let us know, we

> are willing to fix it.

> 

> All of them.  When unplugging a device, the device object will be freed

> from an RCU callback.

> 


Thanks for your reply,
Some objects may not need lock at all to be freed,
Some objects may just need a small lock to be freed,

Should we let RCU callbacks to decide which lock they need instead of enforcing  
the global lock in RCU thread?

As for the device object, can we get the global lock inside the object free/destroy function for now?

If we can remove the global lock inside RCU thread, we can save 9MB heap memory, that's a lot!

Please share with us if you have other idea to do this.

Thanks
Anthony
Paolo Bonzini March 11, 2017, 5:02 p.m. UTC | #5
----- Original Message -----
> From: "Anthony Xu" <anthony.xu@intel.com>
> To: "Paolo Bonzini" <pbonzini@redhat.com>, "Yang Zhong" <yang.zhong@intel.com>, qemu-devel@nongnu.org
> Cc: "Chao P Peng" <chao.p.peng@intel.com>
> Sent: Friday, March 10, 2017 8:30:06 PM
> Subject: RE: [PATCH v1 1/2] reduce qemu's heap Rss size from 12252kB to 2752KB
> 
> > > Ideally, freeing QOM object should not require a global lock.
> > > If you see any other QOM requiring a global lock, please let us know, we
> > are willing to fix it.
> > 
> > All of them.  When unplugging a device, the device object will be freed
> > from an RCU callback.
> > 
> 
> Thanks for your reply,
> Some objects may not need lock at all to be freed,
> Some objects may just need a small lock to be freed,
> 
> Should we let RCU callbacks to decide which lock they need instead of
> enforcing the global lock in RCU thread?

Splitting the global lock and deciding which object requires which lock
is very, very hard.

The problem is that everything calling object_unref potentially needs the
global lock, and that includes the following call chain: object_unref
-> memory_region_unref -> flatview_destroy -> flatview_unref ->
address_space_update_topology -> memory_region_transaction_commit.

Paolo

> As for the device object, can we get the global lock inside the object
> free/destroy function for now?
> 
> If we can remove the global lock inside RCU thread, we can save 9MB heap
> memory, that's a lot!
> 
> Please share with us if you have other idea to do this.
> 
> Thanks
> Anthony
> 
> 
> 
>
diff mbox

Patch

diff --git a/util/rcu.c b/util/rcu.c
index 9adc5e4..c5c373c 100644
--- a/util/rcu.c
+++ b/util/rcu.c
@@ -167,7 +167,7 @@  void synchronize_rcu(void)
 }
 
 
-#define RCU_CALL_MIN_SIZE        30
+#define RCU_CALL_MIN_SIZE        5
 
 /* Multi-producer, single-consumer queue based on urcu/static/wfqueue.h
  * from liburcu.  Note that head is only used by the consumer.
@@ -241,7 +241,7 @@  static void *call_rcu_thread(void *opaque)
          * added before synchronize_rcu() starts.
          */
         while (n == 0 || (n < RCU_CALL_MIN_SIZE && ++tries <= 5)) {
-            g_usleep(10000);
+            g_usleep(100);
             if (n == 0) {
                 qemu_event_reset(&rcu_call_ready_event);
                 n = atomic_read(&rcu_call_count);
@@ -254,24 +254,20 @@  static void *call_rcu_thread(void *opaque)
 
         atomic_sub(&rcu_call_count, n);
         synchronize_rcu();
-        qemu_mutex_lock_iothread();
         while (n > 0) {
             node = try_dequeue();
             while (!node) {
-                qemu_mutex_unlock_iothread();
                 qemu_event_reset(&rcu_call_ready_event);
                 node = try_dequeue();
                 if (!node) {
                     qemu_event_wait(&rcu_call_ready_event);
                     node = try_dequeue();
                 }
-                qemu_mutex_lock_iothread();
             }
 
             n--;
             node->func(node);
         }
-        qemu_mutex_unlock_iothread();
     }
     abort();
 }