diff mbox

[13/16] migration: Create thread infrastructure for multifd recv side

Message ID 20170313124434.1043-14-quintela@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Juan Quintela March 13, 2017, 12:44 p.m. UTC
We make the locking and the transfer of information specific, even if we
are still receiving things through the main thread.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 migration/ram.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 60 insertions(+), 8 deletions(-)

Comments

Paolo Bonzini March 14, 2017, 9:23 a.m. UTC | #1
On 13/03/2017 13:44, Juan Quintela wrote:
>          case RAM_SAVE_FLAG_MULTIFD_PAGE:
>              fd_num = qemu_get_be16(f);
> -            if (fd_num != 0) {
> -                /* this is yet an unused variable, changed later */
> -                fd_num = fd_num;
> -            }
> +            multifd_recv_page(host, fd_num);
>              qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
>              break;

I still believe this design is a mistake.

Paolo
Dr. David Alan Gilbert March 17, 2017, 1:02 p.m. UTC | #2
* Paolo Bonzini (pbonzini@redhat.com) wrote:
> 
> 
> On 13/03/2017 13:44, Juan Quintela wrote:
> >          case RAM_SAVE_FLAG_MULTIFD_PAGE:
> >              fd_num = qemu_get_be16(f);
> > -            if (fd_num != 0) {
> > -                /* this is yet an unused variable, changed later */
> > -                fd_num = fd_num;
> > -            }
> > +            multifd_recv_page(host, fd_num);
> >              qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
> >              break;
> 
> I still believe this design is a mistake.

Is it a use of a separate FD carrying all of the flags/addresses that
you object to?

Dave

> Paolo
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Paolo Bonzini March 17, 2017, 4:05 p.m. UTC | #3
On 17/03/2017 14:02, Dr. David Alan Gilbert wrote:
>>>          case RAM_SAVE_FLAG_MULTIFD_PAGE:
>>>              fd_num = qemu_get_be16(f);
>>> -            if (fd_num != 0) {
>>> -                /* this is yet an unused variable, changed later */
>>> -                fd_num = fd_num;
>>> -            }
>>> +            multifd_recv_page(host, fd_num);
>>>              qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
>>>              break;
>> I still believe this design is a mistake.
> Is it a use of a separate FD carrying all of the flags/addresses that
> you object to?

Yes, it introduces a serialization point unnecessarily, and I don't
believe the rationale that Juan offered was strong enough.

This is certainly true on the receive side, but serialization is not
even necessary on the send side.  Multiple threads can efficiently split
the work among themselves and visit the dirty bitmap without a central
distributor.

I need to study the code more to understand another issue.  Say you have
a page that is sent to two different threads in two different
iterations, like

    thread 1
      iteration 1: pages 3, 7
    thread 2
      iteration 1: page 3
      iteration 2: page 7

Does the code ensure that all threads wait at the end of an iteration?
Otherwise, thread 2 could process page 7 from iteration 2 before or
while thread 1 processes the same page from iteration 1.

Paolo
Dr. David Alan Gilbert March 17, 2017, 7:36 p.m. UTC | #4
* Paolo Bonzini (pbonzini@redhat.com) wrote:
> 
> 
> On 17/03/2017 14:02, Dr. David Alan Gilbert wrote:
> >>>          case RAM_SAVE_FLAG_MULTIFD_PAGE:
> >>>              fd_num = qemu_get_be16(f);
> >>> -            if (fd_num != 0) {
> >>> -                /* this is yet an unused variable, changed later */
> >>> -                fd_num = fd_num;
> >>> -            }
> >>> +            multifd_recv_page(host, fd_num);
> >>>              qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
> >>>              break;
> >> I still believe this design is a mistake.
> > Is it a use of a separate FD carrying all of the flags/addresses that
> > you object to?
> 
> Yes, it introduces a serialization point unnecessarily, and I don't
> believe the rationale that Juan offered was strong enough.
> 
> This is certainly true on the receive side, but serialization is not
> even necessary on the send side.

Is there an easy way to benchmark it (without writing both) to figure
out if sending (word) (page) on one fd is less efficient than sending
two fd's with the pages and words separate?

> Multiple threads can efficiently split
> the work among themselves and visit the dirty bitmap without a central
> distributor.

I mostly agree; I kind of fancy the idea of having one per NUMA node;
but a central distributor might be a good idea anyway in the cases
where you find the heavy-writer all happens to be in the same area.

> 
> I need to study the code more to understand another issue.  Say you have
> a page that is sent to two different threads in two different
> iterations, like
> 
>     thread 1
>       iteration 1: pages 3, 7
>     thread 2
>       iteration 1: page 3
>       iteration 2: page 7
> 
> Does the code ensure that all threads wait at the end of an iteration?
> Otherwise, thread 2 could process page 7 from iteration 2 before or
> while thread 1 processes the same page from iteration 1.

I think there's a sync at the end of each iteration on Juan's current code
that stops that.

Dave

> 
> Paolo
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Paolo Bonzini March 20, 2017, 11:15 a.m. UTC | #5
On 17/03/2017 20:36, Dr. David Alan Gilbert wrote:
> * Paolo Bonzini (pbonzini@redhat.com) wrote:
>> On 17/03/2017 14:02, Dr. David Alan Gilbert wrote:
>>>>>          case RAM_SAVE_FLAG_MULTIFD_PAGE:
>>>>>              fd_num = qemu_get_be16(f);
>>>>> -            if (fd_num != 0) {
>>>>> -                /* this is yet an unused variable, changed later */
>>>>> -                fd_num = fd_num;
>>>>> -            }
>>>>> +            multifd_recv_page(host, fd_num);
>>>>>              qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
>>>>>              break;
>>>> I still believe this design is a mistake.
>>> Is it a use of a separate FD carrying all of the flags/addresses that
>>> you object to?
>>
>> Yes, it introduces a serialization point unnecessarily, and I don't
>> believe the rationale that Juan offered was strong enough.
>>
>> This is certainly true on the receive side, but serialization is not
>> even necessary on the send side.
> 
> Is there an easy way to benchmark it (without writing both) to figure
> out if sending (word) (page) on one fd is less efficient than sending
> two fd's with the pages and words separate?

I think it shouldn't be hard to write a version which keeps the central
distributor but puts the metadata in the auxiliary fds too.

But I think what matters is not efficiency, but rather being more
forward-proof.  Besides liberty of changing implementation, Juan's
current code simply has no commands in auxiliary file descriptors, which
can be very limiting.

Paolo

>> Multiple threads can efficiently split
>> the work among themselves and visit the dirty bitmap without a central
>> distributor.
> 
> I mostly agree; I kind of fancy the idea of having one per NUMA node;
> but a central distributor might be a good idea anyway in the cases
> where you find the heavy-writer all happens to be in the same area.
> 
>>
>> I need to study the code more to understand another issue.  Say you have
>> a page that is sent to two different threads in two different
>> iterations, like
>>
>>     thread 1
>>       iteration 1: pages 3, 7
>>     thread 2
>>       iteration 1: page 3
>>       iteration 2: page 7
>>
>> Does the code ensure that all threads wait at the end of an iteration?
>> Otherwise, thread 2 could process page 7 from iteration 2 before or
>> while thread 1 processes the same page from iteration 1.
> 
> I think there's a sync at the end of each iteration on Juan's current code
> that stops that.
Juan Quintela March 30, 2017, 11:56 a.m. UTC | #6
Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 17/03/2017 20:36, Dr. David Alan Gilbert wrote:
>> * Paolo Bonzini (pbonzini@redhat.com) wrote:
>>> On 17/03/2017 14:02, Dr. David Alan Gilbert wrote:
>>>>>>          case RAM_SAVE_FLAG_MULTIFD_PAGE:
>>>>>>              fd_num = qemu_get_be16(f);
>>>>>> -            if (fd_num != 0) {
>>>>>> -                /* this is yet an unused variable, changed later */
>>>>>> -                fd_num = fd_num;
>>>>>> -            }
>>>>>> +            multifd_recv_page(host, fd_num);
>>>>>>              qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
>>>>>>              break;
>>>>> I still believe this design is a mistake.
>>>> Is it a use of a separate FD carrying all of the flags/addresses that
>>>> you object to?
>>>
>>> Yes, it introduces a serialization point unnecessarily, and I don't
>>> believe the rationale that Juan offered was strong enough.
>>>
>>> This is certainly true on the receive side, but serialization is not
>>> even necessary on the send side.
>> 
>> Is there an easy way to benchmark it (without writing both) to figure
>> out if sending (word) (page) on one fd is less efficient than sending
>> two fd's with the pages and words separate?
>
> I think it shouldn't be hard to write a version which keeps the central
> distributor but puts the metadata in the auxiliary fds too.

That is not difficult to do (famous last words).
I will try to test both approachs for next version, thanks.

>
> But I think what matters is not efficiency, but rather being more
> forward-proof.  Besides liberty of changing implementation, Juan's
> current code simply has no commands in auxiliary file descriptors, which
> can be very limiting.
>
> Paolo
>
>>> Multiple threads can efficiently split
>>> the work among themselves and visit the dirty bitmap without a central
>>> distributor.
>> 
>> I mostly agree; I kind of fancy the idea of having one per NUMA node;
>> but a central distributor might be a good idea anyway in the cases
>> where you find the heavy-writer all happens to be in the same area.
>> 
>>>
>>> I need to study the code more to understand another issue.  Say you have
>>> a page that is sent to two different threads in two different
>>> iterations, like
>>>
>>>     thread 1
>>>       iteration 1: pages 3, 7
>>>     thread 2
>>>       iteration 1: page 3
>>>       iteration 2: page 7
>>>
>>> Does the code ensure that all threads wait at the end of an iteration?
>>> Otherwise, thread 2 could process page 7 from iteration 2 before or
>>> while thread 1 processes the same page from iteration 1.
>> 
>> I think there's a sync at the end of each iteration on Juan's current code
>> that stops that.

This can't happen by design.  We sync all threads at the end of each migration.

Later, Juan.
diff mbox

Patch

diff --git a/migration/ram.c b/migration/ram.c
index 6f5ca50..3b1a2dc 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -44,6 +44,7 @@ 
 #include "exec/ram_addr.h"
 #include "qemu/rcu_queue.h"
 #include "migration/colo.h"
+#include "qemu/iov.h"
 
 static int dirty_rate_high_cnt;
 
@@ -536,7 +537,7 @@  int migrate_multifd_send_threads_create(void)
     return 0;
 }
 
-static int multifd_send_page(uint8_t *address)
+static uint16_t multifd_send_page(uint8_t *address, bool last_page)
 {
     int i, j;
     MultiFDSendParams *p = NULL; /* make happy gcc */
@@ -552,8 +553,10 @@  static int multifd_send_page(uint8_t *address)
     pages.iov[pages.num].iov_len = TARGET_PAGE_SIZE;
     pages.num++;
 
-    if (pages.num < (pages.size - 1)) {
-        return UINT16_MAX;
+    if (!last_page) {
+        if (pages.num < (pages.size - 1)) {
+            return UINT16_MAX;
+        }
     }
 
     qemu_sem_wait(&multifd_send_state->sem);
@@ -581,13 +584,18 @@  static int multifd_send_page(uint8_t *address)
 }
 
 struct MultiFDRecvParams {
+    /* not changed */
     int id;
     QemuThread thread;
     QIOChannel *c;
     QemuSemaphore init;
+    QemuSemaphore ready;
     QemuSemaphore sem;
     QemuMutex mutex;
+    /* proteced by param mutex */
     bool quit;
+    multifd_pages_t pages;
+    bool done;
 };
 typedef struct MultiFDRecvParams MultiFDRecvParams;
 
@@ -641,6 +649,7 @@  static void *multifd_recv_thread(void *opaque)
 
     qio_channel_read(p->c, &start, 1, &error_abort);
     qemu_sem_post(&p->init);
+    qemu_sem_post(&p->ready);
 
     while (true) {
         qemu_mutex_lock(&p->mutex);
@@ -648,6 +657,13 @@  static void *multifd_recv_thread(void *opaque)
             qemu_mutex_unlock(&p->mutex);
             break;
         }
+        if (p->pages.num) {
+            p->pages.num = 0;
+            p->done = true;
+            qemu_mutex_unlock(&p->mutex);
+            qemu_sem_post(&p->ready);
+            continue;
+        }
         qemu_mutex_unlock(&p->mutex);
         qemu_sem_wait(&p->sem);
     }
@@ -672,8 +688,11 @@  int migrate_multifd_recv_threads_create(void)
         qemu_mutex_init(&p->mutex);
         qemu_sem_init(&p->sem, 0);
         qemu_sem_init(&p->init, 0);
+        qemu_sem_init(&p->ready, 0);
         p->quit = false;
         p->id = i;
+        p->done = false;
+        multifd_init_group(&p->pages);
         p->c = socket_recv_channel_create();
 
         if (!p->c) {
@@ -690,6 +709,42 @@  int migrate_multifd_recv_threads_create(void)
     return 0;
 }
 
+static void multifd_recv_page(uint8_t *address, uint16_t fd_num)
+{
+    int thread_count;
+    MultiFDRecvParams *p;
+    static multifd_pages_t pages;
+    static bool once;
+
+    if (!once) {
+        multifd_init_group(&pages);
+        once = true;
+    }
+
+    pages.iov[pages.num].iov_base = address;
+    pages.iov[pages.num].iov_len = TARGET_PAGE_SIZE;
+    pages.num++;
+
+    if (fd_num == UINT16_MAX) {
+        return;
+    }
+
+    thread_count = migrate_multifd_threads();
+    assert(fd_num < thread_count);
+    p = &multifd_recv_state->params[fd_num];
+
+    qemu_sem_wait(&p->ready);
+
+    qemu_mutex_lock(&p->mutex);
+    p->done = false;
+    iov_copy(p->pages.iov, pages.num, pages.iov, pages.num, 0,
+             iov_size(pages.iov, pages.num));
+    p->pages.num = pages.num;
+    pages.num = 0;
+    qemu_mutex_unlock(&p->mutex);
+    qemu_sem_post(&p->sem);
+}
+
 /**
  * save_page_header: Write page header to wire
  *
@@ -1153,7 +1208,7 @@  static int ram_multifd_page(QEMUFile *f, PageSearchStatus *pss,
     if (pages == -1) {
         *bytes_transferred +=
             save_page_header(f, block, offset | RAM_SAVE_FLAG_MULTIFD_PAGE);
-        fd_num = multifd_send_page(p);
+        fd_num = multifd_send_page(p, migration_dirty_pages == 1);
         qemu_put_be16(f, fd_num);
         *bytes_transferred += 2; /* size of fd_num */
         qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
@@ -3012,10 +3067,7 @@  static int ram_load(QEMUFile *f, void *opaque, int version_id)
 
         case RAM_SAVE_FLAG_MULTIFD_PAGE:
             fd_num = qemu_get_be16(f);
-            if (fd_num != 0) {
-                /* this is yet an unused variable, changed later */
-                fd_num = fd_num;
-            }
+            multifd_recv_page(host, fd_num);
             qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
             break;