diff mbox series

[5/9] migration/qemu-file: Add qemu_file_get_to_fd()

Message ID 20220512154320.19697-6-avihaih@nvidia.com (mailing list archive)
State New, archived
Headers show
Series vfio/migration: Implement VFIO migration protocol v2 | expand

Commit Message

Avihai Horon May 12, 2022, 3:43 p.m. UTC
Add new function qemu_file_get_to_fd() that allows reading data from
QEMUFile and writing it straight into a given fd.

This will be used later in VFIO migration code.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 migration/qemu-file.c | 34 ++++++++++++++++++++++++++++++++++
 migration/qemu-file.h |  1 +
 2 files changed, 35 insertions(+)

Comments

Juan Quintela May 16, 2022, 11:31 a.m. UTC | #1
Avihai Horon <avihaih@nvidia.com> wrote:
> Add new function qemu_file_get_to_fd() that allows reading data from
> QEMUFile and writing it straight into a given fd.
>
> This will be used later in VFIO migration code.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  migration/qemu-file.c | 34 ++++++++++++++++++++++++++++++++++
>  migration/qemu-file.h |  1 +
>  2 files changed, 35 insertions(+)
>
> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
> index 1479cddad9..cad3d32eb3 100644
> --- a/migration/qemu-file.c
> +++ b/migration/qemu-file.c
> @@ -867,3 +867,37 @@ QIOChannel *qemu_file_get_ioc(QEMUFile *file)
>  {
>      return file->has_ioc ? QIO_CHANNEL(file->opaque) : NULL;
>  }
> +
> +/*
> + * Read size bytes from QEMUFile f and write them to fd.
> + */
> +int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size)
> +{
> +    while (size) {
> +        size_t pending = f->buf_size - f->buf_index;
> +        ssize_t rc;
> +
> +        if (!pending) {
> +            rc = qemu_fill_buffer(f);
> +            if (rc < 0) {
> +                return rc;
> +            }
> +            if (rc == 0) {
> +                return -1;
> +            }
> +            continue;
> +        }
> +
> +        rc = write(fd, f->buf + f->buf_index, MIN(pending, size));
> +        if (rc < 0) {
> +            return rc;
> +        }
> +        if (rc == 0) {
> +            return -1;
> +        }
> +        f->buf_index += rc;
> +        size -= rc;
> +    }
> +
> +    return 0;
> +}

Is there a really performance difference to just use:

uint8_t buffer[size];

qemu_get_buffer(f, buffer, size);
write(fd, buffer, size);

Or telling it otherwise, what sizes are we talking here?

Thanks, Juan.


> diff --git a/migration/qemu-file.h b/migration/qemu-file.h
> index 3f36d4dc8c..dd26037450 100644
> --- a/migration/qemu-file.h
> +++ b/migration/qemu-file.h
> @@ -162,6 +162,7 @@ int qemu_file_shutdown(QEMUFile *f);
>  QEMUFile *qemu_file_get_return_path(QEMUFile *f);
>  void qemu_fflush(QEMUFile *f);
>  void qemu_file_set_blocking(QEMUFile *f, bool block);
> +int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size);
>  
>  void ram_control_before_iterate(QEMUFile *f, uint64_t flags);
>  void ram_control_after_iterate(QEMUFile *f, uint64_t flags);
Avihai Horon May 17, 2022, 12:36 p.m. UTC | #2
On 5/16/2022 2:31 PM, Juan Quintela wrote:
> External email: Use caution opening links or attachments
>
>
> Avihai Horon <avihaih@nvidia.com> wrote:
>> Add new function qemu_file_get_to_fd() that allows reading data from
>> QEMUFile and writing it straight into a given fd.
>>
>> This will be used later in VFIO migration code.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   migration/qemu-file.c | 34 ++++++++++++++++++++++++++++++++++
>>   migration/qemu-file.h |  1 +
>>   2 files changed, 35 insertions(+)
>>
>> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
>> index 1479cddad9..cad3d32eb3 100644
>> --- a/migration/qemu-file.c
>> +++ b/migration/qemu-file.c
>> @@ -867,3 +867,37 @@ QIOChannel *qemu_file_get_ioc(QEMUFile *file)
>>   {
>>       return file->has_ioc ? QIO_CHANNEL(file->opaque) : NULL;
>>   }
>> +
>> +/*
>> + * Read size bytes from QEMUFile f and write them to fd.
>> + */
>> +int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size)
>> +{
>> +    while (size) {
>> +        size_t pending = f->buf_size - f->buf_index;
>> +        ssize_t rc;
>> +
>> +        if (!pending) {
>> +            rc = qemu_fill_buffer(f);
>> +            if (rc < 0) {
>> +                return rc;
>> +            }
>> +            if (rc == 0) {
>> +                return -1;
>> +            }
>> +            continue;
>> +        }
>> +
>> +        rc = write(fd, f->buf + f->buf_index, MIN(pending, size));
>> +        if (rc < 0) {
>> +            return rc;
>> +        }
>> +        if (rc == 0) {
>> +            return -1;
>> +        }
>> +        f->buf_index += rc;
>> +        size -= rc;
>> +    }
>> +
>> +    return 0;
>> +}
> Is there a really performance difference to just use:
>
> uint8_t buffer[size];
>
> qemu_get_buffer(f, buffer, size);
> write(fd, buffer, size);
>
> Or telling it otherwise, what sizes are we talking here?

It depends on the device, but It can range from a few MBs to several GBs 
AFAIK.

Thanks.

>
> Thanks, Juan.
>
>
>> diff --git a/migration/qemu-file.h b/migration/qemu-file.h
>> index 3f36d4dc8c..dd26037450 100644
>> --- a/migration/qemu-file.h
>> +++ b/migration/qemu-file.h
>> @@ -162,6 +162,7 @@ int qemu_file_shutdown(QEMUFile *f);
>>   QEMUFile *qemu_file_get_return_path(QEMUFile *f);
>>   void qemu_fflush(QEMUFile *f);
>>   void qemu_file_set_blocking(QEMUFile *f, bool block);
>> +int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size);
>>
>>   void ram_control_before_iterate(QEMUFile *f, uint64_t flags);
>>   void ram_control_after_iterate(QEMUFile *f, uint64_t flags);
Juan Quintela May 18, 2022, 11:54 a.m. UTC | #3
Avihai Horon <avihaih@nvidia.com> wrote:
> On 5/16/2022 2:31 PM, Juan Quintela wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> Avihai Horon <avihaih@nvidia.com> wrote:
>>> Add new function qemu_file_get_to_fd() that allows reading data from
>>> QEMUFile and writing it straight into a given fd.
>>>
>>> This will be used later in VFIO migration code.
>>>
>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>> ---
>>>   migration/qemu-file.c | 34 ++++++++++++++++++++++++++++++++++
>>>   migration/qemu-file.h |  1 +
>>>   2 files changed, 35 insertions(+)
>>>
>>> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
>>> index 1479cddad9..cad3d32eb3 100644
>>> --- a/migration/qemu-file.c
>>> +++ b/migration/qemu-file.c
>>> @@ -867,3 +867,37 @@ QIOChannel *qemu_file_get_ioc(QEMUFile *file)
>>>   {
>>>       return file->has_ioc ? QIO_CHANNEL(file->opaque) : NULL;
>>>   }
>>> +
>>> +/*
>>> + * Read size bytes from QEMUFile f and write them to fd.
>>> + */
>>> +int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size)
>>> +{
>>> +    while (size) {
>>> +        size_t pending = f->buf_size - f->buf_index;
>>> +        ssize_t rc;
>>> +
>>> +        if (!pending) {
>>> +            rc = qemu_fill_buffer(f);
>>> +            if (rc < 0) {
>>> +                return rc;
>>> +            }
>>> +            if (rc == 0) {
>>> +                return -1;
>>> +            }
>>> +            continue;
>>> +        }
>>> +
>>> +        rc = write(fd, f->buf + f->buf_index, MIN(pending, size));
>>> +        if (rc < 0) {
>>> +            return rc;
>>> +        }
>>> +        if (rc == 0) {
>>> +            return -1;
>>> +        }
>>> +        f->buf_index += rc;
>>> +        size -= rc;
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>> Is there a really performance difference to just use:
>>
>> uint8_t buffer[size];
>>
>> qemu_get_buffer(f, buffer, size);
>> write(fd, buffer, size);
>>
>> Or telling it otherwise, what sizes are we talking here?
>
> It depends on the device, but It can range from a few MBs to several
> GBs AFAIK.

a few MB is ok.

Several GB on the main migration channel without a single
header/whatever?

This sounds like a recipe for disaster IMHO.
Remember that on source side, we have a migration thread.  This patch
will just block that migration thread.

If we are using 10Gigabit networking, 1GB/second for friends and making
math easy, each GB will take 1 second downtime.

During that downtime, migration control is basically handycapped.
And you are sendinfg this data with qemu_put_buffer_async().  This
function just adds its buffer to an iovec, only user uses 4KB (or
whatever your page size is) to the iovec.  You are talking about adding
gigabytes here.  I don't know how well this is going to work, but my
guess is that migrate_cancel is not going to be happy.

On destination side, this is even worse, because we receive this
multigigabyte chunk in a coroutine, and I am not sure that this will not
completely block all qemu while this is happening (again, multisecond
time).

I have to think more about this problem, but I can see how this is going
to be able to go through the migration main channel.

Later, Juan.
Jason Gunthorpe May 18, 2022, 3:42 p.m. UTC | #4
On Wed, May 18, 2022 at 01:54:34PM +0200, Juan Quintela wrote:

> >> Is there a really performance difference to just use:
> >>
> >> uint8_t buffer[size];
> >>
> >> qemu_get_buffer(f, buffer, size);
> >> write(fd, buffer, size);
> >>
> >> Or telling it otherwise, what sizes are we talking here?
> >
> > It depends on the device, but It can range from a few MBs to several
> > GBs AFAIK.
> 
> a few MB is ok.
> 
> Several GB on the main migration channel without a single
> header/whatever?

IIRC it iterates in multi-megabyte chunks each which gets a header.

The chunking size is set by the size of the buffer mmap

The overall point is that memcpying GB's is going to be taxing so we
want to eliminate copies on this path, especially copies that result
in more system calls.

We are expecting to look into further optimization down the road here
because even this is still too slow.

Jason
Daniel P. Berrangé May 18, 2022, 4 p.m. UTC | #5
On Wed, May 18, 2022 at 12:42:37PM -0300, Jason Gunthorpe wrote:
> On Wed, May 18, 2022 at 01:54:34PM +0200, Juan Quintela wrote:
> 
> > >> Is there a really performance difference to just use:
> > >>
> > >> uint8_t buffer[size];
> > >>
> > >> qemu_get_buffer(f, buffer, size);
> > >> write(fd, buffer, size);
> > >>
> > >> Or telling it otherwise, what sizes are we talking here?
> > >
> > > It depends on the device, but It can range from a few MBs to several
> > > GBs AFAIK.
> > 
> > a few MB is ok.
> > 
> > Several GB on the main migration channel without a single
> > header/whatever?
> 
> IIRC it iterates in multi-megabyte chunks each which gets a header.
> 
> The chunking size is set by the size of the buffer mmap
> 
> The overall point is that memcpying GB's is going to be taxing so we
> want to eliminate copies on this path, especially copies that result
> in more system calls.
> 
> We are expecting to look into further optimization down the road here
> because even this is still too slow.

Considering the possibility of future optimization, IMHO adding this
kind of API at the QEMUFile level is too high. We'd be better pushing
the impl down into the QIOChannel API level.

   int64_t qio_channel_copy_range(QIOCHannel *srcioc,
                                  QIOChannel *tgtioc,
				  size_t len);

The QIOChannel impl can do pretty much what you showed in the general
case, but in special cases it could have the option to offload to the
kernel copy_range() syscall to avoid the context sitches.

With regards,
Daniel
Jason Gunthorpe May 18, 2022, 4:16 p.m. UTC | #6
On Wed, May 18, 2022 at 05:00:26PM +0100, Daniel P. Berrangé wrote:
> On Wed, May 18, 2022 at 12:42:37PM -0300, Jason Gunthorpe wrote:
> > On Wed, May 18, 2022 at 01:54:34PM +0200, Juan Quintela wrote:
> > 
> > > >> Is there a really performance difference to just use:
> > > >>
> > > >> uint8_t buffer[size];
> > > >>
> > > >> qemu_get_buffer(f, buffer, size);
> > > >> write(fd, buffer, size);
> > > >>
> > > >> Or telling it otherwise, what sizes are we talking here?
> > > >
> > > > It depends on the device, but It can range from a few MBs to several
> > > > GBs AFAIK.
> > > 
> > > a few MB is ok.
> > > 
> > > Several GB on the main migration channel without a single
> > > header/whatever?
> > 
> > IIRC it iterates in multi-megabyte chunks each which gets a header.
> > 
> > The chunking size is set by the size of the buffer mmap
> > 
> > The overall point is that memcpying GB's is going to be taxing so we
> > want to eliminate copies on this path, especially copies that result
> > in more system calls.
> > 
> > We are expecting to look into further optimization down the road here
> > because even this is still too slow.
> 
> Considering the possibility of future optimization, IMHO adding this
> kind of API at the QEMUFile level is too high. We'd be better pushing
> the impl down into the QIOChannel API level.
> 
>    int64_t qio_channel_copy_range(QIOCHannel *srcioc,
>                                   QIOChannel *tgtioc,
> 				  size_t len);
> 
> The QIOChannel impl can do pretty much what you showed in the general
> case, but in special cases it could have the option to offload to the
> kernel copy_range() syscall to avoid the context sitches.

This is probably something to do down the road when we figure out
exactly what is best.

Currently we don't have kernel support for optimized copy_file_range()
(ie fops splice_read) inside the VFIO drivers so copy_file_range()
will just fail.

I didn't look closely again but IIRC the challenge is that the
QIOChannel doesn't have a ready large temporary buffer to use for a
non-splice fallback path.

Jason
diff mbox series

Patch

diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 1479cddad9..cad3d32eb3 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -867,3 +867,37 @@  QIOChannel *qemu_file_get_ioc(QEMUFile *file)
 {
     return file->has_ioc ? QIO_CHANNEL(file->opaque) : NULL;
 }
+
+/*
+ * Read size bytes from QEMUFile f and write them to fd.
+ */
+int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size)
+{
+    while (size) {
+        size_t pending = f->buf_size - f->buf_index;
+        ssize_t rc;
+
+        if (!pending) {
+            rc = qemu_fill_buffer(f);
+            if (rc < 0) {
+                return rc;
+            }
+            if (rc == 0) {
+                return -1;
+            }
+            continue;
+        }
+
+        rc = write(fd, f->buf + f->buf_index, MIN(pending, size));
+        if (rc < 0) {
+            return rc;
+        }
+        if (rc == 0) {
+            return -1;
+        }
+        f->buf_index += rc;
+        size -= rc;
+    }
+
+    return 0;
+}
diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index 3f36d4dc8c..dd26037450 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -162,6 +162,7 @@  int qemu_file_shutdown(QEMUFile *f);
 QEMUFile *qemu_file_get_return_path(QEMUFile *f);
 void qemu_fflush(QEMUFile *f);
 void qemu_file_set_blocking(QEMUFile *f, bool block);
+int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size);
 
 void ram_control_before_iterate(QEMUFile *f, uint64_t flags);
 void ram_control_after_iterate(QEMUFile *f, uint64_t flags);