diff mbox series

[1/1] virtio-blk-ccw: tweak the default for num_queues

Message ID 20201109154831.20779-1-pasic@linux.ibm.com (mailing list archive)
State New, archived
Headers show
Series [1/1] virtio-blk-ccw: tweak the default for num_queues | expand

Commit Message

Halil Pasic Nov. 9, 2020, 3:48 p.m. UTC
Currently the default value of num_queues is effectively 1 for
virtio-blk-ccw. Recently 9445e1e15e ("virtio-blk-pci: default num_queues
to -smp N") changed the default for pci to the number of vcpus, citing
interrupt better locality and better performance as a rationale.

While virtio-blk-ccw does not yet benefit from better interrupt
locality, measurements have shown that for secure VMs multiqueue does
help with performance. Since the bounce buffering happens with the queue
lock held (in the guest) this is hardly a surprise.

As for non-secure VMs the extra queues shouldn't hurt too much.

Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
---

We would prefer to land this for 5.2. If we do then commit 9445e1e15e
("virtio-blk-pci: default num_queues to -smp N") took care of all the
necessary compat handling.

If that's not possible, I will send a version that does the necessary
compat handling.
---
 hw/s390x/virtio-ccw-blk.c | 6 ++++++
 1 file changed, 6 insertions(+)


base-commit: 2a190a7256a3e0563b29ffd67e0164097b4a6dac

Comments

Christian Borntraeger Nov. 9, 2020, 3:53 p.m. UTC | #1
On 09.11.20 16:48, Halil Pasic wrote:
> Currently the default value of num_queues is effectively 1 for
> virtio-blk-ccw. Recently 9445e1e15e ("virtio-blk-pci: default num_queues
> to -smp N") changed the default for pci to the number of vcpus, citing
> interrupt better locality and better performance as a rationale.
> 
> While virtio-blk-ccw does not yet benefit from better interrupt
> locality, measurements have shown that for secure VMs multiqueue does
> help with performance. Since the bounce buffering happens with the queue
> lock held (in the guest) this is hardly a surprise.
> 
> As for non-secure VMs the extra queues shouldn't hurt too much.

In fact we have also seen improvments for multiqueues for non secure guests
as one virtqueue seems to have a hard capping in terms of bandwidth that can
be smaller than newer storage devices.

> 
> Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com>
> Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
> ---
> 
> We would prefer to land this for 5.2. If we do then commit 9445e1e15e
> ("virtio-blk-pci: default num_queues to -smp N") took care of all the
> necessary compat handling.
> 
> If that's not possible, I will send a version that does the necessary
> compat handling.

Right. At the moment the original patch has all the necessary compat handling
for 5.1 for all platforms. Adding multi-queue for s390x for > 5.2 would mean
that we would need to have additional platform specific compat handling and it
would increase the patch size unnecessarily. 
In the end you could consider this an enhancement (not calling it fix) of the
original patch.

Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
> ---
>  hw/s390x/virtio-ccw-blk.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/hw/s390x/virtio-ccw-blk.c b/hw/s390x/virtio-ccw-blk.c
> index 2294ce1ce4..7296140dde 100644
> --- a/hw/s390x/virtio-ccw-blk.c
> +++ b/hw/s390x/virtio-ccw-blk.c
> @@ -10,6 +10,7 @@
>   */
>  
>  #include "qemu/osdep.h"
> +#include "hw/boards.h"
>  #include "hw/qdev-properties.h"
>  #include "hw/virtio/virtio.h"
>  #include "qapi/error.h"
> @@ -20,6 +21,11 @@ static void virtio_ccw_blk_realize(VirtioCcwDevice *ccw_dev, Error **errp)
>  {
>      VirtIOBlkCcw *dev = VIRTIO_BLK_CCW(ccw_dev);
>      DeviceState *vdev = DEVICE(&dev->vdev);
> +    VirtIOBlkConf *conf = &dev->vdev.conf;
> +
> +    if (conf->num_queues == VIRTIO_BLK_AUTO_NUM_QUEUES) {
> +        conf->num_queues = MIN(4, current_machine->smp.cpus);
> +    }
>  
>      qdev_realize(vdev, BUS(&ccw_dev->bus), errp);
>  }
> 
> base-commit: 2a190a7256a3e0563b29ffd67e0164097b4a6dac
>
Cornelia Huck Nov. 9, 2020, 4:06 p.m. UTC | #2
On Mon,  9 Nov 2020 16:48:31 +0100
Halil Pasic <pasic@linux.ibm.com> wrote:

> Currently the default value of num_queues is effectively 1 for
> virtio-blk-ccw. Recently 9445e1e15e ("virtio-blk-pci: default num_queues
> to -smp N") changed the default for pci to the number of vcpus, citing
> interrupt better locality and better performance as a rationale.
> 
> While virtio-blk-ccw does not yet benefit from better interrupt
> locality, measurements have shown that for secure VMs multiqueue does
> help with performance. Since the bounce buffering happens with the queue
> lock held (in the guest) this is hardly a surprise.
> 
> As for non-secure VMs the extra queues shouldn't hurt too much.
> 
> Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com>
> Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
> ---
> 
> We would prefer to land this for 5.2. If we do then commit 9445e1e15e
> ("virtio-blk-pci: default num_queues to -smp N") took care of all the
> necessary compat handling.
> 
> If that's not possible, I will send a version that does the necessary
> compat handling.

I think we can still get this into 5.2, and that would indeed be less
hassle.

> ---
>  hw/s390x/virtio-ccw-blk.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/hw/s390x/virtio-ccw-blk.c b/hw/s390x/virtio-ccw-blk.c
> index 2294ce1ce4..7296140dde 100644
> --- a/hw/s390x/virtio-ccw-blk.c
> +++ b/hw/s390x/virtio-ccw-blk.c
> @@ -10,6 +10,7 @@
>   */
>  
>  #include "qemu/osdep.h"
> +#include "hw/boards.h"
>  #include "hw/qdev-properties.h"
>  #include "hw/virtio/virtio.h"
>  #include "qapi/error.h"
> @@ -20,6 +21,11 @@ static void virtio_ccw_blk_realize(VirtioCcwDevice *ccw_dev, Error **errp)
>  {
>      VirtIOBlkCcw *dev = VIRTIO_BLK_CCW(ccw_dev);
>      DeviceState *vdev = DEVICE(&dev->vdev);
> +    VirtIOBlkConf *conf = &dev->vdev.conf;
> +
> +    if (conf->num_queues == VIRTIO_BLK_AUTO_NUM_QUEUES) {
> +        conf->num_queues = MIN(4, current_machine->smp.cpus);
> +    }

I would like to have a comment explaining the numbers here, however.

virtio-pci has a pretty good explanation (use 1:1 for vqs:vcpus if
possible, apply some other capping). 4 seems to be a bit arbitrary
without explanation, although I'm sure you did some measurements :)

Do we also want something similar for virtio-scsi (and vhost-scsi)?

>  
>      qdev_realize(vdev, BUS(&ccw_dev->bus), errp);
>  }
> 
> base-commit: 2a190a7256a3e0563b29ffd67e0164097b4a6dac
Halil Pasic Nov. 9, 2020, 6:53 p.m. UTC | #3
On Mon, 9 Nov 2020 17:06:16 +0100
Cornelia Huck <cohuck@redhat.com> wrote:

> > @@ -20,6 +21,11 @@ static void virtio_ccw_blk_realize(VirtioCcwDevice *ccw_dev, Error **errp)
> >  {
> >      VirtIOBlkCcw *dev = VIRTIO_BLK_CCW(ccw_dev);
> >      DeviceState *vdev = DEVICE(&dev->vdev);
> > +    VirtIOBlkConf *conf = &dev->vdev.conf;
> > +
> > +    if (conf->num_queues == VIRTIO_BLK_AUTO_NUM_QUEUES) {
> > +        conf->num_queues = MIN(4, current_machine->smp.cpus);
> > +    }  
> 
> I would like to have a comment explaining the numbers here, however.
> 
> virtio-pci has a pretty good explanation (use 1:1 for vqs:vcpus if
> possible, apply some other capping). 4 seems to be a bit arbitrary
> without explanation, although I'm sure you did some measurements :)

Frankly, I don't have any measurements yet. For the secure case,
I think Mimu has assessed the impact of multiqueue, hence adding Mimu to
the cc list. @Mimu can you help us out.

Regarding the normal non-protected VMs I'm in a middle of producing some
measurement data. This was admittedly a bit rushed because of where we
are in the cycle. Sorry to disappoint you.

The number 4 was suggested by Christian, maybe Christian does have some
readily available measurement data for the normal VM case. @Christian:
can you help me out?

Regards,
Halil
Christian Borntraeger Nov. 10, 2020, 8:47 a.m. UTC | #4
On 09.11.20 19:53, Halil Pasic wrote:
> On Mon, 9 Nov 2020 17:06:16 +0100
> Cornelia Huck <cohuck@redhat.com> wrote:
> 
>>> @@ -20,6 +21,11 @@ static void virtio_ccw_blk_realize(VirtioCcwDevice *ccw_dev, Error **errp)
>>>  {
>>>      VirtIOBlkCcw *dev = VIRTIO_BLK_CCW(ccw_dev);
>>>      DeviceState *vdev = DEVICE(&dev->vdev);
>>> +    VirtIOBlkConf *conf = &dev->vdev.conf;
>>> +
>>> +    if (conf->num_queues == VIRTIO_BLK_AUTO_NUM_QUEUES) {
>>> +        conf->num_queues = MIN(4, current_machine->smp.cpus);
>>> +    }  
>>
>> I would like to have a comment explaining the numbers here, however.
>>
>> virtio-pci has a pretty good explanation (use 1:1 for vqs:vcpus if
>> possible, apply some other capping). 4 seems to be a bit arbitrary
>> without explanation, although I'm sure you did some measurements :)
> 
> Frankly, I don't have any measurements yet. For the secure case,
> I think Mimu has assessed the impact of multiqueue, hence adding Mimu to
> the cc list. @Mimu can you help us out.
> 
> Regarding the normal non-protected VMs I'm in a middle of producing some
> measurement data. This was admittedly a bit rushed because of where we
> are in the cycle. Sorry to disappoint you.
> 
> The number 4 was suggested by Christian, maybe Christian does have some
> readily available measurement data for the normal VM case. @Christian:
> can you help me out?
My point was to find a balance between performance gain and memory usage.
As a matter of fact, virtqueue do consume memory. So 4 looked like a
reasonable default for me for large guests as long as we do not have directed
interrupts.

Now, thinking about this again: If we want to change the default to something
else in the future (e.g. to num vcpus) then the compat handling will get
really complicated.

So we can
- go with num queues = num cpus. But this will consume memory
for guests with lots of CPUs.
- go with the proposed logic of min(4,vcpus) and accept that future compat handling
is harder
- defer this change
Cornelia Huck Nov. 10, 2020, 10:15 a.m. UTC | #5
On Tue, 10 Nov 2020 09:47:51 +0100
Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> On 09.11.20 19:53, Halil Pasic wrote:
> > On Mon, 9 Nov 2020 17:06:16 +0100
> > Cornelia Huck <cohuck@redhat.com> wrote:
> >   
> >>> @@ -20,6 +21,11 @@ static void virtio_ccw_blk_realize(VirtioCcwDevice *ccw_dev, Error **errp)
> >>>  {
> >>>      VirtIOBlkCcw *dev = VIRTIO_BLK_CCW(ccw_dev);
> >>>      DeviceState *vdev = DEVICE(&dev->vdev);
> >>> +    VirtIOBlkConf *conf = &dev->vdev.conf;
> >>> +
> >>> +    if (conf->num_queues == VIRTIO_BLK_AUTO_NUM_QUEUES) {
> >>> +        conf->num_queues = MIN(4, current_machine->smp.cpus);
> >>> +    }    
> >>
> >> I would like to have a comment explaining the numbers here, however.
> >>
> >> virtio-pci has a pretty good explanation (use 1:1 for vqs:vcpus if
> >> possible, apply some other capping). 4 seems to be a bit arbitrary
> >> without explanation, although I'm sure you did some measurements :)  
> > 
> > Frankly, I don't have any measurements yet. For the secure case,
> > I think Mimu has assessed the impact of multiqueue, hence adding Mimu to
> > the cc list. @Mimu can you help us out.
> > 
> > Regarding the normal non-protected VMs I'm in a middle of producing some
> > measurement data. This was admittedly a bit rushed because of where we
> > are in the cycle. Sorry to disappoint you.
> > 
> > The number 4 was suggested by Christian, maybe Christian does have some
> > readily available measurement data for the normal VM case. @Christian:
> > can you help me out?  
> My point was to find a balance between performance gain and memory usage.
> As a matter of fact, virtqueue do consume memory. So 4 looked like a
> reasonable default for me for large guests as long as we do not have directed
> interrupts.

Yes, 4 does not look like a bad number, but I still don't feel really
comfortable with it without at least some data.

What about large guests with slow vs. fast storage?

> 
> Now, thinking about this again: If we want to change the default to something
> else in the future (e.g. to num vcpus) then the compat handling will get
> really complicated.

Yes, I fear that will be messy. Just picking a value later will need
compat handling, but not a really complicated one.

> 
> So we can
> - go with num queues = num cpus. But this will consume memory
> for guests with lots of CPUs.

I'm not sure that would be a good choice, as we don't have the benefits
that pci has.

> - go with the proposed logic of min(4,vcpus) and accept that future compat handling
> is harder

With a bit more data, I'd be way more comfortable. Might still be ok
for the next rc.

> - defer this change

We might end up with that, given the timing :( (not blaming anyone)
Halil Pasic Nov. 10, 2020, 10:40 a.m. UTC | #6
On Tue, 10 Nov 2020 09:47:51 +0100
Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> 
> 
> On 09.11.20 19:53, Halil Pasic wrote:
> > On Mon, 9 Nov 2020 17:06:16 +0100
> > Cornelia Huck <cohuck@redhat.com> wrote:
> > 
> >>> @@ -20,6 +21,11 @@ static void virtio_ccw_blk_realize(VirtioCcwDevice *ccw_dev, Error **errp)
> >>>  {
> >>>      VirtIOBlkCcw *dev = VIRTIO_BLK_CCW(ccw_dev);
> >>>      DeviceState *vdev = DEVICE(&dev->vdev);
> >>> +    VirtIOBlkConf *conf = &dev->vdev.conf;
> >>> +
> >>> +    if (conf->num_queues == VIRTIO_BLK_AUTO_NUM_QUEUES) {
> >>> +        conf->num_queues = MIN(4, current_machine->smp.cpus);
> >>> +    }  
> >>
> >> I would like to have a comment explaining the numbers here, however.
> >>
> >> virtio-pci has a pretty good explanation (use 1:1 for vqs:vcpus if
> >> possible, apply some other capping). 4 seems to be a bit arbitrary
> >> without explanation, although I'm sure you did some measurements :)
> > 
> > Frankly, I don't have any measurements yet. For the secure case,
> > I think Mimu has assessed the impact of multiqueue, hence adding Mimu to
> > the cc list. @Mimu can you help us out.
> > 
> > Regarding the normal non-protected VMs I'm in a middle of producing some
> > measurement data. This was admittedly a bit rushed because of where we
> > are in the cycle. Sorry to disappoint you.
> > 
> > The number 4 was suggested by Christian, maybe Christian does have some
> > readily available measurement data for the normal VM case. @Christian:
> > can you help me out?
> My point was to find a balance between performance gain and memory usage.
> As a matter of fact, virtqueue do consume memory. So 4 looked like a
> reasonable default for me for large guests as long as we do not have directed
> interrupts.
> 
> Now, thinking about this again: If we want to change the default to something
> else in the future (e.g. to num vcpus) then the compat handling will get
> really complicated.

Regarding compat handling, I believe we would need a new property for
virtio-blk-ccw: something like def_num_queues_max. Then logic would
morph to MIN(def_num_queues_max, current_machine->smp.cpus), and we could
relatively freely do compat stuff on def_num_queues_max.

IMHO not pretty but certainly doable.

> 
> So we can
> - go with num queues = num cpus. But this will consume memory
> for guests with lots of CPUs.

In absence of data that showcases the benefit outweighing the obvious
detriment, I lean towards finding this option the least favorable.

> - go with the proposed logic of min(4,vcpus) and accept that future compat handling
> is harder

IMHO not a bad option, but I think I would still feel better about a
more informed decision. In the end, the end user can already specify the
num_queues explicitly, so I don't think this is urgent.

> - defer this change

So I tend to lean towards deferring.

Another thought is, provided the load is about evenly spread on the
different virtqueues, if the game is about vCPU locality, one could
think of decreasing the size of each individual virtqueue while
increasing their number, with the idea of not paying much more in terms
of memory. The queue size however needs to be a power of 2,
so there is a limit on the granularity.

Regarding secure VMs, currently we have to cramp at least the swiotlb and
the virtqueues into ZONE_DMA. So increasing the number of
virtqueues heavily may get us into new trouble with exotic configs.

Regards,
Halil
Christian Borntraeger Nov. 10, 2020, 1:18 p.m. UTC | #7
On 10.11.20 11:40, Halil Pasic wrote:
> On Tue, 10 Nov 2020 09:47:51 +0100
> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> 
>>
>>
>> On 09.11.20 19:53, Halil Pasic wrote:
>>> On Mon, 9 Nov 2020 17:06:16 +0100
>>> Cornelia Huck <cohuck@redhat.com> wrote:
>>>
>>>>> @@ -20,6 +21,11 @@ static void virtio_ccw_blk_realize(VirtioCcwDevice *ccw_dev, Error **errp)
>>>>>  {
>>>>>      VirtIOBlkCcw *dev = VIRTIO_BLK_CCW(ccw_dev);
>>>>>      DeviceState *vdev = DEVICE(&dev->vdev);
>>>>> +    VirtIOBlkConf *conf = &dev->vdev.conf;
>>>>> +
>>>>> +    if (conf->num_queues == VIRTIO_BLK_AUTO_NUM_QUEUES) {
>>>>> +        conf->num_queues = MIN(4, current_machine->smp.cpus);
>>>>> +    }  
>>>>
>>>> I would like to have a comment explaining the numbers here, however.
>>>>
>>>> virtio-pci has a pretty good explanation (use 1:1 for vqs:vcpus if
>>>> possible, apply some other capping). 4 seems to be a bit arbitrary
>>>> without explanation, although I'm sure you did some measurements :)
>>>
>>> Frankly, I don't have any measurements yet. For the secure case,
>>> I think Mimu has assessed the impact of multiqueue, hence adding Mimu to
>>> the cc list. @Mimu can you help us out.
>>>
>>> Regarding the normal non-protected VMs I'm in a middle of producing some
>>> measurement data. This was admittedly a bit rushed because of where we
>>> are in the cycle. Sorry to disappoint you.
>>>
>>> The number 4 was suggested by Christian, maybe Christian does have some
>>> readily available measurement data for the normal VM case. @Christian:
>>> can you help me out?
>> My point was to find a balance between performance gain and memory usage.
>> As a matter of fact, virtqueue do consume memory. So 4 looked like a
>> reasonable default for me for large guests as long as we do not have directed
>> interrupts.
>>
>> Now, thinking about this again: If we want to change the default to something
>> else in the future (e.g. to num vcpus) then the compat handling will get
>> really complicated.
> 
> Regarding compat handling, I believe we would need a new property for
> virtio-blk-ccw: something like def_num_queues_max. Then logic would
> morph to MIN(def_num_queues_max, current_machine->smp.cpus), and we could
> relatively freely do compat stuff on def_num_queues_max.
> 
> IMHO not pretty but certainly doable.
> 
>>
>> So we can
>> - go with num queues = num cpus. But this will consume memory
>> for guests with lots of CPUs.
> 
> In absence of data that showcases the benefit outweighing the obvious
> detriment, I lean towards finding this option the least favorable.
> 
>> - go with the proposed logic of min(4,vcpus) and accept that future compat handling
>> is harder
> 
> IMHO not a bad option, but I think I would still feel better about a
> more informed decision. In the end, the end user can already specify the
> num_queues explicitly, so I don't think this is urgent.
> 
>> - defer this change
> 
> So I tend to lean towards deferring.

Yes, I was pushing this for 5.2 to avoid compat handling. But maybe it is better
to wait and do it later. But we should certainly continue the discussion to have
something for the next release.

> 
> Another thought is, provided the load is about evenly spread on the
> different virtqueues, if the game is about vCPU locality, one could
> think of decreasing the size of each individual virtqueue while
> increasing their number, with the idea of not paying much more in terms
> of memory. The queue size however needs to be a power of 2,
> so there is a limit on the granularity.
> 
> Regarding secure VMs, currently we have to cramp at least the swiotlb and
> the virtqueues into ZONE_DMA. So increasing the number of
> virtqueues heavily may get us into new trouble with exotic configs.
> 
> Regards,
> Halil
>
Michael Mueller Nov. 10, 2020, 2:16 p.m. UTC | #8
On 09.11.20 19:53, Halil Pasic wrote:
> On Mon, 9 Nov 2020 17:06:16 +0100
> Cornelia Huck <cohuck@redhat.com> wrote:
> 
>>> @@ -20,6 +21,11 @@ static void virtio_ccw_blk_realize(VirtioCcwDevice *ccw_dev, Error **errp)
>>>   {
>>>       VirtIOBlkCcw *dev = VIRTIO_BLK_CCW(ccw_dev);
>>>       DeviceState *vdev = DEVICE(&dev->vdev);
>>> +    VirtIOBlkConf *conf = &dev->vdev.conf;
>>> +
>>> +    if (conf->num_queues == VIRTIO_BLK_AUTO_NUM_QUEUES) {
>>> +        conf->num_queues = MIN(4, current_machine->smp.cpus);
>>> +    }
>>
>> I would like to have a comment explaining the numbers here, however.
>>
>> virtio-pci has a pretty good explanation (use 1:1 for vqs:vcpus if
>> possible, apply some other capping). 4 seems to be a bit arbitrary
>> without explanation, although I'm sure you did some measurements :)
> 
> Frankly, I don't have any measurements yet. For the secure case,
> I think Mimu has assessed the impact of multiqueue, hence adding Mimu to
> the cc list. @Mimu can you help us out.
> 
> Regarding the normal non-protected VMs I'm in a middle of producing some
> measurement data. This was admittedly a bit rushed because of where we
> are in the cycle. Sorry to disappoint you.

I'm talking with the perf team tomorrow. They have done some 
measurements with multiqueue for PV guests and I asked for a comparison 
to non PV guests as well.

Michael

> 
> The number 4 was suggested by Christian, maybe Christian does have some
> readily available measurement data for the normal VM case. @Christian:
> can you help me out?
> 
> Regards,
> Halil
>
Michael Mueller Nov. 11, 2020, 12:26 p.m. UTC | #9
On 10.11.20 15:16, Michael Mueller wrote:
> 
> 
> On 09.11.20 19:53, Halil Pasic wrote:
>> On Mon, 9 Nov 2020 17:06:16 +0100
>> Cornelia Huck <cohuck@redhat.com> wrote:
>>
>>>> @@ -20,6 +21,11 @@ static void 
>>>> virtio_ccw_blk_realize(VirtioCcwDevice *ccw_dev, Error **errp)
>>>>   {
>>>>       VirtIOBlkCcw *dev = VIRTIO_BLK_CCW(ccw_dev);
>>>>       DeviceState *vdev = DEVICE(&dev->vdev);
>>>> +    VirtIOBlkConf *conf = &dev->vdev.conf;
>>>> +
>>>> +    if (conf->num_queues == VIRTIO_BLK_AUTO_NUM_QUEUES) {
>>>> +        conf->num_queues = MIN(4, current_machine->smp.cpus);
>>>> +    }
>>>
>>> I would like to have a comment explaining the numbers here, however.
>>>
>>> virtio-pci has a pretty good explanation (use 1:1 for vqs:vcpus if
>>> possible, apply some other capping). 4 seems to be a bit arbitrary
>>> without explanation, although I'm sure you did some measurements :)
>>
>> Frankly, I don't have any measurements yet. For the secure case,
>> I think Mimu has assessed the impact of multiqueue, hence adding Mimu to
>> the cc list. @Mimu can you help us out.
>>
>> Regarding the normal non-protected VMs I'm in a middle of producing some
>> measurement data. This was admittedly a bit rushed because of where we
>> are in the cycle. Sorry to disappoint you.
> 
> I'm talking with the perf team tomorrow. They have done some 
> measurements with multiqueue for PV guests and I asked for a comparison 
> to non PV guests as well.

The perf team has performed measurements for us that show that a *PV
KVM guest* benefits in terms of throughput for random read, random write
and sequential read (no difference for sequential write) by a multi
queue setup. CPU cost are reduced as well due to reduced spinlock
contention.

For a *standard KVM guest* it currently has no throughput effect. No
benefit and no harm. I have asked them to finalize their measurements
by comparing the CPU cost as well. I will receive that information on 
Friday.

Michael


> 
> Michael
> 
>>
>> The number 4 was suggested by Christian, maybe Christian does have some
>> readily available measurement data for the normal VM case. @Christian:
>> can you help me out?
>>
>> Regards,
>> Halil
>>
>
Cornelia Huck Nov. 11, 2020, 12:38 p.m. UTC | #10
On Wed, 11 Nov 2020 13:26:11 +0100
Michael Mueller <mimu@linux.ibm.com> wrote:

> On 10.11.20 15:16, Michael Mueller wrote:
> > 
> > 
> > On 09.11.20 19:53, Halil Pasic wrote:  
> >> On Mon, 9 Nov 2020 17:06:16 +0100
> >> Cornelia Huck <cohuck@redhat.com> wrote:
> >>  
> >>>> @@ -20,6 +21,11 @@ static void 
> >>>> virtio_ccw_blk_realize(VirtioCcwDevice *ccw_dev, Error **errp)
> >>>>   {
> >>>>       VirtIOBlkCcw *dev = VIRTIO_BLK_CCW(ccw_dev);
> >>>>       DeviceState *vdev = DEVICE(&dev->vdev);
> >>>> +    VirtIOBlkConf *conf = &dev->vdev.conf;
> >>>> +
> >>>> +    if (conf->num_queues == VIRTIO_BLK_AUTO_NUM_QUEUES) {
> >>>> +        conf->num_queues = MIN(4, current_machine->smp.cpus);
> >>>> +    }  
> >>>
> >>> I would like to have a comment explaining the numbers here, however.
> >>>
> >>> virtio-pci has a pretty good explanation (use 1:1 for vqs:vcpus if
> >>> possible, apply some other capping). 4 seems to be a bit arbitrary
> >>> without explanation, although I'm sure you did some measurements :)  
> >>
> >> Frankly, I don't have any measurements yet. For the secure case,
> >> I think Mimu has assessed the impact of multiqueue, hence adding Mimu to
> >> the cc list. @Mimu can you help us out.
> >>
> >> Regarding the normal non-protected VMs I'm in a middle of producing some
> >> measurement data. This was admittedly a bit rushed because of where we
> >> are in the cycle. Sorry to disappoint you.  
> > 
> > I'm talking with the perf team tomorrow. They have done some 
> > measurements with multiqueue for PV guests and I asked for a comparison 
> > to non PV guests as well.  
> 
> The perf team has performed measurements for us that show that a *PV
> KVM guest* benefits in terms of throughput for random read, random write
> and sequential read (no difference for sequential write) by a multi
> queue setup. CPU cost are reduced as well due to reduced spinlock
> contention.

Just to be clear, that was with 4 queues?

> 
> For a *standard KVM guest* it currently has no throughput effect. No
> benefit and no harm. I have asked them to finalize their measurements
> by comparing the CPU cost as well. I will receive that information on 
> Friday.

Thank you for checking!

> 
> Michael
> 
> 
> > 
> > Michael
> >   
> >>
> >> The number 4 was suggested by Christian, maybe Christian does have some
> >> readily available measurement data for the normal VM case. @Christian:
> >> can you help me out?
> >>
> >> Regards,
> >> Halil
> >>  
> >   
> 
>
Michael Mueller Nov. 11, 2020, 12:49 p.m. UTC | #11
On 11.11.20 13:38, Cornelia Huck wrote:
> On Wed, 11 Nov 2020 13:26:11 +0100
> Michael Mueller <mimu@linux.ibm.com> wrote:
> 
>> On 10.11.20 15:16, Michael Mueller wrote:
>>>
>>>
>>> On 09.11.20 19:53, Halil Pasic wrote:
>>>> On Mon, 9 Nov 2020 17:06:16 +0100
>>>> Cornelia Huck <cohuck@redhat.com> wrote:
>>>>   
>>>>>> @@ -20,6 +21,11 @@ static void
>>>>>> virtio_ccw_blk_realize(VirtioCcwDevice *ccw_dev, Error **errp)
>>>>>>    {
>>>>>>        VirtIOBlkCcw *dev = VIRTIO_BLK_CCW(ccw_dev);
>>>>>>        DeviceState *vdev = DEVICE(&dev->vdev);
>>>>>> +    VirtIOBlkConf *conf = &dev->vdev.conf;
>>>>>> +
>>>>>> +    if (conf->num_queues == VIRTIO_BLK_AUTO_NUM_QUEUES) {
>>>>>> +        conf->num_queues = MIN(4, current_machine->smp.cpus);
>>>>>> +    }
>>>>>
>>>>> I would like to have a comment explaining the numbers here, however.
>>>>>
>>>>> virtio-pci has a pretty good explanation (use 1:1 for vqs:vcpus if
>>>>> possible, apply some other capping). 4 seems to be a bit arbitrary
>>>>> without explanation, although I'm sure you did some measurements :)
>>>>
>>>> Frankly, I don't have any measurements yet. For the secure case,
>>>> I think Mimu has assessed the impact of multiqueue, hence adding Mimu to
>>>> the cc list. @Mimu can you help us out.
>>>>
>>>> Regarding the normal non-protected VMs I'm in a middle of producing some
>>>> measurement data. This was admittedly a bit rushed because of where we
>>>> are in the cycle. Sorry to disappoint you.
>>>
>>> I'm talking with the perf team tomorrow. They have done some
>>> measurements with multiqueue for PV guests and I asked for a comparison
>>> to non PV guests as well.
>>
>> The perf team has performed measurements for us that show that a *PV
>> KVM guest* benefits in terms of throughput for random read, random write
>> and sequential read (no difference for sequential write) by a multi
>> queue setup. CPU cost are reduced as well due to reduced spinlock
>> contention.
> 
> Just to be clear, that was with 4 queues?

Yes, we have seen it with 4 and also with 9 queues.

Halil,

still I would like to know what the exact memory consumption per queue
is that you are talking about. Have you made a calculation? Thanks.

> 
>>
>> For a *standard KVM guest* it currently has no throughput effect. No
>> benefit and no harm. I have asked them to finalize their measurements
>> by comparing the CPU cost as well. I will receive that information on
>> Friday.
> 
> Thank you for checking!
> 
>>
>> Michael
>>
>>
>>>
>>> Michael
>>>    
>>>>
>>>> The number 4 was suggested by Christian, maybe Christian does have some
>>>> readily available measurement data for the normal VM case. @Christian:
>>>> can you help me out?
>>>>
>>>> Regards,
>>>> Halil
>>>>   
>>>    
>>
>>
>
Halil Pasic Nov. 11, 2020, 3:16 p.m. UTC | #12
On Wed, 11 Nov 2020 13:38:15 +0100
Cornelia Huck <cohuck@redhat.com> wrote:

> Tags: inv, me 
> From: Cornelia Huck <cohuck@redhat.com>
> To: Michael Mueller <mimu@linux.ibm.com>
> Cc: Thomas Huth <thuth@redhat.com>, David Hildenbrand <david@redhat.com>, "Michael S. Tsirkin" <mst@redhat.com>, qemu-devel@nongnu.org, Halil Pasic <pasic@linux.ibm.com>, Christian Borntraeger <borntraeger@de.ibm.com>, qemu-s390x@nongnu.org
> Subject: Re: [PATCH 1/1] virtio-blk-ccw: tweak the default for num_queues
> Date: Wed, 11 Nov 2020 13:38:15 +0100
> Sender: "Qemu-devel" <qemu-devel-bounces+pasic=linux.ibm.com@nongnu.org>
> Organization: Red Hat GmbH
> 
> On Wed, 11 Nov 2020 13:26:11 +0100
> Michael Mueller <mimu@linux.ibm.com> wrote:
> 
> > On 10.11.20 15:16, Michael Mueller wrote:  
> > > 
> > > 
> > > On 09.11.20 19:53, Halil Pasic wrote:    
> > >> On Mon, 9 Nov 2020 17:06:16 +0100
> > >> Cornelia Huck <cohuck@redhat.com> wrote:
> > >>    
> > >>>> @@ -20,6 +21,11 @@ static void 
> > >>>> virtio_ccw_blk_realize(VirtioCcwDevice *ccw_dev, Error **errp)
> > >>>>   {
> > >>>>       VirtIOBlkCcw *dev = VIRTIO_BLK_CCW(ccw_dev);
> > >>>>       DeviceState *vdev = DEVICE(&dev->vdev);
> > >>>> +    VirtIOBlkConf *conf = &dev->vdev.conf;
> > >>>> +
> > >>>> +    if (conf->num_queues == VIRTIO_BLK_AUTO_NUM_QUEUES) {
> > >>>> +        conf->num_queues = MIN(4, current_machine->smp.cpus);
> > >>>> +    }    
> > >>>
> > >>> I would like to have a comment explaining the numbers here, however.
> > >>>
> > >>> virtio-pci has a pretty good explanation (use 1:1 for vqs:vcpus if
> > >>> possible, apply some other capping). 4 seems to be a bit arbitrary
> > >>> without explanation, although I'm sure you did some measurements :)    
> > >>
> > >> Frankly, I don't have any measurements yet. For the secure case,
> > >> I think Mimu has assessed the impact of multiqueue, hence adding Mimu to
> > >> the cc list. @Mimu can you help us out.
> > >>
> > >> Regarding the normal non-protected VMs I'm in a middle of producing some
> > >> measurement data. This was admittedly a bit rushed because of where we
> > >> are in the cycle. Sorry to disappoint you.    
> > > 
> > > I'm talking with the perf team tomorrow. They have done some 
> > > measurements with multiqueue for PV guests and I asked for a comparison 
> > > to non PV guests as well.    
> > 
> > The perf team has performed measurements for us that show that a *PV
> > KVM guest* benefits in terms of throughput for random read, random write
> > and sequential read (no difference for sequential write) by a multi
> > queue setup. CPU cost are reduced as well due to reduced spinlock
> > contention.  
> 
> Just to be clear, that was with 4 queues?
> 
> > 
> > For a *standard KVM guest* it currently has no throughput effect. No
> > benefit and no harm. I have asked them to finalize their measurements
> > by comparing the CPU cost as well. I will receive that information on 
> > Friday.  
> 
> Thank you for checking!

The results of my measurements (normal case only) are consistent with
these findings.

My setup looks like this: A guest with 6 vcpus and an attached
virtio-blk-ccw disk backed by a raw image on a tmpfs (i.e. backed by
ram, because we are interested in virtio and not in the scsi disk). The
performance was evaluated with fio, randrw, queuedepth=1 and bs=1M. I
scaled the number of virtqueues from 1 to 5, and collected 30 data points
each.

The full fio command line I used is at the end of this mail.

For a nicer table, please see the attached png. Regarding the difference
in averages, it's little about 1,2 percent.

The percentages are with respect to average over
all queues values.

queues
1	Average - write_iops	99.45%
	Average - write_bw	99.45%
	Average - read_iops	99.44%
	Average - read_bw	99.44%
2	Average - write_iops	99.93%
	Average - write_bw	99.93%
	Average - read_iops	99.92%
	Average - read_bw	99.92%
3	Average - write_iops	100.02%
	Average - write_bw	100.02%
	Average - read_iops	100.02%
	Average - read_bw	100.02%
4	Average - write_iops	100.64%
	Average - write_bw	100.64%
	Average - read_iops	100.64%
	Average - read_bw	100.64%
5	Average - write_iops	99.97%
	Average - write_bw	99.97%
	Average - read_iops	99.97%
	Average - read_bw	99.97%
Total Average - write_iops		100.00%
Total Average - write_bw		100.00%
Total Average - read_iops		100.00%
Total Average - read_bw		100.00%

fio --ramp_time=30s --output-format=json --bs=1M --ioengine=libaio --readwrite=randrw --runtime=120s --size=1948m --name=measurement --gtod_reduce=1 --direct=1 --iodepth=1 --filename=/dev/vda  --time_based
Halil Pasic Nov. 12, 2020, 1:31 p.m. UTC | #13
On Wed, 11 Nov 2020 13:49:08 +0100
Michael Mueller <mimu@linux.ibm.com> wrote:

> Halil,
> 
> still I would like to know what the exact memory consumption per queue
> is that you are talking about. Have you made a calculation? Thanks.

Hi! 

The default size for virtio-blk seems to be 256 ring entries, which
translates to 6668 bytes for the split ring(s). The queue size is user
configurable, and guest, in theory, can decide to have a smaller queue.
The indirect descriptors are allocated separately, and bounced via
swiotlb in case of secure guests.

Regards,
Halil
Cornelia Huck Dec. 15, 2020, 8:26 a.m. UTC | #14
On Tue, 10 Nov 2020 14:18:39 +0100
Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> On 10.11.20 11:40, Halil Pasic wrote:
> > On Tue, 10 Nov 2020 09:47:51 +0100
> > Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> >   
> >>
> >>
> >> On 09.11.20 19:53, Halil Pasic wrote:  
> >>> On Mon, 9 Nov 2020 17:06:16 +0100
> >>> Cornelia Huck <cohuck@redhat.com> wrote:
> >>>  
> >>>>> @@ -20,6 +21,11 @@ static void virtio_ccw_blk_realize(VirtioCcwDevice *ccw_dev, Error **errp)
> >>>>>  {
> >>>>>      VirtIOBlkCcw *dev = VIRTIO_BLK_CCW(ccw_dev);
> >>>>>      DeviceState *vdev = DEVICE(&dev->vdev);
> >>>>> +    VirtIOBlkConf *conf = &dev->vdev.conf;
> >>>>> +
> >>>>> +    if (conf->num_queues == VIRTIO_BLK_AUTO_NUM_QUEUES) {
> >>>>> +        conf->num_queues = MIN(4, current_machine->smp.cpus);
> >>>>> +    }    
> >>>>
> >>>> I would like to have a comment explaining the numbers here, however.
> >>>>
> >>>> virtio-pci has a pretty good explanation (use 1:1 for vqs:vcpus if
> >>>> possible, apply some other capping). 4 seems to be a bit arbitrary
> >>>> without explanation, although I'm sure you did some measurements :)  
> >>>
> >>> Frankly, I don't have any measurements yet. For the secure case,
> >>> I think Mimu has assessed the impact of multiqueue, hence adding Mimu to
> >>> the cc list. @Mimu can you help us out.
> >>>
> >>> Regarding the normal non-protected VMs I'm in a middle of producing some
> >>> measurement data. This was admittedly a bit rushed because of where we
> >>> are in the cycle. Sorry to disappoint you.
> >>>
> >>> The number 4 was suggested by Christian, maybe Christian does have some
> >>> readily available measurement data for the normal VM case. @Christian:
> >>> can you help me out?  
> >> My point was to find a balance between performance gain and memory usage.
> >> As a matter of fact, virtqueue do consume memory. So 4 looked like a
> >> reasonable default for me for large guests as long as we do not have directed
> >> interrupts.
> >>
> >> Now, thinking about this again: If we want to change the default to something
> >> else in the future (e.g. to num vcpus) then the compat handling will get
> >> really complicated.  
> > 
> > Regarding compat handling, I believe we would need a new property for
> > virtio-blk-ccw: something like def_num_queues_max. Then logic would
> > morph to MIN(def_num_queues_max, current_machine->smp.cpus), and we could
> > relatively freely do compat stuff on def_num_queues_max.
> > 
> > IMHO not pretty but certainly doable.
> >   
> >>
> >> So we can
> >> - go with num queues = num cpus. But this will consume memory
> >> for guests with lots of CPUs.  
> > 
> > In absence of data that showcases the benefit outweighing the obvious
> > detriment, I lean towards finding this option the least favorable.
> >   
> >> - go with the proposed logic of min(4,vcpus) and accept that future compat handling
> >> is harder  
> > 
> > IMHO not a bad option, but I think I would still feel better about a
> > more informed decision. In the end, the end user can already specify the
> > num_queues explicitly, so I don't think this is urgent.
> >   
> >> - defer this change  
> > 
> > So I tend to lean towards deferring.  
> 
> Yes, I was pushing this for 5.2 to avoid compat handling. But maybe it is better
> to wait and do it later. But we should certainly continue the discussion to have
> something for the next release.

<going through older mails>

Do we have a better idea now about which values would make sense here?

> 
> > 
> > Another thought is, provided the load is about evenly spread on the
> > different virtqueues, if the game is about vCPU locality, one could
> > think of decreasing the size of each individual virtqueue while
> > increasing their number, with the idea of not paying much more in terms
> > of memory. The queue size however needs to be a power of 2,
> > so there is a limit on the granularity.
> > 
> > Regarding secure VMs, currently we have to cramp at least the swiotlb and
> > the virtqueues into ZONE_DMA. So increasing the number of
> > virtqueues heavily may get us into new trouble with exotic configs.
> > 
> > Regards,
> > Halil
> >   
>
Halil Pasic Dec. 15, 2020, 11:33 a.m. UTC | #15
On Tue, 15 Dec 2020 09:26:56 +0100
Cornelia Huck <cohuck@redhat.com> wrote:

> On Tue, 10 Nov 2020 14:18:39 +0100
> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> 
> > On 10.11.20 11:40, Halil Pasic wrote:
> > > On Tue, 10 Nov 2020 09:47:51 +0100
> > > Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> > >   
> > >>
> > >>
> > >> On 09.11.20 19:53, Halil Pasic wrote:  
> > >>> On Mon, 9 Nov 2020 17:06:16 +0100
> > >>> Cornelia Huck <cohuck@redhat.com> wrote:
> > >>>  
> > >>>>> @@ -20,6 +21,11 @@ static void virtio_ccw_blk_realize(VirtioCcwDevice *ccw_dev, Error **errp)
> > >>>>>  {
> > >>>>>      VirtIOBlkCcw *dev = VIRTIO_BLK_CCW(ccw_dev);
> > >>>>>      DeviceState *vdev = DEVICE(&dev->vdev);
> > >>>>> +    VirtIOBlkConf *conf = &dev->vdev.conf;
> > >>>>> +
> > >>>>> +    if (conf->num_queues == VIRTIO_BLK_AUTO_NUM_QUEUES) {
> > >>>>> +        conf->num_queues = MIN(4, current_machine->smp.cpus);
> > >>>>> +    }    
> > >>>>
> > >>>> I would like to have a comment explaining the numbers here, however.
> > >>>>
> > >>>> virtio-pci has a pretty good explanation (use 1:1 for vqs:vcpus if
> > >>>> possible, apply some other capping). 4 seems to be a bit arbitrary
> > >>>> without explanation, although I'm sure you did some measurements :)  
> > >>>
> > >>> Frankly, I don't have any measurements yet. For the secure case,
> > >>> I think Mimu has assessed the impact of multiqueue, hence adding Mimu to
> > >>> the cc list. @Mimu can you help us out.
> > >>>
> > >>> Regarding the normal non-protected VMs I'm in a middle of producing some
> > >>> measurement data. This was admittedly a bit rushed because of where we
> > >>> are in the cycle. Sorry to disappoint you.
> > >>>
> > >>> The number 4 was suggested by Christian, maybe Christian does have some
> > >>> readily available measurement data for the normal VM case. @Christian:
> > >>> can you help me out?  
> > >> My point was to find a balance between performance gain and memory usage.
> > >> As a matter of fact, virtqueue do consume memory. So 4 looked like a
> > >> reasonable default for me for large guests as long as we do not have directed
> > >> interrupts.
> > >>
> > >> Now, thinking about this again: If we want to change the default to something
> > >> else in the future (e.g. to num vcpus) then the compat handling will get
> > >> really complicated.  
> > > 
> > > Regarding compat handling, I believe we would need a new property for
> > > virtio-blk-ccw: something like def_num_queues_max. Then logic would
> > > morph to MIN(def_num_queues_max, current_machine->smp.cpus), and we could
> > > relatively freely do compat stuff on def_num_queues_max.
> > > 
> > > IMHO not pretty but certainly doable.
> > >   
> > >>
> > >> So we can
> > >> - go with num queues = num cpus. But this will consume memory
> > >> for guests with lots of CPUs.  
> > > 
> > > In absence of data that showcases the benefit outweighing the obvious
> > > detriment, I lean towards finding this option the least favorable.
> > >   
> > >> - go with the proposed logic of min(4,vcpus) and accept that future compat handling
> > >> is harder  
> > > 
> > > IMHO not a bad option, but I think I would still feel better about a
> > > more informed decision. In the end, the end user can already specify the
> > > num_queues explicitly, so I don't think this is urgent.
> > >   
> > >> - defer this change  
> > > 
> > > So I tend to lean towards deferring.  
> > 
> > Yes, I was pushing this for 5.2 to avoid compat handling. But maybe it is better
> > to wait and do it later. But we should certainly continue the discussion to have
> > something for the next release.
> 
> <going through older mails>
> 
> Do we have a better idea now about which values would make sense here?
> 

Hi Conny!

I have nothing new since then. Capping at 4 queues still looks like a
reasonable compromise to me.  @Mimu: anything new since then?

If you like I can spin a new version (we need compat handling now).

Halil
Cornelia Huck Dec. 15, 2020, 5:27 p.m. UTC | #16
On Tue, 15 Dec 2020 12:33:34 +0100
Halil Pasic <pasic@linux.ibm.com> wrote:

> On Tue, 15 Dec 2020 09:26:56 +0100
> Cornelia Huck <cohuck@redhat.com> wrote:

> > Do we have a better idea now about which values would make sense here?
> >   
> 
> Hi Conny!
> 
> I have nothing new since then. Capping at 4 queues still looks like a
> reasonable compromise to me.  @Mimu: anything new since then?
> 
> If you like I can spin a new version (we need compat handling now).
> 
> Halil
> 

If your tests show that 4 queues are a reasonable value, this is fine
by me (with an explanatory comment in the code.)

Feel free to respin.
diff mbox series

Patch

diff --git a/hw/s390x/virtio-ccw-blk.c b/hw/s390x/virtio-ccw-blk.c
index 2294ce1ce4..7296140dde 100644
--- a/hw/s390x/virtio-ccw-blk.c
+++ b/hw/s390x/virtio-ccw-blk.c
@@ -10,6 +10,7 @@ 
  */
 
 #include "qemu/osdep.h"
+#include "hw/boards.h"
 #include "hw/qdev-properties.h"
 #include "hw/virtio/virtio.h"
 #include "qapi/error.h"
@@ -20,6 +21,11 @@  static void virtio_ccw_blk_realize(VirtioCcwDevice *ccw_dev, Error **errp)
 {
     VirtIOBlkCcw *dev = VIRTIO_BLK_CCW(ccw_dev);
     DeviceState *vdev = DEVICE(&dev->vdev);
+    VirtIOBlkConf *conf = &dev->vdev.conf;
+
+    if (conf->num_queues == VIRTIO_BLK_AUTO_NUM_QUEUES) {
+        conf->num_queues = MIN(4, current_machine->smp.cpus);
+    }
 
     qdev_realize(vdev, BUS(&ccw_dev->bus), errp);
 }