Message ID | 20210507165905.748196-1-groug@kaod.org (mailing list archive) |
---|---|
Headers | show |
Series | virtio: Improve boot time of virtio-scsi-pci and virtio-blk-pci | expand |
On Fri, May 07, 2021 at 06:59:01PM +0200, Greg Kurz wrote: > Now that virtio-scsi-pci and virtio-blk-pci map 1 virtqueue per vCPU, > a serious slow down may be observed on setups with a big enough number > of vCPUs. > > Exemple with a pseries guest on a bi-POWER9 socket system (128 HW threads): > > virtio-scsi virtio-blk > > 1 0m20.922s 0m21.346s > 2 0m21.230s 0m20.350s > 4 0m21.761s 0m20.997s > 8 0m22.770s 0m20.051s > 16 0m22.038s 0m19.994s > 32 0m22.928s 0m20.803s > 64 0m26.583s 0m22.953s > 128 0m41.273s 0m32.333s > 256 2m4.727s 1m16.924s > 384 6m5.563s 3m26.186s > > Both perf and gprof indicate that QEMU is hogging CPUs when setting up > the ioeventfds: > > 67.88% swapper [kernel.kallsyms] [k] power_pmu_enable > 9.47% qemu-kvm [kernel.kallsyms] [k] smp_call_function_single > 8.64% qemu-kvm [kernel.kallsyms] [k] power_pmu_enable > =>2.79% qemu-kvm qemu-kvm [.] memory_region_ioeventfd_before > =>2.12% qemu-kvm qemu-kvm [.] address_space_update_ioeventfds > 0.56% kworker/8:0-mm [kernel.kallsyms] [k] smp_call_function_single > > address_space_update_ioeventfds() is called when committing an MR > transaction, i.e. for each ioeventfd with the current code base, > and it internally loops on all ioventfds: > > static void address_space_update_ioeventfds(AddressSpace *as) > { > [...] > FOR_EACH_FLAT_RANGE(fr, view) { > for (i = 0; i < fr->mr->ioeventfd_nb; ++i) { > > This means that the setup of ioeventfds for these devices has > quadratic time complexity. > > This series simply changes the device models to extend the transaction > to all virtqueueues, like already done in the past in the generic > code with 710fccf80d78 ("virtio: improve virtio devices initialization > time"). > > Only virtio-scsi and virtio-blk are covered here, but a similar change > might also be beneficial to other device types such as host-scsi-pci, > vhost-user-scsi-pci and vhost-user-blk-pci. > > virtio-scsi virtio-blk > > 1 0m21.271s 0m22.076s > 2 0m20.912s 0m19.716s > 4 0m20.508s 0m19.310s > 8 0m21.374s 0m20.273s > 16 0m21.559s 0m21.374s > 32 0m22.532s 0m21.271s > 64 0m26.550s 0m22.007s > 128 0m29.115s 0m27.446s > 256 0m44.752s 0m41.004s > 384 1m2.884s 0m58.023s > > This should fix https://bugzilla.redhat.com/show_bug.cgi?id=1927108 > which reported the issue for virtio-scsi-pci. > > Changes since v1: > - Add some comments (Stefan) > - Drop optimization on the error path in patch 2 (Stefan) > > Changes since RFC: > > As suggested by Stefan, splimplify the code by directly beginning and > committing the memory transaction from the device model, without all > the virtio specific proxying code and no changes needed in the memory > subsystem. > > Greg Kurz (4): > virtio-blk: Fix rollback path in virtio_blk_data_plane_start() > virtio-blk: Configure all host notifiers in a single MR transaction > virtio-scsi: Set host notifiers and callbacks separately > virtio-scsi: Configure all host notifiers in a single MR transaction > > hw/block/dataplane/virtio-blk.c | 45 ++++++++++++++++++++- > hw/scsi/virtio-scsi-dataplane.c | 72 ++++++++++++++++++++++++--------- > 2 files changed, 97 insertions(+), 20 deletions(-) > > -- > 2.26.3 > Thanks, applied to my block tree: https://gitlab.com/stefanha/qemu/commits/block Stefan
On Wed, 12 May 2021 17:05:53 +0100 Stefan Hajnoczi <stefanha@redhat.com> wrote: > On Fri, May 07, 2021 at 06:59:01PM +0200, Greg Kurz wrote: > > Now that virtio-scsi-pci and virtio-blk-pci map 1 virtqueue per vCPU, > > a serious slow down may be observed on setups with a big enough number > > of vCPUs. > > > > Exemple with a pseries guest on a bi-POWER9 socket system (128 HW threads): > > > > virtio-scsi virtio-blk > > > > 1 0m20.922s 0m21.346s > > 2 0m21.230s 0m20.350s > > 4 0m21.761s 0m20.997s > > 8 0m22.770s 0m20.051s > > 16 0m22.038s 0m19.994s > > 32 0m22.928s 0m20.803s > > 64 0m26.583s 0m22.953s > > 128 0m41.273s 0m32.333s > > 256 2m4.727s 1m16.924s > > 384 6m5.563s 3m26.186s > > > > Both perf and gprof indicate that QEMU is hogging CPUs when setting up > > the ioeventfds: > > > > 67.88% swapper [kernel.kallsyms] [k] power_pmu_enable > > 9.47% qemu-kvm [kernel.kallsyms] [k] smp_call_function_single > > 8.64% qemu-kvm [kernel.kallsyms] [k] power_pmu_enable > > =>2.79% qemu-kvm qemu-kvm [.] memory_region_ioeventfd_before > > =>2.12% qemu-kvm qemu-kvm [.] address_space_update_ioeventfds > > 0.56% kworker/8:0-mm [kernel.kallsyms] [k] smp_call_function_single > > > > address_space_update_ioeventfds() is called when committing an MR > > transaction, i.e. for each ioeventfd with the current code base, > > and it internally loops on all ioventfds: > > > > static void address_space_update_ioeventfds(AddressSpace *as) > > { > > [...] > > FOR_EACH_FLAT_RANGE(fr, view) { > > for (i = 0; i < fr->mr->ioeventfd_nb; ++i) { > > > > This means that the setup of ioeventfds for these devices has > > quadratic time complexity. > > > > This series simply changes the device models to extend the transaction > > to all virtqueueues, like already done in the past in the generic > > code with 710fccf80d78 ("virtio: improve virtio devices initialization > > time"). > > > > Only virtio-scsi and virtio-blk are covered here, but a similar change > > might also be beneficial to other device types such as host-scsi-pci, > > vhost-user-scsi-pci and vhost-user-blk-pci. > > > > virtio-scsi virtio-blk > > > > 1 0m21.271s 0m22.076s > > 2 0m20.912s 0m19.716s > > 4 0m20.508s 0m19.310s > > 8 0m21.374s 0m20.273s > > 16 0m21.559s 0m21.374s > > 32 0m22.532s 0m21.271s > > 64 0m26.550s 0m22.007s > > 128 0m29.115s 0m27.446s > > 256 0m44.752s 0m41.004s > > 384 1m2.884s 0m58.023s > > > > This should fix https://bugzilla.redhat.com/show_bug.cgi?id=1927108 > > which reported the issue for virtio-scsi-pci. > > > > Changes since v1: > > - Add some comments (Stefan) > > - Drop optimization on the error path in patch 2 (Stefan) > > > > Changes since RFC: > > > > As suggested by Stefan, splimplify the code by directly beginning and > > committing the memory transaction from the device model, without all > > the virtio specific proxying code and no changes needed in the memory > > subsystem. > > > > Greg Kurz (4): > > virtio-blk: Fix rollback path in virtio_blk_data_plane_start() > > virtio-blk: Configure all host notifiers in a single MR transaction > > virtio-scsi: Set host notifiers and callbacks separately > > virtio-scsi: Configure all host notifiers in a single MR transaction > > > > hw/block/dataplane/virtio-blk.c | 45 ++++++++++++++++++++- > > hw/scsi/virtio-scsi-dataplane.c | 72 ++++++++++++++++++++++++--------- > > 2 files changed, 97 insertions(+), 20 deletions(-) > > > > -- > > 2.26.3 > > > > Thanks, applied to my block tree: > https://gitlab.com/stefanha/qemu/commits/block > Hi Stefan, It seems that Michael already merged the previous version of this patch set with its latest PR. https://gitlab.com/qemu-project/qemu/-/commit/6005ee07c380cbde44292f5f6c96e7daa70f4f7d It is thus missing the v1->v2 changes. Basically some comments to clarify the optimization we're doing with the MR transaction and the removal of the optimization on an error path. The optimization on the error path isn't needed indeed but it doesn't hurt. No need to change that now that the patches are upstream. I can post a follow-up patch to add the missing comments though. While here, I'd even add these comments in the generic virtio_device_*_ioeventfd_impl() calls as well, since they already have the very same optimization. Anyway, I guess you can drop the patches from your tree. Cheers, -- Greg > Stefan
On Mon, May 17, 2021 at 10:32:59AM +0200, Greg Kurz wrote: > On Wed, 12 May 2021 17:05:53 +0100 > Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > On Fri, May 07, 2021 at 06:59:01PM +0200, Greg Kurz wrote: > > > Now that virtio-scsi-pci and virtio-blk-pci map 1 virtqueue per vCPU, > > > a serious slow down may be observed on setups with a big enough number > > > of vCPUs. > > > > > > Exemple with a pseries guest on a bi-POWER9 socket system (128 HW threads): > > > > > > virtio-scsi virtio-blk > > > > > > 1 0m20.922s 0m21.346s > > > 2 0m21.230s 0m20.350s > > > 4 0m21.761s 0m20.997s > > > 8 0m22.770s 0m20.051s > > > 16 0m22.038s 0m19.994s > > > 32 0m22.928s 0m20.803s > > > 64 0m26.583s 0m22.953s > > > 128 0m41.273s 0m32.333s > > > 256 2m4.727s 1m16.924s > > > 384 6m5.563s 3m26.186s > > > > > > Both perf and gprof indicate that QEMU is hogging CPUs when setting up > > > the ioeventfds: > > > > > > 67.88% swapper [kernel.kallsyms] [k] power_pmu_enable > > > 9.47% qemu-kvm [kernel.kallsyms] [k] smp_call_function_single > > > 8.64% qemu-kvm [kernel.kallsyms] [k] power_pmu_enable > > > =>2.79% qemu-kvm qemu-kvm [.] memory_region_ioeventfd_before > > > =>2.12% qemu-kvm qemu-kvm [.] address_space_update_ioeventfds > > > 0.56% kworker/8:0-mm [kernel.kallsyms] [k] smp_call_function_single > > > > > > address_space_update_ioeventfds() is called when committing an MR > > > transaction, i.e. for each ioeventfd with the current code base, > > > and it internally loops on all ioventfds: > > > > > > static void address_space_update_ioeventfds(AddressSpace *as) > > > { > > > [...] > > > FOR_EACH_FLAT_RANGE(fr, view) { > > > for (i = 0; i < fr->mr->ioeventfd_nb; ++i) { > > > > > > This means that the setup of ioeventfds for these devices has > > > quadratic time complexity. > > > > > > This series simply changes the device models to extend the transaction > > > to all virtqueueues, like already done in the past in the generic > > > code with 710fccf80d78 ("virtio: improve virtio devices initialization > > > time"). > > > > > > Only virtio-scsi and virtio-blk are covered here, but a similar change > > > might also be beneficial to other device types such as host-scsi-pci, > > > vhost-user-scsi-pci and vhost-user-blk-pci. > > > > > > virtio-scsi virtio-blk > > > > > > 1 0m21.271s 0m22.076s > > > 2 0m20.912s 0m19.716s > > > 4 0m20.508s 0m19.310s > > > 8 0m21.374s 0m20.273s > > > 16 0m21.559s 0m21.374s > > > 32 0m22.532s 0m21.271s > > > 64 0m26.550s 0m22.007s > > > 128 0m29.115s 0m27.446s > > > 256 0m44.752s 0m41.004s > > > 384 1m2.884s 0m58.023s > > > > > > This should fix https://bugzilla.redhat.com/show_bug.cgi?id=1927108 > > > which reported the issue for virtio-scsi-pci. > > > > > > Changes since v1: > > > - Add some comments (Stefan) > > > - Drop optimization on the error path in patch 2 (Stefan) > > > > > > Changes since RFC: > > > > > > As suggested by Stefan, splimplify the code by directly beginning and > > > committing the memory transaction from the device model, without all > > > the virtio specific proxying code and no changes needed in the memory > > > subsystem. > > > > > > Greg Kurz (4): > > > virtio-blk: Fix rollback path in virtio_blk_data_plane_start() > > > virtio-blk: Configure all host notifiers in a single MR transaction > > > virtio-scsi: Set host notifiers and callbacks separately > > > virtio-scsi: Configure all host notifiers in a single MR transaction > > > > > > hw/block/dataplane/virtio-blk.c | 45 ++++++++++++++++++++- > > > hw/scsi/virtio-scsi-dataplane.c | 72 ++++++++++++++++++++++++--------- > > > 2 files changed, 97 insertions(+), 20 deletions(-) > > > > > > -- > > > 2.26.3 > > > > > > > Thanks, applied to my block tree: > > https://gitlab.com/stefanha/qemu/commits/block > > > > Hi Stefan, > > It seems that Michael already merged the previous version of this > patch set with its latest PR. > > https://gitlab.com/qemu-project/qemu/-/commit/6005ee07c380cbde44292f5f6c96e7daa70f4f7d > > It is thus missing the v1->v2 changes. Basically some comments to > clarify the optimization we're doing with the MR transaction and > the removal of the optimization on an error path. > > The optimization on the error path isn't needed indeed but it > doesn't hurt. No need to change that now that the patches are > upstream. > > I can post a follow-up patch to add the missing comments though. > While here, I'd even add these comments in the generic > virtio_device_*_ioeventfd_impl() calls as well, since they already > have the very same optimization. > > Anyway, I guess you can drop the patches from your tree. Okay, I have dropped them. Stefan
On Mon, May 17, 2021 at 10:32:59AM +0200, Greg Kurz wrote: > On Wed, 12 May 2021 17:05:53 +0100 > Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > On Fri, May 07, 2021 at 06:59:01PM +0200, Greg Kurz wrote: > > > Now that virtio-scsi-pci and virtio-blk-pci map 1 virtqueue per vCPU, > > > a serious slow down may be observed on setups with a big enough number > > > of vCPUs. > > > > > > Exemple with a pseries guest on a bi-POWER9 socket system (128 HW threads): > > > > > > virtio-scsi virtio-blk > > > > > > 1 0m20.922s 0m21.346s > > > 2 0m21.230s 0m20.350s > > > 4 0m21.761s 0m20.997s > > > 8 0m22.770s 0m20.051s > > > 16 0m22.038s 0m19.994s > > > 32 0m22.928s 0m20.803s > > > 64 0m26.583s 0m22.953s > > > 128 0m41.273s 0m32.333s > > > 256 2m4.727s 1m16.924s > > > 384 6m5.563s 3m26.186s > > > > > > Both perf and gprof indicate that QEMU is hogging CPUs when setting up > > > the ioeventfds: > > > > > > 67.88% swapper [kernel.kallsyms] [k] power_pmu_enable > > > 9.47% qemu-kvm [kernel.kallsyms] [k] smp_call_function_single > > > 8.64% qemu-kvm [kernel.kallsyms] [k] power_pmu_enable > > > =>2.79% qemu-kvm qemu-kvm [.] memory_region_ioeventfd_before > > > =>2.12% qemu-kvm qemu-kvm [.] address_space_update_ioeventfds > > > 0.56% kworker/8:0-mm [kernel.kallsyms] [k] smp_call_function_single > > > > > > address_space_update_ioeventfds() is called when committing an MR > > > transaction, i.e. for each ioeventfd with the current code base, > > > and it internally loops on all ioventfds: > > > > > > static void address_space_update_ioeventfds(AddressSpace *as) > > > { > > > [...] > > > FOR_EACH_FLAT_RANGE(fr, view) { > > > for (i = 0; i < fr->mr->ioeventfd_nb; ++i) { > > > > > > This means that the setup of ioeventfds for these devices has > > > quadratic time complexity. > > > > > > This series simply changes the device models to extend the transaction > > > to all virtqueueues, like already done in the past in the generic > > > code with 710fccf80d78 ("virtio: improve virtio devices initialization > > > time"). > > > > > > Only virtio-scsi and virtio-blk are covered here, but a similar change > > > might also be beneficial to other device types such as host-scsi-pci, > > > vhost-user-scsi-pci and vhost-user-blk-pci. > > > > > > virtio-scsi virtio-blk > > > > > > 1 0m21.271s 0m22.076s > > > 2 0m20.912s 0m19.716s > > > 4 0m20.508s 0m19.310s > > > 8 0m21.374s 0m20.273s > > > 16 0m21.559s 0m21.374s > > > 32 0m22.532s 0m21.271s > > > 64 0m26.550s 0m22.007s > > > 128 0m29.115s 0m27.446s > > > 256 0m44.752s 0m41.004s > > > 384 1m2.884s 0m58.023s > > > > > > This should fix https://bugzilla.redhat.com/show_bug.cgi?id=1927108 > > > which reported the issue for virtio-scsi-pci. > > > > > > Changes since v1: > > > - Add some comments (Stefan) > > > - Drop optimization on the error path in patch 2 (Stefan) > > > > > > Changes since RFC: > > > > > > As suggested by Stefan, splimplify the code by directly beginning and > > > committing the memory transaction from the device model, without all > > > the virtio specific proxying code and no changes needed in the memory > > > subsystem. > > > > > > Greg Kurz (4): > > > virtio-blk: Fix rollback path in virtio_blk_data_plane_start() > > > virtio-blk: Configure all host notifiers in a single MR transaction > > > virtio-scsi: Set host notifiers and callbacks separately > > > virtio-scsi: Configure all host notifiers in a single MR transaction > > > > > > hw/block/dataplane/virtio-blk.c | 45 ++++++++++++++++++++- > > > hw/scsi/virtio-scsi-dataplane.c | 72 ++++++++++++++++++++++++--------- > > > 2 files changed, 97 insertions(+), 20 deletions(-) > > > > > > -- > > > 2.26.3 > > > > > > > Thanks, applied to my block tree: > > https://gitlab.com/stefanha/qemu/commits/block > > > > Hi Stefan, > > It seems that Michael already merged the previous version of this > patch set with its latest PR. > > https://gitlab.com/qemu-project/qemu/-/commit/6005ee07c380cbde44292f5f6c96e7daa70f4f7d > > It is thus missing the v1->v2 changes. Basically some comments to > clarify the optimization we're doing with the MR transaction and > the removal of the optimization on an error path. > > The optimization on the error path isn't needed indeed but it > doesn't hurt. No need to change that now that the patches are > upstream. > > I can post a follow-up patch to add the missing comments though. > While here, I'd even add these comments in the generic > virtio_device_*_ioeventfd_impl() calls as well, since they already > have the very same optimization. Yes, please post patches on top. > Anyway, I guess you can drop the patches from your tree. > > Cheers, > > -- > Greg > > > Stefan >