Message ID | 20220811153739.3079672-3-fanjinhao21s@ict.ac.cn (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | hw/nvme: add irqfd support | expand |
On Aug 11 23:37, Jinhao Fan wrote: > When the new option 'irq-eventfd' is turned on, the IO emulation code > signals an eventfd when it want to (de)assert an irq. The main loop > eventfd handler does the actual irq (de)assertion. This paves the way > for iothread support since QEMU's interrupt emulation is not thread > safe. > > Asserting and deasseting irq with eventfd has some performance > implications. For small queue depth it increases request latency but > for large queue depth it effectively coalesces irqs. > > Comparision (KIOPS): > > QD 1 4 16 64 > QEMU 38 123 210 329 > irq-eventfd 32 106 240 364 > > Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn> > --- > hw/nvme/ctrl.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++-- > hw/nvme/nvme.h | 4 +++ > 2 files changed, 90 insertions(+), 3 deletions(-) > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c > index bd3350d7e0..8a1c5ce3e1 100644 > --- a/hw/nvme/ctrl.c > +++ b/hw/nvme/ctrl.c > @@ -7675,6 +7757,7 @@ static Property nvme_props[] = { > DEFINE_PROP_BOOL("use-intel-id", NvmeCtrl, params.use_intel_id, false), > DEFINE_PROP_BOOL("legacy-cmb", NvmeCtrl, params.legacy_cmb, false), > DEFINE_PROP_BOOL("ioeventfd", NvmeCtrl, params.ioeventfd, false), > + DEFINE_PROP_BOOL("irq-eventfd", NvmeCtrl, params.irq_eventfd, false), This option does not seem to change anything - the value is never used ;)
at 7:20 PM, Klaus Jensen <its@irrelevant.dk> wrote: > This option does not seem to change anything - the value is never used > ;) What a stupid mistake. I’ll fix this in the next version.
On Aug 11 23:37, Jinhao Fan wrote: > When the new option 'irq-eventfd' is turned on, the IO emulation code > signals an eventfd when it want to (de)assert an irq. The main loop > eventfd handler does the actual irq (de)assertion. This paves the way > for iothread support since QEMU's interrupt emulation is not thread > safe. > > Asserting and deasseting irq with eventfd has some performance > implications. For small queue depth it increases request latency but > for large queue depth it effectively coalesces irqs. > > Comparision (KIOPS): > > QD 1 4 16 64 > QEMU 38 123 210 329 > irq-eventfd 32 106 240 364 > > Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn> > --- > hw/nvme/ctrl.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++-- > hw/nvme/nvme.h | 4 +++ > 2 files changed, 90 insertions(+), 3 deletions(-) > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c > index bd3350d7e0..8a1c5ce3e1 100644 > --- a/hw/nvme/ctrl.c > +++ b/hw/nvme/ctrl.c > @@ -1338,6 +1338,54 @@ static void nvme_update_cq_head(NvmeCQueue *cq) > trace_pci_nvme_shadow_doorbell_cq(cq->cqid, cq->head); > } > > +static void nvme_assert_notifier_read(EventNotifier *e) > +{ > + NvmeCQueue *cq = container_of(e, NvmeCQueue, assert_notifier); > + if (event_notifier_test_and_clear(e)) { > + nvme_irq_assert(cq->ctrl, cq); > + } > +} > + > +static void nvme_deassert_notifier_read(EventNotifier *e) > +{ > + NvmeCQueue *cq = container_of(e, NvmeCQueue, deassert_notifier); > + if (event_notifier_test_and_clear(e)) { > + nvme_irq_deassert(cq->ctrl, cq); > + } > +} > + > +static void nvme_init_irq_notifier(NvmeCtrl *n, NvmeCQueue *cq) > +{ > + int ret; > + > + ret = event_notifier_init(&cq->assert_notifier, 0); > + if (ret < 0) { > + goto fail_assert_handler; > + } > + > + event_notifier_set_handler(&cq->assert_notifier, > + nvme_assert_notifier_read); > + > + if (!msix_enabled(&n->parent_obj)) { > + ret = event_notifier_init(&cq->deassert_notifier, 0); > + if (ret < 0) { > + goto fail_deassert_handler; > + } > + > + event_notifier_set_handler(&cq->deassert_notifier, > + nvme_deassert_notifier_read); > + } > + > + return; > + > +fail_deassert_handler: > + event_notifier_set_handler(&cq->deassert_notifier, NULL); > + event_notifier_cleanup(&cq->deassert_notifier); > +fail_assert_handler: > + event_notifier_set_handler(&cq->assert_notifier, NULL); > + event_notifier_cleanup(&cq->assert_notifier); > +} > + > static void nvme_post_cqes(void *opaque) > { > NvmeCQueue *cq = opaque; > @@ -1382,7 +1430,23 @@ static void nvme_post_cqes(void *opaque) > n->cq_pending++; > } > > - nvme_irq_assert(n, cq); > + if (unlikely(cq->first_io_cqe)) { > + /* > + * Initilize event notifier when first cqe is posted. For irqfd > + * support we need to register the MSI message in KVM. We > + * can not do this registration at CQ creation time because > + * Linux's NVMe driver changes the MSI message after CQ creation. > + */ > + cq->first_io_cqe = false; > + > + nvme_init_irq_notifier(n, cq); > + } It is really unfortunate that we have to do this. From what I can tell in the kernel driver, even if it were to change to set up the irq prior to creating the completion queue, we'd still have issues making this work on earlier versions and there is no way to quirk our way out of this. We can't even move this upon creation of the submission queue since the kernel also creates *that* prior to allocating the interrupt line. In conclusion I don't see any way around this other than asking the NVMe TWG to add some kind of bit indicating that the host sets up the interrupt line prior to creating the cq. Meh. > + > + if (cq->assert_notifier.initialized) { > + event_notifier_set(&cq->assert_notifier); > + } else { > + nvme_irq_assert(n, cq); > + } There is a lot of duplication below, checking if the notifier is initialized and then choosing what to do. Can we move this to nvme_irq_assert/deassert()? > } > } > } > @@ -4249,7 +4313,11 @@ static void nvme_cq_notifier(EventNotifier *e) > if (cq->irq_enabled && cq->tail == cq->head) { > n->cq_pending--; > if (!msix_enabled(&n->parent_obj)) { > - nvme_irq_deassert(n, cq); > + if (cq->deassert_notifier.initialized) { > + event_notifier_set(&cq->deassert_notifier); > + } else { > + nvme_irq_deassert(n, cq); > + } > } > } > > @@ -4706,6 +4774,14 @@ static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n) > event_notifier_set_handler(&cq->notifier, NULL); > event_notifier_cleanup(&cq->notifier); > } > + if (cq->assert_notifier.initialized) { > + event_notifier_set_handler(&cq->assert_notifier, NULL); > + event_notifier_cleanup(&cq->assert_notifier); > + } > + if (cq->deassert_notifier.initialized) { > + event_notifier_set_handler(&cq->deassert_notifier, NULL); > + event_notifier_cleanup(&cq->deassert_notifier); > + } > if (msix_enabled(&n->parent_obj)) { > msix_vector_unuse(&n->parent_obj, cq->vector); > } > @@ -4737,6 +4813,7 @@ static uint16_t nvme_del_cq(NvmeCtrl *n, NvmeRequest *req) > } > > if (!msix_enabled(&n->parent_obj)) { > + /* Do not use eventfd since this is always called in main loop */ > nvme_irq_deassert(n, cq); > } > } > @@ -4777,6 +4854,7 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr, > } > n->cq[cqid] = cq; > cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq); > + cq->first_io_cqe = cqid != 0; > } > > static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req) > @@ -6926,7 +7004,11 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val) > if (cq->irq_enabled && cq->tail == cq->head) { > n->cq_pending--; > if (!msix_enabled(&n->parent_obj)) { > - nvme_irq_deassert(n, cq); > + if (cq->deassert_notifier.initialized) { > + event_notifier_set(&cq->deassert_notifier); > + } else { > + nvme_irq_deassert(n, cq); > + } > } > } > } else { > @@ -7675,6 +7757,7 @@ static Property nvme_props[] = { > DEFINE_PROP_BOOL("use-intel-id", NvmeCtrl, params.use_intel_id, false), > DEFINE_PROP_BOOL("legacy-cmb", NvmeCtrl, params.legacy_cmb, false), > DEFINE_PROP_BOOL("ioeventfd", NvmeCtrl, params.ioeventfd, false), > + DEFINE_PROP_BOOL("irq-eventfd", NvmeCtrl, params.irq_eventfd, false), > DEFINE_PROP_UINT8("zoned.zasl", NvmeCtrl, params.zasl, 0), > DEFINE_PROP_BOOL("zoned.auto_transition", NvmeCtrl, > params.auto_transition_zones, true), > diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h > index 79f5c281c2..759d0ecd7c 100644 > --- a/hw/nvme/nvme.h > +++ b/hw/nvme/nvme.h > @@ -398,6 +398,9 @@ typedef struct NvmeCQueue { > uint64_t ei_addr; > QEMUTimer *timer; > EventNotifier notifier; > + EventNotifier assert_notifier; > + EventNotifier deassert_notifier; > + bool first_io_cqe; > bool ioeventfd_enabled; > QTAILQ_HEAD(, NvmeSQueue) sq_list; > QTAILQ_HEAD(, NvmeRequest) req_list; > @@ -422,6 +425,7 @@ typedef struct NvmeParams { > bool auto_transition_zones; > bool legacy_cmb; > bool ioeventfd; > + bool irq_eventfd; > uint8_t sriov_max_vfs; > uint16_t sriov_vq_flexible; > uint16_t sriov_vi_flexible; > -- > 2.25.1 >
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index bd3350d7e0..8a1c5ce3e1 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -1338,6 +1338,54 @@ static void nvme_update_cq_head(NvmeCQueue *cq) trace_pci_nvme_shadow_doorbell_cq(cq->cqid, cq->head); } +static void nvme_assert_notifier_read(EventNotifier *e) +{ + NvmeCQueue *cq = container_of(e, NvmeCQueue, assert_notifier); + if (event_notifier_test_and_clear(e)) { + nvme_irq_assert(cq->ctrl, cq); + } +} + +static void nvme_deassert_notifier_read(EventNotifier *e) +{ + NvmeCQueue *cq = container_of(e, NvmeCQueue, deassert_notifier); + if (event_notifier_test_and_clear(e)) { + nvme_irq_deassert(cq->ctrl, cq); + } +} + +static void nvme_init_irq_notifier(NvmeCtrl *n, NvmeCQueue *cq) +{ + int ret; + + ret = event_notifier_init(&cq->assert_notifier, 0); + if (ret < 0) { + goto fail_assert_handler; + } + + event_notifier_set_handler(&cq->assert_notifier, + nvme_assert_notifier_read); + + if (!msix_enabled(&n->parent_obj)) { + ret = event_notifier_init(&cq->deassert_notifier, 0); + if (ret < 0) { + goto fail_deassert_handler; + } + + event_notifier_set_handler(&cq->deassert_notifier, + nvme_deassert_notifier_read); + } + + return; + +fail_deassert_handler: + event_notifier_set_handler(&cq->deassert_notifier, NULL); + event_notifier_cleanup(&cq->deassert_notifier); +fail_assert_handler: + event_notifier_set_handler(&cq->assert_notifier, NULL); + event_notifier_cleanup(&cq->assert_notifier); +} + static void nvme_post_cqes(void *opaque) { NvmeCQueue *cq = opaque; @@ -1382,7 +1430,23 @@ static void nvme_post_cqes(void *opaque) n->cq_pending++; } - nvme_irq_assert(n, cq); + if (unlikely(cq->first_io_cqe)) { + /* + * Initilize event notifier when first cqe is posted. For irqfd + * support we need to register the MSI message in KVM. We + * can not do this registration at CQ creation time because + * Linux's NVMe driver changes the MSI message after CQ creation. + */ + cq->first_io_cqe = false; + + nvme_init_irq_notifier(n, cq); + } + + if (cq->assert_notifier.initialized) { + event_notifier_set(&cq->assert_notifier); + } else { + nvme_irq_assert(n, cq); + } } } } @@ -4249,7 +4313,11 @@ static void nvme_cq_notifier(EventNotifier *e) if (cq->irq_enabled && cq->tail == cq->head) { n->cq_pending--; if (!msix_enabled(&n->parent_obj)) { - nvme_irq_deassert(n, cq); + if (cq->deassert_notifier.initialized) { + event_notifier_set(&cq->deassert_notifier); + } else { + nvme_irq_deassert(n, cq); + } } } @@ -4706,6 +4774,14 @@ static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n) event_notifier_set_handler(&cq->notifier, NULL); event_notifier_cleanup(&cq->notifier); } + if (cq->assert_notifier.initialized) { + event_notifier_set_handler(&cq->assert_notifier, NULL); + event_notifier_cleanup(&cq->assert_notifier); + } + if (cq->deassert_notifier.initialized) { + event_notifier_set_handler(&cq->deassert_notifier, NULL); + event_notifier_cleanup(&cq->deassert_notifier); + } if (msix_enabled(&n->parent_obj)) { msix_vector_unuse(&n->parent_obj, cq->vector); } @@ -4737,6 +4813,7 @@ static uint16_t nvme_del_cq(NvmeCtrl *n, NvmeRequest *req) } if (!msix_enabled(&n->parent_obj)) { + /* Do not use eventfd since this is always called in main loop */ nvme_irq_deassert(n, cq); } } @@ -4777,6 +4854,7 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr, } n->cq[cqid] = cq; cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq); + cq->first_io_cqe = cqid != 0; } static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req) @@ -6926,7 +7004,11 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val) if (cq->irq_enabled && cq->tail == cq->head) { n->cq_pending--; if (!msix_enabled(&n->parent_obj)) { - nvme_irq_deassert(n, cq); + if (cq->deassert_notifier.initialized) { + event_notifier_set(&cq->deassert_notifier); + } else { + nvme_irq_deassert(n, cq); + } } } } else { @@ -7675,6 +7757,7 @@ static Property nvme_props[] = { DEFINE_PROP_BOOL("use-intel-id", NvmeCtrl, params.use_intel_id, false), DEFINE_PROP_BOOL("legacy-cmb", NvmeCtrl, params.legacy_cmb, false), DEFINE_PROP_BOOL("ioeventfd", NvmeCtrl, params.ioeventfd, false), + DEFINE_PROP_BOOL("irq-eventfd", NvmeCtrl, params.irq_eventfd, false), DEFINE_PROP_UINT8("zoned.zasl", NvmeCtrl, params.zasl, 0), DEFINE_PROP_BOOL("zoned.auto_transition", NvmeCtrl, params.auto_transition_zones, true), diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h index 79f5c281c2..759d0ecd7c 100644 --- a/hw/nvme/nvme.h +++ b/hw/nvme/nvme.h @@ -398,6 +398,9 @@ typedef struct NvmeCQueue { uint64_t ei_addr; QEMUTimer *timer; EventNotifier notifier; + EventNotifier assert_notifier; + EventNotifier deassert_notifier; + bool first_io_cqe; bool ioeventfd_enabled; QTAILQ_HEAD(, NvmeSQueue) sq_list; QTAILQ_HEAD(, NvmeRequest) req_list; @@ -422,6 +425,7 @@ typedef struct NvmeParams { bool auto_transition_zones; bool legacy_cmb; bool ioeventfd; + bool irq_eventfd; uint8_t sriov_max_vfs; uint16_t sriov_vq_flexible; uint16_t sriov_vi_flexible;
When the new option 'irq-eventfd' is turned on, the IO emulation code signals an eventfd when it want to (de)assert an irq. The main loop eventfd handler does the actual irq (de)assertion. This paves the way for iothread support since QEMU's interrupt emulation is not thread safe. Asserting and deasseting irq with eventfd has some performance implications. For small queue depth it increases request latency but for large queue depth it effectively coalesces irqs. Comparision (KIOPS): QD 1 4 16 64 QEMU 38 123 210 329 irq-eventfd 32 106 240 364 Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn> --- hw/nvme/ctrl.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++-- hw/nvme/nvme.h | 4 +++ 2 files changed, 90 insertions(+), 3 deletions(-)