diff mbox series

[v4] hw/nvme: Use ioeventfd to handle doorbell updates

Message ID 20220705142403.101539-1-fanjinhao21s@ict.ac.cn (mailing list archive)
State New, archived
Headers show
Series [v4] hw/nvme: Use ioeventfd to handle doorbell updates | expand

Commit Message

Jinhao Fan July 5, 2022, 2:24 p.m. UTC
Add property "ioeventfd" which is enabled by default. When this is
enabled, updates on the doorbell registers will cause KVM to signal
an event to the QEMU main loop to handle the doorbell updates.
Therefore, instead of letting the vcpu thread run both guest VM and
IO emulation, we now use the main loop thread to do IO emulation and
thus the vcpu thread has more cycles for the guest VM.

Since ioeventfd does not tell us the exact value that is written, it is
only useful when shadow doorbell buffer is enabled, where we check
for the value in the shadow doorbell buffer when we get the doorbell
update event.

IOPS comparison on Linux 5.19-rc2: (Unit: KIOPS)

qd           1   4  16  64
qemu        35 121 176 153
ioeventfd   41 133 258 313

Changes since v3:
 - Do not deregister ioeventfd when it was not enabled on a SQ/CQ

Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
---
 hw/nvme/ctrl.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++++-
 hw/nvme/nvme.h |   5 +++
 2 files changed, 118 insertions(+), 1 deletion(-)

Comments

Klaus Jensen July 5, 2022, 5:11 p.m. UTC | #1
On Jul  5 22:24, Jinhao Fan wrote:
> Add property "ioeventfd" which is enabled by default. When this is
> enabled, updates on the doorbell registers will cause KVM to signal
> an event to the QEMU main loop to handle the doorbell updates.
> Therefore, instead of letting the vcpu thread run both guest VM and
> IO emulation, we now use the main loop thread to do IO emulation and
> thus the vcpu thread has more cycles for the guest VM.
> 

This is not entirely accurate.

Yes, the ioeventfd causes the doorbell write to be handled by the main
iothread, but previously we did not do any substantial device emulation
in the vcpu thread either. That is the reason why we only handle the
bare minimum of the doorbell write and then defer any work until the
timer fires on the main iothread.

But with this patch we just go ahead and do the work (nvme_process_sq)
immediately since we are already in the main iothread.

> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index c952c34f94..4b75c5f549 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -1374,7 +1374,14 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
>  
>      QTAILQ_REMOVE(&req->sq->out_req_list, req, entry);
>      QTAILQ_INSERT_TAIL(&cq->req_list, req, entry);
> -    timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> +
> +    if (req->sq->ioeventfd_enabled) {
> +        /* Post CQE directly since we are in main loop thread */
> +        nvme_post_cqes(cq);
> +    } else {
> +        /* Schedule the timer to post CQE later since we are in vcpu thread */
> +        timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> +    }

Actually, we are only in the vcpu thread if we come here from
nvme_process_db that in very rare circumstances may enqueue the
completion of an AER due to an invalid doorbell write.

In general, nvme_enqueue_req_completion is only ever called from the
main iothread. Which actually causes me to wonder why we defer this work
in the first place. It does have the benefit that we queue up several
completions before posting them in one go and raising the interrupt.
But I wonder if that could be handled better.

Anyway, it doesnt affect the correctness of your patch - just an
observation that we should look into possibly.


LGTM,

Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Keith Busch July 5, 2022, 6:43 p.m. UTC | #2
On Tue, Jul 05, 2022 at 07:11:36PM +0200, Klaus Jensen wrote:
> On Jul  5 22:24, Jinhao Fan wrote:
> > @@ -1374,7 +1374,14 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
> >  
> >      QTAILQ_REMOVE(&req->sq->out_req_list, req, entry);
> >      QTAILQ_INSERT_TAIL(&cq->req_list, req, entry);
> > -    timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> > +
> > +    if (req->sq->ioeventfd_enabled) {
> > +        /* Post CQE directly since we are in main loop thread */
> > +        nvme_post_cqes(cq);
> > +    } else {
> > +        /* Schedule the timer to post CQE later since we are in vcpu thread */
> > +        timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> > +    }
> 
> Actually, we are only in the vcpu thread if we come here from
> nvme_process_db that in very rare circumstances may enqueue the
> completion of an AER due to an invalid doorbell write.
> 
> In general, nvme_enqueue_req_completion is only ever called from the
> main iothread. Which actually causes me to wonder why we defer this work
> in the first place. It does have the benefit that we queue up several
> completions before posting them in one go and raising the interrupt.
> But I wonder if that could be handled better.

I think the timer is used because of the cq_full condition. We need to restart
completions when it becomes not full, which requires a doorbell write. Having
everyone from the main iothread use the same timer as the doorbell handler just
ensures single threaded list access.
Jinhao Fan July 6, 2022, 10:57 a.m. UTC | #3
at 1:11 AM, Klaus Jensen <its@irrelevant.dk> wrote:

> On Jul  5 22:24, Jinhao Fan wrote:
>> Add property "ioeventfd" which is enabled by default. When this is
>> enabled, updates on the doorbell registers will cause KVM to signal
>> an event to the QEMU main loop to handle the doorbell updates.
>> Therefore, instead of letting the vcpu thread run both guest VM and
>> IO emulation, we now use the main loop thread to do IO emulation and
>> thus the vcpu thread has more cycles for the guest VM.
> 
> This is not entirely accurate.
> 
> Yes, the ioeventfd causes the doorbell write to be handled by the main
> iothread, but previously we did not do any substantial device emulation
> in the vcpu thread either. That is the reason why we only handle the
> bare minimum of the doorbell write and then defer any work until the
> timer fires on the main iothread.
> 
> But with this patch we just go ahead and do the work (nvme_process_sq)
> immediately since we are already in the main iothread.
> 

Thanks for pointing this out. I previously thought the timers are fired in
the vcpu threads. I misunderstood why this optimization works but
accidentally got the code right.
Jinhao Fan July 6, 2022, 11:34 a.m. UTC | #4
at 2:43 AM, Keith Busch <kbusch@kernel.org> wrote:

> On Tue, Jul 05, 2022 at 07:11:36PM +0200, Klaus Jensen wrote:
>> On Jul  5 22:24, Jinhao Fan wrote:
>>> @@ -1374,7 +1374,14 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
>>> 
>>>     QTAILQ_REMOVE(&req->sq->out_req_list, req, entry);
>>>     QTAILQ_INSERT_TAIL(&cq->req_list, req, entry);
>>> -    timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
>>> +
>>> +    if (req->sq->ioeventfd_enabled) {
>>> +        /* Post CQE directly since we are in main loop thread */
>>> +        nvme_post_cqes(cq);
>>> +    } else {
>>> +        /* Schedule the timer to post CQE later since we are in vcpu thread */
>>> +        timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
>>> +    }
>> 
>> Actually, we are only in the vcpu thread if we come here from
>> nvme_process_db that in very rare circumstances may enqueue the
>> completion of an AER due to an invalid doorbell write.
>> 
>> In general, nvme_enqueue_req_completion is only ever called from the
>> main iothread. Which actually causes me to wonder why we defer this work
>> in the first place. It does have the benefit that we queue up several
>> completions before posting them in one go and raising the interrupt.
>> But I wonder if that could be handled better.
> 
> I think the timer is used because of the cq_full condition. We need to restart
> completions when it becomes not full, which requires a doorbell write. Having
> everyone from the main iothread use the same timer as the doorbell handler just
> ensures single threaded list access.

Could we let nvme_process_aers register another timer/BH to trigger
nvme_enqueue_req_completion in the iothread? In this way we won’t need the
timer_mod in nvme_enqueue_req_completion. We can also avoid some potential
currency problems because CQ is only modified in the iothread.

BTW, are there any reason that we must use timers (not BH) here? Also why do
we choose to delay for 500ns?
Klaus Jensen July 7, 2022, 5:51 a.m. UTC | #5
On Jul  6 19:34, Jinhao Fan wrote:
> at 2:43 AM, Keith Busch <kbusch@kernel.org> wrote:
> 
> > On Tue, Jul 05, 2022 at 07:11:36PM +0200, Klaus Jensen wrote:
> >> On Jul  5 22:24, Jinhao Fan wrote:
> >>> @@ -1374,7 +1374,14 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
> >>> 
> >>>     QTAILQ_REMOVE(&req->sq->out_req_list, req, entry);
> >>>     QTAILQ_INSERT_TAIL(&cq->req_list, req, entry);
> >>> -    timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> >>> +
> >>> +    if (req->sq->ioeventfd_enabled) {
> >>> +        /* Post CQE directly since we are in main loop thread */
> >>> +        nvme_post_cqes(cq);
> >>> +    } else {
> >>> +        /* Schedule the timer to post CQE later since we are in vcpu thread */
> >>> +        timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> >>> +    }
> >> 
> >> Actually, we are only in the vcpu thread if we come here from
> >> nvme_process_db that in very rare circumstances may enqueue the
> >> completion of an AER due to an invalid doorbell write.
> >> 
> >> In general, nvme_enqueue_req_completion is only ever called from the
> >> main iothread. Which actually causes me to wonder why we defer this work
> >> in the first place. It does have the benefit that we queue up several
> >> completions before posting them in one go and raising the interrupt.
> >> But I wonder if that could be handled better.
> > 
> > I think the timer is used because of the cq_full condition. We need to restart
> > completions when it becomes not full, which requires a doorbell write. Having
> > everyone from the main iothread use the same timer as the doorbell handler just
> > ensures single threaded list access.
> 
> Could we let nvme_process_aers register another timer/BH to trigger
> nvme_enqueue_req_completion in the iothread? In this way we won’t need the
> timer_mod in nvme_enqueue_req_completion.

Yes, we could have process_aers in a timer. Which would probably be
preferable in order to limit the amount of work the mmio handler is
doing in that rare case. However, its such a rare case (only misbehaving
drivers) that it's probably not worth optimizing for.

> We can also avoid some potential currency problems because CQ is only
> modified in the iothread.
> 

There are currently no concurrency problems because of the Big QEMU
Lock. When the mmio handler is running, the vcpu holds the BQL (and
whenever the main iothread is running, it is holding the BQL).

> BTW, are there any reason that we must use timers (not BH) here? Also why do
> we choose to delay for 500ns?

No particular reason. do not see any reason why this could not be bottom
halfs. This will likely change into bhs when we add iothread support
anyway.
Jinhao Fan July 7, 2022, 8:50 a.m. UTC | #6
at 1:51 PM, Klaus Jensen <its@irrelevant.dk> wrote:

> On Jul  6 19:34, Jinhao Fan wrote:
>> at 2:43 AM, Keith Busch <kbusch@kernel.org> wrote:
>> 
>>> On Tue, Jul 05, 2022 at 07:11:36PM +0200, Klaus Jensen wrote:
>>>> On Jul  5 22:24, Jinhao Fan wrote:
>>>>> @@ -1374,7 +1374,14 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
>>>>> 
>>>>>    QTAILQ_REMOVE(&req->sq->out_req_list, req, entry);
>>>>>    QTAILQ_INSERT_TAIL(&cq->req_list, req, entry);
>>>>> -    timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
>>>>> +
>>>>> +    if (req->sq->ioeventfd_enabled) {
>>>>> +        /* Post CQE directly since we are in main loop thread */
>>>>> +        nvme_post_cqes(cq);
>>>>> +    } else {
>>>>> +        /* Schedule the timer to post CQE later since we are in vcpu thread */
>>>>> +        timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
>>>>> +    }
>>>> 
>>>> Actually, we are only in the vcpu thread if we come here from
>>>> nvme_process_db that in very rare circumstances may enqueue the
>>>> completion of an AER due to an invalid doorbell write.
>>>> 
>>>> In general, nvme_enqueue_req_completion is only ever called from the
>>>> main iothread. Which actually causes me to wonder why we defer this work
>>>> in the first place. It does have the benefit that we queue up several
>>>> completions before posting them in one go and raising the interrupt.
>>>> But I wonder if that could be handled better.
>>> 
>>> I think the timer is used because of the cq_full condition. We need to restart
>>> completions when it becomes not full, which requires a doorbell write. Having
>>> everyone from the main iothread use the same timer as the doorbell handler just
>>> ensures single threaded list access.
>> 
>> Could we let nvme_process_aers register another timer/BH to trigger
>> nvme_enqueue_req_completion in the iothread? In this way we won’t need the
>> timer_mod in nvme_enqueue_req_completion.
> 
> Yes, we could have process_aers in a timer. Which would probably be
> preferable in order to limit the amount of work the mmio handler is
> doing in that rare case. However, its such a rare case (only misbehaving
> drivers) that it's probably not worth optimizing for.

I think putting nvme_process_aers in a timer can help make sure
nvme_enqueue_req_completion is only called in an iothread context. Currently
it can be called either in iothread or vcpu thread. So
nvme_enqueue_req_completion has to infer its context, in my patch using the
cq->ioeventfd_enabled flag. This approach is probably error-prone. Honestly
I am not sure whether cq->ioeventfd_enabled is enough to guarantee we are in
iothread.

>> We can also avoid some potential currency problems because CQ is only
>> modified in the iothread.
> 
> There are currently no concurrency problems because of the Big QEMU
> Lock. When the mmio handler is running, the vcpu holds the BQL (and
> whenever the main iothread is running, it is holding the BQL).

Will this still hold when we move on to iothreads?

> 
>> BTW, are there any reason that we must use timers (not BH) here? Also why do
>> we choose to delay for 500ns?
> 
> No particular reason. do not see any reason why this could not be bottom
> halfs. This will likely change into bhs when we add iothread support
> anyway.

I will try BH when I start the iothread work.
Jinhao Fan July 9, 2022, 3:06 a.m. UTC | #7
at 10:24 PM, Jinhao Fan <fanjinhao21s@ict.ac.cn> wrote:

> @@ -5793,6 +5891,7 @@ static uint16_t nvme_dbbuf_config(NvmeCtrl *n, const NvmeRequest *req)
>     uint64_t dbs_addr = le64_to_cpu(req->cmd.dptr.prp1);
>     uint64_t eis_addr = le64_to_cpu(req->cmd.dptr.prp2);
>     int i;
> +    int ret;
> 

I just noticed this ret is unused. Could you help remove this line when
applying the patch?
Klaus Jensen July 12, 2022, 12:23 p.m. UTC | #8
On Jul  9 11:06, Jinhao Fan wrote:
> at 10:24 PM, Jinhao Fan <fanjinhao21s@ict.ac.cn> wrote:
> 
> > @@ -5793,6 +5891,7 @@ static uint16_t nvme_dbbuf_config(NvmeCtrl *n, const NvmeRequest *req)
> >     uint64_t dbs_addr = le64_to_cpu(req->cmd.dptr.prp1);
> >     uint64_t eis_addr = le64_to_cpu(req->cmd.dptr.prp2);
> >     int i;
> > +    int ret;
> > 
> 
> I just noticed this ret is unused. Could you help remove this line when
> applying the patch?

Yes, I noticed it and hot-fixed it ;)
Klaus Jensen July 14, 2022, 5:35 a.m. UTC | #9
On Jul 12 14:23, Klaus Jensen wrote:
> On Jul  9 11:06, Jinhao Fan wrote:
> > at 10:24 PM, Jinhao Fan <fanjinhao21s@ict.ac.cn> wrote:
> > 
> > > @@ -5793,6 +5891,7 @@ static uint16_t nvme_dbbuf_config(NvmeCtrl *n, const NvmeRequest *req)
> > >     uint64_t dbs_addr = le64_to_cpu(req->cmd.dptr.prp1);
> > >     uint64_t eis_addr = le64_to_cpu(req->cmd.dptr.prp2);
> > >     int i;
> > > +    int ret;
> > > 
> > 
> > I just noticed this ret is unused. Could you help remove this line when
> > applying the patch?
> 
> Yes, I noticed it and hot-fixed it ;)

Jinhao,

Applied to nvme-next!
Klaus Jensen July 25, 2022, 8:55 p.m. UTC | #10
On Jul  5 22:24, Jinhao Fan wrote:
> Add property "ioeventfd" which is enabled by default. When this is
> enabled, updates on the doorbell registers will cause KVM to signal
> an event to the QEMU main loop to handle the doorbell updates.
> Therefore, instead of letting the vcpu thread run both guest VM and
> IO emulation, we now use the main loop thread to do IO emulation and
> thus the vcpu thread has more cycles for the guest VM.
> 
> Since ioeventfd does not tell us the exact value that is written, it is
> only useful when shadow doorbell buffer is enabled, where we check
> for the value in the shadow doorbell buffer when we get the doorbell
> update event.
> 
> IOPS comparison on Linux 5.19-rc2: (Unit: KIOPS)
> 
> qd           1   4  16  64
> qemu        35 121 176 153
> ioeventfd   41 133 258 313
> 
> Changes since v3:
>  - Do not deregister ioeventfd when it was not enabled on a SQ/CQ
> 
> Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
> ---
>  hw/nvme/ctrl.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++++-
>  hw/nvme/nvme.h |   5 +++
>  2 files changed, 118 insertions(+), 1 deletion(-)
> 

We have a regression following this patch that we need to address.

With this patch, issuing a reset on the device (`nvme reset /dev/nvme0`
will do the trick) causes QEMU to hog my host cpu at 100%.

I'm still not sure what causes this. The trace output is a bit
inconclusive still.

I'll keep looking into it.
Jinhao Fan July 26, 2022, 7:35 a.m. UTC | #11
at 4:55 AM, Klaus Jensen <its@irrelevant.dk> wrote:

> 
> We have a regression following this patch that we need to address.
> 
> With this patch, issuing a reset on the device (`nvme reset /dev/nvme0`
> will do the trick) causes QEMU to hog my host cpu at 100%.
> 
> I'm still not sure what causes this. The trace output is a bit
> inconclusive still.
> 
> I'll keep looking into it.

I cannot reproduce this bug. I just start the VM and used `nvme reset
/dev/nvme0`. Did you do anything before the reset?
Klaus Jensen July 26, 2022, 7:41 a.m. UTC | #12
On Jul 26 15:35, Jinhao Fan wrote:
> at 4:55 AM, Klaus Jensen <its@irrelevant.dk> wrote:
> 
> > 
> > We have a regression following this patch that we need to address.
> > 
> > With this patch, issuing a reset on the device (`nvme reset /dev/nvme0`
> > will do the trick) causes QEMU to hog my host cpu at 100%.
> > 
> > I'm still not sure what causes this. The trace output is a bit
> > inconclusive still.
> > 
> > I'll keep looking into it.
> 
> I cannot reproduce this bug. I just start the VM and used `nvme reset
> /dev/nvme0`. Did you do anything before the reset?
> 

Interesting and thanks for checking! Looks like a kernel issue then!

I remember that I'm using a dev branch (nvme-v5.20) of the kernel and
reverting to a stock OS kernel did not produce the bug.
Jinhao Fan July 26, 2022, 7:55 a.m. UTC | #13
at 3:41 PM, Klaus Jensen <its@irrelevant.dk> wrote:

> On Jul 26 15:35, Jinhao Fan wrote:
>> at 4:55 AM, Klaus Jensen <its@irrelevant.dk> wrote:
>> 
>>> We have a regression following this patch that we need to address.
>>> 
>>> With this patch, issuing a reset on the device (`nvme reset /dev/nvme0`
>>> will do the trick) causes QEMU to hog my host cpu at 100%.
>>> 
>>> I'm still not sure what causes this. The trace output is a bit
>>> inconclusive still.
>>> 
>>> I'll keep looking into it.
>> 
>> I cannot reproduce this bug. I just start the VM and used `nvme reset
>> /dev/nvme0`. Did you do anything before the reset?
> 
> Interesting and thanks for checking! Looks like a kernel issue then!
> 
> I remember that I'm using a dev branch (nvme-v5.20) of the kernel and
> reverting to a stock OS kernel did not produce the bug.

I’m using 5.19-rc4 which I pulled from linux-next on Jul 1. It works ok on
my machine.
Klaus Jensen July 26, 2022, 9:19 a.m. UTC | #14
On Jul 26 15:55, Jinhao Fan wrote:
> at 3:41 PM, Klaus Jensen <its@irrelevant.dk> wrote:
> 
> > On Jul 26 15:35, Jinhao Fan wrote:
> >> at 4:55 AM, Klaus Jensen <its@irrelevant.dk> wrote:
> >> 
> >>> We have a regression following this patch that we need to address.
> >>> 
> >>> With this patch, issuing a reset on the device (`nvme reset /dev/nvme0`
> >>> will do the trick) causes QEMU to hog my host cpu at 100%.
> >>> 
> >>> I'm still not sure what causes this. The trace output is a bit
> >>> inconclusive still.
> >>> 
> >>> I'll keep looking into it.
> >> 
> >> I cannot reproduce this bug. I just start the VM and used `nvme reset
> >> /dev/nvme0`. Did you do anything before the reset?
> > 
> > Interesting and thanks for checking! Looks like a kernel issue then!
> > 
> > I remember that I'm using a dev branch (nvme-v5.20) of the kernel and
> > reverting to a stock OS kernel did not produce the bug.
> 
> I’m using 5.19-rc4 which I pulled from linux-next on Jul 1. It works ok on
> my machine.

Interesting. I can reproduce on 5.19-rc4 from torvalds tree. Can you
drop your qemu command line here?

This is mine.

/home/kbj/work/src/qemu/build/x86_64-softmmu/qemu-system-x86_64 \
  -nodefaults \
  -display "none" \
  -machine "q35,accel=kvm,kernel-irqchip=split" \
  -cpu "host" \
  -smp "4" \
  -m "8G" \
  -device "intel-iommu" \
  -netdev "user,id=net0,hostfwd=tcp::2222-:22" \
  -device "virtio-net-pci,netdev=net0" \
  -device "virtio-rng-pci" \
  -drive "id=boot,file=/home/kbj/work/vol/machines/img/nvme.qcow2,format=qcow2,if=virtio,discard=unmap,media=disk,read-only=no" \
  -device "pcie-root-port,id=pcie_root_port1,chassis=1,slot=0" \
  -device "nvme,id=nvme0,serial=deadbeef,bus=pcie_root_port1,mdts=7" \
  -drive "id=null,if=none,file=null-co://,file.read-zeroes=on,format=raw" \
  -device "nvme-ns,id=nvm-1,drive=nvm-1,bus=nvme0,nsid=1,drive=null,logical_block_size=4096,physical_block_size=4096" \
  -pidfile "/home/kbj/work/vol/machines/run/null/pidfile" \
  -kernel "/home/kbj/work/src/kernel/linux/arch/x86_64/boot/bzImage" \
  -append "root=/dev/vda1 console=ttyS0,115200 audit=0 intel_iommu=on" \
  -virtfs "local,path=/home/kbj/work/src/kernel/linux,security_model=none,readonly=on,mount_tag=kernel_dir" \
  -serial "mon:stdio" \
  -d "guest_errors" \
  -D "/home/kbj/work/vol/machines/log/null/qemu.log" \
  -trace "pci_nvme*"
Klaus Jensen July 26, 2022, 10:09 a.m. UTC | #15
On Jul 26 11:19, Klaus Jensen wrote:
> On Jul 26 15:55, Jinhao Fan wrote:
> > at 3:41 PM, Klaus Jensen <its@irrelevant.dk> wrote:
> > 
> > > On Jul 26 15:35, Jinhao Fan wrote:
> > >> at 4:55 AM, Klaus Jensen <its@irrelevant.dk> wrote:
> > >> 
> > >>> We have a regression following this patch that we need to address.
> > >>> 
> > >>> With this patch, issuing a reset on the device (`nvme reset /dev/nvme0`
> > >>> will do the trick) causes QEMU to hog my host cpu at 100%.
> > >>> 
> > >>> I'm still not sure what causes this. The trace output is a bit
> > >>> inconclusive still.
> > >>> 
> > >>> I'll keep looking into it.
> > >> 
> > >> I cannot reproduce this bug. I just start the VM and used `nvme reset
> > >> /dev/nvme0`. Did you do anything before the reset?
> > > 
> > > Interesting and thanks for checking! Looks like a kernel issue then!
> > > 
> > > I remember that I'm using a dev branch (nvme-v5.20) of the kernel and
> > > reverting to a stock OS kernel did not produce the bug.
> > 
> > I’m using 5.19-rc4 which I pulled from linux-next on Jul 1. It works ok on
> > my machine.
> 
> Interesting. I can reproduce on 5.19-rc4 from torvalds tree. Can you
> drop your qemu command line here?
> 
> This is mine.
> 
> /home/kbj/work/src/qemu/build/x86_64-softmmu/qemu-system-x86_64 \
>   -nodefaults \
>   -display "none" \
>   -machine "q35,accel=kvm,kernel-irqchip=split" \
>   -cpu "host" \
>   -smp "4" \
>   -m "8G" \
>   -device "intel-iommu" \
>   -netdev "user,id=net0,hostfwd=tcp::2222-:22" \
>   -device "virtio-net-pci,netdev=net0" \
>   -device "virtio-rng-pci" \
>   -drive "id=boot,file=/home/kbj/work/vol/machines/img/nvme.qcow2,format=qcow2,if=virtio,discard=unmap,media=disk,read-only=no" \
>   -device "pcie-root-port,id=pcie_root_port1,chassis=1,slot=0" \
>   -device "nvme,id=nvme0,serial=deadbeef,bus=pcie_root_port1,mdts=7" \
>   -drive "id=null,if=none,file=null-co://,file.read-zeroes=on,format=raw" \
>   -device "nvme-ns,id=nvm-1,drive=nvm-1,bus=nvme0,nsid=1,drive=null,logical_block_size=4096,physical_block_size=4096" \
>   -pidfile "/home/kbj/work/vol/machines/run/null/pidfile" \
>   -kernel "/home/kbj/work/src/kernel/linux/arch/x86_64/boot/bzImage" \
>   -append "root=/dev/vda1 console=ttyS0,115200 audit=0 intel_iommu=on" \
>   -virtfs "local,path=/home/kbj/work/src/kernel/linux,security_model=none,readonly=on,mount_tag=kernel_dir" \
>   -serial "mon:stdio" \
>   -d "guest_errors" \
>   -D "/home/kbj/work/vol/machines/log/null/qemu.log" \
>   -trace "pci_nvme*"

Alright. It was *some* config issue with my kernel. Reverted to a
defconfig + requirements and the issue went away.

I'll try to track down what happended, but doesnt look like qemu is at
fault here.
Klaus Jensen July 26, 2022, 11:24 a.m. UTC | #16
On Jul 26 12:09, Klaus Jensen wrote:
> On Jul 26 11:19, Klaus Jensen wrote:
> > On Jul 26 15:55, Jinhao Fan wrote:
> > > at 3:41 PM, Klaus Jensen <its@irrelevant.dk> wrote:
> > > 
> > > > On Jul 26 15:35, Jinhao Fan wrote:
> > > >> at 4:55 AM, Klaus Jensen <its@irrelevant.dk> wrote:
> > > >> 
> > > >>> We have a regression following this patch that we need to address.
> > > >>> 
> > > >>> With this patch, issuing a reset on the device (`nvme reset /dev/nvme0`
> > > >>> will do the trick) causes QEMU to hog my host cpu at 100%.
> > > >>> 
> > > >>> I'm still not sure what causes this. The trace output is a bit
> > > >>> inconclusive still.
> > > >>> 
> > > >>> I'll keep looking into it.
> > > >> 
> > > >> I cannot reproduce this bug. I just start the VM and used `nvme reset
> > > >> /dev/nvme0`. Did you do anything before the reset?
> > > > 
> > > > Interesting and thanks for checking! Looks like a kernel issue then!
> > > > 
> > > > I remember that I'm using a dev branch (nvme-v5.20) of the kernel and
> > > > reverting to a stock OS kernel did not produce the bug.
> > > 
> > > I’m using 5.19-rc4 which I pulled from linux-next on Jul 1. It works ok on
> > > my machine.
> > 
> > Interesting. I can reproduce on 5.19-rc4 from torvalds tree. Can you
> > drop your qemu command line here?
> > 
> > This is mine.
> > 
> > /home/kbj/work/src/qemu/build/x86_64-softmmu/qemu-system-x86_64 \
> >   -nodefaults \
> >   -display "none" \
> >   -machine "q35,accel=kvm,kernel-irqchip=split" \
> >   -cpu "host" \
> >   -smp "4" \
> >   -m "8G" \
> >   -device "intel-iommu" \
> >   -netdev "user,id=net0,hostfwd=tcp::2222-:22" \
> >   -device "virtio-net-pci,netdev=net0" \
> >   -device "virtio-rng-pci" \
> >   -drive "id=boot,file=/home/kbj/work/vol/machines/img/nvme.qcow2,format=qcow2,if=virtio,discard=unmap,media=disk,read-only=no" \
> >   -device "pcie-root-port,id=pcie_root_port1,chassis=1,slot=0" \
> >   -device "nvme,id=nvme0,serial=deadbeef,bus=pcie_root_port1,mdts=7" \
> >   -drive "id=null,if=none,file=null-co://,file.read-zeroes=on,format=raw" \
> >   -device "nvme-ns,id=nvm-1,drive=nvm-1,bus=nvme0,nsid=1,drive=null,logical_block_size=4096,physical_block_size=4096" \
> >   -pidfile "/home/kbj/work/vol/machines/run/null/pidfile" \
> >   -kernel "/home/kbj/work/src/kernel/linux/arch/x86_64/boot/bzImage" \
> >   -append "root=/dev/vda1 console=ttyS0,115200 audit=0 intel_iommu=on" \
> >   -virtfs "local,path=/home/kbj/work/src/kernel/linux,security_model=none,readonly=on,mount_tag=kernel_dir" \
> >   -serial "mon:stdio" \
> >   -d "guest_errors" \
> >   -D "/home/kbj/work/vol/machines/log/null/qemu.log" \
> >   -trace "pci_nvme*"
> 
> Alright. It was *some* config issue with my kernel. Reverted to a
> defconfig + requirements and the issue went away.
> 

And it went away because I didn't include iommu support in that kernel (and its
not enabled by default on the stock OS kernel).

> I'll try to track down what happended, but doesnt look like qemu is at
> fault here.

OK. So.

I can continue to reproduce this if the machine has a virtual intel iommu
enabled. And it only happens when this commit is applied.

I even backported this patch (and the shadow doorbell patch) to v7.0 and v6.2
(i.e. no SRIOV or CC logic changes that could be buggy) and it still exhibits
this behavior. Sometimes QEMU coredumps on poweroff and I managed to grab one:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  nvme_process_sq (opaque=0x556329708110) at ../hw/nvme/ctrl.c:5720
5720   NvmeCQueue *cq = n->cq[sq->cqid];
[Current thread is 1 (Thread 0x7f7363553cc0 (LWP 2554896))]
(gdb) bt
#0  nvme_process_sq (opaque=0x556329708110) at ../hw/nvme/ctrl.c:5720
#1  0x0000556326e82e28 in nvme_sq_notifier (e=0x556329708148) at ../hw/nvme/ctrl.c:3993
#2  0x000055632738396a in aio_dispatch_handler (ctx=0x5563291c3160, node=0x55632a228b60) at ../util/aio-posix.c:329
#3  0x0000556327383b22 in aio_dispatch_handlers (ctx=0x5563291c3160) at ../util/aio-posix.c:372
#4  0x0000556327383b78 in aio_dispatch (ctx=0x5563291c3160) at ../util/aio-posix.c:382
#5  0x000055632739d748 in aio_ctx_dispatch (source=0x5563291c3160, callback=0x0, user_data=0x0) at ../util/async.c:311
#6  0x00007f7369398163 in g_main_context_dispatch () at /usr/lib64/libglib-2.0.so.0
#7  0x00005563273af279 in glib_pollfds_poll () at ../util/main-loop.c:232
#8  0x00005563273af2f6 in os_host_main_loop_wait (timeout=0x1dbe22c0) at ../util/main-loop.c:255
#9  0x00005563273af404 in main_loop_wait (nonblocking=0x0) at ../util/main-loop.c:531
#10 0x00005563270714d9 in qemu_main_loop () at ../softmmu/runstate.c:726
#11 0x0000556326c7ea46 in main (argc=0x2e, argv=0x7ffc6977f198, envp=0x7ffc6977f310) at ../softmmu/main.c:50

At this point, there should not be any CQ/SQs (I detached the device from the
kernel driver which deletes all queues and bound it to vfio-pci instead), but
somehow a stale notifier is called on poweroff and the queue is bogus, causing
the segfault.

(gdb) p cq->cqid
$2 = 0x7880

My guess would be that we are not cleaning up the notifier properly. Currently
we do this

    if (cq->ioeventfd_enabled) {
        memory_region_del_eventfd(&n->iomem,
                                  0x1000 + offset, 4, false, 0, &cq->notifier);
        event_notifier_cleanup(&cq->notifier);
    }


Any ioeventfd experts that has some insights into what we are doing
wrong here? Something we need to flush? I tried with a test_and_clear on
the eventfd but that didnt do the trick.

I think we'd need to revert this until we can track down what is going wrong.
Klaus Jensen July 26, 2022, 11:32 a.m. UTC | #17
On Jul 26 13:24, Klaus Jensen wrote:
> On Jul 26 12:09, Klaus Jensen wrote:
> > On Jul 26 11:19, Klaus Jensen wrote:
> > > On Jul 26 15:55, Jinhao Fan wrote:
> > > > at 3:41 PM, Klaus Jensen <its@irrelevant.dk> wrote:
> > > > 
> > > > > On Jul 26 15:35, Jinhao Fan wrote:
> > > > >> at 4:55 AM, Klaus Jensen <its@irrelevant.dk> wrote:
> > > > >> 
> > > > >>> We have a regression following this patch that we need to address.
> > > > >>> 
> > > > >>> With this patch, issuing a reset on the device (`nvme reset /dev/nvme0`
> > > > >>> will do the trick) causes QEMU to hog my host cpu at 100%.
> > > > >>> 
> > > > >>> I'm still not sure what causes this. The trace output is a bit
> > > > >>> inconclusive still.
> > > > >>> 
> > > > >>> I'll keep looking into it.
> > > > >> 
> > > > >> I cannot reproduce this bug. I just start the VM and used `nvme reset
> > > > >> /dev/nvme0`. Did you do anything before the reset?
> > > > > 
> > > > > Interesting and thanks for checking! Looks like a kernel issue then!
> > > > > 
> > > > > I remember that I'm using a dev branch (nvme-v5.20) of the kernel and
> > > > > reverting to a stock OS kernel did not produce the bug.
> > > > 
> > > > I’m using 5.19-rc4 which I pulled from linux-next on Jul 1. It works ok on
> > > > my machine.
> > > 
> > > Interesting. I can reproduce on 5.19-rc4 from torvalds tree. Can you
> > > drop your qemu command line here?
> > > 
> > > This is mine.
> > > 
> > > /home/kbj/work/src/qemu/build/x86_64-softmmu/qemu-system-x86_64 \
> > >   -nodefaults \
> > >   -display "none" \
> > >   -machine "q35,accel=kvm,kernel-irqchip=split" \
> > >   -cpu "host" \
> > >   -smp "4" \
> > >   -m "8G" \
> > >   -device "intel-iommu" \
> > >   -netdev "user,id=net0,hostfwd=tcp::2222-:22" \
> > >   -device "virtio-net-pci,netdev=net0" \
> > >   -device "virtio-rng-pci" \
> > >   -drive "id=boot,file=/home/kbj/work/vol/machines/img/nvme.qcow2,format=qcow2,if=virtio,discard=unmap,media=disk,read-only=no" \
> > >   -device "pcie-root-port,id=pcie_root_port1,chassis=1,slot=0" \
> > >   -device "nvme,id=nvme0,serial=deadbeef,bus=pcie_root_port1,mdts=7" \
> > >   -drive "id=null,if=none,file=null-co://,file.read-zeroes=on,format=raw" \
> > >   -device "nvme-ns,id=nvm-1,drive=nvm-1,bus=nvme0,nsid=1,drive=null,logical_block_size=4096,physical_block_size=4096" \
> > >   -pidfile "/home/kbj/work/vol/machines/run/null/pidfile" \
> > >   -kernel "/home/kbj/work/src/kernel/linux/arch/x86_64/boot/bzImage" \
> > >   -append "root=/dev/vda1 console=ttyS0,115200 audit=0 intel_iommu=on" \
> > >   -virtfs "local,path=/home/kbj/work/src/kernel/linux,security_model=none,readonly=on,mount_tag=kernel_dir" \
> > >   -serial "mon:stdio" \
> > >   -d "guest_errors" \
> > >   -D "/home/kbj/work/vol/machines/log/null/qemu.log" \
> > >   -trace "pci_nvme*"
> > 
> > Alright. It was *some* config issue with my kernel. Reverted to a
> > defconfig + requirements and the issue went away.
> > 
> 
> And it went away because I didn't include iommu support in that kernel (and its
> not enabled by default on the stock OS kernel).
> 
> > I'll try to track down what happended, but doesnt look like qemu is at
> > fault here.
> 
> OK. So.
> 
> I can continue to reproduce this if the machine has a virtual intel iommu
> enabled. And it only happens when this commit is applied.
> 
> I even backported this patch (and the shadow doorbell patch) to v7.0 and v6.2
> (i.e. no SRIOV or CC logic changes that could be buggy) and it still exhibits
> this behavior. Sometimes QEMU coredumps on poweroff and I managed to grab one:
> 
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  nvme_process_sq (opaque=0x556329708110) at ../hw/nvme/ctrl.c:5720
> 5720   NvmeCQueue *cq = n->cq[sq->cqid];
> [Current thread is 1 (Thread 0x7f7363553cc0 (LWP 2554896))]
> (gdb) bt
> #0  nvme_process_sq (opaque=0x556329708110) at ../hw/nvme/ctrl.c:5720
> #1  0x0000556326e82e28 in nvme_sq_notifier (e=0x556329708148) at ../hw/nvme/ctrl.c:3993
> #2  0x000055632738396a in aio_dispatch_handler (ctx=0x5563291c3160, node=0x55632a228b60) at ../util/aio-posix.c:329
> #3  0x0000556327383b22 in aio_dispatch_handlers (ctx=0x5563291c3160) at ../util/aio-posix.c:372
> #4  0x0000556327383b78 in aio_dispatch (ctx=0x5563291c3160) at ../util/aio-posix.c:382
> #5  0x000055632739d748 in aio_ctx_dispatch (source=0x5563291c3160, callback=0x0, user_data=0x0) at ../util/async.c:311
> #6  0x00007f7369398163 in g_main_context_dispatch () at /usr/lib64/libglib-2.0.so.0
> #7  0x00005563273af279 in glib_pollfds_poll () at ../util/main-loop.c:232
> #8  0x00005563273af2f6 in os_host_main_loop_wait (timeout=0x1dbe22c0) at ../util/main-loop.c:255
> #9  0x00005563273af404 in main_loop_wait (nonblocking=0x0) at ../util/main-loop.c:531
> #10 0x00005563270714d9 in qemu_main_loop () at ../softmmu/runstate.c:726
> #11 0x0000556326c7ea46 in main (argc=0x2e, argv=0x7ffc6977f198, envp=0x7ffc6977f310) at ../softmmu/main.c:50
> 
> At this point, there should not be any CQ/SQs (I detached the device from the
> kernel driver which deletes all queues and bound it to vfio-pci instead), but
> somehow a stale notifier is called on poweroff and the queue is bogus, causing
> the segfault.
> 
> (gdb) p cq->cqid
> $2 = 0x7880
> 
> My guess would be that we are not cleaning up the notifier properly. Currently
> we do this
> 
>     if (cq->ioeventfd_enabled) {
>         memory_region_del_eventfd(&n->iomem,
>                                   0x1000 + offset, 4, false, 0, &cq->notifier);
>         event_notifier_cleanup(&cq->notifier);
>     }
> 
> 
> Any ioeventfd experts that has some insights into what we are doing
> wrong here? Something we need to flush? I tried with a test_and_clear on
> the eventfd but that didnt do the trick.
> 
> I think we'd need to revert this until we can track down what is going wrong.

One more thing - I now also triggered the coredump with just a `modprobe
vfio-pci` following a `nvme reset /dev/nvme0`.

Similar backtrace.
Klaus Jensen July 26, 2022, 12:08 p.m. UTC | #18
On Jul 26 13:32, Klaus Jensen wrote:
> On Jul 26 13:24, Klaus Jensen wrote:
> > On Jul 26 12:09, Klaus Jensen wrote:
> > > On Jul 26 11:19, Klaus Jensen wrote:
> > > > On Jul 26 15:55, Jinhao Fan wrote:
> > > > > at 3:41 PM, Klaus Jensen <its@irrelevant.dk> wrote:
> > > > > 
> > > > > > On Jul 26 15:35, Jinhao Fan wrote:
> > > > > >> at 4:55 AM, Klaus Jensen <its@irrelevant.dk> wrote:
> > > > > >> 
> > > > > >>> We have a regression following this patch that we need to address.
> > > > > >>> 
> > > > > >>> With this patch, issuing a reset on the device (`nvme reset /dev/nvme0`
> > > > > >>> will do the trick) causes QEMU to hog my host cpu at 100%.
> > > > > >>> 
> > > > > >>> I'm still not sure what causes this. The trace output is a bit
> > > > > >>> inconclusive still.
> > > > > >>> 
> > > > > >>> I'll keep looking into it.
> > > > > >> 
> > > > > >> I cannot reproduce this bug. I just start the VM and used `nvme reset
> > > > > >> /dev/nvme0`. Did you do anything before the reset?
> > > > > > 
> > > > > > Interesting and thanks for checking! Looks like a kernel issue then!
> > > > > > 
> > > > > > I remember that I'm using a dev branch (nvme-v5.20) of the kernel and
> > > > > > reverting to a stock OS kernel did not produce the bug.
> > > > > 
> > > > > I’m using 5.19-rc4 which I pulled from linux-next on Jul 1. It works ok on
> > > > > my machine.
> > > > 
> > > > Interesting. I can reproduce on 5.19-rc4 from torvalds tree. Can you
> > > > drop your qemu command line here?
> > > > 
> > > > This is mine.
> > > > 
> > > > /home/kbj/work/src/qemu/build/x86_64-softmmu/qemu-system-x86_64 \
> > > >   -nodefaults \
> > > >   -display "none" \
> > > >   -machine "q35,accel=kvm,kernel-irqchip=split" \
> > > >   -cpu "host" \
> > > >   -smp "4" \
> > > >   -m "8G" \
> > > >   -device "intel-iommu" \
> > > >   -netdev "user,id=net0,hostfwd=tcp::2222-:22" \
> > > >   -device "virtio-net-pci,netdev=net0" \
> > > >   -device "virtio-rng-pci" \
> > > >   -drive "id=boot,file=/home/kbj/work/vol/machines/img/nvme.qcow2,format=qcow2,if=virtio,discard=unmap,media=disk,read-only=no" \
> > > >   -device "pcie-root-port,id=pcie_root_port1,chassis=1,slot=0" \
> > > >   -device "nvme,id=nvme0,serial=deadbeef,bus=pcie_root_port1,mdts=7" \
> > > >   -drive "id=null,if=none,file=null-co://,file.read-zeroes=on,format=raw" \
> > > >   -device "nvme-ns,id=nvm-1,drive=nvm-1,bus=nvme0,nsid=1,drive=null,logical_block_size=4096,physical_block_size=4096" \
> > > >   -pidfile "/home/kbj/work/vol/machines/run/null/pidfile" \
> > > >   -kernel "/home/kbj/work/src/kernel/linux/arch/x86_64/boot/bzImage" \
> > > >   -append "root=/dev/vda1 console=ttyS0,115200 audit=0 intel_iommu=on" \
> > > >   -virtfs "local,path=/home/kbj/work/src/kernel/linux,security_model=none,readonly=on,mount_tag=kernel_dir" \
> > > >   -serial "mon:stdio" \
> > > >   -d "guest_errors" \
> > > >   -D "/home/kbj/work/vol/machines/log/null/qemu.log" \
> > > >   -trace "pci_nvme*"
> > > 
> > > Alright. It was *some* config issue with my kernel. Reverted to a
> > > defconfig + requirements and the issue went away.
> > > 
> > 
> > And it went away because I didn't include iommu support in that kernel (and its
> > not enabled by default on the stock OS kernel).
> > 
> > > I'll try to track down what happended, but doesnt look like qemu is at
> > > fault here.
> > 
> > OK. So.
> > 
> > I can continue to reproduce this if the machine has a virtual intel iommu
> > enabled. And it only happens when this commit is applied.
> > 
> > I even backported this patch (and the shadow doorbell patch) to v7.0 and v6.2
> > (i.e. no SRIOV or CC logic changes that could be buggy) and it still exhibits
> > this behavior. Sometimes QEMU coredumps on poweroff and I managed to grab one:
> > 
> > Program terminated with signal SIGSEGV, Segmentation fault.
> > #0  nvme_process_sq (opaque=0x556329708110) at ../hw/nvme/ctrl.c:5720
> > 5720   NvmeCQueue *cq = n->cq[sq->cqid];
> > [Current thread is 1 (Thread 0x7f7363553cc0 (LWP 2554896))]
> > (gdb) bt
> > #0  nvme_process_sq (opaque=0x556329708110) at ../hw/nvme/ctrl.c:5720
> > #1  0x0000556326e82e28 in nvme_sq_notifier (e=0x556329708148) at ../hw/nvme/ctrl.c:3993
> > #2  0x000055632738396a in aio_dispatch_handler (ctx=0x5563291c3160, node=0x55632a228b60) at ../util/aio-posix.c:329
> > #3  0x0000556327383b22 in aio_dispatch_handlers (ctx=0x5563291c3160) at ../util/aio-posix.c:372
> > #4  0x0000556327383b78 in aio_dispatch (ctx=0x5563291c3160) at ../util/aio-posix.c:382
> > #5  0x000055632739d748 in aio_ctx_dispatch (source=0x5563291c3160, callback=0x0, user_data=0x0) at ../util/async.c:311
> > #6  0x00007f7369398163 in g_main_context_dispatch () at /usr/lib64/libglib-2.0.so.0
> > #7  0x00005563273af279 in glib_pollfds_poll () at ../util/main-loop.c:232
> > #8  0x00005563273af2f6 in os_host_main_loop_wait (timeout=0x1dbe22c0) at ../util/main-loop.c:255
> > #9  0x00005563273af404 in main_loop_wait (nonblocking=0x0) at ../util/main-loop.c:531
> > #10 0x00005563270714d9 in qemu_main_loop () at ../softmmu/runstate.c:726
> > #11 0x0000556326c7ea46 in main (argc=0x2e, argv=0x7ffc6977f198, envp=0x7ffc6977f310) at ../softmmu/main.c:50
> > 
> > At this point, there should not be any CQ/SQs (I detached the device from the
> > kernel driver which deletes all queues and bound it to vfio-pci instead), but
> > somehow a stale notifier is called on poweroff and the queue is bogus, causing
> > the segfault.
> > 
> > (gdb) p cq->cqid
> > $2 = 0x7880
> > 
> > My guess would be that we are not cleaning up the notifier properly. Currently
> > we do this
> > 
> >     if (cq->ioeventfd_enabled) {
> >         memory_region_del_eventfd(&n->iomem,
> >                                   0x1000 + offset, 4, false, 0, &cq->notifier);
> >         event_notifier_cleanup(&cq->notifier);
> >     }
> > 
> > 
> > Any ioeventfd experts that has some insights into what we are doing
> > wrong here? Something we need to flush? I tried with a test_and_clear on
> > the eventfd but that didnt do the trick.
> > 
> > I think we'd need to revert this until we can track down what is going wrong.
> 
> One more thing - I now also triggered the coredump with just a `modprobe
> vfio-pci` following a `nvme reset /dev/nvme0`.
> 
> Similar backtrace.

Alright. Forget about the iommu, that was just a coincidence.

This patch seems to fix it. I guess it is the
event_notifier_set_handler(..., NULL) that does the trick, but I'd like
to understand why ;)


diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c
index 533ad14e7a61..3bc3c6bfbe78 100644
--- i/hw/nvme/ctrl.c
+++ w/hw/nvme/ctrl.c
@@ -4238,7 +4238,9 @@ static void nvme_cq_notifier(EventNotifier *e)
     NvmeCQueue *cq = container_of(e, NvmeCQueue, notifier);
     NvmeCtrl *n = cq->ctrl;
 
-    event_notifier_test_and_clear(&cq->notifier);
+    if (!event_notifier_test_and_clear(e)) {
+        return;
+    }
 
     nvme_update_cq_head(cq);
 
@@ -4275,7 +4277,9 @@ static void nvme_sq_notifier(EventNotifier *e)
 {
     NvmeSQueue *sq = container_of(e, NvmeSQueue, notifier);
 
-    event_notifier_test_and_clear(&sq->notifier);
+    if (!event_notifier_test_and_clear(e)) {
+        return;
+    }
 
     nvme_process_sq(sq);
 }
@@ -4307,6 +4311,8 @@ static void nvme_free_sq(NvmeSQueue *sq, NvmeCtrl *n)
     if (sq->ioeventfd_enabled) {
         memory_region_del_eventfd(&n->iomem,
                                   0x1000 + offset, 4, false, 0, &sq->notifier);
+        event_notifier_set_handler(&sq->notifier, NULL);
+        nvme_sq_notifier(&sq->notifier);
         event_notifier_cleanup(&sq->notifier);
     }
     g_free(sq->io_req);
@@ -4697,6 +4703,8 @@ static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
     if (cq->ioeventfd_enabled) {
         memory_region_del_eventfd(&n->iomem,
                                   0x1000 + offset, 4, false, 0, &cq->notifier);
+        event_notifier_set_handler(&cq->notifier, NULL);
+        nvme_cq_notifier(&cq->notifier);
         event_notifier_cleanup(&cq->notifier);
     }
     if (msix_enabled(&n->parent_obj)) {
Stefan Hajnoczi July 26, 2022, 12:35 p.m. UTC | #19
On Tue, 5 Jul 2022 at 10:25, Jinhao Fan <fanjinhao21s@ict.ac.cn> wrote:
>
> Add property "ioeventfd" which is enabled by default. When this is
> enabled, updates on the doorbell registers will cause KVM to signal
> an event to the QEMU main loop to handle the doorbell updates.
> Therefore, instead of letting the vcpu thread run both guest VM and
> IO emulation, we now use the main loop thread to do IO emulation and
> thus the vcpu thread has more cycles for the guest VM.
>
> Since ioeventfd does not tell us the exact value that is written, it is
> only useful when shadow doorbell buffer is enabled, where we check
> for the value in the shadow doorbell buffer when we get the doorbell
> update event.
>
> IOPS comparison on Linux 5.19-rc2: (Unit: KIOPS)
>
> qd           1   4  16  64
> qemu        35 121 176 153
> ioeventfd   41 133 258 313

Nice

>
> Changes since v3:
>  - Do not deregister ioeventfd when it was not enabled on a SQ/CQ
>
> Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
> ---
>  hw/nvme/ctrl.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++++-
>  hw/nvme/nvme.h |   5 +++
>  2 files changed, 118 insertions(+), 1 deletion(-)
>
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index c952c34f94..4b75c5f549 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -1374,7 +1374,14 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
>
>      QTAILQ_REMOVE(&req->sq->out_req_list, req, entry);
>      QTAILQ_INSERT_TAIL(&cq->req_list, req, entry);
> -    timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> +
> +    if (req->sq->ioeventfd_enabled) {
> +        /* Post CQE directly since we are in main loop thread */
> +        nvme_post_cqes(cq);
> +    } else {
> +        /* Schedule the timer to post CQE later since we are in vcpu thread */
> +        timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> +    }

This change is somewhat independent of ioeventfd. I wonder how much of
the performance improvement is due to the sq ioeventfd and how much is
due to bypassing the completion timer.

In the qd=1 case I expect the completion timer to increase latency and
hurt performance. In the qd=64 case the timer increases batching,
reduces CPU consumption, and probably improves performance.

It would be nice to measure this in isolation, independently of
ioeventfd/IOThreads.

>  }
>
>  static void nvme_process_aers(void *opaque)
> @@ -4195,10 +4202,82 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
>      return NVME_INVALID_OPCODE | NVME_DNR;
>  }
>
> +static void nvme_cq_notifier(EventNotifier *e)
> +{
> +    NvmeCQueue *cq = container_of(e, NvmeCQueue, notifier);
> +    NvmeCtrl *n = cq->ctrl;
> +
> +    event_notifier_test_and_clear(&cq->notifier);
> +
> +    nvme_update_cq_head(cq);
> +
> +    if (cq->tail == cq->head) {
> +        if (cq->irq_enabled) {
> +            n->cq_pending--;
> +        }
> +
> +        nvme_irq_deassert(n, cq);
> +    }
> +
> +    nvme_post_cqes(cq);
> +}
> +
> +static int nvme_init_cq_ioeventfd(NvmeCQueue *cq)
> +{
> +    NvmeCtrl *n = cq->ctrl;
> +    uint16_t offset = (cq->cqid << 3) + (1 << 2);

Too many...

> +    int ret;
> +
> +    ret = event_notifier_init(&cq->notifier, 0);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    event_notifier_set_handler(&cq->notifier, nvme_cq_notifier);
> +    memory_region_add_eventfd(&n->iomem,
> +                              0x1000 + offset, 4, false, 0, &cq->notifier);

...magic numbers. Please use constants so it's clear what << 3, (1 <<
2), 0x1000, etc mean.

> +
> +    return 0;
> +}
> +
> +static void nvme_sq_notifier(EventNotifier *e)
> +{
> +    NvmeSQueue *sq = container_of(e, NvmeSQueue, notifier);
> +
> +    event_notifier_test_and_clear(&sq->notifier);
> +
> +    nvme_process_sq(sq);
> +}
> +
> +static int nvme_init_sq_ioeventfd(NvmeSQueue *sq)
> +{
> +    NvmeCtrl *n = sq->ctrl;
> +    uint16_t offset = sq->sqid << 3;
> +    int ret;
> +
> +    ret = event_notifier_init(&sq->notifier, 0);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    event_notifier_set_handler(&sq->notifier, nvme_sq_notifier);
> +    memory_region_add_eventfd(&n->iomem,
> +                              0x1000 + offset, 4, false, 0, &sq->notifier);
> +
> +    return 0;
> +}
> +
>  static void nvme_free_sq(NvmeSQueue *sq, NvmeCtrl *n)
>  {
> +    uint16_t offset = sq->sqid << 3;
> +
>      n->sq[sq->sqid] = NULL;
>      timer_free(sq->timer);
> +    if (sq->ioeventfd_enabled) {
> +        memory_region_del_eventfd(&n->iomem,
> +                                  0x1000 + offset, 4, false, 0, &sq->notifier);
> +        event_notifier_cleanup(&sq->notifier);
> +    }
>      g_free(sq->io_req);
>      if (sq->sqid) {
>          g_free(sq);
> @@ -4271,6 +4350,12 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
>      if (n->dbbuf_enabled) {
>          sq->db_addr = n->dbbuf_dbs + (sqid << 3);
>          sq->ei_addr = n->dbbuf_eis + (sqid << 3);
> +
> +        if (n->params.ioeventfd && sq->sqid != 0) {
> +            if (!nvme_init_sq_ioeventfd(sq)) {
> +                sq->ioeventfd_enabled = true;
> +            }
> +        }
>      }
>
>      assert(n->cq[cqid]);
> @@ -4575,8 +4660,15 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
>
>  static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
>  {
> +    uint16_t offset = (cq->cqid << 3) + (1 << 2);
> +
>      n->cq[cq->cqid] = NULL;
>      timer_free(cq->timer);
> +    if (cq->ioeventfd_enabled) {
> +        memory_region_del_eventfd(&n->iomem,
> +                                  0x1000 + offset, 4, false, 0, &cq->notifier);
> +        event_notifier_cleanup(&cq->notifier);
> +    }
>      if (msix_enabled(&n->parent_obj)) {
>          msix_vector_unuse(&n->parent_obj, cq->vector);
>      }
> @@ -4635,6 +4727,12 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
>      if (n->dbbuf_enabled) {
>          cq->db_addr = n->dbbuf_dbs + (cqid << 3) + (1 << 2);
>          cq->ei_addr = n->dbbuf_eis + (cqid << 3) + (1 << 2);
> +
> +        if (n->params.ioeventfd && cqid != 0) {
> +            if (!nvme_init_cq_ioeventfd(cq)) {
> +                cq->ioeventfd_enabled = true;
> +            }
> +        }
>      }
>      n->cq[cqid] = cq;
>      cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
> @@ -5793,6 +5891,7 @@ static uint16_t nvme_dbbuf_config(NvmeCtrl *n, const NvmeRequest *req)
>      uint64_t dbs_addr = le64_to_cpu(req->cmd.dptr.prp1);
>      uint64_t eis_addr = le64_to_cpu(req->cmd.dptr.prp2);
>      int i;
> +    int ret;
>
>      /* Address should be page aligned */
>      if (dbs_addr & (n->page_size - 1) || eis_addr & (n->page_size - 1)) {
> @@ -5818,6 +5917,12 @@ static uint16_t nvme_dbbuf_config(NvmeCtrl *n, const NvmeRequest *req)
>              sq->ei_addr = eis_addr + (i << 3);
>              pci_dma_write(&n->parent_obj, sq->db_addr, &sq->tail,
>                      sizeof(sq->tail));
> +
> +            if (n->params.ioeventfd && sq->sqid != 0) {
> +                if (!nvme_init_sq_ioeventfd(sq)) {
> +                    sq->ioeventfd_enabled = true;
> +                }
> +            }
>          }
>
>          if (cq) {
> @@ -5826,6 +5931,12 @@ static uint16_t nvme_dbbuf_config(NvmeCtrl *n, const NvmeRequest *req)
>              cq->ei_addr = eis_addr + (i << 3) + (1 << 2);
>              pci_dma_write(&n->parent_obj, cq->db_addr, &cq->head,
>                      sizeof(cq->head));
> +
> +            if (n->params.ioeventfd && cq->cqid != 0) {
> +                if (!nvme_init_cq_ioeventfd(cq)) {
> +                    cq->ioeventfd_enabled = true;
> +                }
> +            }
>          }
>      }
>
> @@ -7040,6 +7151,7 @@ static Property nvme_props[] = {
>      DEFINE_PROP_UINT8("zoned.zasl", NvmeCtrl, params.zasl, 0),
>      DEFINE_PROP_BOOL("zoned.auto_transition", NvmeCtrl,
>                       params.auto_transition_zones, true),
> +    DEFINE_PROP_BOOL("ioeventfd", NvmeCtrl, params.ioeventfd, true),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>
> diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
> index 4452e4b1bf..2a9beea0c8 100644
> --- a/hw/nvme/nvme.h
> +++ b/hw/nvme/nvme.h
> @@ -369,6 +369,8 @@ typedef struct NvmeSQueue {
>      uint64_t    db_addr;
>      uint64_t    ei_addr;
>      QEMUTimer   *timer;
> +    EventNotifier notifier;
> +    bool        ioeventfd_enabled;
>      NvmeRequest *io_req;
>      QTAILQ_HEAD(, NvmeRequest) req_list;
>      QTAILQ_HEAD(, NvmeRequest) out_req_list;
> @@ -388,6 +390,8 @@ typedef struct NvmeCQueue {
>      uint64_t    db_addr;
>      uint64_t    ei_addr;
>      QEMUTimer   *timer;
> +    EventNotifier notifier;
> +    bool        ioeventfd_enabled;
>      QTAILQ_HEAD(, NvmeSQueue) sq_list;
>      QTAILQ_HEAD(, NvmeRequest) req_list;
>  } NvmeCQueue;
> @@ -410,6 +414,7 @@ typedef struct NvmeParams {
>      uint8_t  zasl;
>      bool     auto_transition_zones;
>      bool     legacy_cmb;
> +    bool     ioeventfd;
>  } NvmeParams;
>
>  typedef struct NvmeCtrl {
> --
> 2.25.1
>
>
Klaus Jensen July 27, 2022, 7:06 a.m. UTC | #20
On Jul 26 14:08, Klaus Jensen wrote:
> 
> Alright. Forget about the iommu, that was just a coincidence.
> 
> This patch seems to fix it. I guess it is the
> event_notifier_set_handler(..., NULL) that does the trick, but I'd like
> to understand why ;)
> 
> 
> diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c
> index 533ad14e7a61..3bc3c6bfbe78 100644
> --- i/hw/nvme/ctrl.c
> +++ w/hw/nvme/ctrl.c
> @@ -4238,7 +4238,9 @@ static void nvme_cq_notifier(EventNotifier *e)
>      NvmeCQueue *cq = container_of(e, NvmeCQueue, notifier);
>      NvmeCtrl *n = cq->ctrl;
>  
> -    event_notifier_test_and_clear(&cq->notifier);
> +    if (!event_notifier_test_and_clear(e)) {
> +        return;
> +    }
>  
>      nvme_update_cq_head(cq);
>  
> @@ -4275,7 +4277,9 @@ static void nvme_sq_notifier(EventNotifier *e)
>  {
>      NvmeSQueue *sq = container_of(e, NvmeSQueue, notifier);
>  
> -    event_notifier_test_and_clear(&sq->notifier);
> +    if (!event_notifier_test_and_clear(e)) {
> +        return;
> +    }
>  
>      nvme_process_sq(sq);
>  }
> @@ -4307,6 +4311,8 @@ static void nvme_free_sq(NvmeSQueue *sq, NvmeCtrl *n)
>      if (sq->ioeventfd_enabled) {
>          memory_region_del_eventfd(&n->iomem,
>                                    0x1000 + offset, 4, false, 0, &sq->notifier);
> +        event_notifier_set_handler(&sq->notifier, NULL);
> +        nvme_sq_notifier(&sq->notifier);
>          event_notifier_cleanup(&sq->notifier);
>      }
>      g_free(sq->io_req);
> @@ -4697,6 +4703,8 @@ static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
>      if (cq->ioeventfd_enabled) {
>          memory_region_del_eventfd(&n->iomem,
>                                    0x1000 + offset, 4, false, 0, &cq->notifier);
> +        event_notifier_set_handler(&cq->notifier, NULL);
> +        nvme_cq_notifier(&cq->notifier);
>          event_notifier_cleanup(&cq->notifier);
>      }
>      if (msix_enabled(&n->parent_obj)) {

Jinhao,

Do you have any comments on the above patch - does it make sense to you,
considering the effort you've done into researching how virtio does
this?
Jinhao Fan July 27, 2022, 8:16 a.m. UTC | #21
at 3:06 PM, Klaus Jensen <its@irrelevant.dk> wrote:

> On Jul 26 14:08, Klaus Jensen wrote:
>> Alright. Forget about the iommu, that was just a coincidence.
>> 
>> This patch seems to fix it. I guess it is the
>> event_notifier_set_handler(..., NULL) that does the trick, but I'd like
>> to understand why ;)
>> 
>> 
>> diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c
>> index 533ad14e7a61..3bc3c6bfbe78 100644
>> --- i/hw/nvme/ctrl.c
>> +++ w/hw/nvme/ctrl.c
>> @@ -4238,7 +4238,9 @@ static void nvme_cq_notifier(EventNotifier *e)
>>     NvmeCQueue *cq = container_of(e, NvmeCQueue, notifier);
>>     NvmeCtrl *n = cq->ctrl;
>> 
>> -    event_notifier_test_and_clear(&cq->notifier);
>> +    if (!event_notifier_test_and_clear(e)) {
>> +        return;
>> +    }
>> 
>>     nvme_update_cq_head(cq);
>> 
>> @@ -4275,7 +4277,9 @@ static void nvme_sq_notifier(EventNotifier *e)
>> {
>>     NvmeSQueue *sq = container_of(e, NvmeSQueue, notifier);
>> 
>> -    event_notifier_test_and_clear(&sq->notifier);
>> +    if (!event_notifier_test_and_clear(e)) {
>> +        return;
>> +    }
>> 

Yes, virtio also checks the return value of event_notifier_test_and_clear().

>>     nvme_process_sq(sq);
>> }
>> @@ -4307,6 +4311,8 @@ static void nvme_free_sq(NvmeSQueue *sq, NvmeCtrl *n)
>>     if (sq->ioeventfd_enabled) {
>>         memory_region_del_eventfd(&n->iomem,
>>                                   0x1000 + offset, 4, false, 0, &sq->notifier);
>> +        event_notifier_set_handler(&sq->notifier, NULL);
>> +        nvme_sq_notifier(&sq->notifier);
>>         event_notifier_cleanup(&sq->notifier);
>>     }
>>     g_free(sq->io_req);
>> @@ -4697,6 +4703,8 @@ static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
>>     if (cq->ioeventfd_enabled) {
>>         memory_region_del_eventfd(&n->iomem,
>>                                   0x1000 + offset, 4, false, 0, &cq->notifier);
>> +        event_notifier_set_handler(&cq->notifier, NULL);
>> +        nvme_cq_notifier(&cq->notifier);
>>         event_notifier_cleanup(&cq->notifier);
>>     }
>>     if (msix_enabled(&n->parent_obj)) {
> 

I’m a bit messed up here. I will further check how virtio handles queue deletion today.
Klaus Jensen July 27, 2022, 8:21 a.m. UTC | #22
On Jul 27 16:16, Jinhao Fan wrote:
> at 3:06 PM, Klaus Jensen <its@irrelevant.dk> wrote:
> 
> > On Jul 26 14:08, Klaus Jensen wrote:
> >> Alright. Forget about the iommu, that was just a coincidence.
> >> 
> >> This patch seems to fix it. I guess it is the
> >> event_notifier_set_handler(..., NULL) that does the trick, but I'd like
> >> to understand why ;)
> >> 
> >> 
> >> diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c
> >> index 533ad14e7a61..3bc3c6bfbe78 100644
> >> --- i/hw/nvme/ctrl.c
> >> +++ w/hw/nvme/ctrl.c
> >> @@ -4238,7 +4238,9 @@ static void nvme_cq_notifier(EventNotifier *e)
> >>     NvmeCQueue *cq = container_of(e, NvmeCQueue, notifier);
> >>     NvmeCtrl *n = cq->ctrl;
> >> 
> >> -    event_notifier_test_and_clear(&cq->notifier);
> >> +    if (!event_notifier_test_and_clear(e)) {
> >> +        return;
> >> +    }
> >> 
> >>     nvme_update_cq_head(cq);
> >> 
> >> @@ -4275,7 +4277,9 @@ static void nvme_sq_notifier(EventNotifier *e)
> >> {
> >>     NvmeSQueue *sq = container_of(e, NvmeSQueue, notifier);
> >> 
> >> -    event_notifier_test_and_clear(&sq->notifier);
> >> +    if (!event_notifier_test_and_clear(e)) {
> >> +        return;
> >> +    }
> >> 
> 
> Yes, virtio also checks the return value of event_notifier_test_and_clear().
> 
> >>     nvme_process_sq(sq);
> >> }
> >> @@ -4307,6 +4311,8 @@ static void nvme_free_sq(NvmeSQueue *sq, NvmeCtrl *n)
> >>     if (sq->ioeventfd_enabled) {
> >>         memory_region_del_eventfd(&n->iomem,
> >>                                   0x1000 + offset, 4, false, 0, &sq->notifier);
> >> +        event_notifier_set_handler(&sq->notifier, NULL);
> >> +        nvme_sq_notifier(&sq->notifier);
> >>         event_notifier_cleanup(&sq->notifier);
> >>     }
> >>     g_free(sq->io_req);
> >> @@ -4697,6 +4703,8 @@ static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
> >>     if (cq->ioeventfd_enabled) {
> >>         memory_region_del_eventfd(&n->iomem,
> >>                                   0x1000 + offset, 4, false, 0, &cq->notifier);
> >> +        event_notifier_set_handler(&cq->notifier, NULL);
> >> +        nvme_cq_notifier(&cq->notifier);
> >>         event_notifier_cleanup(&cq->notifier);
> >>     }
> >>     if (msix_enabled(&n->parent_obj)) {
> > 
> 
> I’m a bit messed up here. I will further check how virtio handles queue deletion today.

Yeah, the API documentation is not exactly exhaustive on the specifics
here, so I kinda inferred this from the virtio code.

If this is the way to do it, then I think I will follow up with a patch
for the event notifier api with a not on how teardown should be done.

Paolo - get_maintainer.pl spits you out for main-loop.h. Any chance you
could shed some light on this?
Jinhao Fan July 27, 2022, 3:09 p.m. UTC | #23
After digging through the source code, I found event_notifier_cleanup() only
closes the eventfd, but does not de-register the event from QEMU’s event
loop. event_notifier_set_handler() actually interacts with the event loop.
It is a wrapper around aio_set_event_notifier(), which is again a wrapper of
aio_set_fd_handler(). Passing in a handler of NULL means removing the fd
handler from the event loop.

Therefore, originally in nvme_free_sq/cq(), we closed the eventfd but did
not delete its handler. I guess maybe sometime later the fd is reused and
that corresponding file is somehow written (A POLLIN event), this will
trigger the event loop to call the stale handler, which causes this bug.

So the correct order is:

Init:
1. event_notifier_init: Open the eventfd
2. event_notifier_set_handler: Register on the event loop
3. memory_region_add_eventfd: Associate with IO region

Cleanup:
1. memory_region_del_eventfd: De-associate with IO region
2. event_notifier_set_handler(NULL): Remove from the event loop
3. event_notifier_cleanup: Close the eventfd
diff mbox series

Patch

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index c952c34f94..4b75c5f549 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -1374,7 +1374,14 @@  static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
 
     QTAILQ_REMOVE(&req->sq->out_req_list, req, entry);
     QTAILQ_INSERT_TAIL(&cq->req_list, req, entry);
-    timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
+
+    if (req->sq->ioeventfd_enabled) {
+        /* Post CQE directly since we are in main loop thread */
+        nvme_post_cqes(cq);
+    } else {
+        /* Schedule the timer to post CQE later since we are in vcpu thread */
+        timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
+    }
 }
 
 static void nvme_process_aers(void *opaque)
@@ -4195,10 +4202,82 @@  static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
     return NVME_INVALID_OPCODE | NVME_DNR;
 }
 
+static void nvme_cq_notifier(EventNotifier *e)
+{
+    NvmeCQueue *cq = container_of(e, NvmeCQueue, notifier);
+    NvmeCtrl *n = cq->ctrl;
+
+    event_notifier_test_and_clear(&cq->notifier);
+
+    nvme_update_cq_head(cq);
+
+    if (cq->tail == cq->head) {
+        if (cq->irq_enabled) {
+            n->cq_pending--;
+        }
+
+        nvme_irq_deassert(n, cq);
+    }
+
+    nvme_post_cqes(cq);
+}
+
+static int nvme_init_cq_ioeventfd(NvmeCQueue *cq)
+{
+    NvmeCtrl *n = cq->ctrl;
+    uint16_t offset = (cq->cqid << 3) + (1 << 2);
+    int ret;
+
+    ret = event_notifier_init(&cq->notifier, 0);
+    if (ret < 0) {
+        return ret;
+    }
+
+    event_notifier_set_handler(&cq->notifier, nvme_cq_notifier);
+    memory_region_add_eventfd(&n->iomem,
+                              0x1000 + offset, 4, false, 0, &cq->notifier);
+    
+    return 0;
+}
+
+static void nvme_sq_notifier(EventNotifier *e)
+{
+    NvmeSQueue *sq = container_of(e, NvmeSQueue, notifier);
+
+    event_notifier_test_and_clear(&sq->notifier);
+
+    nvme_process_sq(sq);
+}
+
+static int nvme_init_sq_ioeventfd(NvmeSQueue *sq)
+{
+    NvmeCtrl *n = sq->ctrl;
+    uint16_t offset = sq->sqid << 3;
+    int ret;
+
+    ret = event_notifier_init(&sq->notifier, 0);
+    if (ret < 0) {
+        return ret;
+    }
+
+    event_notifier_set_handler(&sq->notifier, nvme_sq_notifier);
+    memory_region_add_eventfd(&n->iomem,
+                              0x1000 + offset, 4, false, 0, &sq->notifier);
+
+    return 0;
+}
+
 static void nvme_free_sq(NvmeSQueue *sq, NvmeCtrl *n)
 {
+    uint16_t offset = sq->sqid << 3;
+
     n->sq[sq->sqid] = NULL;
     timer_free(sq->timer);
+    if (sq->ioeventfd_enabled) {
+        memory_region_del_eventfd(&n->iomem,
+                                  0x1000 + offset, 4, false, 0, &sq->notifier);
+        event_notifier_cleanup(&sq->notifier);
+    }
     g_free(sq->io_req);
     if (sq->sqid) {
         g_free(sq);
@@ -4271,6 +4350,12 @@  static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
     if (n->dbbuf_enabled) {
         sq->db_addr = n->dbbuf_dbs + (sqid << 3);
         sq->ei_addr = n->dbbuf_eis + (sqid << 3);
+            
+        if (n->params.ioeventfd && sq->sqid != 0) {
+            if (!nvme_init_sq_ioeventfd(sq)) {
+                sq->ioeventfd_enabled = true;
+            }
+        }
     }
 
     assert(n->cq[cqid]);
@@ -4575,8 +4660,15 @@  static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
 
 static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
 {
+    uint16_t offset = (cq->cqid << 3) + (1 << 2);
+
     n->cq[cq->cqid] = NULL;
     timer_free(cq->timer);
+    if (cq->ioeventfd_enabled) {
+        memory_region_del_eventfd(&n->iomem,
+                                  0x1000 + offset, 4, false, 0, &cq->notifier);
+        event_notifier_cleanup(&cq->notifier);
+    }
     if (msix_enabled(&n->parent_obj)) {
         msix_vector_unuse(&n->parent_obj, cq->vector);
     }
@@ -4635,6 +4727,12 @@  static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
     if (n->dbbuf_enabled) {
         cq->db_addr = n->dbbuf_dbs + (cqid << 3) + (1 << 2);
         cq->ei_addr = n->dbbuf_eis + (cqid << 3) + (1 << 2);
+
+        if (n->params.ioeventfd && cqid != 0) {
+            if (!nvme_init_cq_ioeventfd(cq)) {
+                cq->ioeventfd_enabled = true;
+            }
+        }
     }
     n->cq[cqid] = cq;
     cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
@@ -5793,6 +5891,7 @@  static uint16_t nvme_dbbuf_config(NvmeCtrl *n, const NvmeRequest *req)
     uint64_t dbs_addr = le64_to_cpu(req->cmd.dptr.prp1);
     uint64_t eis_addr = le64_to_cpu(req->cmd.dptr.prp2);
     int i;
+    int ret;
 
     /* Address should be page aligned */
     if (dbs_addr & (n->page_size - 1) || eis_addr & (n->page_size - 1)) {
@@ -5818,6 +5917,12 @@  static uint16_t nvme_dbbuf_config(NvmeCtrl *n, const NvmeRequest *req)
             sq->ei_addr = eis_addr + (i << 3);
             pci_dma_write(&n->parent_obj, sq->db_addr, &sq->tail,
                     sizeof(sq->tail));
+            
+            if (n->params.ioeventfd && sq->sqid != 0) {
+                if (!nvme_init_sq_ioeventfd(sq)) {
+                    sq->ioeventfd_enabled = true;
+                }
+            }
         }
 
         if (cq) {
@@ -5826,6 +5931,12 @@  static uint16_t nvme_dbbuf_config(NvmeCtrl *n, const NvmeRequest *req)
             cq->ei_addr = eis_addr + (i << 3) + (1 << 2);
             pci_dma_write(&n->parent_obj, cq->db_addr, &cq->head,
                     sizeof(cq->head));
+            
+            if (n->params.ioeventfd && cq->cqid != 0) {
+                if (!nvme_init_cq_ioeventfd(cq)) {
+                    cq->ioeventfd_enabled = true;
+                }
+            }
         }
     }
 
@@ -7040,6 +7151,7 @@  static Property nvme_props[] = {
     DEFINE_PROP_UINT8("zoned.zasl", NvmeCtrl, params.zasl, 0),
     DEFINE_PROP_BOOL("zoned.auto_transition", NvmeCtrl,
                      params.auto_transition_zones, true),
+    DEFINE_PROP_BOOL("ioeventfd", NvmeCtrl, params.ioeventfd, true),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 4452e4b1bf..2a9beea0c8 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -369,6 +369,8 @@  typedef struct NvmeSQueue {
     uint64_t    db_addr;
     uint64_t    ei_addr;
     QEMUTimer   *timer;
+    EventNotifier notifier;
+    bool        ioeventfd_enabled;
     NvmeRequest *io_req;
     QTAILQ_HEAD(, NvmeRequest) req_list;
     QTAILQ_HEAD(, NvmeRequest) out_req_list;
@@ -388,6 +390,8 @@  typedef struct NvmeCQueue {
     uint64_t    db_addr;
     uint64_t    ei_addr;
     QEMUTimer   *timer;
+    EventNotifier notifier;
+    bool        ioeventfd_enabled;
     QTAILQ_HEAD(, NvmeSQueue) sq_list;
     QTAILQ_HEAD(, NvmeRequest) req_list;
 } NvmeCQueue;
@@ -410,6 +414,7 @@  typedef struct NvmeParams {
     uint8_t  zasl;
     bool     auto_transition_zones;
     bool     legacy_cmb;
+    bool     ioeventfd;
 } NvmeParams;
 
 typedef struct NvmeCtrl {