Message ID | 20241025-issue-2388-v1-1-16707e0d3342@samsung.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | hw/nvme: fix handling of over-committed queues | expand |
Hi, I have applied this patch to the same QEMU source tree where I reproduced this issue. I changed the queue size to 3 on the OSv side to trigger this bug, but unfortunately, I still see the same behavior of the OSv guest hanging. Regards, Waldemar Kozaczuk On Fri, Oct 25, 2024 at 6:51 AM Klaus Jensen <its@irrelevant.dk> wrote: > From: Klaus Jensen <k.jensen@samsung.com> > > If a host chooses to use the SQHD "hint" in the CQE to know if there is > room in the submission queue for additional commands, it may result in a > situation where there are not enough internal resources (struct > NvmeRequest) available to process the command. For a lack of a better > term, the host may "over-commit" the device (i.e., it may have more > inflight commands than the queue size). > > For example, assume a queue with N entries. The host submits N commands > and all are picked up for processing, advancing the head and emptying > the queue. Regardless of which of these N commands complete first, the > SQHD field of that CQE will indicate to the host that the queue is > empty, which allows the host to issue N commands again. However, if the > device has not posted CQEs for all the previous commands yet, the device > will have less than N resources available to process the commands, so > queue processing is suspended. > > And here lies an 11 year latent bug. In the absense of any additional > tail updates on the submission queue, we never schedule the processing > bottom-half again unless we observe a head update on an associated full > completion queue. This has been sufficient to handle N-to-1 SQ/CQ setups > (in the absense of over-commit of course). Incidentially, that "kick all > associated SQs" mechanism can now be killed since we now just schedule > queue processing when we return a processing resource to a non-empty > submission queue, which happens to cover both edge cases. > > So, apparently, no previous driver tested with hw/nvme has ever used > SQHD (e.g., neither the Linux NVMe driver or SPDK uses it). But then OSv > shows up with the driver that actually does. I salute you. > > Fixes: f3c507adcd7b ("NVMe: Initial commit for new storage interface") > Cc: qemu-stable@nongnu.org > Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2388 > Reported-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com> > Signed-off-by: Klaus Jensen <k.jensen@samsung.com> > --- > hw/nvme/ctrl.c | 16 +++++++--------- > 1 file changed, 7 insertions(+), 9 deletions(-) > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c > index > f4e89203c1a6e3b051fd7185cbf01ec9bae9684a..b13585c4da911b9e8ae4a722761fd85dfa24be4d > 100644 > --- a/hw/nvme/ctrl.c > +++ b/hw/nvme/ctrl.c > @@ -1520,9 +1520,16 @@ static void nvme_post_cqes(void *opaque) > stl_le_p(&n->bar.csts, NVME_CSTS_FAILED); > break; > } > + > QTAILQ_REMOVE(&cq->req_list, req, entry); > + > nvme_inc_cq_tail(cq); > nvme_sg_unmap(&req->sg); > + > + if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) { > + qemu_bh_schedule(sq->bh); > + } > + > QTAILQ_INSERT_TAIL(&sq->req_list, req, entry); > } > if (cq->tail != cq->head) { > @@ -7950,7 +7957,6 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr > addr, int val) > /* Completion queue doorbell write */ > > uint16_t new_head = val & 0xffff; > - int start_sqs; > NvmeCQueue *cq; > > qid = (addr - (0x1000 + (1 << 2))) >> 3; > @@ -8001,18 +8007,10 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr > addr, int val) > > trace_pci_nvme_mmio_doorbell_cq(cq->cqid, new_head); > > - start_sqs = nvme_cq_full(cq) ? 1 : 0; > cq->head = new_head; > if (!qid && n->dbbuf_enabled) { > stl_le_pci_dma(pci, cq->db_addr, cq->head, > MEMTXATTRS_UNSPECIFIED); > } > - if (start_sqs) { > - NvmeSQueue *sq; > - QTAILQ_FOREACH(sq, &cq->sq_list, entry) { > - qemu_bh_schedule(sq->bh); > - } > - qemu_bh_schedule(cq->bh); > - } > > if (cq->tail == cq->head) { > if (cq->irq_enabled) { > > --- > base-commit: e67b7aef7c7f67ecd0282e903e0daff806d5d680 > change-id: 20241025-issue-2388-bd047487f74c > > Best regards, > -- > Klaus Jensen <k.jensen@samsung.com> > >
On Fri, Oct 25, 2024 at 12:50:45PM +0200, Klaus Jensen wrote: > @@ -1520,9 +1520,16 @@ static void nvme_post_cqes(void *opaque) > nvme_inc_cq_tail(cq); > nvme_sg_unmap(&req->sg); > + > + if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) { > + qemu_bh_schedule(sq->bh); > + } > + > QTAILQ_INSERT_TAIL(&sq->req_list, req, entry); > } Shouldn't we schedule the bottom half after the req has been added to the list? I think everything the callback needs to be written prior to calling qemu_bh_schedule().
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index f4e89203c1a6e3b051fd7185cbf01ec9bae9684a..b13585c4da911b9e8ae4a722761fd85dfa24be4d 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -1520,9 +1520,16 @@ static void nvme_post_cqes(void *opaque) stl_le_p(&n->bar.csts, NVME_CSTS_FAILED); break; } + QTAILQ_REMOVE(&cq->req_list, req, entry); + nvme_inc_cq_tail(cq); nvme_sg_unmap(&req->sg); + + if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) { + qemu_bh_schedule(sq->bh); + } + QTAILQ_INSERT_TAIL(&sq->req_list, req, entry); } if (cq->tail != cq->head) { @@ -7950,7 +7957,6 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val) /* Completion queue doorbell write */ uint16_t new_head = val & 0xffff; - int start_sqs; NvmeCQueue *cq; qid = (addr - (0x1000 + (1 << 2))) >> 3; @@ -8001,18 +8007,10 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val) trace_pci_nvme_mmio_doorbell_cq(cq->cqid, new_head); - start_sqs = nvme_cq_full(cq) ? 1 : 0; cq->head = new_head; if (!qid && n->dbbuf_enabled) { stl_le_pci_dma(pci, cq->db_addr, cq->head, MEMTXATTRS_UNSPECIFIED); } - if (start_sqs) { - NvmeSQueue *sq; - QTAILQ_FOREACH(sq, &cq->sq_list, entry) { - qemu_bh_schedule(sq->bh); - } - qemu_bh_schedule(cq->bh); - } if (cq->tail == cq->head) { if (cq->irq_enabled) {