Message ID | 1446830423-25027-5-git-send-email-axboe@fb.com (mailing list archive) |
---|---|
State | Awaiting Upstream |
Delegated to: | Jens Axboe |
Headers | show |
> -----Original Message----- > From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel- > owner@vger.kernel.org] On Behalf Of Jens Axboe > Sent: Friday, November 6, 2015 11:20 AM ... > Subject: [PATCH 4/5] NVMe: add blk polling support > > Add nvme_poll(), which will check a specific completion queue for > command completions. Wire that up to the new block layer poll > mechanism. > > Later on we'll setup specific sq/cq pairs that don't have interrups > enabled, so we can do more efficient polling. As of this patch, an > IRQ will still trigger on command completion. ... > -static int nvme_process_cq(struct nvme_queue *nvmeq) > +static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int > *tag) > { > u16 head, phase; > > @@ -953,6 +953,8 @@ static int nvme_process_cq(struct nvme_queue *nvmeq) > head = 0; > phase = !phase; > } > + if (tag && *tag == cqe.command_id) > + *tag = -1; > ctx = nvme_finish_cmd(nvmeq, cqe.command_id, &fn); > fn(nvmeq, ctx, &cqe); > } The NVMe completion queue entries are 16 bytes long. Although it's most likely to write them from 0..15 in one PCIe Memory Write transaction, the NVMe device could write those bytes in any order. It could thus update the command identifier before the other bytes, causing this code to process invalid stale values in the other fields. When using interrupts, the MSI-X interrupt ensures the whole entry is updated first, since it is delivered with a PCIe Memory Write transaction and upstream writes do not pass upstream writes. The existing interrupt handler loops looking for additional completions, so is susceptible to this same problem - it's just less likely. The only concern is completions that are added while the CPU is in the interrupt handler. --- Robert Elliott, HPE Persistent Memory -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 06, 2015 at 03:46:07PM -0800, Elliott, Robert (Persistent Memory) wrote: > > -----Original Message----- > > From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel- > > owner@vger.kernel.org] On Behalf Of Jens Axboe > > Sent: Friday, November 6, 2015 11:20 AM > ... > > Subject: [PATCH 4/5] NVMe: add blk polling support > > > > Add nvme_poll(), which will check a specific completion queue for > > command completions. Wire that up to the new block layer poll > > mechanism. > > > > Later on we'll setup specific sq/cq pairs that don't have interrups > > enabled, so we can do more efficient polling. As of this patch, an > > IRQ will still trigger on command completion. > ... > > -static int nvme_process_cq(struct nvme_queue *nvmeq) > > +static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int > > *tag) > > { > > u16 head, phase; > > > > @@ -953,6 +953,8 @@ static int nvme_process_cq(struct nvme_queue *nvmeq) > > head = 0; > > phase = !phase; > > } > > + if (tag && *tag == cqe.command_id) > > + *tag = -1; > > ctx = nvme_finish_cmd(nvmeq, cqe.command_id, &fn); > > fn(nvmeq, ctx, &cqe); > > } > > The NVMe completion queue entries are 16 bytes long. Although it's > most likely to write them from 0..15 in one PCIe Memory Write > transaction, the NVMe device could write those bytes in any order. > It could thus update the command identifier before the other bytes, > causing this code to process invalid stale values in the other > fields. That's a very interesting point. We are okay if we can rely on the phase bit, which we can by my reading of the spec. Coalescing would not work if the driver can observe a new phase in a partially written CQE. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: Keith Busch [mailto:keith.busch@intel.com] > Sent: Friday, November 6, 2015 6:58 PM > Subject: Re: [PATCH 4/5] NVMe: add blk polling support > > On Fri, Nov 06, 2015 at 03:46:07PM -0800, Elliott, Robert wrote: > > > -----Original Message----- > > > From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel- > > > owner@vger.kernel.org] On Behalf Of Jens Axboe > > > Sent: Friday, November 6, 2015 11:20 AM > > ... > > > Subject: [PATCH 4/5] NVMe: add blk polling support > > > > > > @@ -953,6 +953,8 @@ static int nvme_process_cq(struct nvme_queue > *nvmeq) > > > head = 0; > > > phase = !phase; > > > } > > > + if (tag && *tag == cqe.command_id) > > > + *tag = -1; > > > ctx = nvme_finish_cmd(nvmeq, cqe.command_id, &fn); > > > fn(nvmeq, ctx, &cqe); > > > } > > > > The NVMe completion queue entries are 16 bytes long. Although it's > > most likely to write them from 0..15 in one PCIe Memory Write > > transaction, the NVMe device could write those bytes in any order. > > It could thus update the command identifier before the other bytes, > > causing this code to process invalid stale values in the other > > fields. > > That's a very interesting point. We are okay if we can rely on the phase > bit, which we can by my reading of the spec. Coalescing would not work > if the driver can observe a new phase in a partially written CQE. The Phase bit is in the same Dword as the Command Identifier, so it doesn't help. If that Dword were written first, then the preceding three Dwords might not be valid yet. I'll ask our NVMe representative to propose wording like this: 4.6 Completion Queue Entry ... The controller shall write the Dwords in a CQE from lowest to highest (e.g., if the CQE is 16 bytes, write Dword 0, then Dword 1, then Dword 2, then Dword 3). This ensures that the first three Dwords are valid when the Phase Tag and Command Identifier are valid. Additional Dwords, if any, are not valid at that time. and refer to that rule in 7.1 where host processing of completions is discussed. --- Robert Elliott, HPE Persistent Memory -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index e878590e71b6..4a715f49f5db 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -90,7 +90,7 @@ static struct class *nvme_class; static int __nvme_reset(struct nvme_dev *dev); static int nvme_reset(struct nvme_dev *dev); -static int nvme_process_cq(struct nvme_queue *nvmeq); +static void nvme_process_cq(struct nvme_queue *nvmeq); static void nvme_dead_ctrl(struct nvme_dev *dev); struct async_cmd_info { @@ -935,7 +935,7 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, return BLK_MQ_RQ_QUEUE_BUSY; } -static int nvme_process_cq(struct nvme_queue *nvmeq) +static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag) { u16 head, phase; @@ -953,6 +953,8 @@ static int nvme_process_cq(struct nvme_queue *nvmeq) head = 0; phase = !phase; } + if (tag && *tag == cqe.command_id) + *tag = -1; ctx = nvme_finish_cmd(nvmeq, cqe.command_id, &fn); fn(nvmeq, ctx, &cqe); } @@ -964,14 +966,18 @@ static int nvme_process_cq(struct nvme_queue *nvmeq) * a big problem. */ if (head == nvmeq->cq_head && phase == nvmeq->cq_phase) - return 0; + return; writel(head, nvmeq->q_db + nvmeq->dev->db_stride); nvmeq->cq_head = head; nvmeq->cq_phase = phase; nvmeq->cqe_seen = 1; - return 1; +} + +static void nvme_process_cq(struct nvme_queue *nvmeq) +{ + __nvme_process_cq(nvmeq, NULL); } static irqreturn_t nvme_irq(int irq, void *data) @@ -995,6 +1001,23 @@ static irqreturn_t nvme_irq_check(int irq, void *data) return IRQ_WAKE_THREAD; } +static int nvme_poll(struct blk_mq_hw_ctx *hctx, unsigned int tag) +{ + struct nvme_queue *nvmeq = hctx->driver_data; + + if ((le16_to_cpu(nvmeq->cqes[nvmeq->cq_head].status) & 1) == + nvmeq->cq_phase) { + spin_lock_irq(&nvmeq->q_lock); + __nvme_process_cq(nvmeq, &tag); + spin_unlock_irq(&nvmeq->q_lock); + + if (tag == -1) + return 1; + } + + return 0; +} + /* * Returns 0 on success. If the result is negative, it's a Linux error code; * if the result is positive, it's an NVM Express status code @@ -1654,6 +1677,7 @@ static struct blk_mq_ops nvme_mq_ops = { .init_hctx = nvme_init_hctx, .init_request = nvme_init_request, .timeout = nvme_timeout, + .poll = nvme_poll, }; static void nvme_dev_remove_admin(struct nvme_dev *dev)