diff mbox

[4/5] NVMe: add blk polling support

Message ID 1446830423-25027-5-git-send-email-axboe@fb.com (mailing list archive)
State Awaiting Upstream
Delegated to: Jens Axboe
Headers show

Commit Message

Jens Axboe Nov. 6, 2015, 5:20 p.m. UTC
Add nvme_poll(), which will check a specific completion queue for
command completions. Wire that up to the new block layer poll
mechanism.

Later on we'll setup specific sq/cq pairs that don't have interrups
enabled, so we can do more efficient polling. As of this patch, an
IRQ will still trigger on command completion.

Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/pci.c | 32 ++++++++++++++++++++++++++++----
 1 file changed, 28 insertions(+), 4 deletions(-)

Comments

Elliott, Robert (Servers) Nov. 6, 2015, 11:46 p.m. UTC | #1
> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Jens Axboe
> Sent: Friday, November 6, 2015 11:20 AM
...
> Subject: [PATCH 4/5] NVMe: add blk polling support
> 
> Add nvme_poll(), which will check a specific completion queue for
> command completions. Wire that up to the new block layer poll
> mechanism.
> 
> Later on we'll setup specific sq/cq pairs that don't have interrups
> enabled, so we can do more efficient polling. As of this patch, an
> IRQ will still trigger on command completion.
...
> -static int nvme_process_cq(struct nvme_queue *nvmeq)
> +static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int
> *tag)
>  {
>  	u16 head, phase;
> 
> @@ -953,6 +953,8 @@ static int nvme_process_cq(struct nvme_queue *nvmeq)
>  			head = 0;
>  			phase = !phase;
>  		}
> +		if (tag && *tag == cqe.command_id)
> +			*tag = -1;
>  		ctx = nvme_finish_cmd(nvmeq, cqe.command_id, &fn);
>  		fn(nvmeq, ctx, &cqe);
>  	}

The NVMe completion queue entries are 16 bytes long.  Although it's 
most likely to write them from 0..15 in one PCIe Memory Write
transaction, the NVMe device could write those bytes in any order.
It could thus update the command identifier before the other bytes,
causing this code to process invalid stale values in the other
fields.

When using interrupts, the MSI-X interrupt ensures the whole entry
is updated first, since it is delivered with a PCIe Memory Write
transaction and upstream writes do not pass upstream writes.

The existing interrupt handler loops looking for additional
completions, so is susceptible to this same problem - it's 
just less likely.  The only concern is completions that are 
added while the CPU is in the interrupt handler.

---
Robert Elliott, HPE Persistent Memory



--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Keith Busch Nov. 7, 2015, 12:58 a.m. UTC | #2
On Fri, Nov 06, 2015 at 03:46:07PM -0800, Elliott, Robert (Persistent Memory) wrote:
> > -----Original Message-----
> > From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> > owner@vger.kernel.org] On Behalf Of Jens Axboe
> > Sent: Friday, November 6, 2015 11:20 AM
> ...
> > Subject: [PATCH 4/5] NVMe: add blk polling support
> >
> > Add nvme_poll(), which will check a specific completion queue for
> > command completions. Wire that up to the new block layer poll
> > mechanism.
> >
> > Later on we'll setup specific sq/cq pairs that don't have interrups
> > enabled, so we can do more efficient polling. As of this patch, an
> > IRQ will still trigger on command completion.
> ...
> > -static int nvme_process_cq(struct nvme_queue *nvmeq)
> > +static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int
> > *tag)
> >  {
> >       u16 head, phase;
> >
> > @@ -953,6 +953,8 @@ static int nvme_process_cq(struct nvme_queue *nvmeq)
> >                       head = 0;
> >                       phase = !phase;
> >               }
> > +             if (tag && *tag == cqe.command_id)
> > +                     *tag = -1;
> >               ctx = nvme_finish_cmd(nvmeq, cqe.command_id, &fn);
> >               fn(nvmeq, ctx, &cqe);
> >       }
> 
> The NVMe completion queue entries are 16 bytes long.  Although it's
> most likely to write them from 0..15 in one PCIe Memory Write
> transaction, the NVMe device could write those bytes in any order.
> It could thus update the command identifier before the other bytes,
> causing this code to process invalid stale values in the other
> fields.

That's a very interesting point. We are okay if we can rely on the phase
bit, which we can by my reading of the spec. Coalescing would not work
if the driver can observe a new phase in a partially written CQE.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Elliott, Robert (Servers) Nov. 12, 2015, 1:40 a.m. UTC | #3
> -----Original Message-----
> From: Keith Busch [mailto:keith.busch@intel.com]
> Sent: Friday, November 6, 2015 6:58 PM
> Subject: Re: [PATCH 4/5] NVMe: add blk polling support
> 
> On Fri, Nov 06, 2015 at 03:46:07PM -0800, Elliott, Robert wrote:
> > > -----Original Message-----
> > > From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> > > owner@vger.kernel.org] On Behalf Of Jens Axboe
> > > Sent: Friday, November 6, 2015 11:20 AM
> > ...
> > > Subject: [PATCH 4/5] NVMe: add blk polling support
> > >
> > > @@ -953,6 +953,8 @@ static int nvme_process_cq(struct nvme_queue
> *nvmeq)
> > >                       head = 0;
> > >                       phase = !phase;
> > >               }
> > > +             if (tag && *tag == cqe.command_id)
> > > +                     *tag = -1;
> > >               ctx = nvme_finish_cmd(nvmeq, cqe.command_id, &fn);
> > >               fn(nvmeq, ctx, &cqe);
> > >       }
> >
> > The NVMe completion queue entries are 16 bytes long.  Although it's
> > most likely to write them from 0..15 in one PCIe Memory Write
> > transaction, the NVMe device could write those bytes in any order.
> > It could thus update the command identifier before the other bytes,
> > causing this code to process invalid stale values in the other
> > fields.
> 
> That's a very interesting point. We are okay if we can rely on the phase
> bit, which we can by my reading of the spec. Coalescing would not work
> if the driver can observe a new phase in a partially written CQE.

The Phase bit is in the same Dword as the Command Identifier, so
it doesn't help.  If that Dword were written first, then the
preceding three Dwords might not be valid yet.

I'll ask our NVMe representative to propose wording like this:

  4.6 Completion Queue Entry
  ...
  The controller shall write the Dwords in a CQE from lowest to highest 
  (e.g., if the CQE is 16 bytes, write Dword 0, then Dword 1, then
  Dword 2, then Dword 3).  This ensures that the first three Dwords are
  valid when the Phase Tag and Command Identifier are valid. Additional
  Dwords, if any, are not valid at that time.

and refer to that rule in 7.1 where host processing of completions
is discussed.

---
Robert Elliott, HPE Persistent Memory


--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e878590e71b6..4a715f49f5db 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -90,7 +90,7 @@  static struct class *nvme_class;
 
 static int __nvme_reset(struct nvme_dev *dev);
 static int nvme_reset(struct nvme_dev *dev);
-static int nvme_process_cq(struct nvme_queue *nvmeq);
+static void nvme_process_cq(struct nvme_queue *nvmeq);
 static void nvme_dead_ctrl(struct nvme_dev *dev);
 
 struct async_cmd_info {
@@ -935,7 +935,7 @@  static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return BLK_MQ_RQ_QUEUE_BUSY;
 }
 
-static int nvme_process_cq(struct nvme_queue *nvmeq)
+static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
 {
 	u16 head, phase;
 
@@ -953,6 +953,8 @@  static int nvme_process_cq(struct nvme_queue *nvmeq)
 			head = 0;
 			phase = !phase;
 		}
+		if (tag && *tag == cqe.command_id)
+			*tag = -1;
 		ctx = nvme_finish_cmd(nvmeq, cqe.command_id, &fn);
 		fn(nvmeq, ctx, &cqe);
 	}
@@ -964,14 +966,18 @@  static int nvme_process_cq(struct nvme_queue *nvmeq)
 	 * a big problem.
 	 */
 	if (head == nvmeq->cq_head && phase == nvmeq->cq_phase)
-		return 0;
+		return;
 
 	writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
 	nvmeq->cq_head = head;
 	nvmeq->cq_phase = phase;
 
 	nvmeq->cqe_seen = 1;
-	return 1;
+}
+
+static void nvme_process_cq(struct nvme_queue *nvmeq)
+{
+	__nvme_process_cq(nvmeq, NULL);
 }
 
 static irqreturn_t nvme_irq(int irq, void *data)
@@ -995,6 +1001,23 @@  static irqreturn_t nvme_irq_check(int irq, void *data)
 	return IRQ_WAKE_THREAD;
 }
 
+static int nvme_poll(struct blk_mq_hw_ctx *hctx, unsigned int tag)
+{
+	struct nvme_queue *nvmeq = hctx->driver_data;
+
+	if ((le16_to_cpu(nvmeq->cqes[nvmeq->cq_head].status) & 1) ==
+	    nvmeq->cq_phase) {
+		spin_lock_irq(&nvmeq->q_lock);
+		__nvme_process_cq(nvmeq, &tag);
+		spin_unlock_irq(&nvmeq->q_lock);
+
+		if (tag == -1)
+			return 1;
+	}
+
+	return 0;
+}
+
 /*
  * Returns 0 on success.  If the result is negative, it's a Linux error code;
  * if the result is positive, it's an NVM Express status code
@@ -1654,6 +1677,7 @@  static struct blk_mq_ops nvme_mq_ops = {
 	.init_hctx	= nvme_init_hctx,
 	.init_request	= nvme_init_request,
 	.timeout	= nvme_timeout,
+	.poll		= nvme_poll,
 };
 
 static void nvme_dev_remove_admin(struct nvme_dev *dev)