diff mbox

[1/5,v2] blk-mq: Add prep/unprep support

Message ID 1429101284-19490-2-git-send-email-m@bjorling.me (mailing list archive)
State New, archived
Headers show

Commit Message

Matias Bjørling April 15, 2015, 12:34 p.m. UTC
Allow users to hook into prep/unprep functions just before an IO is
dispatched to the device driver. This is necessary for request-based
logic to take place at upper layers.

Signed-off-by: Matias Bjørling <m@bjorling.me>
---
 block/blk-mq.c         | 28 ++++++++++++++++++++++++++--
 include/linux/blk-mq.h |  1 +
 2 files changed, 27 insertions(+), 2 deletions(-)

Comments

Christoph Hellwig April 17, 2015, 6:34 a.m. UTC | #1
On Wed, Apr 15, 2015 at 02:34:40PM +0200, Matias Bj??rling wrote:
> Allow users to hook into prep/unprep functions just before an IO is
> dispatched to the device driver. This is necessary for request-based
> logic to take place at upper layers.

I don't think any of this logic belongs into the block layer.  All this
should be library functions called by the drivers.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matias Bjørling April 17, 2015, 8:15 a.m. UTC | #2
On 04/17/2015 08:34 AM, Christoph Hellwig wrote:
> On Wed, Apr 15, 2015 at 02:34:40PM +0200, Matias Bj??rling wrote:
>> Allow users to hook into prep/unprep functions just before an IO is
>> dispatched to the device driver. This is necessary for request-based
>> logic to take place at upper layers.
>
> I don't think any of this logic belongs into the block layer.  All this
> should be library functions called by the drivers.
>

Just the prep/unprep, or other pieces as well?

I like that struct request_queue has a ref to struct nvm_dev, and the 
variables in request and bio to get to the struct is in the block layer.

In the future, applications can have an API to get/put flash block 
directly. (using the blk_nvm_[get/put]_blk interface).
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig April 17, 2015, 5:46 p.m. UTC | #3
On Fri, Apr 17, 2015 at 10:15:46AM +0200, Matias Bj?rling wrote:
> Just the prep/unprep, or other pieces as well?

All of it - it's functionality that lies logically below the block
layer, so that's where it should be handled.

In fact it should probably work similar to the mtd subsystem - that is
have it's own API for low level drivers, and just export a block driver
as one consumer on the top side.

> In the future, applications can have an API to get/put flash block directly.
> (using the blk_nvm_[get/put]_blk interface).

s/application/filesystem/?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matias Bjørling April 18, 2015, 6:45 a.m. UTC | #4
Den 17-04-2015 kl. 19:46 skrev Christoph Hellwig:
> On Fri, Apr 17, 2015 at 10:15:46AM +0200, Matias Bj?rling wrote:
>> Just the prep/unprep, or other pieces as well?
>
> All of it - it's functionality that lies logically below the block
> layer, so that's where it should be handled.
>
> In fact it should probably work similar to the mtd subsystem - that is
> have it's own API for low level drivers, and just export a block driver
> as one consumer on the top side.

The low level drivers will be NVMe and vendor's own PCI-e drivers. It's 
very generic in their nature. Each driver would duplicate the same work. 
Both could have normal and open-channel drives attached.

I'll like to keep blk-mq in the loop. I don't think it will be pretty to 
have two data paths in the drivers. For blk-mq, bios are splitted/merged 
on the way down. Thus, the actual physical addresses needs aren't known 
before the IO is diced to the right size.

The reason it shouldn't be under the a single block device, is that a 
target should be able to provide a global address space. That allows the 
address space to grow/shrink dynamically with the disks. Allowing a 
continuously growing address space, where disks can be added/removed as 
requirements grow or flash ages. Not on a sector level, but on a flash 
block level.

>
>> In the future, applications can have an API to get/put flash block directly.
>> (using the blk_nvm_[get/put]_blk interface).
>
> s/application/filesystem/?
>

Applications. The goal is that key value stores, e.g. RocksDB, 
Aerospike, Ceph and similar have direct access to flash storage. There 
won't be a kernel file-system between.

The get/put interface can be seen as a space reservation interface for 
where a given process is allowed to access the storage media.

It can also be seen in the way that we provide a block allocator in the 
kernel, while applications implement the rest of "file-system" in 
user-space, specially optimized for their data structures. This makes a 
lot of sense for a small subset (LSM, Fractal trees, etc.) of database 
applications.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig April 18, 2015, 8:16 p.m. UTC | #5
On Sat, Apr 18, 2015 at 08:45:19AM +0200, Matias Bjorling wrote:
> The low level drivers will be NVMe and vendor's own PCI-e drivers. It's very
> generic in their nature. Each driver would duplicate the same work. Both
> could have normal and open-channel drives attached.

I didn't say the work should move into the driver, bur rather that
driver should talk to the open channel ssd code directly instead of
hooking into the core block code.

> I'll like to keep blk-mq in the loop. I don't think it will be pretty to
> have two data paths in the drivers. For blk-mq, bios are splitted/merged on
> the way down. Thus, the actual physical addresses needs aren't known before
> the IO is diced to the right size.

But you _do_ have two different data path already.  Nothing says you
can't use blk-mq for your data path, ut it should be a separate entry
point.  Similar to say how a SCSI disk and MMC device both use the block
layer but still use different entry points.

> The reason it shouldn't be under the a single block device, is that a target
> should be able to provide a global address space.
> That allows the address
> space to grow/shrink dynamically with the disks. Allowing a continuously
> growing address space, where disks can be added/removed as requirements grow
> or flash ages. Not on a sector level, but on a flash block level.

I don't understand what you mean with a single block device here, but I
suspect we're talking past each other somehow.

> >>In the future, applications can have an API to get/put flash block directly.
> >>(using the blk_nvm_[get/put]_blk interface).
> >
> >s/application/filesystem/?
> >
> 
> Applications. The goal is that key value stores, e.g. RocksDB, Aerospike,
> Ceph and similar have direct access to flash storage. There won't be a
> kernel file-system between.
> 
> The get/put interface can be seen as a space reservation interface for where
> a given process is allowed to access the storage media.
> 
> It can also be seen in the way that we provide a block allocator in the
> kernel, while applications implement the rest of "file-system" in
> user-space, specially optimized for their data structures. This makes a lot
> of sense for a small subset (LSM, Fractal trees, etc.) of database
> applications.

While we'll need a proper API for that first it's just another reason of
why we shouldnt shoe horn the open channel ssd support into the block
layer.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matias Bjørling April 19, 2015, 6:12 p.m. UTC | #6
> On Sat, Apr 18, 2015 at 08:45:19AM +0200, Matias Bjorling wrote:
<snip>
>> The reason it shouldn't be under the a single block device, is that a target
>> should be able to provide a global address space.
>> That allows the address
>> space to grow/shrink dynamically with the disks. Allowing a continuously
>> growing address space, where disks can be added/removed as requirements grow
>> or flash ages. Not on a sector level, but on a flash block level.
>
> I don't understand what you mean with a single block device here, but I
> suspect we're talking past each other somehow.

Sorry. I meant that several block devices should form a single address 
space (exposed as a single block device), consisting of all the flash 
blocks. Applications could then get/put from that.

Thanks for your feedback. I'll push the pieces around and make the 
integration self-contained outside of the block layer.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 33c4285..f3dd028 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -338,6 +338,11 @@  EXPORT_SYMBOL(__blk_mq_end_request);
 
 void blk_mq_end_request(struct request *rq, int error)
 {
+	struct request_queue *q = rq->q;
+
+	if (q->unprep_rq_fn)
+		q->unprep_rq_fn(q, rq);
+
 	if (blk_update_request(rq, error, blk_rq_bytes(rq)))
 		BUG();
 	__blk_mq_end_request(rq, error);
@@ -753,6 +758,17 @@  static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 	}
 }
 
+static int blk_mq_prep_rq(struct request_queue *q, struct request *rq)
+{
+	if (!q->prep_rq_fn)
+		return 0;
+
+	if (rq->cmd_flags & REQ_DONTPREP)
+		return 0;
+
+	return q->prep_rq_fn(q, rq);
+}
+
 /*
  * Run this hardware queue, pulling any software queues mapped to it in.
  * Note that this function currently has various problems around ordering
@@ -812,11 +828,15 @@  static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 		bd.list = dptr;
 		bd.last = list_empty(&rq_list);
 
-		ret = q->mq_ops->queue_rq(hctx, &bd);
+		ret = blk_mq_prep_rq(q, rq);
+		if (likely(!ret))
+			ret = q->mq_ops->queue_rq(hctx, &bd);
 		switch (ret) {
 		case BLK_MQ_RQ_QUEUE_OK:
 			queued++;
 			continue;
+		case BLK_MQ_RQ_QUEUE_DONE:
+			continue;
 		case BLK_MQ_RQ_QUEUE_BUSY:
 			list_add(&rq->queuelist, &rq_list);
 			__blk_mq_requeue_request(rq);
@@ -1270,10 +1290,14 @@  static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		 * error (busy), just add it to our list as we previously
 		 * would have done
 		 */
-		ret = q->mq_ops->queue_rq(data.hctx, &bd);
+		ret = blk_mq_prep_rq(q, rq);
+		if (likely(!ret))
+			ret = q->mq_ops->queue_rq(data.hctx, &bd);
 		if (ret == BLK_MQ_RQ_QUEUE_OK)
 			goto done;
 		else {
+			if (ret == BLK_MQ_RQ_QUEUE_DONE)
+				goto done;
 			__blk_mq_requeue_request(rq);
 
 			if (ret == BLK_MQ_RQ_QUEUE_ERROR) {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 7aec861..d7b39af 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -140,6 +140,7 @@  enum {
 	BLK_MQ_RQ_QUEUE_OK	= 0,	/* queued fine */
 	BLK_MQ_RQ_QUEUE_BUSY	= 1,	/* requeue IO for later */
 	BLK_MQ_RQ_QUEUE_ERROR	= 2,	/* end IO with error */
+	BLK_MQ_RQ_QUEUE_DONE	= 3,	/* IO is already handled */
 
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,