[1/5,v2] blk-mq: Add prep/unprep support

Message ID	1429101284-19490-2-git-send-email-m@bjorling.me (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: =?UTF-8?q?Matias=20Bj=C3=B8rling?= <m@bjorling.me> To: hch@infradead.org, axboe@fb.com, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org Cc: javier@paletta.io, keith.busch@intel.com, =?UTF-8?q?Matias=20Bj=C3=B8rling?= <m@bjorling.me> Subject: [PATCH 1/5 v2] blk-mq: Add prep/unprep support Date: Wed, 15 Apr 2015 14:34:40 +0200 Message-Id: <1429101284-19490-2-git-send-email-m@bjorling.me> In-Reply-To: <1429101284-19490-1-git-send-email-m@bjorling.me> References: <1429101284-19490-1-git-send-email-m@bjorling.me> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Matias Bjørling April 15, 2015, 12:34 p.m. UTC

Allow users to hook into prep/unprep functions just before an IO is
dispatched to the device driver. This is necessary for request-based
logic to take place at upper layers.

Signed-off-by: Matias Bjørling <m@bjorling.me>
---
 block/blk-mq.c         | 28 ++++++++++++++++++++++++++--
 include/linux/blk-mq.h |  1 +
 2 files changed, 27 insertions(+), 2 deletions(-)

Christoph Hellwig April 17, 2015, 6:34 a.m. UTC | #1

On Wed, Apr 15, 2015 at 02:34:40PM +0200, Matias Bj??rling wrote:
> Allow users to hook into prep/unprep functions just before an IO is
> dispatched to the device driver. This is necessary for request-based
> logic to take place at upper layers.

I don't think any of this logic belongs into the block layer.  All this
should be library functions called by the drivers.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Matias Bjørling April 17, 2015, 8:15 a.m. UTC | #2

On 04/17/2015 08:34 AM, Christoph Hellwig wrote:
> On Wed, Apr 15, 2015 at 02:34:40PM +0200, Matias Bj??rling wrote:
>> Allow users to hook into prep/unprep functions just before an IO is
>> dispatched to the device driver. This is necessary for request-based
>> logic to take place at upper layers.
>
> I don't think any of this logic belongs into the block layer.  All this
> should be library functions called by the drivers.
>

Just the prep/unprep, or other pieces as well?

I like that struct request_queue has a ref to struct nvm_dev, and the 
variables in request and bio to get to the struct is in the block layer.

In the future, applications can have an API to get/put flash block 
directly. (using the blk_nvm_[get/put]_blk interface).
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig April 17, 2015, 5:46 p.m. UTC | #3

On Fri, Apr 17, 2015 at 10:15:46AM +0200, Matias Bj?rling wrote:
> Just the prep/unprep, or other pieces as well?

All of it - it's functionality that lies logically below the block
layer, so that's where it should be handled.

In fact it should probably work similar to the mtd subsystem - that is
have it's own API for low level drivers, and just export a block driver
as one consumer on the top side.

> In the future, applications can have an API to get/put flash block directly.
> (using the blk_nvm_[get/put]_blk interface).

s/application/filesystem/?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Matias Bjørling April 18, 2015, 6:45 a.m. UTC | #4

Den 17-04-2015 kl. 19:46 skrev Christoph Hellwig:
> On Fri, Apr 17, 2015 at 10:15:46AM +0200, Matias Bj?rling wrote:
>> Just the prep/unprep, or other pieces as well?
>
> All of it - it's functionality that lies logically below the block
> layer, so that's where it should be handled.
>
> In fact it should probably work similar to the mtd subsystem - that is
> have it's own API for low level drivers, and just export a block driver
> as one consumer on the top side.

The low level drivers will be NVMe and vendor's own PCI-e drivers. It's 
very generic in their nature. Each driver would duplicate the same work. 
Both could have normal and open-channel drives attached.

I'll like to keep blk-mq in the loop. I don't think it will be pretty to 
have two data paths in the drivers. For blk-mq, bios are splitted/merged 
on the way down. Thus, the actual physical addresses needs aren't known 
before the IO is diced to the right size.

The reason it shouldn't be under the a single block device, is that a 
target should be able to provide a global address space. That allows the 
address space to grow/shrink dynamically with the disks. Allowing a 
continuously growing address space, where disks can be added/removed as 
requirements grow or flash ages. Not on a sector level, but on a flash 
block level.

>
>> In the future, applications can have an API to get/put flash block directly.
>> (using the blk_nvm_[get/put]_blk interface).
>
> s/application/filesystem/?
>

Applications. The goal is that key value stores, e.g. RocksDB, 
Aerospike, Ceph and similar have direct access to flash storage. There 
won't be a kernel file-system between.

The get/put interface can be seen as a space reservation interface for 
where a given process is allowed to access the storage media.

It can also be seen in the way that we provide a block allocator in the 
kernel, while applications implement the rest of "file-system" in 
user-space, specially optimized for their data structures. This makes a 
lot of sense for a small subset (LSM, Fractal trees, etc.) of database 
applications.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig April 18, 2015, 8:16 p.m. UTC | #5

On Sat, Apr 18, 2015 at 08:45:19AM +0200, Matias Bjorling wrote:
> The low level drivers will be NVMe and vendor's own PCI-e drivers. It's very
> generic in their nature. Each driver would duplicate the same work. Both
> could have normal and open-channel drives attached.

I didn't say the work should move into the driver, bur rather that
driver should talk to the open channel ssd code directly instead of
hooking into the core block code.

> I'll like to keep blk-mq in the loop. I don't think it will be pretty to
> have two data paths in the drivers. For blk-mq, bios are splitted/merged on
> the way down. Thus, the actual physical addresses needs aren't known before
> the IO is diced to the right size.

But you _do_ have two different data path already.  Nothing says you
can't use blk-mq for your data path, ut it should be a separate entry
point.  Similar to say how a SCSI disk and MMC device both use the block
layer but still use different entry points.

> The reason it shouldn't be under the a single block device, is that a target
> should be able to provide a global address space.
> That allows the address
> space to grow/shrink dynamically with the disks. Allowing a continuously
> growing address space, where disks can be added/removed as requirements grow
> or flash ages. Not on a sector level, but on a flash block level.

I don't understand what you mean with a single block device here, but I
suspect we're talking past each other somehow.

> >>In the future, applications can have an API to get/put flash block directly.
> >>(using the blk_nvm_[get/put]_blk interface).
> >
> >s/application/filesystem/?
> >
> 
> Applications. The goal is that key value stores, e.g. RocksDB, Aerospike,
> Ceph and similar have direct access to flash storage. There won't be a
> kernel file-system between.
> 
> The get/put interface can be seen as a space reservation interface for where
> a given process is allowed to access the storage media.
> 
> It can also be seen in the way that we provide a block allocator in the
> kernel, while applications implement the rest of "file-system" in
> user-space, specially optimized for their data structures. This makes a lot
> of sense for a small subset (LSM, Fractal trees, etc.) of database
> applications.

While we'll need a proper API for that first it's just another reason of
why we shouldnt shoe horn the open channel ssd support into the block
layer.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Matias Bjørling April 19, 2015, 6:12 p.m. UTC | #6

> On Sat, Apr 18, 2015 at 08:45:19AM +0200, Matias Bjorling wrote:
<snip>
>> The reason it shouldn't be under the a single block device, is that a target
>> should be able to provide a global address space.
>> That allows the address
>> space to grow/shrink dynamically with the disks. Allowing a continuously
>> growing address space, where disks can be added/removed as requirements grow
>> or flash ages. Not on a sector level, but on a flash block level.
>
> I don't understand what you mean with a single block device here, but I
> suspect we're talking past each other somehow.

Sorry. I meant that several block devices should form a single address 
space (exposed as a single block device), consisting of all the flash 
blocks. Applications could then get/put from that.

Thanks for your feedback. I'll push the pieces around and make the 
integration self-contained outside of the block layer.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[1/5,v2] blk-mq: Add prep/unprep support

Commit Message

Comments

Patch