From patchwork Wed Sep 18 10:18:14 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dongsheng Yang X-Patchwork-Id: 13806774 Received: from out-185.mta1.migadu.com (out-185.mta1.migadu.com [95.215.58.185]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5411817C9AE for ; Wed, 18 Sep 2024 10:18:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.185 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726654734; cv=none; b=kke+8SXz4NIa42EkCs4DMLxpQ9a3/ELc6FrQC19r6W+mSgw5LGiYyl9uvWsfOeMaS3wqBPtRM0ib/iiMzYlXCHZS4PlIAHDocH1pzAXfQIvhzhxGhJculqkQNEOx+XTr8YBbVZtD3JbJIgriEWUit4cKHJLYUyzWbMS9ILDuo3I= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726654734; c=relaxed/simple; bh=cPNvongFuOAZs3svQNXpJ/mKG41uzjv2iOw2AIEwjAM=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=RTFW3ab5x51pIkuz19lKJFjHwtrRZ3d2Iw/Y3t7pQhFO9duDhb3n++DonhXOmAaU+VmH0ZcgsYbwmlkLl/AZY0wB7XxIaNqenWoQLMLUB1jMUvyObVtjf6q1ecF+L4Aua0Mtr/jEyRiLZxUU2cDXTVjelyaEwEjHHkUl7BI2VUo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=qNrPAewg; arc=none smtp.client-ip=95.215.58.185 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="qNrPAewg" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1726654727; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=D9vMquosGNQahw5FgSUZEKnhEmjWaslUdu0G45Swvwc=; b=qNrPAewgqGx/23l5830GRIY2rcaypoKIKidPl98ZCfx3+0qFpFXEZoriqMQ+SZ1BiwI3Qh TVa1kA8glT3gYkQgwgX/gsUHBwC+89H7MEzvs7pAsLTsk+DtEkofUJQWQWHYIqAk/pwusS ZoaeBx+PL/NmrwPt/pj5MAkMwONH7Gw= From: Dongsheng Yang To: axboe@kernel.dk, dan.j.williams@intel.com, gregory.price@memverge.com, John@groves.net, Jonathan.Cameron@Huawei.com, bbhushan2@marvell.com, chaitanyak@nvidia.com, rdunlap@infradead.org Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, linux-bcache@vger.kernel.org, Dongsheng Yang Subject: [PATCH v2 1/8] cbd: introduce cbd_transport Date: Wed, 18 Sep 2024 10:18:14 +0000 Message-Id: <20240918101821.681118-2-dongsheng.yang@linux.dev> In-Reply-To: <20240918101821.681118-1-dongsheng.yang@linux.dev> References: <20240918101821.681118-1-dongsheng.yang@linux.dev> Precedence: bulk X-Mailing-List: linux-cxl@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT cbd_transport represents the layout of the entire shared memory, as shown below. +-------------------------------------------------------------------------------------------------------------------------------+ | cbd transport | +--------------------+-----------------------+-----------------------+----------------------+-----------------------------------+ | | hosts | backends | blkdevs | channels | | cbd transport info +----+----+----+--------+----+----+----+--------+----+----+----+-------+-------+-------+-------+-----------+ | | | | | ... | | | | ... | | | | ... | | | | ... | +--------------------+----+----+----+--------+----+----+----+--------+----+----+----+-------+---+---+---+---+-------+-----------+ | | | | | | | | +-------------------------------------------------------------------------------------+ | | | | | v | +-----------------------------------------------------------+ | | channel segment | | +--------------------+--------------------------------------+ | | channel meta | channel data | | +---------+----------+--------------------------------------+ | | | | | | | v | +----------------------------------------------------------+ | | channel meta | | +-----------+--------------+-------------------------------+ | | meta ctrl | comp ring | cmd ring | | +-----------+--------------+-------------------------------+ | | | | +--------------------------------------------------------------------------------------------+ | | | v +----------------------------------------------------------+ | cache segment | +-----------+----------------------------------------------+ | info | data | +-----------+----------------------------------------------+ The shared memory is divided into five regions: a) Transport_info: Information about the overall transport, including the layout of the transport. b) Hosts: Each host wishing to utilize this transport needs to register its own information within a host entry in this region. c) Backends: Starting a backend on a host requires filling in information in a backend entry within this region. d) Blkdevs: Once a backend is established, it can be mapped to CBD device on any associated host. The information about the blkdevs is then filled into the blkdevs region. e) Segments: This is the actual data communication area, where communication between blkdev and backend occurs. Each queue of a block device uses a channel, and each backend has a corresponding handler interacting with this queue. f) Channel segment: Channel is one type of segment, is further divided into meta and data regions. The meta region includes subm rings and comp rings. The blkdev converts upper-layer requests into cbd_se and fills them into the subm ring. The handler accepts the cbd_se from the subm ring and sends them to the local actual block device of the backend (e.g., sda). After completion, the results are formed into cbd_ce and filled into the comp ring. The blkdev then receives the cbd_ce and returns the results to the upper-layer IO sender. g) Cache segment: Cache segment is another type of segment, when cache enabled for a backend, transport will allocate cache segments to this backend. Signed-off-by: Dongsheng Yang --- drivers/block/cbd/cbd_internal.h | 1193 +++++++++++++++++++++++++++++ drivers/block/cbd/cbd_transport.c | 957 +++++++++++++++++++++++ 2 files changed, 2150 insertions(+) create mode 100644 drivers/block/cbd/cbd_internal.h create mode 100644 drivers/block/cbd/cbd_transport.c diff --git a/drivers/block/cbd/cbd_internal.h b/drivers/block/cbd/cbd_internal.h new file mode 100644 index 000000000000..236dbcb62906 --- /dev/null +++ b/drivers/block/cbd/cbd_internal.h @@ -0,0 +1,1193 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _CBD_INTERNAL_H +#define _CBD_INTERNAL_H + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* + * CBD (CXL Block Device) provides two usage scenarios: single-host and multi-hosts. + * + * (1) Single-host scenario, CBD can use a pmem device as a cache for block devices, + * providing a caching mechanism specifically designed for persistent memory. + * + * +-----------------------------------------------------------------+ + * | single-host | + * +-----------------------------------------------------------------+ + * | | + * | | + * | | + * | | + * | | + * | +-----------+ +------------+ | + * | | /dev/cbd0 | | /dev/cbd1 | | + * | | | | | | + * | +---------------------|-----------|-----|------------|-------+ | + * | | | | | | | | + * | | /dev/pmem0 | cbd0 cache| | cbd1 cache | | | + * | | | | | | | | + * | +---------------------|-----------|-----|------------|-------+ | + * | |+---------+| |+----------+| | + * | ||/dev/sda || || /dev/sdb || | + * | |+---------+| |+----------+| | + * | +-----------+ +------------+ | + * +-----------------------------------------------------------------+ + * + * (2) Multi-hosts scenario, CBD also provides a cache while taking advantage of + * shared memory features, allowing users to access block devices on other nodes across + * different hosts. + * + * As shared memory is supported in CXL3.0 spec, we can transfer data via CXL shared memory. + * CBD use CXL shared memory to transfer data between node-1 and node-2. + * + * +--------------------------------------------------------------------------------------------------------+ + * | multi-hosts | + * +--------------------------------------------------------------------------------------------------------+ + * | | + * | | + * | +-------------------------------+ +------------------------------------+ | + * | | node-1 | | node-2 | | + * | +-------------------------------+ +------------------------------------+ | + * | | | | | | + * | | +-------+ +---------+ | | + * | | | cbd0 | | backend0+------------------+ | | + * | | +-------+ +---------+ | | | + * | | | pmem0 | | pmem0 | v | | + * | | +-------+-------+ +---------+----+ +---------------+ | + * | | | cxl driver | | cxl driver | | /dev/sda | | + * | +---------------+--------+------+ +-----+--------+-----+---------------+ | + * | | | | + * | | | | + * | | CXL CXL | | + * | +----------------+ +-----------+ | + * | | | | + * | | | | + * | | | | + * | +-------------------------+---------------+--------------------------+ | + * | | +---------------+ | | + * | | shared memory device | cbd0 cache | | | + * | | +---------------+ | | + * | +--------------------------------------------------------------------+ | + * | | + * +--------------------------------------------------------------------------------------------------------+ + */ + +#define cbd_err(fmt, ...) \ + pr_err("cbd: %s:%u " fmt, __func__, __LINE__, ##__VA_ARGS__) +#define cbd_info(fmt, ...) \ + pr_info("cbd: %s:%u " fmt, __func__, __LINE__, ##__VA_ARGS__) +#define cbd_debug(fmt, ...) \ + pr_debug("cbd: %s:%u " fmt, __func__, __LINE__, ##__VA_ARGS__) + +#define cbdt_err(transport, fmt, ...) \ + cbd_err("cbd_transport%u: " fmt, \ + transport->id, ##__VA_ARGS__) +#define cbdt_info(transport, fmt, ...) \ + cbd_info("cbd_transport%u: " fmt, \ + transport->id, ##__VA_ARGS__) +#define cbdt_debug(transport, fmt, ...) \ + cbd_debug("cbd_transport%u: " fmt, \ + transport->id, ##__VA_ARGS__) + +#define cbdb_err(backend, fmt, ...) \ + cbdt_err(backend->cbdt, "backend%d: " fmt, \ + backend->backend_id, ##__VA_ARGS__) +#define cbdb_info(backend, fmt, ...) \ + cbdt_info(backend->cbdt, "backend%d: " fmt, \ + backend->backend_id, ##__VA_ARGS__) +#define cbdbdebug(backend, fmt, ...) \ + cbdt_debug(backend->cbdt, "backend%d: " fmt, \ + backend->backend_id, ##__VA_ARGS__) + +#define cbd_handler_err(handler, fmt, ...) \ + cbdb_err(handler->cbdb, "handler%d: " fmt, \ + handler->channel.seg_id, ##__VA_ARGS__) +#define cbd_handler_info(handler, fmt, ...) \ + cbdb_info(handler->cbdb, "handler%d: " fmt, \ + handler->channel.seg_id, ##__VA_ARGS__) +#define cbd_handler_debug(handler, fmt, ...) \ + cbdb_debug(handler->cbdb, "handler%d: " fmt, \ + handler->channel.seg_id, ##__VA_ARGS__) + +#define cbd_blk_err(dev, fmt, ...) \ + cbdt_err(dev->cbdt, "cbd%d: " fmt, \ + dev->mapped_id, ##__VA_ARGS__) +#define cbd_blk_info(dev, fmt, ...) \ + cbdt_info(dev->cbdt, "cbd%d: " fmt, \ + dev->mapped_id, ##__VA_ARGS__) +#define cbd_blk_debug(dev, fmt, ...) \ + cbdt_debug(dev->cbdt, "cbd%d: " fmt, \ + dev->mapped_id, ##__VA_ARGS__) + +#define cbd_queue_err(queue, fmt, ...) \ + cbd_blk_err(queue->cbd_blkdev, "queue%d: " fmt, \ + queue->channel.seg_id, ##__VA_ARGS__) +#define cbd_queue_info(queue, fmt, ...) \ + cbd_blk_info(queue->cbd_blkdev, "queue%d: " fmt, \ + queue->channel.seg_id, ##__VA_ARGS__) +#define cbd_queue_debug(queue, fmt, ...) \ + cbd_blk_debug(queue->cbd_blkdev, "queue%d: " fmt, \ + queue->channel.seg_id, ##__VA_ARGS__) + +#define cbd_channel_err(channel, fmt, ...) \ + cbdt_err(channel->cbdt, "channel%d: " fmt, \ + channel->seg_id, ##__VA_ARGS__) +#define cbd_channel_info(channel, fmt, ...) \ + cbdt_info(channel->cbdt, "channel%d: " fmt, \ + channel->seg_id, ##__VA_ARGS__) +#define cbd_channel_debug(channel, fmt, ...) \ + cbdt_debug(channel->cbdt, "channel%d: " fmt, \ + channel->seg_id, ##__VA_ARGS__) + +#define cbd_cache_err(cache, fmt, ...) \ + cbdt_err(cache->cbdt, "cache%d: " fmt, \ + cache->cache_id, ##__VA_ARGS__) +#define cbd_cache_info(cache, fmt, ...) \ + cbdt_info(cache->cbdt, "cache%d: " fmt, \ + cache->cache_id, ##__VA_ARGS__) +#define cbd_cache_debug(cache, fmt, ...) \ + cbdt_debug(cache->cbdt, "cache%d: " fmt, \ + cache->cache_id, ##__VA_ARGS__) + +#define CBD_KB (1024) +#define CBD_MB (CBD_KB * CBD_KB) + +#define CBD_TRANSPORT_MAX 1024 +#define CBD_PATH_LEN 512 +#define CBD_NAME_LEN 32 + +#define CBD_QUEUES_MAX 128 + +#define CBD_PART_SHIFT 4 +#define CBD_DRV_NAME "cbd" +#define CBD_DEV_NAME_LEN 32 + +#define CBD_HB_INTERVAL msecs_to_jiffies(5000) /* 5s */ +#define CBD_HB_TIMEOUT (30 * 1000) /* 30s */ + +/* + * CBD transport layout: + * + * +-------------------------------------------------------------------------------------------------------------------------------+ + * | cbd transport | + * +--------------------+-----------------------+-----------------------+----------------------+-----------------------------------+ + * | | hosts | backends | blkdevs | channels | + * | cbd transport info +----+----+----+--------+----+----+----+--------+----+----+----+-------+-------+-------+-------+-----------+ + * | | | | | ... | | | | ... | | | | ... | | | | ... | + * +--------------------+----+----+----+--------+----+----+----+--------+----+----+----+-------+---+---+---+---+-------+-----------+ + * | | + * | | + * | | + * | | + * +-------------------------------------------------------------------------------------+ | + * | | + * | | + * v | + * +-----------------------------------------------------------+ | + * | channel segment | | + * +--------------------+--------------------------------------+ | + * | channel meta | channel data | | + * +---------+----------+--------------------------------------+ | + * | | + * | | + * | | + * v | + * +----------------------------------------------------------+ | + * | channel meta | | + * +-----------+--------------+-------------------------------+ | + * | meta ctrl | comp ring | cmd ring | | + * +-----------+--------------+-------------------------------+ | + * | + * | + * | + * +--------------------------------------------------------------------------------------------+ + * | + * | + * | + * v + * +----------------------------------------------------------+ + * | cache segment | + * +-----------+----------------------------------------------+ + * | info | data | + * +-----------+----------------------------------------------+ + */ + +/* cbd segment */ +#define CBDT_SEG_SIZE (16 * 1024 * 1024) + +/* cbd channel seg */ +#define CBDC_META_SIZE (4 * 1024 * 1024) +#define CBDC_SUBMR_RESERVED sizeof(struct cbd_se) +#define CBDC_CMPR_RESERVED sizeof(struct cbd_ce) + +#define CBDC_DATA_ALIGH 4096 +#define CBDC_DATA_RESERVED CBDC_DATA_ALIGH + +#define CBDC_CTRL_OFF 0 +#define CBDC_CTRL_SIZE PAGE_SIZE +#define CBDC_COMPR_OFF (CBDC_CTRL_OFF + CBDC_CTRL_SIZE) +#define CBDC_COMPR_SIZE (sizeof(struct cbd_ce) * 1024) +#define CBDC_SUBMR_OFF (CBDC_COMPR_OFF + CBDC_COMPR_SIZE) +#define CBDC_SUBMR_SIZE (CBDC_META_SIZE - CBDC_SUBMR_OFF) + +#define CBDC_DATA_OFF CBDC_META_SIZE +#define CBDC_DATA_SIZE (CBDT_SEG_SIZE - CBDC_META_SIZE) + +#define CBDC_UPDATE_SUBMR_HEAD(head, used, size) (head = ((head % size) + used) % size) +#define CBDC_UPDATE_SUBMR_TAIL(tail, used, size) (tail = ((tail % size) + used) % size) + +#define CBDC_UPDATE_COMPR_HEAD(head, used, size) (head = ((head % size) + used) % size) +#define CBDC_UPDATE_COMPR_TAIL(tail, used, size) (tail = ((tail % size) + used) % size) + +/* cbd transport */ +#define CBD_TRANSPORT_MAGIC 0x65B05EFA96C596EFULL +#define CBD_TRANSPORT_VERSION 1 + +#define CBDT_INFO_OFF 0 +#define CBDT_INFO_SIZE PAGE_SIZE + +#define CBDT_HOST_INFO_SIZE round_up(sizeof(struct cbd_host_info), PAGE_SIZE) +#define CBDT_BACKEND_INFO_SIZE round_up(sizeof(struct cbd_backend_info), PAGE_SIZE) +#define CBDT_BLKDEV_INFO_SIZE round_up(sizeof(struct cbd_blkdev_info), PAGE_SIZE) + +#define CBD_TRASNPORT_SIZE_MIN (512 * 1024 * 1024) + +/* + * CBD structure diagram: + * + * +--------------+ + * | cbd_transport| +----------+ + * +--------------+ | cbd_host | + * | | +----------+ + * | host +---------------------------------------------->| | + * +--------------------+ backends | | hostname | + * | | devices +------------------------------------------+ | | + * | | | | +----------+ + * | +--------------+ | + * | | + * | | + * | | + * | | + * | | + * v v + * +------------+ +-----------+ +------+ +-----------+ +-----------+ +------+ + * | cbd_backend+---->|cbd_backend+---->| NULL | | cbd_blkdev+----->| cbd_blkdev+---->| NULL | + * +------------+ +-----------+ +------+ +-----------+ +-----------+ +------+ + * +------------+ cbd_cache | | handlers | +------+ queues | | queues | + * | | | +-----------+ | | | +-----------+ + * | +------+ handlers | | | | + * | | +------------+ | | cbd_cache +-------------------------------------+ + * | | | +-----------+ | + * | | | | + * | | +-------------+ +-------------+ +------+ | +-----------+ +-----------+ +------+ | + * | +----->| cbd_handler +------>| cbd_handler +---------->| NULL | +----->| cbd_queue +----->| cbd_queue +---->| NULL | | + * | +-------------+ +-------------+ +------+ +-----------+ +-----------+ +------+ | + * | +------+ channel | | channel | +------+ channel | | channel | | + * | | +-------------+ +-------------+ | +-----------+ +-----------+ | + * | | | | + * | | | | + * | | | | + * | | v | + * | | +-----------------------+ | + * | +------------------------------------------------------->| cbd_channel | | + * | +-----------------------+ | + * | | channel_id | | + * | | cmdr (cmd ring) | | + * | | compr (complete ring) | | + * | | data (data area) | | + * | | | | + * | +-----------------------+ | + * | | + * | +-----------------------------+ | + * +------------------------------------------------>| cbd_cache |<------------------------------------------------------------+ + * +-----------------------------+ + * | cache_wq | + * | cache_tree | + * | segments[] | + * +-----------------------------+ + */ + +#define CBD_DEVICE(OBJ) \ +struct cbd_## OBJ ##_device { \ + struct device dev; \ + struct cbd_transport *cbdt; \ + struct cbd_## OBJ ##_info *OBJ##_info; \ +}; \ + \ +struct cbd_## OBJ ##s_device { \ + struct device OBJ ##s_dev; \ + struct cbd_## OBJ ##_device OBJ ##_devs[]; \ +} + +/* cbd_worker_cfg*/ +struct cbd_worker_cfg { + u32 busy_retry_cur; + u32 busy_retry_count; + u32 busy_retry_max; + u32 busy_retry_min; + u64 busy_retry_interval; +}; + +static inline void cbdwc_init(struct cbd_worker_cfg *cfg) +{ + /* init cbd_worker_cfg with default values */ + cfg->busy_retry_cur = 0; + cfg->busy_retry_count = 100; + cfg->busy_retry_max = cfg->busy_retry_count * 2; + cfg->busy_retry_min = 0; + cfg->busy_retry_interval = 1; /* 1us */ +} + +/* reset retry_cur and increase busy_retry_count */ +static inline void cbdwc_hit(struct cbd_worker_cfg *cfg) +{ + u32 delta; + + cfg->busy_retry_cur = 0; + + if (cfg->busy_retry_count == cfg->busy_retry_max) + return; + + /* retry_count increase by 1/16 */ + delta = cfg->busy_retry_count >> 4; + if (!delta) + delta = (cfg->busy_retry_max + cfg->busy_retry_min) >> 1; + + cfg->busy_retry_count += delta; + + if (cfg->busy_retry_count > cfg->busy_retry_max) + cfg->busy_retry_count = cfg->busy_retry_max; +} + +/* reset retry_cur and decrease busy_retry_count */ +static inline void cbdwc_miss(struct cbd_worker_cfg *cfg) +{ + u32 delta; + + cfg->busy_retry_cur = 0; + + if (cfg->busy_retry_count == cfg->busy_retry_min) + return; + + /* retry_count decrease by 1/16 */ + delta = cfg->busy_retry_count >> 4; + if (!delta) + delta = cfg->busy_retry_count; + + cfg->busy_retry_count -= delta; +} + +static inline bool cbdwc_need_retry(struct cbd_worker_cfg *cfg) +{ + if (++cfg->busy_retry_cur < cfg->busy_retry_count) { + cpu_relax(); + fsleep(cfg->busy_retry_interval); + return true; + } + + return false; +} + +/* cbd_transport */ +#define CBDT_INFO_F_BIGENDIAN (1 << 0) +#define CBDT_INFO_F_CRC (1 << 1) + +#ifdef CONFIG_CBD_MULTIHOST +#define CBDT_HOSTS_MAX 16 +#else +#define CBDT_HOSTS_MAX 1 +#endif /*CONFIG_CBD_MULTIHOST*/ + +struct cbd_transport_info { + __le64 magic; + __le16 version; + __le16 flags; + + u32 hosts_registered; + + u64 host_area_off; + u32 host_info_size; + u32 host_num; + + u64 backend_area_off; + u32 backend_info_size; + u32 backend_num; + + u64 blkdev_area_off; + u32 blkdev_info_size; + u32 blkdev_num; + + u64 segment_area_off; + u32 segment_size; + u32 segment_num; +}; + +struct cbd_transport { + u16 id; + struct device device; + struct mutex lock; + struct mutex adm_lock; + + struct cbd_transport_info *transport_info; + + struct cbd_host *host; + struct list_head backends; + struct list_head devices; + + struct cbd_hosts_device *cbd_hosts_dev; + struct cbd_segments_device *cbd_segments_dev; + struct cbd_backends_device *cbd_backends_dev; + struct cbd_blkdevs_device *cbd_blkdevs_dev; + + struct dax_device *dax_dev; + struct file *bdev_file; +}; + +struct cbdt_register_options { + char hostname[CBD_NAME_LEN]; + char path[CBD_PATH_LEN]; + u16 format:1; + u16 force:1; + u16 unused:15; +}; + +struct cbd_blkdev; +struct cbd_backend; +struct cbd_backend_io; +struct cbd_cache; + +int cbdt_register(struct cbdt_register_options *opts); +int cbdt_unregister(u32 transport_id); + +struct cbd_host_info *cbdt_get_host_info(struct cbd_transport *cbdt, u32 id); +struct cbd_backend_info *cbdt_get_backend_info(struct cbd_transport *cbdt, u32 id); +struct cbd_blkdev_info *cbdt_get_blkdev_info(struct cbd_transport *cbdt, u32 id); +struct cbd_segment_info *cbdt_get_segment_info(struct cbd_transport *cbdt, u32 id); +static inline struct cbd_channel_info *cbdt_get_channel_info(struct cbd_transport *cbdt, u32 id) +{ + return (struct cbd_channel_info *)cbdt_get_segment_info(cbdt, id); +} + +int cbdt_get_empty_host_id(struct cbd_transport *cbdt, u32 *id); +int cbdt_get_empty_backend_id(struct cbd_transport *cbdt, u32 *id); +int cbdt_get_empty_blkdev_id(struct cbd_transport *cbdt, u32 *id); +int cbdt_get_empty_segment_id(struct cbd_transport *cbdt, u32 *id); + +void cbdt_add_backend(struct cbd_transport *cbdt, struct cbd_backend *cbdb); +void cbdt_del_backend(struct cbd_transport *cbdt, struct cbd_backend *cbdb); +struct cbd_backend *cbdt_get_backend(struct cbd_transport *cbdt, u32 id); +void cbdt_add_blkdev(struct cbd_transport *cbdt, struct cbd_blkdev *blkdev); +void cbdt_del_blkdev(struct cbd_transport *cbdt, struct cbd_blkdev *blkdev); +struct cbd_blkdev *cbdt_get_blkdev(struct cbd_transport *cbdt, u32 id); + +struct page *cbdt_page(struct cbd_transport *cbdt, u64 transport_off, u32 *page_off); +void cbdt_zero_range(struct cbd_transport *cbdt, void *pos, u32 size); + +/* cbd_host */ +CBD_DEVICE(host); + +enum cbd_host_state { + cbd_host_state_none = 0, + cbd_host_state_running, + cbd_host_state_removing +}; + +struct cbd_host_info { + u8 state; + u64 alive_ts; + char hostname[CBD_NAME_LEN]; +}; + +struct cbd_host { + u32 host_id; + struct cbd_transport *cbdt; + + struct cbd_host_device *dev; + struct cbd_host_info *host_info; + struct delayed_work hb_work; /* heartbeat work */ +}; + +int cbd_host_register(struct cbd_transport *cbdt, char *hostname); +int cbd_host_unregister(struct cbd_transport *cbdt); +int cbd_host_clear(struct cbd_transport *cbdt, u32 host_id); +bool cbd_host_info_is_alive(struct cbd_host_info *info); + +/* cbd_segment */ +CBD_DEVICE(segment); + +enum cbd_segment_state { + cbd_segment_state_none = 0, + cbd_segment_state_running, + cbd_segment_state_removing +}; + +enum cbd_seg_type { + cbds_type_none = 0, + cbds_type_channel, + cbds_type_cache +}; + +static inline const char *cbds_type_str(enum cbd_seg_type type) +{ + if (type == cbds_type_channel) + return "channel"; + else if (type == cbds_type_cache) + return "cache"; + + return "Unknown"; +} + +struct cbd_segment_info { + u8 state; + u8 type; + u8 ref; + u32 next_seg; + u64 alive_ts; +}; + +struct cbd_seg_pos { + struct cbd_segment *segment; + u32 off; +}; + +struct cbd_seg_ops { + void (*sanitize_pos)(struct cbd_seg_pos *pos); +}; + +struct cbds_init_options { + u32 seg_id; + enum cbd_seg_type type; + u32 data_off; + struct cbd_seg_ops *seg_ops; + void *priv_data; +}; + +struct cbd_segment { + struct cbd_transport *cbdt; + struct cbd_segment *next; + + u32 seg_id; + struct cbd_segment_info *segment_info; + struct cbd_seg_ops *seg_ops; + + void *data; + u32 data_size; + + void *priv_data; + + struct delayed_work hb_work; /* heartbeat work */ +}; + +int cbd_segment_clear(struct cbd_transport *cbdt, u32 segment_id); +void cbd_segment_init(struct cbd_transport *cbdt, struct cbd_segment *segment, + struct cbds_init_options *options); +void cbd_segment_exit(struct cbd_segment *segment); +bool cbd_segment_info_is_alive(struct cbd_segment_info *info); +void cbds_copy_to_bio(struct cbd_segment *segment, + u32 data_off, u32 data_len, struct bio *bio, u32 bio_off); +void cbds_copy_from_bio(struct cbd_segment *segment, + u32 data_off, u32 data_len, struct bio *bio, u32 bio_off); +u32 cbd_seg_crc(struct cbd_segment *segment, u32 data_off, u32 data_len); +int cbds_map_pages(struct cbd_segment *segment, struct cbd_backend_io *io); +int cbds_pos_advance(struct cbd_seg_pos *seg_pos, u32 len); +void cbds_copy_data(struct cbd_seg_pos *dst_pos, + struct cbd_seg_pos *src_pos, u32 len); + +/* cbd_channel */ + +enum cbdc_blkdev_state { + cbdc_blkdev_state_none = 0, + cbdc_blkdev_state_running, + cbdc_blkdev_state_stopped, +}; + +enum cbdc_backend_state { + cbdc_backend_state_none = 0, + cbdc_backend_state_running, + cbdc_backend_state_stopped, +}; + +struct cbd_channel_info { + struct cbd_segment_info seg_info; /* must be the first member */ + u8 blkdev_state; + u32 blkdev_id; + + u8 backend_state; + u32 backend_id; + + u32 polling:1; + + u32 submr_head; + u32 submr_tail; + + u32 compr_head; + u32 compr_tail; +}; + +struct cbd_channel { + u32 seg_id; + struct cbd_segment segment; + + struct cbd_channel_info *channel_info; + + struct cbd_transport *cbdt; + + void *submr; + void *compr; + + u32 submr_size; + u32 compr_size; + + u32 data_size; + u32 data_head; + u32 data_tail; + + spinlock_t submr_lock; + spinlock_t compr_lock; +}; + +void cbd_channel_init(struct cbd_channel *channel, struct cbd_transport *cbdt, u32 seg_id); +void cbd_channel_exit(struct cbd_channel *channel); +void cbdc_copy_from_bio(struct cbd_channel *channel, + u32 data_off, u32 data_len, struct bio *bio, u32 bio_off); +void cbdc_copy_to_bio(struct cbd_channel *channel, + u32 data_off, u32 data_len, struct bio *bio, u32 bio_off); +u32 cbd_channel_crc(struct cbd_channel *channel, u32 data_off, u32 data_len); +int cbdc_map_pages(struct cbd_channel *channel, struct cbd_backend_io *io); +int cbd_get_empty_channel_id(struct cbd_transport *cbdt, u32 *id); +ssize_t cbd_channel_seg_detail_show(struct cbd_channel_info *channel_info, char *buf); + +/* cbd cache */ +struct cbd_cache_seg_info { + struct cbd_segment_info segment_info; /* first member */ + u32 flags; + u32 next_cache_seg_id; /* index in cache->segments */ + u64 gen; +}; + +#define CBD_CACHE_SEG_FLAGS_HAS_NEXT (1 << 0) +#define CBD_CACHE_SEG_FLAGS_WB_DONE (1 << 1) +#define CBD_CACHE_SEG_FLAGS_GC_DONE (1 << 2) + +enum cbd_cache_blkdev_state { + cbd_cache_blkdev_state_none = 0, + cbd_cache_blkdev_state_running, + cbd_cache_blkdev_state_removing +}; + +struct cbd_cache_segment { + struct cbd_cache *cache; + u32 cache_seg_id; /* index in cache->segments */ + u32 used; + spinlock_t gen_lock; + struct cbd_cache_seg_info *cache_seg_info; + struct cbd_segment segment; + atomic_t refs; +}; + +struct cbd_cache_pos { + struct cbd_cache_segment *cache_seg; + u32 seg_off; +}; + +struct cbd_cache_pos_onmedia { + u32 cache_seg_id; + u32 seg_off; +}; + +struct cbd_cache_info { + u8 blkdev_state; + u32 blkdev_id; + + u32 seg_id; + u32 n_segs; + + u32 used_segs; + u16 gc_percent; + + struct cbd_cache_pos_onmedia key_tail_pos; + struct cbd_cache_pos_onmedia dirty_tail_pos; +}; + +struct cbd_cache_tree { + struct rb_root root; + spinlock_t tree_lock; +}; + +struct cbd_cache_data_head { + spinlock_t data_head_lock; + struct cbd_cache_pos head_pos; +}; + +struct cbd_cache_key { + struct cbd_cache *cache; + struct cbd_cache_tree *cache_tree; + struct kref ref; + + struct rb_node rb_node; + struct list_head list_node; + + u64 off; + u32 len; + u64 flags; + + struct cbd_cache_pos cache_pos; + + u64 seg_gen; +#ifdef CONFIG_CBD_CRC + u32 data_crc; +#endif +}; + +#define CBD_CACHE_KEY_FLAGS_EMPTY (1 << 0) +#define CBD_CACHE_KEY_FLAGS_CLEAN (1 << 1) + +struct cbd_cache_key_onmedia { + u64 off; + u32 len; + + u32 flags; + + u32 cache_seg_id; + u32 cache_seg_off; + + u64 seg_gen; +#ifdef CONFIG_CBD_CRC + u32 data_crc; +#endif +}; + +struct cbd_cache_kset_onmedia { + u32 crc; + u64 magic; + u64 flags; + u32 key_num; + struct cbd_cache_key_onmedia data[]; +}; + +#define CBD_KSET_FLAGS_LAST (1 << 0) + +#define CBD_KSET_MAGIC 0x676894a64e164f1aULL + +struct cbd_cache_kset { + struct cbd_cache *cache; + spinlock_t kset_lock; + struct delayed_work flush_work; + struct cbd_cache_kset_onmedia kset_onmedia; +}; + +enum cbd_cache_state { + cbd_cache_state_none = 0, + cbd_cache_state_running, + cbd_cache_state_stopping +}; + +struct cbd_cache { + struct cbd_transport *cbdt; + struct cbd_cache_info *cache_info; + u32 cache_id; /* same with related backend->backend_id */ + + u32 n_heads; + struct cbd_cache_data_head *data_heads; + + spinlock_t key_head_lock; + struct cbd_cache_pos key_head; + u32 n_ksets; + struct cbd_cache_kset *ksets; + + struct cbd_cache_pos key_tail; + struct cbd_cache_pos dirty_tail; + + struct kmem_cache *key_cache; + u32 n_trees; + struct cbd_cache_tree *cache_trees; + struct work_struct clean_work; + + spinlock_t miss_read_reqs_lock; + struct list_head miss_read_reqs; + struct work_struct miss_read_end_work; + + struct workqueue_struct *cache_wq; + + struct file *bdev_file; + u64 dev_size; + struct delayed_work writeback_work; + struct delayed_work gc_work; + struct bio_set *bioset; + + struct kmem_cache *req_cache; + + u32 state:8; + u32 init_keys:1; + u32 start_writeback:1; + u32 start_gc:1; + + u32 n_segs; + unsigned long *seg_map; + u32 last_cache_seg; + spinlock_t seg_map_lock; + struct cbd_cache_segment segments[]; /* should be the last member */ +}; + +struct cbd_request; +struct cbd_cache_opts { + struct cbd_cache_info *cache_info; + bool alloc_segs; + bool start_writeback; + bool start_gc; + bool init_keys; + u64 dev_size; + u32 n_paral; + struct file *bdev_file; /* needed for start_writeback is true */ +}; + +struct cbd_cache *cbd_cache_alloc(struct cbd_transport *cbdt, + struct cbd_cache_opts *opts); +void cbd_cache_destroy(struct cbd_cache *cache); +int cbd_cache_handle_req(struct cbd_cache *cache, struct cbd_request *cbd_req); + +/* cbd_handler */ +struct cbd_handler { + struct cbd_backend *cbdb; + struct cbd_channel_info *channel_info; + + struct cbd_channel channel; + spinlock_t compr_lock; + + u32 se_to_handle; + u64 req_tid_expected; + + u32 polling:1; + + struct delayed_work handle_work; + struct cbd_worker_cfg handle_worker_cfg; + + struct hlist_node hash_node; + struct bio_set bioset; +}; + +void cbd_handler_destroy(struct cbd_handler *handler); +int cbd_handler_create(struct cbd_backend *cbdb, u32 seg_id); +void cbd_handler_notify(struct cbd_handler *handler); + +/* cbd_backend */ +CBD_DEVICE(backend); + +enum cbd_backend_state { + cbd_backend_state_none = 0, + cbd_backend_state_running, + cbd_backend_state_removing +}; + +#define CBDB_BLKDEV_COUNT_MAX 1 + +struct cbd_backend_info { + u8 state; + u32 host_id; + u32 blkdev_count; + u64 alive_ts; + u64 dev_size; /* nr_sectors */ + struct cbd_cache_info cache_info; + + char path[CBD_PATH_LEN]; +}; + +struct cbd_backend_io { + struct cbd_se *se; + u64 off; + u32 len; + struct bio *bio; + struct cbd_handler *handler; +}; + +#define CBD_BACKENDS_HANDLER_BITS 7 + +struct cbd_backend { + u32 backend_id; + char path[CBD_PATH_LEN]; + struct cbd_transport *cbdt; + struct cbd_backend_info *backend_info; + spinlock_t lock; + + struct block_device *bdev; + struct file *bdev_file; + + struct workqueue_struct *task_wq; + struct delayed_work state_work; + struct delayed_work hb_work; /* heartbeat work */ + + struct list_head node; /* cbd_transport->backends */ + DECLARE_HASHTABLE(handlers_hash, CBD_BACKENDS_HANDLER_BITS); + + struct cbd_backend_device *backend_device; + + struct kmem_cache *backend_io_cache; + + struct cbd_cache *cbd_cache; + struct device cache_dev; + bool cache_dev_registered; +}; + +int cbd_backend_start(struct cbd_transport *cbdt, char *path, u32 backend_id, u32 cache_segs); +int cbd_backend_stop(struct cbd_transport *cbdt, u32 backend_id); +int cbd_backend_clear(struct cbd_transport *cbdt, u32 backend_id); +int cbdb_add_handler(struct cbd_backend *cbdb, struct cbd_handler *handler); +void cbdb_del_handler(struct cbd_backend *cbdb, struct cbd_handler *handler); +bool cbd_backend_info_is_alive(struct cbd_backend_info *info); +bool cbd_backend_cache_on(struct cbd_backend_info *backend_info); +void cbd_backend_notify(struct cbd_backend *cbdb, u32 seg_id); + +/* cbd_queue */ +enum cbd_op { + CBD_OP_WRITE = 0, + CBD_OP_READ, + CBD_OP_FLUSH, +}; + +struct cbd_se { +#ifdef CONFIG_CBD_CRC + u32 se_crc; /* should be the first member */ + u32 data_crc; +#endif + u32 op; + u32 flags; + u64 req_tid; + + u64 offset; + u32 len; + + u32 data_off; + u32 data_len; +}; + +struct cbd_ce { +#ifdef CONFIG_CBD_CRC + u32 ce_crc; /* should be the first member */ + u32 data_crc; +#endif + u64 req_tid; + u32 result; + u32 flags; +}; + +#ifdef CONFIG_CBD_CRC +static inline u32 cbd_se_crc(struct cbd_se *se) +{ + return crc32(0, (void *)se + 4, sizeof(*se) - 4); +} + +static inline u32 cbd_ce_crc(struct cbd_ce *ce) +{ + return crc32(0, (void *)ce + 4, sizeof(*ce) - 4); +} +#endif + +struct cbd_request { + struct cbd_queue *cbdq; + + struct cbd_se *se; + struct cbd_ce *ce; + struct request *req; + + u64 off; + struct bio *bio; + u32 bio_off; + spinlock_t lock; /* race between cache and complete_work to access bio */ + + enum cbd_op op; + u64 req_tid; + struct list_head inflight_reqs_node; + + u32 data_off; + u32 data_len; + + struct work_struct work; + + struct kref ref; + int ret; + struct cbd_request *parent; + + void *priv_data; + void (*end_req)(struct cbd_request *cbd_req, void *priv_data); +}; + +struct cbd_cache_req { + struct cbd_cache *cache; + enum cbd_op op; + struct work_struct work; +}; + +#define CBD_SE_FLAGS_DONE 1 + +static inline bool cbd_se_flags_test(struct cbd_se *se, u32 bit) +{ + return (se->flags & bit); +} + +static inline void cbd_se_flags_set(struct cbd_se *se, u32 bit) +{ + se->flags |= bit; +} + +enum cbd_queue_state { + cbd_queue_state_none = 0, + cbd_queue_state_running, + cbd_queue_state_removing +}; + +struct cbd_queue { + struct cbd_blkdev *cbd_blkdev; + u32 index; + struct list_head inflight_reqs; + spinlock_t inflight_reqs_lock; + u64 req_tid; + + u64 *released_extents; + + struct cbd_channel_info *channel_info; + struct cbd_channel channel; + + atomic_t state; + + struct delayed_work complete_work; + struct cbd_worker_cfg complete_worker_cfg; +}; + +int cbd_queue_start(struct cbd_queue *cbdq); +void cbd_queue_stop(struct cbd_queue *cbdq); +extern const struct blk_mq_ops cbd_mq_ops; +int cbd_queue_req_to_backend(struct cbd_request *cbd_req); +void cbd_req_get(struct cbd_request *cbd_req); +void cbd_req_put(struct cbd_request *cbd_req, int ret); +void cbd_queue_advance(struct cbd_queue *cbdq, struct cbd_request *cbd_req); + +/* cbd_blkdev */ +CBD_DEVICE(blkdev); + +enum cbd_blkdev_state { + cbd_blkdev_state_none = 0, + cbd_blkdev_state_running, + cbd_blkdev_state_removing +}; + +struct cbd_blkdev_info { + u8 state; + u64 alive_ts; + u32 backend_id; + u32 host_id; + u32 mapped_id; +}; + +struct cbd_blkdev { + u32 blkdev_id; /* index in transport blkdev area */ + u32 backend_id; + int mapped_id; /* id in block device such as: /dev/cbd0 */ + + struct cbd_backend *backend; /* reference to backend if blkdev and backend on the same host */ + + int major; /* blkdev assigned major */ + int minor; + struct gendisk *disk; /* blkdev's gendisk and rq */ + + struct mutex lock; + unsigned long open_count; /* protected by lock */ + + struct list_head node; + struct delayed_work hb_work; /* heartbeat work */ + + /* Block layer tags. */ + struct blk_mq_tag_set tag_set; + + uint32_t num_queues; + struct cbd_queue *queues; + + u64 dev_size; + + struct workqueue_struct *task_wq; + + struct cbd_blkdev_device *blkdev_dev; + struct cbd_blkdev_info *blkdev_info; + + struct cbd_transport *cbdt; + + struct cbd_cache *cbd_cache; +}; + +int cbd_blkdev_init(void); +void cbd_blkdev_exit(void); +int cbd_blkdev_start(struct cbd_transport *cbdt, u32 backend_id, u32 queues); +int cbd_blkdev_stop(struct cbd_transport *cbdt, u32 devid, bool force); +int cbd_blkdev_clear(struct cbd_transport *cbdt, u32 devid); +bool cbd_blkdev_info_is_alive(struct cbd_blkdev_info *info); + +extern struct workqueue_struct *cbd_wq; + +#define cbd_setup_device(DEV, PARENT, TYPE, fmt, ...) \ +do { \ + device_initialize(DEV); \ + device_set_pm_not_required(DEV); \ + dev_set_name(DEV, fmt, ##__VA_ARGS__); \ + DEV->parent = PARENT; \ + DEV->type = TYPE; \ + \ + ret = device_add(DEV); \ +} while (0) + +#define CBD_OBJ_HEARTBEAT(OBJ) \ +static void OBJ##_hb_workfn(struct work_struct *work) \ +{ \ + struct cbd_##OBJ *obj = container_of(work, struct cbd_##OBJ, hb_work.work); \ + struct cbd_##OBJ##_info *info = obj->OBJ##_info; \ + \ + info->alive_ts = ktime_get_real(); \ + \ + queue_delayed_work(cbd_wq, &obj->hb_work, CBD_HB_INTERVAL); \ +} \ + \ +bool cbd_##OBJ##_info_is_alive(struct cbd_##OBJ##_info *info) \ +{ \ + ktime_t oldest, ts; \ + \ + ts = info->alive_ts; \ + oldest = ktime_sub_ms(ktime_get_real(), CBD_HB_TIMEOUT); \ + \ + if (ktime_after(ts, oldest)) \ + return true; \ + \ + return false; \ +} \ + \ +static ssize_t cbd_##OBJ##_alive_show(struct device *dev, \ + struct device_attribute *attr, \ + char *buf) \ +{ \ + struct cbd_##OBJ##_device *_dev; \ + \ + _dev = container_of(dev, struct cbd_##OBJ##_device, dev); \ + \ + if (cbd_##OBJ##_info_is_alive(_dev->OBJ##_info)) \ + return sprintf(buf, "true\n"); \ + \ + return sprintf(buf, "false\n"); \ +} \ + \ +static DEVICE_ATTR(alive, 0400, cbd_##OBJ##_alive_show, NULL) + +#endif /* _CBD_INTERNAL_H */ diff --git a/drivers/block/cbd/cbd_transport.c b/drivers/block/cbd/cbd_transport.c new file mode 100644 index 000000000000..30dc4cd6c77a --- /dev/null +++ b/drivers/block/cbd/cbd_transport.c @@ -0,0 +1,957 @@ +#include +#include "cbd_internal.h" + +#define CBDT_OBJ(OBJ, OBJ_SIZE) \ +extern struct device_type cbd_##OBJ##_type; \ +extern struct device_type cbd_##OBJ##s_type; \ + \ +static int cbd_##OBJ##s_init(struct cbd_transport *cbdt) \ +{ \ + struct cbd_##OBJ##s_device *devs; \ + struct cbd_##OBJ##_device *cbd_dev; \ + struct device *dev; \ + int i; \ + int ret; \ + \ + u32 memsize = struct_size(devs, OBJ##_devs, \ + cbdt->transport_info->OBJ##_num); \ + devs = kzalloc(memsize, GFP_KERNEL); \ + if (!devs) { \ + return -ENOMEM; \ + } \ + \ + dev = &devs->OBJ##s_dev; \ + device_initialize(dev); \ + device_set_pm_not_required(dev); \ + dev_set_name(dev, "cbd_" #OBJ "s"); \ + dev->parent = &cbdt->device; \ + dev->type = &cbd_##OBJ##s_type; \ + ret = device_add(dev); \ + if (ret) { \ + goto devs_free; \ + } \ + \ + for (i = 0; i < cbdt->transport_info->OBJ##_num; i++) { \ + cbd_dev = &devs->OBJ##_devs[i]; \ + dev = &cbd_dev->dev; \ + \ + cbd_dev->cbdt = cbdt; \ + cbd_dev->OBJ##_info = cbdt_get_##OBJ##_info(cbdt, i); \ + device_initialize(dev); \ + device_set_pm_not_required(dev); \ + dev_set_name(dev, #OBJ "%u", i); \ + dev->parent = &devs->OBJ##s_dev; \ + dev->type = &cbd_##OBJ##_type; \ + \ + ret = device_add(dev); \ + if (ret) { \ + i--; \ + goto del_device; \ + } \ + } \ + cbdt->cbd_##OBJ##s_dev = devs; \ + \ + return 0; \ +del_device: \ + for (; i >= 0; i--) { \ + cbd_dev = &devs->OBJ##_devs[i]; \ + dev = &cbd_dev->dev; \ + device_del(dev); \ + } \ +devs_free: \ + kfree(devs); \ + return ret; \ +} \ + \ +static void cbd_##OBJ##s_exit(struct cbd_transport *cbdt) \ +{ \ + struct cbd_##OBJ##s_device *devs = cbdt->cbd_##OBJ##s_dev; \ + struct device *dev; \ + int i; \ + \ + if (!devs) \ + return; \ + \ + for (i = 0; i < cbdt->transport_info->OBJ##_num; i++) { \ + struct cbd_##OBJ##_device *cbd_dev = &devs->OBJ##_devs[i]; \ + dev = &cbd_dev->dev; \ + \ + device_del(dev); \ + } \ + \ + device_del(&devs->OBJ##s_dev); \ + \ + kfree(devs); \ + cbdt->cbd_##OBJ##s_dev = NULL; \ + \ + return; \ +} \ + \ +static inline struct cbd_##OBJ##_info \ +*__get_##OBJ##_info(struct cbd_transport *cbdt, u32 id) \ +{ \ + struct cbd_transport_info *info = cbdt->transport_info; \ + void *start = cbdt->transport_info; \ + \ + start += info->OBJ##_area_off; \ + \ + return start + ((u64)info->OBJ_SIZE * id); \ +} \ + \ +struct cbd_##OBJ##_info \ +*cbdt_get_##OBJ##_info(struct cbd_transport *cbdt, u32 id) \ +{ \ + struct cbd_##OBJ##_info *info; \ + \ + mutex_lock(&cbdt->lock); \ + info = __get_##OBJ##_info(cbdt, id); \ + mutex_unlock(&cbdt->lock); \ + \ + return info; \ +} \ + \ +int cbdt_get_empty_##OBJ##_id(struct cbd_transport *cbdt, u32 *id) \ +{ \ + struct cbd_transport_info *info = cbdt->transport_info; \ + struct cbd_##OBJ##_info *_info; \ + int ret = 0; \ + int i; \ + \ + mutex_lock(&cbdt->lock); \ + for (i = 0; i < info->OBJ##_num; i++) { \ + _info = __get_##OBJ##_info(cbdt, i); \ + if (_info->state == cbd_##OBJ##_state_none) { \ + cbdt_zero_range(cbdt, _info, info->OBJ_SIZE); \ + *id = i; \ + goto out; \ + } \ + } \ + \ + cbdt_err(cbdt, "No available " #OBJ "_id found."); \ + ret = -ENOENT; \ +out: \ + mutex_unlock(&cbdt->lock); \ + \ + return ret; \ +} + +CBDT_OBJ(host, host_info_size); +CBDT_OBJ(backend, backend_info_size); +CBDT_OBJ(blkdev, blkdev_info_size); +CBDT_OBJ(segment, segment_size); + +static struct cbd_transport *cbd_transports[CBD_TRANSPORT_MAX]; +static DEFINE_IDA(cbd_transport_id_ida); +static DEFINE_MUTEX(cbd_transport_mutex); + +extern struct bus_type cbd_bus_type; +extern struct device cbd_root_dev; + +static ssize_t cbd_myhost_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct cbd_transport *cbdt; + struct cbd_host *host; + + cbdt = container_of(dev, struct cbd_transport, device); + + host = cbdt->host; + if (!host) + return 0; + + return sprintf(buf, "%d\n", host->host_id); +} + +static DEVICE_ATTR(my_host_id, 0400, cbd_myhost_show, NULL); + +enum { + CBDT_ADM_OPT_ERR = 0, + CBDT_ADM_OPT_OP, + CBDT_ADM_OPT_FORCE, + CBDT_ADM_OPT_PATH, + CBDT_ADM_OPT_BID, + CBDT_ADM_OPT_DID, + CBDT_ADM_OPT_QUEUES, + CBDT_ADM_OPT_HID, + CBDT_ADM_OPT_SID, + CBDT_ADM_OPT_CACHE_SIZE, +}; + +enum { + CBDT_ADM_OP_B_START, + CBDT_ADM_OP_B_STOP, + CBDT_ADM_OP_B_CLEAR, + CBDT_ADM_OP_DEV_START, + CBDT_ADM_OP_DEV_STOP, + CBDT_ADM_OP_DEV_CLEAR, + CBDT_ADM_OP_H_CLEAR, + CBDT_ADM_OP_S_CLEAR, +}; + +static const char *const adm_op_names[] = { + [CBDT_ADM_OP_B_START] = "backend-start", + [CBDT_ADM_OP_B_STOP] = "backend-stop", + [CBDT_ADM_OP_B_CLEAR] = "backend-clear", + [CBDT_ADM_OP_DEV_START] = "dev-start", + [CBDT_ADM_OP_DEV_STOP] = "dev-stop", + [CBDT_ADM_OP_DEV_CLEAR] = "dev-clear", + [CBDT_ADM_OP_H_CLEAR] = "host-clear", + [CBDT_ADM_OP_S_CLEAR] = "segment-clear", +}; + +static const match_table_t adm_opt_tokens = { + { CBDT_ADM_OPT_OP, "op=%s" }, + { CBDT_ADM_OPT_FORCE, "force=%u" }, + { CBDT_ADM_OPT_PATH, "path=%s" }, + { CBDT_ADM_OPT_BID, "backend_id=%u" }, + { CBDT_ADM_OPT_DID, "dev_id=%u" }, + { CBDT_ADM_OPT_QUEUES, "queues=%u" }, + { CBDT_ADM_OPT_HID, "host_id=%u" }, + { CBDT_ADM_OPT_SID, "segment_id=%u" }, + { CBDT_ADM_OPT_CACHE_SIZE, "cache_size=%u" }, /* unit is MiB */ + { CBDT_ADM_OPT_ERR, NULL } +}; + + +struct cbd_adm_options { + u16 op; + u16 force:1; + u32 backend_id; + union { + struct host_options { + u32 hid; + } host; + struct backend_options { + char path[CBD_PATH_LEN]; + u64 cache_size_M; + } backend; + struct segment_options { + u32 sid; + } segment; + struct blkdev_options { + u32 devid; + u32 queues; + } blkdev; + }; +}; + +static int parse_adm_options(struct cbd_transport *cbdt, + char *buf, + struct cbd_adm_options *opts) +{ + substring_t args[MAX_OPT_ARGS]; + char *o, *p; + int token, ret = 0; + + o = buf; + + while ((p = strsep(&o, ",\n")) != NULL) { + if (!*p) + continue; + + token = match_token(p, adm_opt_tokens, args); + switch (token) { + case CBDT_ADM_OPT_OP: + ret = match_string(adm_op_names, ARRAY_SIZE(adm_op_names), args[0].from); + if (ret < 0) { + cbdt_err(cbdt, "unknown op: '%s'\n", args[0].from); + ret = -EINVAL; + break; + } + opts->op = ret; + break; + case CBDT_ADM_OPT_PATH: + if (match_strlcpy(opts->backend.path, &args[0], + CBD_PATH_LEN) == 0) { + ret = -EINVAL; + break; + } + break; + case CBDT_ADM_OPT_FORCE: + if (match_uint(args, &token) || token != 1) { + ret = -EINVAL; + goto out; + } + opts->force = 1; + break; + case CBDT_ADM_OPT_BID: + if (match_uint(args, &token)) { + ret = -EINVAL; + goto out; + } + opts->backend_id = token; + break; + case CBDT_ADM_OPT_DID: + if (match_uint(args, &token)) { + ret = -EINVAL; + goto out; + } + opts->blkdev.devid = token; + break; + case CBDT_ADM_OPT_QUEUES: + if (match_uint(args, &token)) { + ret = -EINVAL; + goto out; + } + opts->blkdev.queues = token; + break; + case CBDT_ADM_OPT_HID: + if (match_uint(args, &token)) { + ret = -EINVAL; + goto out; + } + opts->host.hid = token; + break; + case CBDT_ADM_OPT_SID: + if (match_uint(args, &token)) { + ret = -EINVAL; + goto out; + } + opts->segment.sid = token; + break; + case CBDT_ADM_OPT_CACHE_SIZE: + if (match_uint(args, &token)) { + ret = -EINVAL; + goto out; + } + opts->backend.cache_size_M = token; + break; + default: + cbdt_err(cbdt, "unknown parameter or missing value '%s'\n", p); + ret = -EINVAL; + goto out; + } + } + +out: + return ret; +} + +void cbdt_zero_range(struct cbd_transport *cbdt, void *pos, u32 size) +{ + memset(pos, 0, size); +} + +static void segments_format(struct cbd_transport *cbdt) +{ + u32 i; + struct cbd_segment_info *seg_info; + + for (i = 0; i < cbdt->transport_info->segment_num; i++) { + seg_info = cbdt_get_segment_info(cbdt, i); + cbdt_zero_range(cbdt, seg_info, sizeof(struct cbd_segment_info)); + } +} + +static int cbd_transport_format(struct cbd_transport *cbdt, bool force) +{ + struct cbd_transport_info *info = cbdt->transport_info; + u64 transport_dev_size; + u32 seg_size; + u32 nr_segs; + u64 magic; + u16 flags = 0; + + magic = le64_to_cpu(info->magic); + if (magic && !force) + return -EEXIST; + + transport_dev_size = bdev_nr_bytes(file_bdev(cbdt->bdev_file)); + if (transport_dev_size < CBD_TRASNPORT_SIZE_MIN) { + cbdt_err(cbdt, "dax device is too small, required at least %u", + CBD_TRASNPORT_SIZE_MIN); + return -ENOSPC; + } + + memset(info, 0, sizeof(*info)); + + info->magic = cpu_to_le64(CBD_TRANSPORT_MAGIC); + info->version = cpu_to_le16(CBD_TRANSPORT_VERSION); +#if defined(__BYTE_ORDER) ? __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN) + flags |= CBDT_INFO_F_BIGENDIAN; +#endif +#ifdef CONFIG_CBD_CRC + flags |= CBDT_INFO_F_CRC; +#endif + info->flags = cpu_to_le16(flags); + + info->hosts_registered = 0; + /* + * Try to fully utilize all available space, + * assuming host:blkdev:backend:segment = 1:1:1:1 + */ + seg_size = (CBDT_HOST_INFO_SIZE + CBDT_BACKEND_INFO_SIZE + + CBDT_BLKDEV_INFO_SIZE + CBDT_SEG_SIZE); + nr_segs = (transport_dev_size - CBDT_INFO_SIZE) / seg_size; + + info->host_area_off = CBDT_INFO_OFF + CBDT_INFO_SIZE; + info->host_info_size = CBDT_HOST_INFO_SIZE; + info->host_num = nr_segs; + + info->backend_area_off = info->host_area_off + (info->host_info_size * info->host_num); + info->backend_info_size = CBDT_BACKEND_INFO_SIZE; + info->backend_num = nr_segs; + + info->blkdev_area_off = info->backend_area_off + (info->backend_info_size * info->backend_num); + info->blkdev_info_size = CBDT_BLKDEV_INFO_SIZE; + info->blkdev_num = nr_segs; + + info->segment_area_off = info->blkdev_area_off + (info->blkdev_info_size * info->blkdev_num); + info->segment_size = CBDT_SEG_SIZE; + info->segment_num = nr_segs; + + cbdt_zero_range(cbdt, (void *)info + info->host_area_off, + info->segment_area_off - info->host_area_off); + + segments_format(cbdt); + + return 0; +} + +/* + * Any transport metadata allocation or reclaim should be in the + * control operation rutine + * + * All transport space allocation and deallocation should occur within the control flow, + * specifically within `adm_store()`, so that all transport space allocation + * and deallocation are managed within this function. This prevents other processes + * from involving transport space allocation and deallocation. By making `adm_store` + * exclusive, we can manage space effectively. For a single-host scenario, `adm_lock` + * can ensure mutual exclusion of `adm_store`. However, in a multi-host scenario, + * we need a distributed lock to guarantee that all `adm_store` calls are mutually exclusive. + * + * TODO: Is there a way to lock the CXL shared memory device? + */ +static ssize_t adm_store(struct device *dev, + struct device_attribute *attr, + const char *ubuf, + size_t size) +{ + int ret; + char *buf; + struct cbd_adm_options opts = { 0 }; + struct cbd_transport *cbdt; + + opts.backend_id = U32_MAX; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + cbdt = container_of(dev, struct cbd_transport, device); + + buf = kmemdup(ubuf, size + 1, GFP_KERNEL); + if (IS_ERR(buf)) { + cbdt_err(cbdt, "failed to dup buf for adm option: %d", (int)PTR_ERR(buf)); + return PTR_ERR(buf); + } + buf[size] = '\0'; + ret = parse_adm_options(cbdt, buf, &opts); + if (ret < 0) { + kfree(buf); + return ret; + } + kfree(buf); + + mutex_lock(&cbdt->adm_lock); + switch (opts.op) { + case CBDT_ADM_OP_B_START: + u32 cache_segs = 0; + + if (opts.backend.cache_size_M > 0) + cache_segs = DIV_ROUND_UP(opts.backend.cache_size_M, + cbdt->transport_info->segment_size / CBD_MB); + + ret = cbd_backend_start(cbdt, opts.backend.path, opts.backend_id, cache_segs); + break; + case CBDT_ADM_OP_B_STOP: + ret = cbd_backend_stop(cbdt, opts.backend_id); + break; + case CBDT_ADM_OP_B_CLEAR: + ret = cbd_backend_clear(cbdt, opts.backend_id); + break; + case CBDT_ADM_OP_DEV_START: + if (opts.blkdev.queues > CBD_QUEUES_MAX) { + mutex_unlock(&cbdt->adm_lock); + cbdt_err(cbdt, "invalid queues = %u, larger than max %u\n", + opts.blkdev.queues, CBD_QUEUES_MAX); + return -EINVAL; + } + ret = cbd_blkdev_start(cbdt, opts.backend_id, opts.blkdev.queues); + break; + case CBDT_ADM_OP_DEV_STOP: + ret = cbd_blkdev_stop(cbdt, opts.blkdev.devid, opts.force); + break; + case CBDT_ADM_OP_DEV_CLEAR: + ret = cbd_blkdev_clear(cbdt, opts.blkdev.devid); + break; + case CBDT_ADM_OP_H_CLEAR: + ret = cbd_host_clear(cbdt, opts.host.hid); + break; + case CBDT_ADM_OP_S_CLEAR: + ret = cbd_segment_clear(cbdt, opts.segment.sid); + break; + default: + mutex_unlock(&cbdt->adm_lock); + cbdt_err(cbdt, "invalid op: %d\n", opts.op); + return -EINVAL; + } + mutex_unlock(&cbdt->adm_lock); + + if (ret < 0) + return ret; + + return size; +} + +static DEVICE_ATTR_WO(adm); + +static ssize_t cbd_transport_info(struct cbd_transport *cbdt, char *buf) +{ + struct cbd_transport_info *info = cbdt->transport_info; + ssize_t ret; + + mutex_lock(&cbdt->lock); + info = cbdt->transport_info; + mutex_unlock(&cbdt->lock); + + ret = sprintf(buf, "magic: 0x%llx\n" + "version: %u\n" + "flags: %x\n\n" + "hosts_registered: %u\n" + "host_area_off: %llu\n" + "bytes_per_host_info: %u\n" + "host_num: %u\n\n" + "backend_area_off: %llu\n" + "bytes_per_backend_info: %u\n" + "backend_num: %u\n\n" + "blkdev_area_off: %llu\n" + "bytes_per_blkdev_info: %u\n" + "blkdev_num: %u\n\n" + "segment_area_off: %llu\n" + "bytes_per_segment: %u\n" + "segment_num: %u\n", + le64_to_cpu(info->magic), + le16_to_cpu(info->version), + le16_to_cpu(info->flags), + info->hosts_registered, + info->host_area_off, + info->host_info_size, + info->host_num, + info->backend_area_off, + info->backend_info_size, + info->backend_num, + info->blkdev_area_off, + info->blkdev_info_size, + info->blkdev_num, + info->segment_area_off, + info->segment_size, + info->segment_num); + + return ret; +} + +static ssize_t cbd_info_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct cbd_transport *cbdt; + + cbdt = container_of(dev, struct cbd_transport, device); + + return cbd_transport_info(cbdt, buf); +} +static DEVICE_ATTR(info, 0400, cbd_info_show, NULL); + +static struct attribute *cbd_transport_attrs[] = { + &dev_attr_adm.attr, + &dev_attr_info.attr, + &dev_attr_my_host_id.attr, + NULL +}; + +static struct attribute_group cbd_transport_attr_group = { + .attrs = cbd_transport_attrs, +}; + +static const struct attribute_group *cbd_transport_attr_groups[] = { + &cbd_transport_attr_group, + NULL +}; + +static void cbd_transport_release(struct device *dev) +{ +} + +const struct device_type cbd_transport_type = { + .name = "cbd_transport", + .groups = cbd_transport_attr_groups, + .release = cbd_transport_release, +}; + +static int +cbd_dax_notify_failure( + struct dax_device *dax_devp, + u64 offset, + u64 len, + int mf_flags) +{ + + pr_err("%s: dax_devp %llx offset %llx len %lld mf_flags %x\n", + __func__, (u64)dax_devp, (u64)offset, (u64)len, mf_flags); + return -EOPNOTSUPP; +} + +const struct dax_holder_operations cbd_dax_holder_ops = { + .notify_failure = cbd_dax_notify_failure, +}; + +static struct cbd_transport *cbdt_alloc(void) +{ + struct cbd_transport *cbdt; + int ret; + + cbdt = kzalloc(sizeof(struct cbd_transport), GFP_KERNEL); + if (!cbdt) + return NULL; + + mutex_init(&cbdt->lock); + mutex_init(&cbdt->adm_lock); + INIT_LIST_HEAD(&cbdt->backends); + INIT_LIST_HEAD(&cbdt->devices); + + ret = ida_simple_get(&cbd_transport_id_ida, 0, CBD_TRANSPORT_MAX, + GFP_KERNEL); + if (ret < 0) + goto cbdt_free; + + cbdt->id = ret; + cbd_transports[cbdt->id] = cbdt; + + return cbdt; + +cbdt_free: + kfree(cbdt); + return NULL; +} + +static void cbdt_destroy(struct cbd_transport *cbdt) +{ + cbd_transports[cbdt->id] = NULL; + ida_simple_remove(&cbd_transport_id_ida, cbdt->id); + kfree(cbdt); +} + +static int cbdt_dax_init(struct cbd_transport *cbdt, char *path) +{ + struct dax_device *dax_dev = NULL; + struct file *bdev_file = NULL; + long access_size; + void *kaddr; + u64 start_off = 0; + int ret; + int id; + + bdev_file = bdev_file_open_by_path(path, BLK_OPEN_READ | BLK_OPEN_WRITE, cbdt, NULL); + if (IS_ERR(bdev_file)) { + cbdt_err(cbdt, "%s: failed blkdev_get_by_path(%s)\n", __func__, path); + ret = PTR_ERR(bdev_file); + goto err; + } + + dax_dev = fs_dax_get_by_bdev(file_bdev(bdev_file), &start_off, + cbdt, + &cbd_dax_holder_ops); + if (IS_ERR(dax_dev)) { + cbdt_err(cbdt, "%s: unable to get daxdev from bdev_file\n", __func__); + ret = -ENODEV; + goto fput; + } + + id = dax_read_lock(); + access_size = dax_direct_access(dax_dev, 0, 1, DAX_ACCESS, &kaddr, NULL); + if (access_size != 1) { + dax_read_unlock(id); + ret = -EINVAL; + goto dax_put; + } + + cbdt->bdev_file = bdev_file; + cbdt->dax_dev = dax_dev; + cbdt->transport_info = (struct cbd_transport_info *)kaddr; + dax_read_unlock(id); + + return 0; + +dax_put: + fs_put_dax(dax_dev, cbdt); +fput: + fput(bdev_file); +err: + return ret; +} + +static void cbdt_dax_release(struct cbd_transport *cbdt) +{ + if (cbdt->dax_dev) + fs_put_dax(cbdt->dax_dev, cbdt); + + if (cbdt->bdev_file) + fput(cbdt->bdev_file); +} + +static int cbd_transport_init(struct cbd_transport *cbdt) +{ + struct device *dev; + + dev = &cbdt->device; + device_initialize(dev); + device_set_pm_not_required(dev); + dev->bus = &cbd_bus_type; + dev->type = &cbd_transport_type; + dev->parent = &cbd_root_dev; + + dev_set_name(&cbdt->device, "transport%d", cbdt->id); + + return device_add(&cbdt->device); +} + + +static int cbdt_validate(struct cbd_transport *cbdt) +{ + u16 flags; + + if (le64_to_cpu(cbdt->transport_info->magic) != CBD_TRANSPORT_MAGIC) { + cbdt_err(cbdt, "unexpected magic: %llx\n", + le64_to_cpu(cbdt->transport_info->magic)); + return -EINVAL; + } + + flags = le16_to_cpu(cbdt->transport_info->flags); +#if defined(__BYTE_ORDER) ? __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN) + if (!(flags & CBDT_INFO_F_BIGENDIAN)) { + cbdt_err(cbdt, "transport is not big endian\n"); + return -EINVAL; + } +#else + if (flags & CBDT_INFO_F_BIGENDIAN) { + cbdt_err(cbdt, "transport is big endian\n"); + return -EINVAL; + } +#endif + +#ifndef CONFIG_CBD_CRC + if (flags & CBDT_INFO_F_CRC) { + cbdt_err(cbdt, "transport expects CBD_CRC enabled.\n"); + return -ENOTSUPP; + } +#endif + + return 0; +} + +int cbdt_unregister(u32 tid) +{ + struct cbd_transport *cbdt; + + cbdt = cbd_transports[tid]; + if (!cbdt) { + pr_err("tid: %u, is not registered\n", tid); + return -EINVAL; + } + + mutex_lock(&cbdt->lock); + if (!list_empty(&cbdt->backends) || !list_empty(&cbdt->devices)) { + mutex_unlock(&cbdt->lock); + return -EBUSY; + } + mutex_unlock(&cbdt->lock); + + cbd_blkdevs_exit(cbdt); + cbd_segments_exit(cbdt); + cbd_backends_exit(cbdt); + cbd_hosts_exit(cbdt); + + cbd_host_unregister(cbdt); + cbdt->transport_info->hosts_registered--; + + device_unregister(&cbdt->device); + cbdt_dax_release(cbdt); + cbdt_destroy(cbdt); + module_put(THIS_MODULE); + + return 0; +} + +static bool cbdt_register_allowed(struct cbd_transport *cbdt) +{ + struct cbd_transport_info *transport_info; + + transport_info = cbdt->transport_info; + + if (transport_info->hosts_registered >= CBDT_HOSTS_MAX) { + cbdt_err(cbdt, "too many hosts registered: %u (max %u).", + transport_info->hosts_registered, + CBDT_HOSTS_MAX); + return false; + } + + return true; +} + +int cbdt_register(struct cbdt_register_options *opts) +{ + struct cbd_transport *cbdt; + int ret; + + if (!try_module_get(THIS_MODULE)) + return -ENODEV; + + if (!strstr(opts->path, "/dev/pmem")) { + pr_err("%s: path (%s) is not pmem\n", + __func__, opts->path); + ret = -EINVAL; + goto module_put; + } + + cbdt = cbdt_alloc(); + if (!cbdt) { + ret = -ENOMEM; + goto module_put; + } + + ret = cbdt_dax_init(cbdt, opts->path); + if (ret) + goto cbdt_destroy; + + if (!cbdt_register_allowed(cbdt)) { + ret = -EINVAL; + goto dax_release; + } + + if (opts->format) { + ret = cbd_transport_format(cbdt, opts->force); + if (ret < 0) + goto dax_release; + } + + ret = cbdt_validate(cbdt); + if (ret) + goto dax_release; + + ret = cbd_transport_init(cbdt); + if (ret) + goto dax_release; + + ret = cbd_host_register(cbdt, opts->hostname); + if (ret) + goto dev_unregister; + + if (cbd_hosts_init(cbdt) || cbd_backends_init(cbdt) || + cbd_segments_init(cbdt) || cbd_blkdevs_init(cbdt)) { + ret = -ENOMEM; + goto devs_exit; + } + + cbdt->transport_info->hosts_registered++; + + return 0; + +devs_exit: + cbd_blkdevs_exit(cbdt); + cbd_segments_exit(cbdt); + cbd_backends_exit(cbdt); + cbd_hosts_exit(cbdt); + + cbd_host_unregister(cbdt); +dev_unregister: + device_unregister(&cbdt->device); +dax_release: + cbdt_dax_release(cbdt); +cbdt_destroy: + cbdt_destroy(cbdt); +module_put: + module_put(THIS_MODULE); + + return ret; +} + +void cbdt_add_backend(struct cbd_transport *cbdt, struct cbd_backend *cbdb) +{ + mutex_lock(&cbdt->lock); + list_add(&cbdb->node, &cbdt->backends); + mutex_unlock(&cbdt->lock); +} + +void cbdt_del_backend(struct cbd_transport *cbdt, struct cbd_backend *cbdb) +{ + if (list_empty(&cbdb->node)) + return; + + mutex_lock(&cbdt->lock); + list_del_init(&cbdb->node); + mutex_unlock(&cbdt->lock); +} + +struct cbd_backend *cbdt_get_backend(struct cbd_transport *cbdt, u32 id) +{ + struct cbd_backend *backend; + + mutex_lock(&cbdt->lock); + list_for_each_entry(backend, &cbdt->backends, node) { + if (backend->backend_id == id) + goto out; + } + backend = NULL; +out: + mutex_unlock(&cbdt->lock); + return backend; +} + +void cbdt_add_blkdev(struct cbd_transport *cbdt, struct cbd_blkdev *blkdev) +{ + mutex_lock(&cbdt->lock); + list_add(&blkdev->node, &cbdt->devices); + mutex_unlock(&cbdt->lock); +} + +void cbdt_del_blkdev(struct cbd_transport *cbdt, struct cbd_blkdev *blkdev) +{ + if (list_empty(&blkdev->node)) + return; + + mutex_lock(&cbdt->lock); + list_del_init(&blkdev->node); + mutex_unlock(&cbdt->lock); +} + +struct cbd_blkdev *cbdt_get_blkdev(struct cbd_transport *cbdt, u32 id) +{ + struct cbd_blkdev *dev; + + mutex_lock(&cbdt->lock); + list_for_each_entry(dev, &cbdt->devices, node) { + if (dev->blkdev_id == id) + goto out; + } + dev = NULL; +out: + mutex_unlock(&cbdt->lock); + return dev; +} + +struct page *cbdt_page(struct cbd_transport *cbdt, u64 transport_off, u32 *page_off) +{ + long access_size; + pfn_t pfn; + + access_size = dax_direct_access(cbdt->dax_dev, transport_off >> PAGE_SHIFT, + 1, DAX_ACCESS, NULL, &pfn); + if (access_size < 0) + return NULL; + + if (page_off) + *page_off = transport_off & PAGE_MASK; + + return pfn_t_to_page(pfn); +} From patchwork Wed Sep 18 10:18:15 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dongsheng Yang X-Patchwork-Id: 13806775 Received: from out-181.mta1.migadu.com (out-181.mta1.migadu.com [95.215.58.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 17CDC17E01D for ; Wed, 18 Sep 2024 10:18:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.181 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726654736; cv=none; b=FD0l1R5DmIuGRHW8i4P3k3WYPcMO8bWwv4SmY/r6EbCUDU9lY9mfh5Z6fQcL3eUWsBeiwlcAjhaTTI2oYIM/wQTJrhy3fDuQ6gWVhB7rzgKBd6K3WrTeoLNGdqTjJlyroZwQ7FYOqwk2aU5Gka4aSnG0jB2yxh/moh0B06zmJg4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726654736; c=relaxed/simple; bh=3Z9cqqwVNBgZGNzCsATLdcwnVncJcipLmol4c5UwyR4=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=dzyXyiRfPJjyAuG+yiA79Zo800rI4v4mSvV6rLVqIuopnvV+Awx9HS35xdScUXm3QANg18SwdtdBPNd4C9MXweyuARXoLFiyRlYch4jfUqy0wP5uR3w/JX5fjFh676ddcvwyCgy6qexuiY/YQx1KhYoILVgOJvBjOUcmDfYtNKg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=fmEPI8GT; arc=none smtp.client-ip=95.215.58.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="fmEPI8GT" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1726654732; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=8NF/WSkahAQpAxr8dlQx+eZwv0YPW32YvfGYGq2Rnzo=; b=fmEPI8GThz3nRX5Pnsv9KNNaxqFiVfEBV3CT0S+DYTdOjSRJlxoFT0tJpYwF9IZPu5txT4 7bIvOq9xuvD2H8VrmSnB0jP/ZHXCYB7QzMeQfKXxVabyOAUU2Cv3f3cJzMk0fVAXsu20Ni 7SpxbVEvYQlTgYaCo9FaAyMqj5IX6Xo= From: Dongsheng Yang To: axboe@kernel.dk, dan.j.williams@intel.com, gregory.price@memverge.com, John@groves.net, Jonathan.Cameron@Huawei.com, bbhushan2@marvell.com, chaitanyak@nvidia.com, rdunlap@infradead.org Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, linux-bcache@vger.kernel.org, Dongsheng Yang Subject: [PATCH v2 2/8] cbd: introduce cbd_host Date: Wed, 18 Sep 2024 10:18:15 +0000 Message-Id: <20240918101821.681118-3-dongsheng.yang@linux.dev> In-Reply-To: <20240918101821.681118-1-dongsheng.yang@linux.dev> References: <20240918101821.681118-1-dongsheng.yang@linux.dev> Precedence: bulk X-Mailing-List: linux-cxl@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT The "cbd_host" represents a host node. Each node needs to be registered before it can use the "cbd_transport". After registration, the node's information, such as its hostname, will be recorded in the "hosts" area of this transport. Through this mechanism, we can know which nodes are currently using each transport. If a host dies without unregistering, we allow the user to clear this host entry in the metadata. Signed-off-by: Dongsheng Yang --- drivers/block/cbd/cbd_host.c | 129 +++++++++++++++++++++++++++++++++++ 1 file changed, 129 insertions(+) create mode 100644 drivers/block/cbd/cbd_host.c diff --git a/drivers/block/cbd/cbd_host.c b/drivers/block/cbd/cbd_host.c new file mode 100644 index 000000000000..02f7ef52150d --- /dev/null +++ b/drivers/block/cbd/cbd_host.c @@ -0,0 +1,129 @@ +#include "cbd_internal.h" + +static ssize_t cbd_host_name_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct cbd_host_device *host; + struct cbd_host_info *host_info; + + host = container_of(dev, struct cbd_host_device, dev); + host_info = host->host_info; + + if (host_info->state == cbd_host_state_none) + return 0; + + return sprintf(buf, "%s\n", host_info->hostname); +} + +static DEVICE_ATTR(hostname, 0400, cbd_host_name_show, NULL); + +CBD_OBJ_HEARTBEAT(host); + +static struct attribute *cbd_host_attrs[] = { + &dev_attr_hostname.attr, + &dev_attr_alive.attr, + NULL +}; + +static struct attribute_group cbd_host_attr_group = { + .attrs = cbd_host_attrs, +}; + +static const struct attribute_group *cbd_host_attr_groups[] = { + &cbd_host_attr_group, + NULL +}; + +static void cbd_host_release(struct device *dev) +{ +} + +const struct device_type cbd_host_type = { + .name = "cbd_host", + .groups = cbd_host_attr_groups, + .release = cbd_host_release, +}; + +const struct device_type cbd_hosts_type = { + .name = "cbd_hosts", + .release = cbd_host_release, +}; + +int cbd_host_register(struct cbd_transport *cbdt, char *hostname) +{ + struct cbd_host *host; + struct cbd_host_info *host_info; + u32 host_id; + int ret; + + if (cbdt->host) + return -EEXIST; + + if (strlen(hostname) == 0) + return -EINVAL; + + ret = cbdt_get_empty_host_id(cbdt, &host_id); + if (ret < 0) + return ret; + + host = kzalloc(sizeof(struct cbd_host), GFP_KERNEL); + if (!host) + return -ENOMEM; + + host->host_id = host_id; + host->cbdt = cbdt; + INIT_DELAYED_WORK(&host->hb_work, host_hb_workfn); + + host_info = cbdt_get_host_info(cbdt, host_id); + host_info->state = cbd_host_state_running; + memcpy(host_info->hostname, hostname, CBD_NAME_LEN); + + host->host_info = host_info; + cbdt->host = host; + + queue_delayed_work(cbd_wq, &host->hb_work, 0); + + return 0; +} + +int cbd_host_unregister(struct cbd_transport *cbdt) +{ + struct cbd_host *host = cbdt->host; + struct cbd_host_info *host_info; + + if (!host) { + cbd_err("This host is not registered."); + return 0; + } + + host->host_info->state = cbd_host_state_removing; + cancel_delayed_work_sync(&host->hb_work); + host_info = host->host_info; + memset(host_info->hostname, 0, CBD_NAME_LEN); + host_info->alive_ts = 0; + host_info->state = cbd_host_state_none; + + cbdt->host = NULL; + kfree(cbdt->host); + + return 0; +} + +int cbd_host_clear(struct cbd_transport *cbdt, u32 host_id) +{ + struct cbd_host_info *host_info; + + host_info = cbdt_get_host_info(cbdt, host_id); + if (cbd_host_info_is_alive(host_info)) { + cbdt_err(cbdt, "host %u is still alive\n", host_id); + return -EBUSY; + } + + if (host_info->state == cbd_host_state_none) + return 0; + + host_info->state = cbd_host_state_none; + + return 0; +} From patchwork Wed Sep 18 10:18:16 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dongsheng Yang X-Patchwork-Id: 13806776 Received: from out-170.mta1.migadu.com (out-170.mta1.migadu.com [95.215.58.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 59FD917C991 for ; Wed, 18 Sep 2024 10:18:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.170 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726654741; cv=none; b=NzF1Nv8efyLV8AAC/7FtjGOfy52EDisNPHBT60FozowBKFth1fAIGHgKoEIHpw1bSIqB/sEcP9NEWDYc53nEK3a55BBqR+4Ks6wP9yFYU6ju9EW2WiqamI2Mov6Lxuahw+KcHOkfUNjUsv+VHyUjeB8ozb69ua0eOliiYeb14lA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726654741; c=relaxed/simple; bh=nKYBwq3DNPvYPSU91alzoWyvAqMifp7STvMO8nRDivg=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=iNGhZXIUNnNkhWRW4+s5njszy39nF8XjeodjfsRKAhuQzvGKMwfrNCm1qM4iRZFpoduSNo4xadFr7XZuqp5AM7WgPCj7zWBZuS1++HUK03tckutiJ8RVSXkwec+sJmiWZ02DTf1JGkaGwT7AP1fhhD3HibdgS7kCKIHjjKIDfm4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=D4z8/D7i; arc=none smtp.client-ip=95.215.58.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="D4z8/D7i" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1726654736; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=X6k89tmJJFaVmsE+g86sdavkc0csK8GGCHyRyKeFHVs=; b=D4z8/D7ioQoY3OaTl9TGLPJY1dyeZC3HtLBPmtSXJJDIDpP/wqlTFlAKtpwurUDAdOTzRo hF46d+s437SBrtElxd1Dz/QdU9tvqO3Hm9dDzkgDp0k1wUMYPBRKahG6f05IgJ5RB2asBk qMzkSIz0o0t+zbJtwp9va2tJb1i2woM= From: Dongsheng Yang To: axboe@kernel.dk, dan.j.williams@intel.com, gregory.price@memverge.com, John@groves.net, Jonathan.Cameron@Huawei.com, bbhushan2@marvell.com, chaitanyak@nvidia.com, rdunlap@infradead.org Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, linux-bcache@vger.kernel.org, Dongsheng Yang Subject: [PATCH v2 3/8] cbd: introduce cbd_segment Date: Wed, 18 Sep 2024 10:18:16 +0000 Message-Id: <20240918101821.681118-4-dongsheng.yang@linux.dev> In-Reply-To: <20240918101821.681118-1-dongsheng.yang@linux.dev> References: <20240918101821.681118-1-dongsheng.yang@linux.dev> Precedence: bulk X-Mailing-List: linux-cxl@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT The `cbd_segments` is an abstraction of the data area in transport. The data area in transport is divided into segments. The specific use of this area is determined by `cbd_seg_type`. For example, `cbd_blkdev` and `cbd_backend` data transfers need to access a segment of the type `cbds_type_channel`. The segment also allows for more scenarios and more segment types to be expanded. Signed-off-by: Dongsheng Yang --- drivers/block/cbd/cbd_segment.c | 349 ++++++++++++++++++++++++++++++++ 1 file changed, 349 insertions(+) create mode 100644 drivers/block/cbd/cbd_segment.c diff --git a/drivers/block/cbd/cbd_segment.c b/drivers/block/cbd/cbd_segment.c new file mode 100644 index 000000000000..d7fbfee64059 --- /dev/null +++ b/drivers/block/cbd/cbd_segment.c @@ -0,0 +1,349 @@ +#include "cbd_internal.h" + +static ssize_t cbd_seg_detail_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct cbd_segment_device *segment; + struct cbd_segment_info *segment_info; + + segment = container_of(dev, struct cbd_segment_device, dev); + segment_info = segment->segment_info; + + if (segment_info->state == cbd_segment_state_none) + return 0; + + if (segment_info->type == cbds_type_channel) + return cbd_channel_seg_detail_show((struct cbd_channel_info *)segment_info, buf); + + return 0; +} + +static ssize_t cbd_seg_type_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct cbd_segment_device *segment; + struct cbd_segment_info *segment_info; + + segment = container_of(dev, struct cbd_segment_device, dev); + segment_info = segment->segment_info; + + if (segment_info->state == cbd_segment_state_none) + return 0; + + return sprintf(buf, "%s\n", cbds_type_str(segment_info->type)); +} + +static DEVICE_ATTR(detail, 0400, cbd_seg_detail_show, NULL); +static DEVICE_ATTR(type, 0400, cbd_seg_type_show, NULL); + +CBD_OBJ_HEARTBEAT(segment); + +static struct attribute *cbd_segment_attrs[] = { + &dev_attr_detail.attr, + &dev_attr_type.attr, + &dev_attr_alive.attr, + NULL +}; + +static struct attribute_group cbd_segment_attr_group = { + .attrs = cbd_segment_attrs, +}; + +static const struct attribute_group *cbd_segment_attr_groups[] = { + &cbd_segment_attr_group, + NULL +}; + +static void cbd_segment_release(struct device *dev) +{ +} + +const struct device_type cbd_segment_type = { + .name = "cbd_segment", + .groups = cbd_segment_attr_groups, + .release = cbd_segment_release, +}; + +const struct device_type cbd_segments_type = { + .name = "cbd_segments", + .release = cbd_segment_release, +}; + +void cbd_segment_init(struct cbd_transport *cbdt, struct cbd_segment *segment, + struct cbds_init_options *options) +{ + struct cbd_segment_info *segment_info = cbdt_get_segment_info(cbdt, options->seg_id); + + segment->cbdt = cbdt; + segment->segment_info = segment_info; + segment->seg_id = options->seg_id; + segment_info->type = options->type; + segment->seg_ops = options->seg_ops; + segment->data_size = CBDT_SEG_SIZE - options->data_off; + segment->data = (void *)(segment->segment_info) + options->data_off; + segment->priv_data = options->priv_data; + + segment_info->ref++; + segment_info->state = cbd_segment_state_running; + + INIT_DELAYED_WORK(&segment->hb_work, segment_hb_workfn); + queue_delayed_work(cbd_wq, &segment->hb_work, 0); +} + +void cbd_segment_exit(struct cbd_segment *segment) +{ + if (!segment->segment_info || + segment->segment_info->state != cbd_segment_state_running) + return; + + cancel_delayed_work_sync(&segment->hb_work); + + if (--segment->segment_info->ref > 0) + return; + + segment->segment_info->state = cbd_segment_state_none; + segment->segment_info->alive_ts = 0; +} + +int cbd_segment_clear(struct cbd_transport *cbdt, u32 seg_id) +{ + struct cbd_segment_info *segment_info; + + segment_info = cbdt_get_segment_info(cbdt, seg_id); + if (cbd_segment_info_is_alive(segment_info)) { + cbdt_err(cbdt, "segment %u is still alive\n", seg_id); + return -EBUSY; + } + + cbdt_zero_range(cbdt, segment_info, CBDT_SEG_SIZE); + + return 0; +} + +void cbds_copy_data(struct cbd_seg_pos *dst_pos, + struct cbd_seg_pos *src_pos, u32 len) +{ + u32 copied = 0; + u32 to_copy; + + while (copied < len) { + if (dst_pos->off >= dst_pos->segment->data_size) + dst_pos->segment->seg_ops->sanitize_pos(dst_pos); + + if (src_pos->off >= src_pos->segment->data_size) + src_pos->segment->seg_ops->sanitize_pos(src_pos); + + to_copy = len - copied; + + if (to_copy > dst_pos->segment->data_size - dst_pos->off) + to_copy = dst_pos->segment->data_size - dst_pos->off; + + if (to_copy > src_pos->segment->data_size - src_pos->off) + to_copy = src_pos->segment->data_size - src_pos->off; + + memcpy(dst_pos->segment->data + dst_pos->off, src_pos->segment->data + src_pos->off, to_copy); + + copied += to_copy; + + cbds_pos_advance(dst_pos, to_copy); + cbds_pos_advance(src_pos, to_copy); + } +} + +void cbds_copy_to_bio(struct cbd_segment *segment, + u32 data_off, u32 data_len, struct bio *bio, u32 bio_off) +{ + struct bio_vec bv; + struct bvec_iter iter; + void *dst; + u32 to_copy, page_off = 0; + struct cbd_seg_pos pos = { .segment = segment, + .off = data_off }; + +next: + bio_for_each_segment(bv, bio, iter) { + if (bio_off > bv.bv_len) { + bio_off -= bv.bv_len; + continue; + } + page_off = bv.bv_offset; + page_off += bio_off; + bio_off = 0; + + dst = kmap_local_page(bv.bv_page); +again: + if (pos.off >= pos.segment->data_size) + segment->seg_ops->sanitize_pos(&pos); + segment = pos.segment; + + to_copy = min(bv.bv_offset + bv.bv_len - page_off, + segment->data_size - pos.off); + if (to_copy > data_len) + to_copy = data_len; + flush_dcache_page(bv.bv_page); + memcpy(dst + page_off, segment->data + pos.off, to_copy); + + /* advance */ + pos.off += to_copy; + page_off += to_copy; + data_len -= to_copy; + if (!data_len) { + kunmap_local(dst); + return; + } + + /* more data in this bv page */ + if (page_off < bv.bv_offset + bv.bv_len) + goto again; + kunmap_local(dst); + } + + if (bio->bi_next) { + bio = bio->bi_next; + goto next; + } +} + +void cbds_copy_from_bio(struct cbd_segment *segment, + u32 data_off, u32 data_len, struct bio *bio, u32 bio_off) +{ + struct bio_vec bv; + struct bvec_iter iter; + void *src; + u32 to_copy, page_off = 0; + struct cbd_seg_pos pos = { .segment = segment, + .off = data_off }; + +next: + bio_for_each_segment(bv, bio, iter) { + if (bio_off > bv.bv_len) { + bio_off -= bv.bv_len; + continue; + } + page_off = bv.bv_offset; + page_off += bio_off; + bio_off = 0; + + src = kmap_local_page(bv.bv_page); +again: + if (pos.off >= pos.segment->data_size) + segment->seg_ops->sanitize_pos(&pos); + segment = pos.segment; + + to_copy = min(bv.bv_offset + bv.bv_len - page_off, + segment->data_size - pos.off); + if (to_copy > data_len) + to_copy = data_len; + + memcpy(segment->data + pos.off, src + page_off, to_copy); + flush_dcache_page(bv.bv_page); + + /* advance */ + pos.off += to_copy; + page_off += to_copy; + data_len -= to_copy; + if (!data_len) { + kunmap_local(src); + return; + } + + /* more data in this bv page */ + if (page_off < bv.bv_offset + bv.bv_len) + goto again; + kunmap_local(src); + } + + if (bio->bi_next) { + bio = bio->bi_next; + goto next; + } +} + +u32 cbd_seg_crc(struct cbd_segment *segment, u32 data_off, u32 data_len) +{ + u32 crc = 0; + u32 crc_size; + struct cbd_seg_pos pos = { .segment = segment, + .off = data_off }; + + while (data_len) { + if (pos.off >= pos.segment->data_size) + segment->seg_ops->sanitize_pos(&pos); + segment = pos.segment; + + crc_size = min(segment->data_size - pos.off, data_len); + + crc = crc32(crc, segment->data + pos.off, crc_size); + + data_len -= crc_size; + pos.off += crc_size; + } + + return crc; +} + +int cbds_map_pages(struct cbd_segment *segment, struct cbd_backend_io *io) +{ + struct cbd_transport *cbdt = segment->cbdt; + struct cbd_se *se = io->se; + u32 off = se->data_off; + u32 size = se->data_len; + u32 done = 0; + struct page *page; + u32 page_off; + int ret = 0; + int id; + + id = dax_read_lock(); + while (size) { + unsigned int len = min_t(size_t, PAGE_SIZE, size); + struct cbd_seg_pos pos = { .segment = segment, + .off = off + done }; + + if (pos.off >= pos.segment->data_size) + segment->seg_ops->sanitize_pos(&pos); + segment = pos.segment; + + u64 transport_off = segment->data - + (void *)cbdt->transport_info + pos.off; + + page = cbdt_page(cbdt, transport_off, &page_off); + + ret = bio_add_page(io->bio, page, len, 0); + if (unlikely(ret != len)) { + cbdt_err(cbdt, "failed to add page"); + goto out; + } + + done += len; + size -= len; + } + + ret = 0; +out: + dax_read_unlock(id); + return ret; +} + +int cbds_pos_advance(struct cbd_seg_pos *seg_pos, u32 len) +{ + u32 to_advance; + + while (len) { + to_advance = len; + + if (seg_pos->off >= seg_pos->segment->data_size) + seg_pos->segment->seg_ops->sanitize_pos(seg_pos); + + if (to_advance > seg_pos->segment->data_size - seg_pos->off) + to_advance = seg_pos->segment->data_size - seg_pos->off; + + seg_pos->off += to_advance; + + len -= to_advance; + } + + return 0; +} From patchwork Wed Sep 18 10:18:17 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dongsheng Yang X-Patchwork-Id: 13806777 Received: from out-189.mta1.migadu.com (out-189.mta1.migadu.com [95.215.58.189]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6C9E517D8A9 for ; Wed, 18 Sep 2024 10:19:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.189 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726654745; cv=none; b=DJwCBjtrXDs11VBYDzeSyzW1+k+sdwX68NTVn3MHkkaUcZE5yhGvyFNsMOyNjpmpf0vKW5Gh+R4gFsElt7aNHJJWoXzl7bXwmD7hp3ZTchKVF0HBJA9DGfY6FStVfogGNoOKa4jBxBf08lSAUOOCr46tS5WcqlhUlifX+nd41AQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726654745; c=relaxed/simple; bh=x5+e2My5BaOPTgvmct8wXbL95aiKhPSe+IaPqvjgdVY=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=j95N3Dbi/0Bb90+E/ISzSbQp8SD9VL8OtNqg9fpcGclhSGqBu/4jr9Fn4Wr0QGCAbjhW7L2soHHcBhP30ao5L94uG9rgvIa+jMyI5iGSYguafMjjigf64KsouPkDHhwc6iX7X8/AFSpuz+w+7fmrHqAtqCj+JVOkXp4CGHlSjp4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=hSi2M9QR; arc=none smtp.client-ip=95.215.58.189 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="hSi2M9QR" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1726654740; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=fyUBo72TgIvtSI2EW8Tbo6F1BzZphxvwxeH+Ee3WF98=; b=hSi2M9QReCJagBh+7SIts1zSjZpPlbxhx9lBEeUYTtaQ+fPrsTEP7FfTIQOMTMBO/UnQbb 6lSEevSb7EC8w9Qkd/TyDYHC3ImN1H9rDZaN0dGauepTZgYDO01ZBpVbHtPVqLWovGvqfW h9eOIBrYJBFMqaYXA1cq25raQD2Oa48= From: Dongsheng Yang To: axboe@kernel.dk, dan.j.williams@intel.com, gregory.price@memverge.com, John@groves.net, Jonathan.Cameron@Huawei.com, bbhushan2@marvell.com, chaitanyak@nvidia.com, rdunlap@infradead.org Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, linux-bcache@vger.kernel.org, Dongsheng Yang Subject: [PATCH v2 4/8] cbd: introduce cbd_channel Date: Wed, 18 Sep 2024 10:18:17 +0000 Message-Id: <20240918101821.681118-5-dongsheng.yang@linux.dev> In-Reply-To: <20240918101821.681118-1-dongsheng.yang@linux.dev> References: <20240918101821.681118-1-dongsheng.yang@linux.dev> Precedence: bulk X-Mailing-List: linux-cxl@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT The "cbd_channel" is the component responsible for the interaction between the blkdev and the backend. It mainly provides the functions "cbdc_copy_to_bio", "cbdc_copy_from_bio" and "cbd_channel_crc" If the blkdev or backend is alive, that means there is active user for this channel, then channel is alive. Signed-off-by: Dongsheng Yang --- drivers/block/cbd/cbd_channel.c | 96 +++++++++++++++++++++++++++++++++ 1 file changed, 96 insertions(+) create mode 100644 drivers/block/cbd/cbd_channel.c diff --git a/drivers/block/cbd/cbd_channel.c b/drivers/block/cbd/cbd_channel.c new file mode 100644 index 000000000000..6d3f77b9dc79 --- /dev/null +++ b/drivers/block/cbd/cbd_channel.c @@ -0,0 +1,96 @@ +#include "cbd_internal.h" + +static void channel_format(struct cbd_transport *cbdt, u32 id) +{ + struct cbd_channel_info *channel_info = cbdt_get_channel_info(cbdt, id); + + cbdt_zero_range(cbdt, channel_info, CBDC_META_SIZE); +} + +int cbd_get_empty_channel_id(struct cbd_transport *cbdt, u32 *id) +{ + int ret; + + ret = cbdt_get_empty_segment_id(cbdt, id); + if (ret) + return ret; + + channel_format(cbdt, *id); + + return 0; +} + +void cbdc_copy_to_bio(struct cbd_channel *channel, + u32 data_off, u32 data_len, struct bio *bio, u32 bio_off) +{ + cbds_copy_to_bio(&channel->segment, data_off, data_len, bio, bio_off); +} + +void cbdc_copy_from_bio(struct cbd_channel *channel, + u32 data_off, u32 data_len, struct bio *bio, u32 bio_off) +{ + cbds_copy_from_bio(&channel->segment, data_off, data_len, bio, bio_off); +} + +u32 cbd_channel_crc(struct cbd_channel *channel, u32 data_off, u32 data_len) +{ + return cbd_seg_crc(&channel->segment, data_off, data_len); +} + + +int cbdc_map_pages(struct cbd_channel *channel, struct cbd_backend_io *io) +{ + return cbds_map_pages(&channel->segment, io); +} + +ssize_t cbd_channel_seg_detail_show(struct cbd_channel_info *channel_info, char *buf) +{ + return sprintf(buf, "channel backend id: %u\n" + "channel blkdev id: %u\n", + channel_info->backend_id, + channel_info->blkdev_id); +} + +static void cbd_channel_seg_sanitize_pos(struct cbd_seg_pos *pos) +{ + struct cbd_segment *segment = pos->segment; + + /* channel only use one segment as a ring */ + while (pos->off >= segment->data_size) + pos->off -= segment->data_size; +} + +static struct cbd_seg_ops cbd_channel_seg_ops = { + .sanitize_pos = cbd_channel_seg_sanitize_pos +}; + +void cbd_channel_init(struct cbd_channel *channel, struct cbd_transport *cbdt, u32 seg_id) +{ + struct cbd_channel_info *channel_info = cbdt_get_channel_info(cbdt, seg_id); + struct cbd_segment *segment = &channel->segment; + struct cbds_init_options seg_options; + + seg_options.seg_id = seg_id; + seg_options.type = cbds_type_channel; + seg_options.data_off = CBDC_DATA_OFF; + seg_options.seg_ops = &cbd_channel_seg_ops; + + cbd_segment_init(cbdt, segment, &seg_options); + + channel->cbdt = cbdt; + channel->channel_info = channel_info; + channel->seg_id = seg_id; + channel->submr = (void *)channel_info + CBDC_SUBMR_OFF; + channel->compr = (void *)channel_info + CBDC_COMPR_OFF; + channel->submr_size = rounddown(CBDC_SUBMR_SIZE, sizeof(struct cbd_se)); + channel->compr_size = rounddown(CBDC_COMPR_SIZE, sizeof(struct cbd_ce)); + channel->data_size = CBDC_DATA_SIZE; + + spin_lock_init(&channel->submr_lock); + spin_lock_init(&channel->compr_lock); +} + +void cbd_channel_exit(struct cbd_channel *channel) +{ + cbd_segment_exit(&channel->segment); +} From patchwork Wed Sep 18 10:18:18 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dongsheng Yang X-Patchwork-Id: 13806778 Received: from out-182.mta1.migadu.com (out-182.mta1.migadu.com [95.215.58.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 682401898EF for ; Wed, 18 Sep 2024 10:19:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.182 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726654752; cv=none; b=JnEH5Gc5oy+wlJDGJlL9GBsUEZnyAjPsLQJY5OVdGCMmvyJLW7kcGBbqu3atf+9T9HpTHe2EkPrKKXNup3399kIGM4SO8Sef1TN8cOpgyTCMhQeS1Cj9YwvKww8vmmR+m4LFZVrUbDudwltZON0aoaT5lotBcKgKREw3lPeKXbc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726654752; c=relaxed/simple; bh=IrAGkSmHMVrJ582AiyQPE8dcbMa0pA8DXHVBDgvBJ1E=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=pdUcECn5j0uwDHMciDWHdLfmjk/rT/ZXy/TZc/7jxb2nAB24IHSy+XjHeXHXvSlng2h0qDHaGXgvGBFIUaicA7KEAmrPWdwQVEjNyq6LtDJIwXYihLaeY5+twuHt/rOoHI/ozgs803YnRcekG5VnSck5xnOHCgc71Cjx9cKXqVs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=pAVzJKyC; arc=none smtp.client-ip=95.215.58.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="pAVzJKyC" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1726654746; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=hVl8YEPhue4gkg8taRJYSJaQ8lDCQRI6aDhcH+Bd4mE=; b=pAVzJKyCfa+t4POHqhExzKLMHbqn650DP/XIMmVqits5hhk3LT2y03OsJ8tFXtKUKob5Oq G9lvFaQvySEpqAZFGRow1w31vZIOShZ3Qvr00jnAp1Mw/jIfr4iQOcroHDkEcyV3xBhSur DhIiKVpCsSk0EDgi+eA5WXrhg+tI4oA= From: Dongsheng Yang To: axboe@kernel.dk, dan.j.williams@intel.com, gregory.price@memverge.com, John@groves.net, Jonathan.Cameron@Huawei.com, bbhushan2@marvell.com, chaitanyak@nvidia.com, rdunlap@infradead.org Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, linux-bcache@vger.kernel.org, Dongsheng Yang Subject: [PATCH v2 5/8] cbd: introduce cbd_cache Date: Wed, 18 Sep 2024 10:18:18 +0000 Message-Id: <20240918101821.681118-6-dongsheng.yang@linux.dev> In-Reply-To: <20240918101821.681118-1-dongsheng.yang@linux.dev> References: <20240918101821.681118-1-dongsheng.yang@linux.dev> Precedence: bulk X-Mailing-List: linux-cxl@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT cbd cache is a lightweight solution that uses persistent memory as block device cache. It works similar with bcache, where bcache uses block devices as cache drives, but cbd cache only supports persistent memory devices for caching. It s designed specifically for PMEM scenarios, with a simple design and implementation, aiming to provide a low-latency, high-concurrency, and performance-stable caching solution. Note: cbd cache is not intended to replace your bcache. Instead, it offers an alternative solution specifically suited for scenarios where you want to use persistent memory devices as block device cache. Signed-off-by: Dongsheng Yang --- drivers/block/cbd/cbd_cache.c | 2410 +++++++++++++++++++++++++++++++++ 1 file changed, 2410 insertions(+) create mode 100644 drivers/block/cbd/cbd_cache.c diff --git a/drivers/block/cbd/cbd_cache.c b/drivers/block/cbd/cbd_cache.c new file mode 100644 index 000000000000..17efb6d67429 --- /dev/null +++ b/drivers/block/cbd/cbd_cache.c @@ -0,0 +1,2410 @@ +#include "cbd_internal.h" + +#define CBD_CACHE_PARAL_MAX (128) + +#define CBD_CACHE_TREE_SIZE (4 * 1024 * 1024) +#define CBD_CACHE_TREE_SIZE_MASK 0x3FFFFF +#define CBD_CACHE_TREE_SIZE_SHIFT 22 + +#define CBD_KSET_KEYS_MAX 128 +#define CBD_KSET_ONMEDIA_SIZE_MAX struct_size_t(struct cbd_cache_kset_onmedia, data, CBD_KSET_KEYS_MAX) +#define CBD_KSET_SIZE (sizeof(struct cbd_cache_kset) + sizeof(struct cbd_cache_key_onmedia) * CBD_KSET_KEYS_MAX) + +#define CBD_CACHE_GC_PERCENT_MIN 0 +#define CBD_CACHE_GC_PERCENT_MAX 90 +#define CBD_CACHE_GC_PERCENT_DEFAULT 70 + +#define CBD_CACHE_SEGS_EACH_PARAL 4 + +#define CBD_CLEAN_KEYS_MAX 10 + +#define CACHE_KEY(node) (container_of(node, struct cbd_cache_key, rb_node)) + +static inline struct cbd_cache_tree *get_cache_tree(struct cbd_cache *cache, u64 off) +{ + return &cache->cache_trees[off >> CBD_CACHE_TREE_SIZE_SHIFT]; +} + +static inline void *cache_pos_addr(struct cbd_cache_pos *pos) +{ + return (pos->cache_seg->segment.data + pos->seg_off); +} +static inline struct cbd_cache_kset_onmedia *get_key_head_addr(struct cbd_cache *cache) +{ + return (struct cbd_cache_kset_onmedia *)cache_pos_addr(&cache->key_head); +} + +static inline u32 get_kset_id(struct cbd_cache *cache, u64 off) +{ + return (off >> CBD_CACHE_TREE_SIZE_SHIFT) % cache->n_ksets; +} + +static inline struct cbd_cache_kset *get_kset(struct cbd_cache *cache, u32 kset_id) +{ + return (void *)cache->ksets + CBD_KSET_SIZE * kset_id; +} + +static inline struct cbd_cache_data_head *get_data_head(struct cbd_cache *cache, u32 i) +{ + return &cache->data_heads[i % cache->n_heads]; +} + +static inline bool cache_key_empty(struct cbd_cache_key *key) +{ + return key->flags & CBD_CACHE_KEY_FLAGS_EMPTY; +} + +static inline bool cache_key_clean(struct cbd_cache_key *key) +{ + return key->flags & CBD_CACHE_KEY_FLAGS_CLEAN; +} + +static inline u32 get_backend_id(struct cbd_transport *cbdt, + struct cbd_backend_info *backend_info) +{ + u64 backend_off; + struct cbd_transport_info *transport_info; + + transport_info = cbdt->transport_info; + backend_off = (void *)backend_info - (void *)transport_info; + + return (backend_off - transport_info->backend_area_off) / transport_info->backend_info_size; +} + +static inline bool cache_seg_has_next(struct cbd_cache_segment *cache_seg) +{ + return (cache_seg->cache_seg_info->flags & CBD_CACHE_SEG_FLAGS_HAS_NEXT); +} + +static inline bool cache_seg_wb_done(struct cbd_cache_segment *cache_seg) +{ + return (cache_seg->cache_seg_info->flags & CBD_CACHE_SEG_FLAGS_WB_DONE); +} + +static inline bool cache_seg_gc_done(struct cbd_cache_segment *cache_seg) +{ + return (cache_seg->cache_seg_info->flags & CBD_CACHE_SEG_FLAGS_GC_DONE); +} + +static inline void cache_pos_copy(struct cbd_cache_pos *dst, struct cbd_cache_pos *src) +{ + memcpy(dst, src, sizeof(struct cbd_cache_pos)); +} + +/* cbd_cache_seg_ops */ +static struct cbd_cache_segment *cache_seg_get_next(struct cbd_cache_segment *cache_seg) +{ + struct cbd_cache *cache = cache_seg->cache; + + if (cache_seg->cache_seg_info->flags & CBD_CACHE_SEG_FLAGS_HAS_NEXT) + return &cache->segments[cache_seg->cache_seg_info->next_cache_seg_id]; + + return NULL; +} + +static void cbd_cache_seg_sanitize_pos(struct cbd_seg_pos *pos) +{ + struct cbd_segment *segment; + struct cbd_cache_segment *cache_seg; + +again: + segment = pos->segment; + cache_seg = container_of(segment, struct cbd_cache_segment, segment); + if (pos->off >= segment->data_size) { + pos->off -= segment->data_size; + cache_seg = cache_seg_get_next(cache_seg); + BUG_ON(!cache_seg); + pos->segment = &cache_seg->segment; + + goto again; + } +} + +static struct cbd_seg_ops cbd_cache_seg_ops = { + .sanitize_pos = cbd_cache_seg_sanitize_pos +}; + +/* sysfs for cache */ +static ssize_t cache_segs_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct cbd_backend *backend; + + backend = container_of(dev, struct cbd_backend, cache_dev); + + return sprintf(buf, "%u\n", backend->cbd_cache->cache_info->n_segs); +} + +static DEVICE_ATTR(cache_segs, 0400, cache_segs_show, NULL); + +static ssize_t used_segs_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct cbd_backend *backend; + struct cbd_cache *cache; + + backend = container_of(dev, struct cbd_backend, cache_dev); + cache = backend->cbd_cache; + + return sprintf(buf, "%u\n", cache->cache_info->used_segs); +} + +static DEVICE_ATTR(used_segs, 0400, used_segs_show, NULL); + +static ssize_t gc_percent_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct cbd_backend *backend; + + backend = container_of(dev, struct cbd_backend, cache_dev); + + return sprintf(buf, "%u\n", backend->cbd_cache->cache_info->gc_percent); +} + +static ssize_t gc_percent_store(struct device *dev, + struct device_attribute *attr, + const char *buf, + size_t size) +{ + struct cbd_backend *backend; + unsigned long val; + int ret; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + backend = container_of(dev, struct cbd_backend, cache_dev); + ret = kstrtoul(buf, 10, &val); + if (ret) + return ret; + + if (val < CBD_CACHE_GC_PERCENT_MIN || + val > CBD_CACHE_GC_PERCENT_MAX) + return -EINVAL; + + backend->cbd_cache->cache_info->gc_percent = val; + + return size; +} + +static DEVICE_ATTR(gc_percent, 0600, gc_percent_show, gc_percent_store); + +static struct attribute *cbd_cache_attrs[] = { + &dev_attr_cache_segs.attr, + &dev_attr_used_segs.attr, + &dev_attr_gc_percent.attr, + NULL +}; + +static struct attribute_group cbd_cache_attr_group = { + .attrs = cbd_cache_attrs, +}; + +static const struct attribute_group *cbd_cache_attr_groups[] = { + &cbd_cache_attr_group, + NULL +}; + +static void cbd_cache_release(struct device *dev) +{ +} + +const struct device_type cbd_cache_type = { + .name = "cbd_cache", + .groups = cbd_cache_attr_groups, + .release = cbd_cache_release, +}; + +/* debug functions */ +#ifdef CONFIG_CBD_DEBUG +static void dump_seg_map(struct cbd_cache *cache) +{ + int i; + + cbd_cache_debug(cache, "start seg map dump"); + for (i = 0; i < cache->n_segs; i++) + cbd_cache_debug(cache, "seg: %u, %u", i, test_bit(i, cache->seg_map)); + cbd_cache_debug(cache, "end seg map dump"); +} + +static void dump_cache(struct cbd_cache *cache) +{ + struct cbd_cache_key *key; + struct rb_node *node; + int i; + + cbd_cache_debug(cache, "start cache tree dump"); + + for (i = 0; i < cache->n_trees; i++) { + struct cbd_cache_tree *cache_tree; + + cache_tree = &cache->cache_trees[i]; + node = rb_first(&cache_tree->root); + while (node) { + key = CACHE_KEY(node); + node = rb_next(node); + + if (cache_key_empty(key)) + continue; + + cbd_cache_debug(cache, "key: %p gen: %llu key->off: %llu, len: %u, cache: %p segid: %u, seg_off: %u\n", + key, key->seg_gen, key->off, key->len, cache_pos_addr(&key->cache_pos), + key->cache_pos.cache_seg->cache_seg_id, key->cache_pos.seg_off); + } + } + cbd_cache_debug(cache, "end cache tree dump"); +} + +#endif /* CONFIG_CBD_DEBUG */ + +/* cache_segment allocation and reclaim */ +static void cache_seg_init(struct cbd_cache *cache, + u32 seg_id, u32 cache_seg_id) +{ + struct cbd_transport *cbdt = cache->cbdt; + struct cbd_cache_segment *cache_seg = &cache->segments[cache_seg_id]; + struct cbds_init_options seg_options = { 0 }; + struct cbd_segment *segment = &cache_seg->segment; + + seg_options.type = cbds_type_cache; + seg_options.data_off = round_up(sizeof(struct cbd_cache_seg_info), PAGE_SIZE); + seg_options.seg_ops = &cbd_cache_seg_ops; + seg_options.seg_id = seg_id; + + cbd_segment_init(cbdt, segment, &seg_options); + + atomic_set(&cache_seg->refs, 0); + spin_lock_init(&cache_seg->gen_lock); + cache_seg->cache = cache; + cache_seg->cache_seg_id = cache_seg_id; + cache_seg->cache_seg_info = (struct cbd_cache_seg_info *)segment->segment_info; +} + +static void cache_seg_exit(struct cbd_cache_segment *cache_seg) +{ + cbd_segment_exit(&cache_seg->segment); +} + +#define CBD_WAIT_NEW_CACHE_INTERVAL 100 /* usecs */ +#define CBD_WAIT_NEW_CACHE_COUNT 100 + +static struct cbd_cache_segment *get_cache_segment(struct cbd_cache *cache) +{ + struct cbd_cache_segment *cache_seg; + u32 seg_id; + u32 wait_count = 0; + +again: + spin_lock(&cache->seg_map_lock); + seg_id = find_next_zero_bit(cache->seg_map, cache->n_segs, cache->last_cache_seg); + if (seg_id == cache->n_segs) { + spin_unlock(&cache->seg_map_lock); + if (cache->last_cache_seg) { + cache->last_cache_seg = 0; + goto again; + } + + if (++wait_count >= CBD_WAIT_NEW_CACHE_COUNT) + return NULL; + udelay(CBD_WAIT_NEW_CACHE_INTERVAL); + goto again; + } + + set_bit(seg_id, cache->seg_map); + cache->cache_info->used_segs++; + cache->last_cache_seg = seg_id; + spin_unlock(&cache->seg_map_lock); + + cache_seg = &cache->segments[seg_id]; + cache_seg->cache_seg_id = seg_id; + cache_seg->cache_seg_info->flags = 0; + + cbdt_zero_range(cache->cbdt, cache_seg->segment.data, cache_seg->segment.data_size); + + return cache_seg; +} + +static void cache_seg_get(struct cbd_cache_segment *cache_seg) +{ + atomic_inc(&cache_seg->refs); +} + +static void cache_seg_invalidate(struct cbd_cache_segment *cache_seg) +{ + struct cbd_cache *cache; + + cache = cache_seg->cache; + + spin_lock(&cache_seg->gen_lock); + cache_seg->cache_seg_info->gen++; + spin_unlock(&cache_seg->gen_lock); + + spin_lock(&cache->seg_map_lock); + clear_bit(cache_seg->cache_seg_id, cache->seg_map); + cache->cache_info->used_segs--; + spin_unlock(&cache->seg_map_lock); + + queue_work(cache->cache_wq, &cache->clean_work); + cbd_cache_debug(cache, "gc invalidat seg: %u\n", cache_seg->cache_seg_id); + +#ifdef CONFIG_CBD_DEBUG + dump_seg_map(cache); +#endif +} + +static void cache_seg_put(struct cbd_cache_segment *cache_seg) +{ + if (atomic_dec_and_test(&cache_seg->refs)) + cache_seg_invalidate(cache_seg); +} + +static void cache_key_gc(struct cbd_cache *cache, struct cbd_cache_key *key) +{ + struct cbd_cache_segment *cache_seg = key->cache_pos.cache_seg; + + cache_seg_put(cache_seg); +} + +static int cache_data_head_init(struct cbd_cache *cache, u32 i) +{ + struct cbd_cache_segment *next_seg; + struct cbd_cache_data_head *data_head; + + data_head = get_data_head(cache, i); + next_seg = get_cache_segment(cache); + if (!next_seg) + return -EBUSY; + + cache_seg_get(next_seg); + data_head->head_pos.cache_seg = next_seg; + data_head->head_pos.seg_off = 0; + + return 0; +} + +static void cache_pos_advance(struct cbd_cache_pos *pos, u32 len, bool set); +static int cache_data_alloc(struct cbd_cache *cache, struct cbd_cache_key *key, u32 head_index) +{ + struct cbd_cache_data_head *data_head; + struct cbd_cache_pos *head_pos; + struct cbd_cache_segment *cache_seg; + struct cbd_segment *segment; + u32 seg_remain; + u32 allocated = 0, to_alloc; + int ret = 0; + + data_head = get_data_head(cache, head_index); + + spin_lock(&data_head->data_head_lock); +again: + if (!data_head->head_pos.cache_seg) { + seg_remain = 0; + } else { + cache_pos_copy(&key->cache_pos, &data_head->head_pos); + key->seg_gen = key->cache_pos.cache_seg->cache_seg_info->gen; + + head_pos = &data_head->head_pos; + cache_seg = head_pos->cache_seg; + segment = &cache_seg->segment; + seg_remain = segment->data_size - head_pos->seg_off; + to_alloc = key->len - allocated; + } + + if (seg_remain > to_alloc) { + cache_pos_advance(head_pos, to_alloc, false); + allocated += to_alloc; + cache_seg_get(cache_seg); + } else if (seg_remain) { + cache_pos_advance(head_pos, seg_remain, false); + key->len = seg_remain; + cache_seg_get(cache_seg); /* get for key */ + + cache_seg_put(head_pos->cache_seg); /* put for head_pos->cache_seg */ + head_pos->cache_seg = NULL; + } else { + ret = cache_data_head_init(cache, head_index); + if (ret) + goto out; + + goto again; + } + +out: + spin_unlock(&data_head->data_head_lock); + + return ret; +} + +static void cache_copy_from_req_bio(struct cbd_cache *cache, struct cbd_cache_key *key, + struct cbd_request *cbd_req, u32 bio_off) +{ + struct cbd_cache_pos *pos = &key->cache_pos; + struct cbd_segment *segment; + + segment = &pos->cache_seg->segment; + + cbds_copy_from_bio(segment, pos->seg_off, key->len, cbd_req->bio, bio_off); +} + +static int cache_copy_to_req_bio(struct cbd_cache *cache, struct cbd_request *cbd_req, + u32 off, u32 len, struct cbd_cache_pos *pos, u64 key_gen) +{ + struct cbd_cache_segment *cache_seg = pos->cache_seg; + struct cbd_segment *segment = &cache_seg->segment; + + spin_lock(&cache_seg->gen_lock); + if (key_gen < cache_seg->cache_seg_info->gen) { + spin_unlock(&cache_seg->gen_lock); + return -EINVAL; + } + + spin_lock(&cbd_req->lock); + cbds_copy_to_bio(segment, pos->seg_off, len, cbd_req->bio, off); + spin_unlock(&cbd_req->lock); + spin_unlock(&cache_seg->gen_lock); + + return 0; +} + +static void cache_copy_from_req_channel(struct cbd_cache *cache, struct cbd_request *cbd_req, + struct cbd_cache_pos *pos, u32 off, u32 len) +{ + struct cbd_seg_pos dst_pos, src_pos; + + src_pos.segment = &cbd_req->cbdq->channel.segment; + src_pos.off = cbd_req->data_off; + + dst_pos.segment = &pos->cache_seg->segment; + dst_pos.off = pos->seg_off; + + if (off) { + cbds_pos_advance(&dst_pos, off); + cbds_pos_advance(&src_pos, off); + } + + cbds_copy_data(&dst_pos, &src_pos, len); +} + + +/* cache_key management, allocation, destroy, cutfront, cutback...*/ +static struct cbd_cache_key *cache_key_alloc(struct cbd_cache *cache) +{ + struct cbd_cache_key *key; + + key = kmem_cache_zalloc(cache->key_cache, GFP_NOWAIT); + if (!key) + return NULL; + + kref_init(&key->ref); + key->cache = cache; + INIT_LIST_HEAD(&key->list_node); + RB_CLEAR_NODE(&key->rb_node); + + return key; +} + +static void cache_key_get(struct cbd_cache_key *key) +{ + kref_get(&key->ref); +} + +static void cache_key_destroy(struct kref *ref) +{ + struct cbd_cache_key *key = container_of(ref, struct cbd_cache_key, ref); + struct cbd_cache *cache = key->cache; + + kmem_cache_free(cache->key_cache, key); +} + +static void cache_key_put(struct cbd_cache_key *key) +{ + kref_put(&key->ref, cache_key_destroy); +} + +static inline u64 cache_key_lstart(struct cbd_cache_key *key) +{ + return key->off; +} + +static inline u64 cache_key_lend(struct cbd_cache_key *key) +{ + return key->off + key->len; +} + +static inline void cache_key_copy(struct cbd_cache_key *key_dst, struct cbd_cache_key *key_src) +{ + key_dst->off = key_src->off; + key_dst->len = key_src->len; + key_dst->seg_gen = key_src->seg_gen; + key_dst->cache_tree = key_src->cache_tree; + key_dst->flags = key_src->flags; + + cache_pos_copy(&key_dst->cache_pos, &key_src->cache_pos); +} + +static void cache_pos_advance(struct cbd_cache_pos *pos, u32 len, bool set) +{ + struct cbd_cache_segment *cache_seg; + struct cbd_segment *segment; + u32 seg_remain, to_advance; + u32 advanced = 0; + +again: + cache_seg = pos->cache_seg; + BUG_ON(!cache_seg); + segment = &cache_seg->segment; + seg_remain = segment->data_size - pos->seg_off; + to_advance = len - advanced; + + if (seg_remain >= to_advance) { + pos->seg_off += to_advance; + advanced += to_advance; + } else if (seg_remain) { + pos->seg_off += seg_remain; + advanced += seg_remain; + } else { + pos->cache_seg = cache_seg_get_next(pos->cache_seg); + BUG_ON(!pos->cache_seg); + pos->seg_off = 0; + if (set) { + struct cbd_cache *cache = pos->cache_seg->cache; + + set_bit(pos->cache_seg->cache_seg_id, pos->cache_seg->cache->seg_map); + cbd_cache_debug(cache, "set seg in advance %u\n", pos->cache_seg->cache_seg_id); +#ifdef CONFIG_CBD_DEBUG + dump_seg_map(cache); +#endif + } + } + + if (advanced < len) + goto again; +} + +static inline void cache_key_cutfront(struct cbd_cache_key *key, u32 cut_len) +{ + if (key->cache_pos.cache_seg) + cache_pos_advance(&key->cache_pos, cut_len, false); + + key->off += cut_len; + key->len -= cut_len; +} + +static inline void cache_key_cutback(struct cbd_cache_key *key, u32 cut_len) +{ + key->len -= cut_len; +} + +static inline void cache_key_delete(struct cbd_cache_key *key) +{ + struct cbd_cache_tree *cache_tree; + + cache_tree = key->cache_tree; + if (!cache_tree) + return; + + rb_erase(&key->rb_node, &cache_tree->root); + key->flags = 0; + cache_key_put(key); +} + +static inline u32 cache_key_data_crc(struct cbd_cache_key *key) +{ + void *data; + + data = cache_pos_addr(&key->cache_pos); + + return crc32(0, data, key->len); +} + +static void cache_key_encode(struct cbd_cache_key_onmedia *key_onmedia, + struct cbd_cache_key *key) +{ + key_onmedia->off = key->off; + key_onmedia->len = key->len; + + key_onmedia->cache_seg_id = key->cache_pos.cache_seg->cache_seg_id; + key_onmedia->cache_seg_off = key->cache_pos.seg_off; + + key_onmedia->seg_gen = key->seg_gen; + key_onmedia->flags = key->flags; + +#ifdef CONFIG_CBD_CRC + key_onmedia->data_crc = key->data_crc; +#endif +} + +static void cache_key_decode(struct cbd_cache_key_onmedia *key_onmedia, struct cbd_cache_key *key) +{ + struct cbd_cache *cache = key->cache; + + key->off = key_onmedia->off; + key->len = key_onmedia->len; + + key->cache_pos.cache_seg = &cache->segments[key_onmedia->cache_seg_id]; + key->cache_pos.seg_off = key_onmedia->cache_seg_off; + + key->seg_gen = key_onmedia->seg_gen; + key->flags = key_onmedia->flags; + +#ifdef CONFIG_CBD_CRC + key->data_crc = key_onmedia->data_crc; +#endif +} + +/* cache_kset */ +static inline u32 cache_kset_crc(struct cbd_cache_kset_onmedia *kset_onmedia) +{ + return crc32(0, (void *)kset_onmedia + 4, struct_size(kset_onmedia, data, kset_onmedia->key_num) - 4); +} + +static inline u32 get_kset_onmedia_size(struct cbd_cache_kset_onmedia *kset_onmedia) +{ + return struct_size_t(struct cbd_cache_kset_onmedia, data, kset_onmedia->key_num); +} + +static inline u32 get_seg_remain(struct cbd_cache_pos *pos) +{ + struct cbd_cache_segment *cache_seg; + struct cbd_segment *segment; + u32 seg_remain; + + cache_seg = pos->cache_seg; + segment = &cache_seg->segment; + seg_remain = segment->data_size - pos->seg_off; + + return seg_remain; +} + +static int cache_kset_close(struct cbd_cache *cache, struct cbd_cache_kset *kset) +{ + struct cbd_cache_kset_onmedia *kset_onmedia; + u32 kset_onmedia_size; + int ret; + + kset_onmedia = &kset->kset_onmedia; + + if (!kset_onmedia->key_num) + return 0; + + kset_onmedia_size = struct_size(kset_onmedia, data, kset_onmedia->key_num); + + spin_lock(&cache->key_head_lock); +again: + if (get_seg_remain(&cache->key_head) < CBD_KSET_ONMEDIA_SIZE_MAX) { + struct cbd_cache_segment *cur_seg, *next_seg; + + next_seg = get_cache_segment(cache); + if (!next_seg) { + ret = -EBUSY; + goto out; + } + + cur_seg = cache->key_head.cache_seg; + + cur_seg->cache_seg_info->next_cache_seg_id = next_seg->cache_seg_id; + cur_seg->cache_seg_info->flags |= CBD_CACHE_SEG_FLAGS_HAS_NEXT; + + cache->key_head.cache_seg = next_seg; + cache->key_head.seg_off = 0; + goto again; + } + + if (get_seg_remain(&cache->key_head) - kset_onmedia_size < CBD_KSET_ONMEDIA_SIZE_MAX) + kset_onmedia->flags |= CBD_KSET_FLAGS_LAST; + + kset_onmedia->magic = CBD_KSET_MAGIC; + kset_onmedia->crc = cache_kset_crc(kset_onmedia); + + memcpy(get_key_head_addr(cache), kset_onmedia, kset_onmedia_size); + memset(kset_onmedia, 0, sizeof(struct cbd_cache_kset_onmedia)); + + cache_pos_advance(&cache->key_head, kset_onmedia_size, false); + + ret = 0; +out: + spin_unlock(&cache->key_head_lock); + + return ret; +} + +/* append a cache_key into related kset, if this kset full, close this kset, + * else queue a flush_work to do kset writting. + */ +static int cache_key_append(struct cbd_cache *cache, struct cbd_cache_key *key) +{ + struct cbd_cache_kset *kset; + struct cbd_cache_kset_onmedia *kset_onmedia; + struct cbd_cache_key_onmedia *key_onmedia; + u32 kset_id = get_kset_id(cache, key->off); + int ret = 0; + + kset = get_kset(cache, kset_id); + kset_onmedia = &kset->kset_onmedia; + + spin_lock(&kset->kset_lock); + key_onmedia = &kset_onmedia->data[kset_onmedia->key_num]; +#ifdef CONFIG_CBD_CRC + key->data_crc = cache_key_data_crc(key); +#endif + cache_key_encode(key_onmedia, key); + if (++kset_onmedia->key_num == CBD_KSET_KEYS_MAX) { + ret = cache_kset_close(cache, kset); + if (ret) { + /* return ocuppied key back */ + kset_onmedia->key_num--; + goto out; + } + } else { + queue_delayed_work(cache->cache_wq, &kset->flush_work, 1 * HZ); + } +out: + spin_unlock(&kset->kset_lock); + + return ret; +} + +/* cache_tree walk */ +struct cbd_cache_tree_walk_ctx { + struct cbd_cache *cache; + struct rb_node *start_node; + struct cbd_request *cbd_req; + u32 req_done; + struct cbd_cache_key *key; + + struct list_head *delete_key_list; + struct list_head *submit_req_list; + + /* + * |--------| key_tmp + * |====| key + */ + int (*before)(struct cbd_cache_key *key, struct cbd_cache_key *key_tmp, + struct cbd_cache_tree_walk_ctx *ctx); + + /* + * |----------| key_tmp + * |=====| key + */ + int (*after)(struct cbd_cache_key *key, struct cbd_cache_key *key_tmp, + struct cbd_cache_tree_walk_ctx *ctx); + + /* + * |----------------| key_tmp + * |===========| key + */ + int (*overlap_tail)(struct cbd_cache_key *key, struct cbd_cache_key *key_tmp, + struct cbd_cache_tree_walk_ctx *ctx); + + /* + * |--------| key_tmp + * |==========| key + */ + int (*overlap_head)(struct cbd_cache_key *key, struct cbd_cache_key *key_tmp, + struct cbd_cache_tree_walk_ctx *ctx); + + /* + * |----| key_tmp + * |==========| key + */ + int (*overlap_contain)(struct cbd_cache_key *key, struct cbd_cache_key *key_tmp, + struct cbd_cache_tree_walk_ctx *ctx); + + /* + * |-----------| key_tmp + * |====| key + */ + int (*overlap_contained)(struct cbd_cache_key *key, struct cbd_cache_key *key_tmp, + struct cbd_cache_tree_walk_ctx *ctx); + + int (*walk_finally)(struct cbd_cache_tree_walk_ctx *ctx); + bool (*walk_done)(struct cbd_cache_tree_walk_ctx *ctx); +}; + +static int cache_tree_walk(struct cbd_cache *cache, struct cbd_cache_tree_walk_ctx *ctx) +{ + struct cbd_cache_key *key_tmp, *key; + struct rb_node *node_tmp; + int ret; + + key = ctx->key; + node_tmp = ctx->start_node; + + while (node_tmp) { + if (ctx->walk_done && ctx->walk_done(ctx)) + break; + + key_tmp = CACHE_KEY(node_tmp); + /* + * |----------| + * |=====| + */ + if (cache_key_lend(key_tmp) <= cache_key_lstart(key)) { + if (ctx->after) { + ret = ctx->after(key, key_tmp, ctx); + if (ret) + goto out; + } + goto next; + } + + /* + * |--------| + * |====| + */ + if (cache_key_lstart(key_tmp) >= cache_key_lend(key)) { + if (ctx->before) { + ret = ctx->before(key, key_tmp, ctx); + if (ret) + goto out; + } + break; + } + + /* overlap */ + if (cache_key_lstart(key_tmp) >= cache_key_lstart(key)) { + /* + * |----------------| key_tmp + * |===========| key + */ + if (cache_key_lend(key_tmp) >= cache_key_lend(key)) { + if (ctx->overlap_tail) { + ret = ctx->overlap_tail(key, key_tmp, ctx); + if (ret) + goto out; + } + break; + } + + /* + * |----| key_tmp + * |==========| key + */ + if (ctx->overlap_contain) { + ret = ctx->overlap_contain(key, key_tmp, ctx); + if (ret) + goto out; + } + + goto next; + } + + /* + * |-----------| key_tmp + * |====| key + */ + if (cache_key_lend(key_tmp) > cache_key_lend(key)) { + if (ctx->overlap_contained) { + ret = ctx->overlap_contained(key, key_tmp, ctx); + if (ret) + goto out; + } + break; + } + + /* + * |--------| key_tmp + * |==========| key + */ + if (ctx->overlap_head) { + ret = ctx->overlap_head(key, key_tmp, ctx); + if (ret) + goto out; + } +next: + node_tmp = rb_next(node_tmp); + } + + if (ctx->walk_finally) { + ret = ctx->walk_finally(ctx); + if (ret) + goto out; + } + + return 0; +out: + return ret; +} + +/* cache_tree_search, search in a cache_tree */ +static inline bool cache_key_invalid(struct cbd_cache_key *key) +{ + if (cache_key_empty(key)) + return false; + + return (key->seg_gen < key->cache_pos.cache_seg->cache_seg_info->gen); +} + +static struct rb_node *cache_tree_search(struct cbd_cache_tree *cache_tree, struct cbd_cache_key *key, + struct rb_node **parentp, struct rb_node ***newp, + struct list_head *delete_key_list) +{ + struct rb_node **new, *parent = NULL; + struct cbd_cache_key *key_tmp; + struct rb_node *prev_node = NULL; + + new = &(cache_tree->root.rb_node); + while (*new) { + key_tmp = container_of(*new, struct cbd_cache_key, rb_node); + if (cache_key_invalid(key_tmp)) + list_add(&key_tmp->list_node, delete_key_list); + + parent = *new; + if (key_tmp->off >= key->off) { + new = &((*new)->rb_left); + } else { + prev_node = *new; + new = &((*new)->rb_right); + } + } + + if (!prev_node) + prev_node = rb_first(&cache_tree->root); + + if (parentp) + *parentp = parent; + + if (newp) + *newp = new; + + return prev_node; +} + +/* cache insert fixup, which will walk the cache_tree and do some fixup for key insert + * if the new key has overlap with existing keys in cache_tree + */ +static int fixup_overlap_tail(struct cbd_cache_key *key, struct cbd_cache_key *key_tmp, + struct cbd_cache_tree_walk_ctx *ctx) +{ + int ret; + + /* + * |----------------| key_tmp + * |===========| key + */ + cache_key_cutfront(key_tmp, cache_key_lend(key) - cache_key_lstart(key_tmp)); + if (key_tmp->len == 0) { + cache_key_delete(key_tmp); + ret = -EAGAIN; + goto out; + } + + return 0; +out: + return ret; +} + +static int fixup_overlap_contain(struct cbd_cache_key *key, struct cbd_cache_key *key_tmp, + struct cbd_cache_tree_walk_ctx *ctx) +{ + /* + * |----| key_tmp + * |==========| key + */ + cache_key_delete(key_tmp); + + return -EAGAIN; +} + +static int cache_insert_key(struct cbd_cache *cache, struct cbd_cache_key *key, bool new); +static int fixup_overlap_contained(struct cbd_cache_key *key, struct cbd_cache_key *key_tmp, + struct cbd_cache_tree_walk_ctx *ctx) +{ + struct cbd_cache *cache = ctx->cache; + int ret; + + /* + * |-----------| key_tmp + * |====| key + */ + if (cache_key_empty(key_tmp)) { + /* if key_tmp is empty, dont split key_tmp */ + cache_key_cutback(key_tmp, cache_key_lend(key_tmp) - cache_key_lstart(key)); + if (key_tmp->len == 0) { + cache_key_delete(key_tmp); + ret = -EAGAIN; + goto out; + } + } else { + struct cbd_cache_key *key_fixup; + bool need_research = false; + + key_fixup = cache_key_alloc(cache); + if (!key_fixup) { + ret = -ENOMEM; + goto out; + } + + cache_key_copy(key_fixup, key_tmp); + + cache_key_cutback(key_tmp, cache_key_lend(key_tmp) - cache_key_lstart(key)); + if (key_tmp->len == 0) { + cache_key_delete(key_tmp); + need_research = true; + } + + cache_key_cutfront(key_fixup, cache_key_lend(key) - cache_key_lstart(key_tmp)); + if (key_fixup->len == 0) { + cache_key_put(key_fixup); + } else { + ret = cache_insert_key(cache, key_fixup, false); + if (ret) + goto out; + need_research = true; + } + + if (need_research) { + ret = -EAGAIN; + goto out; + } + } + + return 0; +out: + return ret; +} + +static int fixup_overlap_head(struct cbd_cache_key *key, struct cbd_cache_key *key_tmp, + struct cbd_cache_tree_walk_ctx *ctx) +{ + /* + * |--------| key_tmp + * |==========| key + */ + cache_key_cutback(key_tmp, cache_key_lend(key_tmp) - cache_key_lstart(key)); + if (key_tmp->len == 0) { + cache_key_delete(key_tmp); + return -EAGAIN; + } + + return 0; +} + +static int cache_insert_fixup(struct cbd_cache *cache, struct cbd_cache_key *key, struct rb_node *prev_node) +{ + struct cbd_cache_tree_walk_ctx walk_ctx = { 0 }; + + walk_ctx.cache = cache; + walk_ctx.start_node = prev_node; + walk_ctx.key = key; + + walk_ctx.overlap_tail = fixup_overlap_tail; + walk_ctx.overlap_head = fixup_overlap_head; + walk_ctx.overlap_contain = fixup_overlap_contain; + walk_ctx.overlap_contained = fixup_overlap_contained; + + return cache_tree_walk(cache, &walk_ctx); +} +static int cache_insert_key(struct cbd_cache *cache, struct cbd_cache_key *key, bool new_key) +{ + struct rb_node **new, *parent = NULL; + struct cbd_cache_tree *cache_tree; + struct cbd_cache_key *key_tmp = NULL, *key_next; + struct rb_node *prev_node = NULL; + LIST_HEAD(delete_key_list); + int ret; + + cache_tree = get_cache_tree(cache, key->off); + + if (new_key) + key->cache_tree = cache_tree; + +search: + prev_node = cache_tree_search(cache_tree, key, &parent, &new, &delete_key_list); + + if (!list_empty(&delete_key_list)) { + list_for_each_entry_safe(key_tmp, key_next, &delete_key_list, list_node) { + list_del_init(&key_tmp->list_node); + cache_key_delete(key_tmp); + } + goto search; + } + + if (new_key) { + ret = cache_insert_fixup(cache, key, prev_node); + if (ret == -EAGAIN) + goto search; + if (ret) + goto out; + } + + rb_link_node(&key->rb_node, parent, new); + rb_insert_color(&key->rb_node, &cache_tree->root); + + return 0; +out: + return ret; +} + +/* cache miss: when a read miss happen on cbd_cache, it will submit a backing request + * to read from backend, and when this backing request done, it will copy the data + * read from backend into cache, then next user can read the same data immediately from + * cache + */ +static void miss_read_end_req(struct cbd_cache *cache, struct cbd_request *cbd_req) +{ + void *priv_data = cbd_req->priv_data; + int ret; + + if (priv_data) { + struct cbd_cache_key *key; + struct cbd_cache_tree *cache_tree; + + key = (struct cbd_cache_key *)priv_data; + cache_tree = key->cache_tree; + + spin_lock(&cache_tree->tree_lock); + if (key->flags & CBD_CACHE_KEY_FLAGS_EMPTY) { + if (cbd_req->ret) { + cache_key_delete(key); + goto unlock; + } + + ret = cache_data_alloc(cache, key, cbd_req->cbdq->index); + if (ret) { + cache_key_delete(key); + goto unlock; + } + cache_copy_from_req_channel(cache, cbd_req, &key->cache_pos, + key->off - cbd_req->off, key->len); + key->flags &= ~CBD_CACHE_KEY_FLAGS_EMPTY; + key->flags |= CBD_CACHE_KEY_FLAGS_CLEAN; + + ret = cache_key_append(cache, key); + if (ret) { + cache_seg_put(key->cache_pos.cache_seg); + cache_key_delete(key); + goto unlock; + } + } +unlock: + spin_unlock(&cache_tree->tree_lock); + cache_key_put(key); + } + + cbd_queue_advance(cbd_req->cbdq, cbd_req); + kmem_cache_free(cache->req_cache, cbd_req); +} + +static void miss_read_end_work_fn(struct work_struct *work) +{ + struct cbd_cache *cache = container_of(work, struct cbd_cache, miss_read_end_work); + struct cbd_request *cbd_req; + LIST_HEAD(tmp_list); + + spin_lock(&cache->miss_read_reqs_lock); + list_splice_init(&cache->miss_read_reqs, &tmp_list); + spin_unlock(&cache->miss_read_reqs_lock); + + while (!list_empty(&tmp_list)) { + cbd_req = list_first_entry(&tmp_list, + struct cbd_request, inflight_reqs_node); + list_del_init(&cbd_req->inflight_reqs_node); + miss_read_end_req(cache, cbd_req); + } +} + +static void cache_backing_req_end_req(struct cbd_request *cbd_req, void *priv_data) +{ + struct cbd_cache *cache = cbd_req->cbdq->cbd_blkdev->cbd_cache; + + spin_lock(&cache->miss_read_reqs_lock); + list_add_tail(&cbd_req->inflight_reqs_node, &cache->miss_read_reqs); + spin_unlock(&cache->miss_read_reqs_lock); + + queue_work(cache->cache_wq, &cache->miss_read_end_work); +} + +static void submit_backing_req(struct cbd_cache *cache, struct cbd_request *cbd_req) +{ + int ret; + + if (cbd_req->priv_data) { + struct cbd_cache_key *key; + + key = (struct cbd_cache_key *)cbd_req->priv_data; + ret = cache_insert_key(cache, key, true); + if (ret) { + cache_key_put(key); + cbd_req->priv_data = NULL; + goto out; + } + } + + ret = cbd_queue_req_to_backend(cbd_req); +out: + cbd_req_put(cbd_req, ret); +} + +static struct cbd_request *create_backing_req(struct cbd_cache *cache, struct cbd_request *parent, + u32 off, u32 len, bool insert_key) +{ + struct cbd_request *new_req; + struct cbd_cache_key *key = NULL; + int ret; + + if (insert_key) { + key = cache_key_alloc(cache); + if (!key) { + ret = -ENOMEM; + goto out; + } + + key->off = parent->off + off; + key->len = len; + key->flags |= CBD_CACHE_KEY_FLAGS_EMPTY; + } + + new_req = kmem_cache_zalloc(cache->req_cache, GFP_NOWAIT); + if (!new_req) { + ret = -ENOMEM; + goto delete_key; + } + + INIT_LIST_HEAD(&new_req->inflight_reqs_node); + kref_init(&new_req->ref); + spin_lock_init(&new_req->lock); + + new_req->cbdq = parent->cbdq; + new_req->bio = parent->bio; + new_req->off = parent->off + off; + new_req->op = parent->op; + new_req->bio_off = off; + new_req->data_len = len; + new_req->req = NULL; + + cbd_req_get(parent); + new_req->parent = parent; + + if (key) { + cache_key_get(key); + new_req->priv_data = key; + } + + new_req->end_req = cache_backing_req_end_req; + + return new_req; + +delete_key: + if (key) + cache_key_delete(key); +out: + return NULL; +} + +static int send_backing_req(struct cbd_cache *cache, struct cbd_request *cbd_req, + u32 off, u32 len, bool insert_key) +{ + struct cbd_request *new_req; + + new_req = create_backing_req(cache, cbd_req, off, len, insert_key); + if (!new_req) + return -ENOMEM; + + submit_backing_req(cache, new_req); + + return 0; +} + +/* cache tree walk for cache_read */ +static int read_before(struct cbd_cache_key *key, struct cbd_cache_key *key_tmp, + struct cbd_cache_tree_walk_ctx *ctx) +{ + struct cbd_request *backing_req; + int ret; + + backing_req = create_backing_req(ctx->cache, ctx->cbd_req, ctx->req_done, key->len, true); + if (!backing_req) { + ret = -ENOMEM; + goto out; + } + list_add(&backing_req->inflight_reqs_node, ctx->submit_req_list); + + ctx->req_done += key->len; + cache_key_cutfront(key, key->len); + + return 0; +out: + return ret; +} + +static int read_overlap_tail(struct cbd_cache_key *key, struct cbd_cache_key *key_tmp, + struct cbd_cache_tree_walk_ctx *ctx) +{ + struct cbd_request *backing_req; + u32 io_len; + int ret; + + /* + * |----------------| key_tmp + * |===========| key + */ + io_len = cache_key_lstart(key_tmp) - cache_key_lstart(key); + if (io_len) { + backing_req = create_backing_req(ctx->cache, ctx->cbd_req, ctx->req_done, io_len, true); + if (!backing_req) { + ret = -ENOMEM; + goto out; + } + list_add(&backing_req->inflight_reqs_node, ctx->submit_req_list); + + ctx->req_done += io_len; + cache_key_cutfront(key, io_len); + } + + io_len = cache_key_lend(key) - cache_key_lstart(key_tmp); + if (cache_key_empty(key_tmp)) { + ret = send_backing_req(ctx->cache, ctx->cbd_req, ctx->req_done, io_len, false); + if (ret) + goto out; + } else { + ret = cache_copy_to_req_bio(ctx->cache, ctx->cbd_req, ctx->req_done, + io_len, &key_tmp->cache_pos, key_tmp->seg_gen); + if (ret) { + list_add(&key_tmp->list_node, ctx->delete_key_list); + goto out; + } + } + ctx->req_done += io_len; + cache_key_cutfront(key, io_len); + + return 0; + +out: + return ret; +} + +static int read_overlap_contain(struct cbd_cache_key *key, struct cbd_cache_key *key_tmp, + struct cbd_cache_tree_walk_ctx *ctx) +{ + struct cbd_request *backing_req; + u32 io_len; + int ret; + /* + * |----| key_tmp + * |==========| key + */ + io_len = cache_key_lstart(key_tmp) - cache_key_lstart(key); + if (io_len) { + backing_req = create_backing_req(ctx->cache, ctx->cbd_req, ctx->req_done, io_len, true); + if (!backing_req) { + ret = -ENOMEM; + goto out; + } + list_add(&backing_req->inflight_reqs_node, ctx->submit_req_list); + + ctx->req_done += io_len; + cache_key_cutfront(key, io_len); + } + + io_len = key_tmp->len; + if (cache_key_empty(key_tmp)) { + ret = send_backing_req(ctx->cache, ctx->cbd_req, ctx->req_done, io_len, false); + if (ret) + goto out; + } else { + ret = cache_copy_to_req_bio(ctx->cache, ctx->cbd_req, ctx->req_done, + io_len, &key_tmp->cache_pos, key_tmp->seg_gen); + if (ret) { + list_add(&key_tmp->list_node, ctx->delete_key_list); + goto out; + } + } + ctx->req_done += io_len; + cache_key_cutfront(key, io_len); + + return 0; +out: + return ret; +} + +static int read_overlap_contained(struct cbd_cache_key *key, struct cbd_cache_key *key_tmp, + struct cbd_cache_tree_walk_ctx *ctx) +{ + struct cbd_cache_pos pos; + int ret; + + /* + * |-----------| key_tmp + * |====| key + */ + if (cache_key_empty(key_tmp)) { + ret = send_backing_req(ctx->cache, ctx->cbd_req, ctx->req_done, key->len, false); + if (ret) + goto out; + } else { + cache_pos_copy(&pos, &key_tmp->cache_pos); + cache_pos_advance(&pos, cache_key_lstart(key) - cache_key_lstart(key_tmp), false); + + ret = cache_copy_to_req_bio(ctx->cache, ctx->cbd_req, ctx->req_done, + key->len, &pos, key_tmp->seg_gen); + if (ret) { + list_add(&key_tmp->list_node, ctx->delete_key_list); + goto out; + } + } + ctx->req_done += key->len; + cache_key_cutfront(key, key->len); + + return 0; +out: + return ret; +} + +static int read_overlap_head(struct cbd_cache_key *key, struct cbd_cache_key *key_tmp, + struct cbd_cache_tree_walk_ctx *ctx) +{ + struct cbd_cache_pos pos; + u32 io_len; + int ret; + /* + * |--------| key_tmp + * |==========| key + */ + io_len = cache_key_lend(key_tmp) - cache_key_lstart(key); + + if (cache_key_empty(key_tmp)) { + ret = send_backing_req(ctx->cache, ctx->cbd_req, ctx->req_done, io_len, false); + if (ret) + goto out; + } else { + cache_pos_copy(&pos, &key_tmp->cache_pos); + cache_pos_advance(&pos, cache_key_lstart(key) - cache_key_lstart(key_tmp), false); + + ret = cache_copy_to_req_bio(ctx->cache, ctx->cbd_req, ctx->req_done, + io_len, &pos, key_tmp->seg_gen); + if (ret) { + list_add(&key_tmp->list_node, ctx->delete_key_list); + goto out; + } + } + + ctx->req_done += io_len; + cache_key_cutfront(key, io_len); + + return 0; +out: + return ret; +} + +static int read_walk_finally(struct cbd_cache_tree_walk_ctx *ctx) +{ + struct cbd_request *backing_req, *next_req; + struct cbd_cache_key *key = ctx->key; + int ret; + + if (key->len) { + ret = send_backing_req(ctx->cache, ctx->cbd_req, ctx->req_done, key->len, true); + if (ret) + goto out; + ctx->req_done += key->len; + } + + list_for_each_entry_safe(backing_req, next_req, ctx->submit_req_list, inflight_reqs_node) { + list_del_init(&backing_req->inflight_reqs_node); + submit_backing_req(ctx->cache, backing_req); + } + + return 0; +out: + return ret; +} +static bool read_walk_done(struct cbd_cache_tree_walk_ctx *ctx) +{ + return (ctx->req_done >= ctx->cbd_req->data_len); +} + +static int cache_read(struct cbd_cache *cache, struct cbd_request *cbd_req) +{ + struct cbd_cache_key key_data = { .off = cbd_req->off, .len = cbd_req->data_len }; + struct cbd_cache_tree *cache_tree; + struct cbd_cache_key *key_tmp = NULL, *key_next; + struct rb_node *prev_node = NULL; + struct cbd_cache_key *key = &key_data; + struct cbd_cache_tree_walk_ctx walk_ctx = { 0 }; + LIST_HEAD(delete_key_list); + LIST_HEAD(submit_req_list); + int ret; + + walk_ctx.cache = cache; + walk_ctx.req_done = 0; + walk_ctx.cbd_req = cbd_req; + walk_ctx.before = read_before; + walk_ctx.overlap_tail = read_overlap_tail; + walk_ctx.overlap_head = read_overlap_head; + walk_ctx.overlap_contain = read_overlap_contain; + walk_ctx.overlap_contained = read_overlap_contained; + walk_ctx.walk_finally = read_walk_finally; + walk_ctx.walk_done = read_walk_done; + walk_ctx.delete_key_list = &delete_key_list; + walk_ctx.submit_req_list = &submit_req_list; + +next_tree: + key->off = cbd_req->off + walk_ctx.req_done; + key->len = cbd_req->data_len - walk_ctx.req_done; + if (key->len > CBD_CACHE_TREE_SIZE - (key->off & CBD_CACHE_TREE_SIZE_MASK)) + key->len = CBD_CACHE_TREE_SIZE - (key->off & CBD_CACHE_TREE_SIZE_MASK); + cache_tree = get_cache_tree(cache, key->off); + spin_lock(&cache_tree->tree_lock); +search: + prev_node = cache_tree_search(cache_tree, key, NULL, NULL, &delete_key_list); + +cleanup_tree: + if (!list_empty(&delete_key_list)) { + list_for_each_entry_safe(key_tmp, key_next, &delete_key_list, list_node) { + list_del_init(&key_tmp->list_node); + cache_key_delete(key_tmp); + } + goto search; + } + + walk_ctx.start_node = prev_node; + walk_ctx.key = key; + + ret = cache_tree_walk(cache, &walk_ctx); + if (ret == -EINVAL) + goto cleanup_tree; + else if (ret) + goto out; + + spin_unlock(&cache_tree->tree_lock); + + if (walk_ctx.req_done < cbd_req->data_len) + goto next_tree; + + return 0; +out: + spin_unlock(&cache_tree->tree_lock); + + return ret; +} + +/* cache write */ +static int cache_write(struct cbd_cache *cache, struct cbd_request *cbd_req) +{ + struct cbd_cache_tree *cache_tree; + struct cbd_cache_key *key; + u64 offset = cbd_req->off; + u32 length = cbd_req->data_len; + u32 io_done = 0; + int ret; + + while (true) { + if (io_done >= length) + break; + + key = cache_key_alloc(cache); + if (!key) { + ret = -ENOMEM; + goto err; + } + + key->off = offset + io_done; + key->len = length - io_done; + if (key->len > CBD_CACHE_TREE_SIZE - (key->off & CBD_CACHE_TREE_SIZE_MASK)) + key->len = CBD_CACHE_TREE_SIZE - (key->off & CBD_CACHE_TREE_SIZE_MASK); + + ret = cache_data_alloc(cache, key, cbd_req->cbdq->index); + if (ret) { + cache_key_put(key); + goto err; + } + + if (!key->len) { + cache_seg_put(key->cache_pos.cache_seg); + cache_key_put(key); + continue; + } + + BUG_ON(!key->cache_pos.cache_seg); + cache_copy_from_req_bio(cache, key, cbd_req, io_done); + + cache_tree = get_cache_tree(cache, key->off); + spin_lock(&cache_tree->tree_lock); + ret = cache_insert_key(cache, key, true); + if (ret) { + cache_key_put(key); + goto put_seg; + } + + ret = cache_key_append(cache, key); + if (ret) { + cache_key_delete(key); + goto put_seg; + } + + io_done += key->len; + spin_unlock(&cache_tree->tree_lock); + } + + return 0; +put_seg: + cache_seg_put(key->cache_pos.cache_seg); + spin_unlock(&cache_tree->tree_lock); +err: + return ret; +} + +/* cache flush */ +static int cache_flush(struct cbd_cache *cache) +{ + struct cbd_cache_kset *kset; + u32 i, ret; + + for (i = 0; i < cache->n_ksets; i++) { + kset = get_kset(cache, i); + + spin_lock(&kset->kset_lock); + ret = cache_kset_close(cache, kset); + spin_unlock(&kset->kset_lock); + if (ret) + return ret; + } + + return 0; +} + +/* This function is the cache request entry */ +int cbd_cache_handle_req(struct cbd_cache *cache, struct cbd_request *cbd_req) +{ + switch (cbd_req->op) { + case CBD_OP_FLUSH: + return cache_flush(cache); + case CBD_OP_WRITE: + return cache_write(cache, cbd_req); + case CBD_OP_READ: + return cache_read(cache, cbd_req); + default: + return -EIO; + } + + return 0; +} + +/* cache replay */ +static void cache_pos_encode(struct cbd_cache *cache, + struct cbd_cache_pos_onmedia *pos_onmedia, + struct cbd_cache_pos *pos) +{ + pos_onmedia->cache_seg_id = pos->cache_seg->cache_seg_id; + pos_onmedia->seg_off = pos->seg_off; +} + +static void cache_pos_decode(struct cbd_cache *cache, + struct cbd_cache_pos_onmedia *pos_onmedia, + struct cbd_cache_pos *pos) +{ + pos->cache_seg = &cache->segments[pos_onmedia->cache_seg_id]; + pos->seg_off = pos_onmedia->seg_off; +} + +static int cache_replay(struct cbd_cache *cache) +{ + struct cbd_cache_pos pos_tail; + struct cbd_cache_pos *pos; + struct cbd_cache_kset_onmedia *kset_onmedia; + struct cbd_cache_key_onmedia *key_onmedia; + struct cbd_cache_key *key = NULL; + int ret = 0; + void *addr; + int i; + + cache_pos_copy(&pos_tail, &cache->key_tail); + pos = &pos_tail; + + set_bit(pos->cache_seg->cache_seg_id, cache->seg_map); + + while (true) { + addr = cache_pos_addr(pos); + + kset_onmedia = (struct cbd_cache_kset_onmedia *)addr; + if (kset_onmedia->magic != CBD_KSET_MAGIC || + kset_onmedia->crc != cache_kset_crc(kset_onmedia)) { + break; + } + + for (i = 0; i < kset_onmedia->key_num; i++) { + key_onmedia = &kset_onmedia->data[i]; + + key = cache_key_alloc(cache); + if (!key) { + ret = -ENOMEM; + goto out; + } + + cache_key_decode(key_onmedia, key); +#ifdef CONFIG_CBD_CRC + if (key->data_crc != cache_key_data_crc(key)) { + cbd_cache_debug(cache, "key: %llu:%u seg %u:%u data_crc error: %x, expected: %x\n", + key->off, key->len, key->cache_pos.cache_seg->cache_seg_id, + key->cache_pos.seg_off, cache_key_data_crc(key), key->data_crc); + ret = -EIO; + cache_key_put(key); + goto out; + } +#endif + set_bit(key->cache_pos.cache_seg->cache_seg_id, cache->seg_map); + + if (key->seg_gen < key->cache_pos.cache_seg->cache_seg_info->gen) { + cache_key_put(key); + } else { + ret = cache_insert_key(cache, key, true); + if (ret) { + cache_key_put(key); + goto out; + } + } + + cache_seg_get(key->cache_pos.cache_seg); + } + + cache_pos_advance(pos, get_kset_onmedia_size(kset_onmedia), false); + + if (kset_onmedia->flags & CBD_KSET_FLAGS_LAST) { + struct cbd_cache_segment *cur_seg, *next_seg; + + cur_seg = pos->cache_seg; + next_seg = cache_seg_get_next(cur_seg); + if (!next_seg) + break; + pos->cache_seg = next_seg; + pos->seg_off = 0; + set_bit(pos->cache_seg->cache_seg_id, cache->seg_map); + continue; + } + } + +#ifdef CONFIG_CBD_DEBUG + dump_cache(cache); +#endif + + spin_lock(&cache->key_head_lock); + cache_pos_copy(&cache->key_head, pos); + cache->cache_info->used_segs++; + spin_unlock(&cache->key_head_lock); + +out: + return ret; +} + +/* Writeback */ +static bool no_more_dirty(struct cbd_cache *cache) +{ + struct cbd_cache_kset_onmedia *kset_onmedia; + struct cbd_cache_pos *pos; + void *addr; + + pos = &cache->dirty_tail; + + if (cache_seg_wb_done(pos->cache_seg)) { + cbd_cache_debug(cache, "seg %u wb done\n", pos->cache_seg->cache_seg_id); + return !cache_seg_has_next(pos->cache_seg); + } + + addr = cache_pos_addr(pos); + kset_onmedia = (struct cbd_cache_kset_onmedia *)addr; + if (kset_onmedia->magic != CBD_KSET_MAGIC) { + cbd_cache_debug(cache, "dirty_tail: %u:%u magic: %llx, not expected: %llx\n", + pos->cache_seg->cache_seg_id, pos->seg_off, + kset_onmedia->magic, CBD_KSET_MAGIC); + return true; + } + + if (kset_onmedia->crc != cache_kset_crc(kset_onmedia)) { + cbd_cache_debug(cache, "dirty_tail: %u:%u crc: %x, not expected: %x\n", + pos->cache_seg->cache_seg_id, pos->seg_off, + cache_kset_crc(kset_onmedia), kset_onmedia->crc); + return true; + } + + return false; +} + +static void cache_writeback_exit(struct cbd_cache *cache) +{ + if (!cache->bioset) + return; + + cache_flush(cache); + + while (!no_more_dirty(cache)) + msleep(100); + + cancel_delayed_work_sync(&cache->writeback_work); + bioset_exit(cache->bioset); + kfree(cache->bioset); +} + +static int cache_writeback_init(struct cbd_cache *cache) +{ + int ret; + + cache->bioset = kzalloc(sizeof(*cache->bioset), GFP_KERNEL); + if (!cache->bioset) { + ret = -ENOMEM; + goto err; + } + + ret = bioset_init(cache->bioset, 256, 0, BIOSET_NEED_BVECS); + if (ret) { + kfree(cache->bioset); + cache->bioset = NULL; + goto err; + } + + queue_delayed_work(cache->cache_wq, &cache->writeback_work, 0); + + return 0; + +err: + return ret; +} + +static int cache_key_writeback(struct cbd_cache *cache, struct cbd_cache_key *key) +{ + struct cbd_cache_pos *pos; + void *addr; + ssize_t written; + struct cbd_cache_segment *cache_seg; + struct cbd_segment *segment; + u32 seg_remain; + u64 off; + + if (cache_key_clean(key)) + return 0; + + pos = &key->cache_pos; + + cache_seg = pos->cache_seg; + BUG_ON(!cache_seg); + + segment = &cache_seg->segment; + seg_remain = segment->data_size - pos->seg_off; + /* all data in one key should be int the same segment */ + BUG_ON(seg_remain < key->len); + + addr = cache_pos_addr(pos); + off = key->off; + + /* Here write is in sync way, because it should consider + * the sequence of overwrites. E.g, K1 writes A at 0-4K, + * K2 after K1 writes B to 0-4K, we have to ensure K1 + * to be written back before K2. + */ + written = kernel_write(cache->bdev_file, addr, key->len, &off); + if (written != key->len) + return -EIO; + + return 0; +} + +static void writeback_fn(struct work_struct *work) +{ + struct cbd_cache *cache = container_of(work, struct cbd_cache, writeback_work.work); + struct cbd_cache_pos *pos; + struct cbd_cache_kset_onmedia *kset_onmedia; + struct cbd_cache_key_onmedia *key_onmedia; + struct cbd_cache_key *key = NULL; + int ret = 0; + void *addr; + int i; + + while (true) { + if (no_more_dirty(cache)) { + queue_delayed_work(cache->cache_wq, &cache->writeback_work, 1 * HZ); + return; + } + + pos = &cache->dirty_tail; + if (cache_seg_wb_done(pos->cache_seg)) + goto next_seg; + + addr = cache_pos_addr(pos); + kset_onmedia = (struct cbd_cache_kset_onmedia *)addr; +#ifdef CONFIG_CBD_CRC + /* check the data crc */ + for (i = 0; i < kset_onmedia->key_num; i++) { + struct cbd_cache_key key_tmp = { 0 }; + + key = &key_tmp; + + kref_init(&key->ref); + key->cache = cache; + INIT_LIST_HEAD(&key->list_node); + + key_onmedia = &kset_onmedia->data[i]; + + cache_key_decode(key_onmedia, key); + if (key->data_crc != cache_key_data_crc(key)) { + cbd_cache_debug(cache, "key: %llu:%u data crc(%x) is not expected(%x), wait for data ready.\n", + key->off, key->len, cache_key_data_crc(key), key->data_crc); + queue_delayed_work(cache->cache_wq, &cache->writeback_work, 1 * HZ); + return; + } + } +#endif + for (i = 0; i < kset_onmedia->key_num; i++) { + key_onmedia = &kset_onmedia->data[i]; + + key = cache_key_alloc(cache); + if (!key) { + cbd_cache_err(cache, "writeback error failed to alloc key\n"); + queue_delayed_work(cache->cache_wq, &cache->writeback_work, 1 * HZ); + return; + } + + cache_key_decode(key_onmedia, key); + ret = cache_key_writeback(cache, key); + cache_key_put(key); + + if (ret) { + cbd_cache_err(cache, "writeback error: %d\n", ret); + queue_delayed_work(cache->cache_wq, &cache->writeback_work, 1 * HZ); + return; + } + } + + vfs_fsync(cache->bdev_file, 1); + + cache_pos_advance(pos, get_kset_onmedia_size(kset_onmedia), false); + cache_pos_encode(cache, &cache->cache_info->dirty_tail_pos, &cache->dirty_tail); + + if (kset_onmedia->flags & CBD_KSET_FLAGS_LAST) { + struct cbd_cache_segment *cur_seg, *next_seg; + + pos->cache_seg->cache_seg_info->flags |= CBD_CACHE_SEG_FLAGS_WB_DONE; +next_seg: + cur_seg = pos->cache_seg; + next_seg = cache_seg_get_next(cur_seg); + if (!next_seg) + continue; + pos->cache_seg = next_seg; + pos->seg_off = 0; + cache_pos_encode(cache, &cache->cache_info->dirty_tail_pos, &cache->dirty_tail); + } + } +} + +/* gc */ +static bool need_gc(struct cbd_cache *cache) +{ + void *dirty_addr, *key_addr; + + cache_pos_decode(cache, &cache->cache_info->dirty_tail_pos, &cache->dirty_tail); + + dirty_addr = cache_pos_addr(&cache->dirty_tail); + key_addr = cache_pos_addr(&cache->key_tail); + + if (dirty_addr == key_addr) + return false; + + if (bitmap_weight(cache->seg_map, cache->n_segs) < (cache->n_segs / 100 * cache->cache_info->gc_percent)) + return false; + + return true; +} + +static void gc_fn(struct work_struct *work) +{ + struct cbd_cache *cache = container_of(work, struct cbd_cache, gc_work.work); + struct cbd_cache_pos *pos; + struct cbd_cache_kset_onmedia *kset_onmedia; + struct cbd_cache_key_onmedia *key_onmedia; + struct cbd_cache_key *key = NULL; + void *addr; + int i; + + while (true) { + if (cache->state == cbd_cache_state_stopping) + return; + + if (!need_gc(cache)) { + queue_delayed_work(cache->cache_wq, &cache->gc_work, 1 * HZ); + return; + } + + pos = &cache->key_tail; + if (cache_seg_gc_done(pos->cache_seg)) + goto next_seg; + + addr = cache_pos_addr(pos); + kset_onmedia = (struct cbd_cache_kset_onmedia *)addr; + if (kset_onmedia->magic != CBD_KSET_MAGIC) { + cbd_cache_err(cache, "gc error magic is not expected. magic: %llx, expected: %llx\n", + kset_onmedia->magic, CBD_KSET_MAGIC); + queue_delayed_work(cache->cache_wq, &cache->gc_work, 1 * HZ); + return; + } + + if (kset_onmedia->crc != cache_kset_crc(kset_onmedia)) { + cbd_cache_err(cache, "gc error crc is not expected. crc: %x, expected: %x\n", + cache_kset_crc(kset_onmedia), kset_onmedia->crc); + queue_delayed_work(cache->cache_wq, &cache->gc_work, 1 * HZ); + return; + } + + for (i = 0; i < kset_onmedia->key_num; i++) { + key_onmedia = &kset_onmedia->data[i]; + + key = cache_key_alloc(cache); + if (!key) { + cbd_cache_err(cache, "gc error failed to alloc key\n"); + queue_delayed_work(cache->cache_wq, &cache->gc_work, 1 * HZ); + return; + } + + cache_key_decode(key_onmedia, key); + cache_key_gc(cache, key); + cache_key_put(key); + } + + cache_pos_advance(pos, get_kset_onmedia_size(kset_onmedia), false); + cache_pos_encode(cache, &cache->cache_info->key_tail_pos, &cache->key_tail); + + if (kset_onmedia->flags & CBD_KSET_FLAGS_LAST) { + struct cbd_cache_segment *cur_seg, *next_seg; + + pos->cache_seg->cache_seg_info->flags |= CBD_CACHE_SEG_FLAGS_GC_DONE; +next_seg: + cache_pos_decode(cache, &cache->cache_info->dirty_tail_pos, &cache->dirty_tail); + /* dont move next segment if dirty_tail has not move */ + if (cache->dirty_tail.cache_seg == pos->cache_seg) + continue; + cur_seg = pos->cache_seg; + next_seg = cache_seg_get_next(cur_seg); + if (!next_seg) + continue; + pos->cache_seg = next_seg; + pos->seg_off = 0; + cache_pos_encode(cache, &cache->cache_info->key_tail_pos, &cache->key_tail); + cbd_cache_debug(cache, "gc kset seg: %u\n", cur_seg->cache_seg_id); + + spin_lock(&cache->seg_map_lock); + clear_bit(cur_seg->cache_seg_id, cache->seg_map); + cache->cache_info->used_segs--; + spin_unlock(&cache->seg_map_lock); + } + } +} + +/* + * function for flush_work, flush_work is queued in cache_key_append(). When key append + * to kset, if this kset is full, then the kset will be closed immediately, if this kset + * is not full, cache_key_append() will queue a kset->flush_work to close this kset later. + */ +static void kset_flush_fn(struct work_struct *work) +{ + struct cbd_cache_kset *kset = container_of(work, struct cbd_cache_kset, flush_work.work); + struct cbd_cache *cache = kset->cache; + int ret; + + spin_lock(&kset->kset_lock); + ret = cache_kset_close(cache, kset); + spin_unlock(&kset->kset_lock); + + if (ret) { + /* Failed to flush kset, retry it. */ + queue_delayed_work(cache->cache_wq, &kset->flush_work, 0); + } +} + +/* function to clean_work, clean work would be queued after a cache_segment to be invalidated + * in cache gc, then it will clean up the invalid keys from cache_tree in backgroud. + * + * As this clean need to spin_lock(&cache_tree->tree_lock), we unlock after + * CBD_CLEAN_KEYS_MAX keys deleted and start another round for clean. + */ +static void clean_fn(struct work_struct *work) +{ + struct cbd_cache *cache = container_of(work, struct cbd_cache, clean_work); + struct cbd_cache_tree *cache_tree; + struct rb_node *node; + struct cbd_cache_key *key; + int i, count; + + for (i = 0; i < cache->n_trees; i++) { + cache_tree = &cache->cache_trees[i]; + +again: + if (cache->state == cbd_cache_state_stopping) + return; + + /* delete at most CBD_CLEAN_KEYS_MAX a round */ + count = 0; + spin_lock(&cache_tree->tree_lock); + node = rb_first(&cache_tree->root); + while (node) { + key = CACHE_KEY(node); + node = rb_next(node); + if (cache_key_invalid(key)) { + count++; + cache_key_delete(key); + } + + if (count >= CBD_CLEAN_KEYS_MAX) { + spin_unlock(&cache_tree->tree_lock); + usleep_range(1000, 2000); + goto again; + } + } + spin_unlock(&cache_tree->tree_lock); + + } +} + +struct cbd_cache *cbd_cache_alloc(struct cbd_transport *cbdt, + struct cbd_cache_opts *opts) +{ + struct cbd_cache_info *cache_info; + struct cbd_backend_info *backend_info; + struct cbd_segment_info *prev_seg_info = NULL; + struct cbd_cache *cache; + u32 seg_id; + u32 backend_id; + int ret; + int i; + + /* options sanitize */ + if (opts->n_paral > CBD_CACHE_PARAL_MAX) { + cbdt_err(cbdt, "n_paral too large (max %u).\n", + CBD_CACHE_PARAL_MAX); + return NULL; + } + + cache_info = opts->cache_info; + backend_info = container_of(cache_info, struct cbd_backend_info, cache_info); + backend_id = get_backend_id(cbdt, backend_info); + + if (opts->n_paral * CBD_CACHE_SEGS_EACH_PARAL > cache_info->n_segs) { + cbdt_err(cbdt, "n_paral %u requires cache size (%llu), more than current (%llu).", + opts->n_paral, opts->n_paral * CBD_CACHE_SEGS_EACH_PARAL * (u64)CBDT_SEG_SIZE, + cache_info->n_segs * (u64)CBDT_SEG_SIZE); + return NULL; + } + + cache = kzalloc(struct_size(cache, segments, cache_info->n_segs), GFP_KERNEL); + if (!cache) + return NULL; + + cache->cache_id = backend_id; + + cache->seg_map = bitmap_zalloc(cache_info->n_segs, GFP_KERNEL); + if (!cache->seg_map) { + ret = -ENOMEM; + goto destroy_cache; + } + + cache->key_cache = KMEM_CACHE(cbd_cache_key, 0); + if (!cache->key_cache) { + ret = -ENOMEM; + goto destroy_cache; + } + + cache->req_cache = KMEM_CACHE(cbd_request, 0); + if (!cache->req_cache) { + ret = -ENOMEM; + goto destroy_cache; + } + + cache->cache_wq = alloc_workqueue("cbdt%d-c%u", WQ_UNBOUND | WQ_MEM_RECLAIM, + 0, cbdt->id, cache->cache_id); + if (!cache->cache_wq) { + ret = -ENOMEM; + goto destroy_cache; + } + + cache->cbdt = cbdt; + cache->cache_info = cache_info; + cache->n_segs = cache_info->n_segs; + spin_lock_init(&cache->seg_map_lock); + cache->bdev_file = opts->bdev_file; + cache->cache_info->gc_percent = CBD_CACHE_GC_PERCENT_DEFAULT; + + spin_lock_init(&cache->key_head_lock); + spin_lock_init(&cache->miss_read_reqs_lock); + INIT_LIST_HEAD(&cache->miss_read_reqs); + + INIT_DELAYED_WORK(&cache->writeback_work, writeback_fn); + INIT_DELAYED_WORK(&cache->gc_work, gc_fn); + INIT_WORK(&cache->clean_work, clean_fn); + INIT_WORK(&cache->miss_read_end_work, miss_read_end_work_fn); + + cache->dev_size = opts->dev_size; + + for (i = 0; i < cache_info->n_segs; i++) { + if (opts->alloc_segs) { + ret = cbdt_get_empty_segment_id(cbdt, &seg_id); + if (ret) + goto destroy_cache; + + if (prev_seg_info) + prev_seg_info->next_seg = seg_id; + else + cache_info->seg_id = seg_id; + + } else { + if (prev_seg_info) + seg_id = prev_seg_info->next_seg; + else + seg_id = cache_info->seg_id; + } + + cache_seg_init(cache, seg_id, i); + prev_seg_info = cbdt_get_segment_info(cbdt, seg_id); + } + + cache_pos_decode(cache, &cache_info->key_tail_pos, &cache->key_tail); + cache_pos_decode(cache, &cache_info->dirty_tail_pos, &cache->dirty_tail); + + cache->state = cbd_cache_state_running; + + if (opts->init_keys) { + cache->init_keys = 1; + + cache->n_trees = DIV_ROUND_UP(cache->dev_size << SECTOR_SHIFT, CBD_CACHE_TREE_SIZE); + cache->cache_trees = kvcalloc(cache->n_trees, sizeof(struct cbd_cache_tree), GFP_KERNEL); + if (!cache->cache_trees) { + ret = -ENOMEM; + goto destroy_cache; + } + + for (i = 0; i < cache->n_trees; i++) { + struct cbd_cache_tree *cache_tree; + + cache_tree = &cache->cache_trees[i]; + cache_tree->root = RB_ROOT; + spin_lock_init(&cache_tree->tree_lock); + } + + ret = cache_replay(cache); + if (ret) { + cbd_cache_err(cache, "failed to replay keys\n"); + goto destroy_cache; + } + + cache->n_ksets = opts->n_paral; + cache->ksets = kcalloc(cache->n_ksets, CBD_KSET_SIZE, GFP_KERNEL); + if (!cache->ksets) { + ret = -ENOMEM; + goto destroy_cache; + } + + for (i = 0; i < cache->n_ksets; i++) { + struct cbd_cache_kset *kset; + + kset = get_kset(cache, i); + + kset->cache = cache; + spin_lock_init(&kset->kset_lock); + INIT_DELAYED_WORK(&kset->flush_work, kset_flush_fn); + } + + /* Init caceh->data_heads */ + cache->n_heads = opts->n_paral; + cache->data_heads = kcalloc(cache->n_heads, sizeof(struct cbd_cache_data_head), GFP_KERNEL); + if (!cache->data_heads) { + ret = -ENOMEM; + goto destroy_cache; + } + + for (i = 0; i < cache->n_heads; i++) { + struct cbd_cache_data_head *data_head; + + data_head = &cache->data_heads[i]; + spin_lock_init(&data_head->data_head_lock); + } + } + + /* start writeback */ + if (opts->start_writeback) { + cache->start_writeback = 1; + ret = cache_writeback_init(cache); + if (ret) + goto destroy_cache; + } + + /* start gc */ + if (opts->start_gc) { + cache->start_gc = 1; + queue_delayed_work(cache->cache_wq, &cache->gc_work, 0); + } + + return cache; + +destroy_cache: + cbd_cache_destroy(cache); + + return NULL; +} + +void cbd_cache_destroy(struct cbd_cache *cache) +{ + int i; + + cache->state = cbd_cache_state_stopping; + + if (cache->start_gc) { + cancel_delayed_work_sync(&cache->gc_work); + flush_work(&cache->clean_work); + } + + if (cache->start_writeback) + cache_writeback_exit(cache); + + if (cache->init_keys) { +#ifdef CONFIG_CBD_DEBUG + dump_cache(cache); +#endif + for (i = 0; i < cache->n_trees; i++) { + struct cbd_cache_tree *cache_tree; + struct rb_node *node; + struct cbd_cache_key *key; + + cache_tree = &cache->cache_trees[i]; + + spin_lock(&cache_tree->tree_lock); + node = rb_first(&cache_tree->root); + while (node) { + key = CACHE_KEY(node); + node = rb_next(node); + + cache_key_delete(key); + } + spin_unlock(&cache_tree->tree_lock); + } + + for (i = 0; i < cache->n_ksets; i++) { + struct cbd_cache_kset *kset; + + kset = get_kset(cache, i); + cancel_delayed_work_sync(&kset->flush_work); + } + + cache_flush(cache); + } + + if (cache->cache_wq) { + drain_workqueue(cache->cache_wq); + destroy_workqueue(cache->cache_wq); + } + + kmem_cache_destroy(cache->req_cache); + kmem_cache_destroy(cache->key_cache); + + if (cache->seg_map) + bitmap_free(cache->seg_map); + + for (i = 0; i < cache->n_segs; i++) + cache_seg_exit(&cache->segments[i]); + + kfree(cache->data_heads); + kfree(cache->ksets); + + if (cache->cache_trees) + kvfree(cache->cache_trees); + + kfree(cache); +} From patchwork Wed Sep 18 10:18:19 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dongsheng Yang X-Patchwork-Id: 13806779 Received: from out-170.mta1.migadu.com (out-170.mta1.migadu.com [95.215.58.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 16AB0189901 for ; Wed, 18 Sep 2024 10:19:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.170 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726654757; cv=none; b=Bod+akkYxpF7Rq2kwtGZABh+HTyavQ4YtxlyeX9Dk/4bfI/W+sqxVJd41tAby8xiTajYYuNjUN16xGt8GfQfJnfh2QnoQy+5sf7Z9Nc5tYuUWhcUmtHlAK0tv5i3M+K79y5tg0xUdrKFlBGUDiHeYRDL801HpNXHhPCmK6/Vdhk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726654757; c=relaxed/simple; bh=BaUGCto9vs8oFFeaBFeumiYxEaC+cI0sNS72mCFbL4c=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=YEtHCItnscFlTR6WW925h26kKy8jSL4nnWLHU5112FfOudVGOVdUCFJANXlHFnzS3vSqqylCCdLUjQNBPm4eNsGCx7ddosZyQjc20tNsFS8Sin4jvhk3OX3PpxO3TcZcjgzzISHLlF3LPr5AqBOH7dMgnPr6PXnsAcPbml9yE2U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=eJNEcFQL; arc=none smtp.client-ip=95.215.58.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="eJNEcFQL" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1726654751; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=W1Gtd6ScgTrJJmOdkTMe2oKtbid4lctcDodxd4qiA3I=; b=eJNEcFQLFvjjY7KxtUWbts+lI1tq4qCiD/LxDrmdnbBlSvQRi8WKAgSXEyLFAT1mP+eoIE rKYYuWenAMd5dIURS8RuoGmaCN3z7DRU3Q4PVz/aDm2KMSN2MoJaGhU4dT5VJEmg2rBeHo WOUuHkHTTebUi5k614JXjMl7PqtKvYA= From: Dongsheng Yang To: axboe@kernel.dk, dan.j.williams@intel.com, gregory.price@memverge.com, John@groves.net, Jonathan.Cameron@Huawei.com, bbhushan2@marvell.com, chaitanyak@nvidia.com, rdunlap@infradead.org Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, linux-bcache@vger.kernel.org, Dongsheng Yang Subject: [PATCH v2 6/8] cbd: introduce cbd_blkdev Date: Wed, 18 Sep 2024 10:18:19 +0000 Message-Id: <20240918101821.681118-7-dongsheng.yang@linux.dev> In-Reply-To: <20240918101821.681118-1-dongsheng.yang@linux.dev> References: <20240918101821.681118-1-dongsheng.yang@linux.dev> Precedence: bulk X-Mailing-List: linux-cxl@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT The "cbd_blkdev" represents a virtual block device named "/dev/cbdX". It corresponds to a backend. The "blkdev" interacts with upper-layer users and accepts IO requests from them. A "blkdev" includes multiple "cbd_queues", each of which requires a "cbd_channel" to interact with the backend's handler. The "cbd_queue" forwards IO requests from the upper layer to the backend's handler through the channel. cbd_blkdev allow user to force stop a /dev/cbdX, even if backend is not responsible. Signed-off-by: Dongsheng Yang --- drivers/block/cbd/cbd_blkdev.c | 433 +++++++++++++++++++++++++ drivers/block/cbd/cbd_queue.c | 574 +++++++++++++++++++++++++++++++++ 2 files changed, 1007 insertions(+) create mode 100644 drivers/block/cbd/cbd_blkdev.c create mode 100644 drivers/block/cbd/cbd_queue.c diff --git a/drivers/block/cbd/cbd_blkdev.c b/drivers/block/cbd/cbd_blkdev.c new file mode 100644 index 000000000000..f16bc429704b --- /dev/null +++ b/drivers/block/cbd/cbd_blkdev.c @@ -0,0 +1,433 @@ +#include "cbd_internal.h" + +static ssize_t blkdev_backend_id_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct cbd_blkdev_device *blkdev; + struct cbd_blkdev_info *blkdev_info; + + blkdev = container_of(dev, struct cbd_blkdev_device, dev); + blkdev_info = blkdev->blkdev_info; + + if (blkdev_info->state == cbd_blkdev_state_none) + return 0; + + return sprintf(buf, "%u\n", blkdev_info->backend_id); +} + +static DEVICE_ATTR(backend_id, 0400, blkdev_backend_id_show, NULL); + +static ssize_t blkdev_host_id_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct cbd_blkdev_device *blkdev; + struct cbd_blkdev_info *blkdev_info; + + blkdev = container_of(dev, struct cbd_blkdev_device, dev); + blkdev_info = blkdev->blkdev_info; + + if (blkdev_info->state == cbd_blkdev_state_none) + return 0; + + return sprintf(buf, "%u\n", blkdev_info->host_id); +} + +static DEVICE_ATTR(host_id, 0400, blkdev_host_id_show, NULL); + +static ssize_t blkdev_mapped_id_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct cbd_blkdev_device *blkdev; + struct cbd_blkdev_info *blkdev_info; + + blkdev = container_of(dev, struct cbd_blkdev_device, dev); + blkdev_info = blkdev->blkdev_info; + + if (blkdev_info->state == cbd_blkdev_state_none) + return 0; + + return sprintf(buf, "%u\n", blkdev_info->mapped_id); +} + +static DEVICE_ATTR(mapped_id, 0400, blkdev_mapped_id_show, NULL); + +CBD_OBJ_HEARTBEAT(blkdev); + +static struct attribute *cbd_blkdev_attrs[] = { + &dev_attr_mapped_id.attr, + &dev_attr_host_id.attr, + &dev_attr_backend_id.attr, + &dev_attr_alive.attr, + NULL +}; + +static struct attribute_group cbd_blkdev_attr_group = { + .attrs = cbd_blkdev_attrs, +}; + +static const struct attribute_group *cbd_blkdev_attr_groups[] = { + &cbd_blkdev_attr_group, + NULL +}; + +static void cbd_blkdev_release(struct device *dev) +{ +} + +const struct device_type cbd_blkdev_type = { + .name = "cbd_blkdev", + .groups = cbd_blkdev_attr_groups, + .release = cbd_blkdev_release, +}; + +const struct device_type cbd_blkdevs_type = { + .name = "cbd_blkdevs", + .release = cbd_blkdev_release, +}; + + +static int cbd_major; +static DEFINE_IDA(cbd_mapped_id_ida); + +static int minor_to_cbd_mapped_id(int minor) +{ + return minor >> CBD_PART_SHIFT; +} + + +static int cbd_open(struct gendisk *disk, blk_mode_t mode) +{ + struct cbd_blkdev *cbd_blkdev = disk->private_data; + + mutex_lock(&cbd_blkdev->lock); + cbd_blkdev->open_count++; + mutex_unlock(&cbd_blkdev->lock); + + return 0; +} + +static void cbd_release(struct gendisk *disk) +{ + struct cbd_blkdev *cbd_blkdev = disk->private_data; + + mutex_lock(&cbd_blkdev->lock); + cbd_blkdev->open_count--; + mutex_unlock(&cbd_blkdev->lock); +} + +static const struct block_device_operations cbd_bd_ops = { + .owner = THIS_MODULE, + .open = cbd_open, + .release = cbd_release, +}; + +static void cbd_blkdev_stop_queues(struct cbd_blkdev *cbd_blkdev) +{ + int i; + + for (i = 0; i < cbd_blkdev->num_queues; i++) + cbd_queue_stop(&cbd_blkdev->queues[i]); +} + +static void cbd_blkdev_destroy_queues(struct cbd_blkdev *cbd_blkdev) +{ + cbd_blkdev_stop_queues(cbd_blkdev); + kfree(cbd_blkdev->queues); +} + +static int cbd_blkdev_create_queues(struct cbd_blkdev *cbd_blkdev) +{ + int i; + int ret; + struct cbd_queue *cbdq; + + cbd_blkdev->queues = kcalloc(cbd_blkdev->num_queues, sizeof(struct cbd_queue), GFP_KERNEL); + if (!cbd_blkdev->queues) + return -ENOMEM; + + for (i = 0; i < cbd_blkdev->num_queues; i++) { + cbdq = &cbd_blkdev->queues[i]; + cbdq->cbd_blkdev = cbd_blkdev; + cbdq->index = i; + ret = cbd_queue_start(cbdq); + if (ret) + goto err; + } + + return 0; +err: + cbd_blkdev_destroy_queues(cbd_blkdev); + return ret; +} + +static int disk_start(struct cbd_blkdev *cbd_blkdev) +{ + struct gendisk *disk; + struct queue_limits lim = { + .max_hw_sectors = BIO_MAX_VECS * PAGE_SECTORS, + .io_min = 4096, + .io_opt = 4096, + .max_segments = USHRT_MAX, + .max_segment_size = UINT_MAX, + .discard_granularity = 0, + .max_hw_discard_sectors = 0, + .max_write_zeroes_sectors = 0 + }; + int ret; + + memset(&cbd_blkdev->tag_set, 0, sizeof(cbd_blkdev->tag_set)); + cbd_blkdev->tag_set.ops = &cbd_mq_ops; + cbd_blkdev->tag_set.queue_depth = 128; + cbd_blkdev->tag_set.numa_node = NUMA_NO_NODE; + cbd_blkdev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_SCHED; + cbd_blkdev->tag_set.nr_hw_queues = cbd_blkdev->num_queues; + cbd_blkdev->tag_set.cmd_size = sizeof(struct cbd_request); + cbd_blkdev->tag_set.timeout = 0; + cbd_blkdev->tag_set.driver_data = cbd_blkdev; + + ret = blk_mq_alloc_tag_set(&cbd_blkdev->tag_set); + if (ret) { + cbd_blk_err(cbd_blkdev, "failed to alloc tag set %d", ret); + goto err; + } + + disk = blk_mq_alloc_disk(&cbd_blkdev->tag_set, &lim, cbd_blkdev); + if (IS_ERR(disk)) { + ret = PTR_ERR(disk); + cbd_blk_err(cbd_blkdev, "failed to alloc disk"); + goto out_tag_set; + } + + snprintf(disk->disk_name, sizeof(disk->disk_name), "cbd%d", + cbd_blkdev->mapped_id); + + disk->major = cbd_major; + disk->first_minor = cbd_blkdev->mapped_id << CBD_PART_SHIFT; + disk->minors = (1 << CBD_PART_SHIFT); + + disk->fops = &cbd_bd_ops; + disk->private_data = cbd_blkdev; + + /* Tell the block layer that this is not a rotational device */ + blk_queue_flag_set(QUEUE_FLAG_NONROT, disk->queue); + + cbd_blkdev->disk = disk; + + cbdt_add_blkdev(cbd_blkdev->cbdt, cbd_blkdev); + cbd_blkdev->blkdev_info->mapped_id = cbd_blkdev->blkdev_id; + cbd_blkdev->blkdev_info->state = cbd_blkdev_state_running; + + set_capacity(cbd_blkdev->disk, cbd_blkdev->dev_size); + + set_disk_ro(cbd_blkdev->disk, false); + blk_queue_write_cache(cbd_blkdev->disk->queue, false, false); + + ret = add_disk(cbd_blkdev->disk); + if (ret) + goto put_disk; + + ret = sysfs_create_link(&disk_to_dev(cbd_blkdev->disk)->kobj, + &cbd_blkdev->blkdev_dev->dev.kobj, "cbd_blkdev"); + if (ret) + goto del_disk; + + return 0; + +del_disk: + del_gendisk(cbd_blkdev->disk); +put_disk: + put_disk(cbd_blkdev->disk); +out_tag_set: + blk_mq_free_tag_set(&cbd_blkdev->tag_set); +err: + return ret; +} + +int cbd_blkdev_start(struct cbd_transport *cbdt, u32 backend_id, u32 queues) +{ + struct cbd_blkdev *cbd_blkdev; + struct cbd_backend_info *backend_info; + u64 dev_size; + int ret; + + backend_info = cbdt_get_backend_info(cbdt, backend_id); + if (backend_info->blkdev_count == CBDB_BLKDEV_COUNT_MAX) + return -EBUSY; + + if (!cbd_backend_info_is_alive(backend_info)) { + cbdt_err(cbdt, "backend %u is not alive\n", backend_id); + return -EINVAL; + } + + dev_size = backend_info->dev_size; + + cbd_blkdev = kzalloc(sizeof(struct cbd_blkdev), GFP_KERNEL); + if (!cbd_blkdev) + return -ENOMEM; + + mutex_init(&cbd_blkdev->lock); + + if (backend_info->host_id == cbdt->host->host_id) + cbd_blkdev->backend = cbdt_get_backend(cbdt, backend_id); + + ret = cbdt_get_empty_blkdev_id(cbdt, &cbd_blkdev->blkdev_id); + if (ret < 0) + goto blkdev_free; + + cbd_blkdev->mapped_id = ida_simple_get(&cbd_mapped_id_ida, 0, + minor_to_cbd_mapped_id(1 << MINORBITS), + GFP_KERNEL); + if (cbd_blkdev->mapped_id < 0) { + ret = -ENOENT; + goto blkdev_free; + } + + cbd_blkdev->task_wq = alloc_workqueue("cbdt%d-d%u", WQ_UNBOUND | WQ_MEM_RECLAIM, + 0, cbdt->id, cbd_blkdev->mapped_id); + if (!cbd_blkdev->task_wq) { + ret = -ENOMEM; + goto ida_remove; + } + + INIT_LIST_HEAD(&cbd_blkdev->node); + cbd_blkdev->cbdt = cbdt; + cbd_blkdev->backend_id = backend_id; + cbd_blkdev->num_queues = queues; + cbd_blkdev->dev_size = dev_size; + cbd_blkdev->blkdev_info = cbdt_get_blkdev_info(cbdt, cbd_blkdev->blkdev_id); + cbd_blkdev->blkdev_dev = &cbdt->cbd_blkdevs_dev->blkdev_devs[cbd_blkdev->blkdev_id]; + + cbd_blkdev->blkdev_info->backend_id = backend_id; + cbd_blkdev->blkdev_info->host_id = cbdt->host->host_id; + cbd_blkdev->blkdev_info->state = cbd_blkdev_state_running; + + + if (cbd_backend_cache_on(backend_info)) { + struct cbd_cache_opts cache_opts = { 0 }; + + cache_opts.cache_info = &backend_info->cache_info; + cache_opts.alloc_segs = false; + cache_opts.start_writeback = false; + cache_opts.start_gc = true; + cache_opts.init_keys = true; + cache_opts.dev_size = dev_size; + cache_opts.n_paral = cbd_blkdev->num_queues; + cbd_blkdev->cbd_cache = cbd_cache_alloc(cbdt, &cache_opts); + if (!cbd_blkdev->cbd_cache) { + ret = -ENOMEM; + goto destroy_wq; + } + } + + ret = cbd_blkdev_create_queues(cbd_blkdev); + if (ret < 0) + goto destroy_cache; + + backend_info->blkdev_count++; + + INIT_DELAYED_WORK(&cbd_blkdev->hb_work, blkdev_hb_workfn); + queue_delayed_work(cbd_wq, &cbd_blkdev->hb_work, 0); + + ret = disk_start(cbd_blkdev); + if (ret < 0) + goto destroy_queues; + return 0; + +destroy_queues: + cancel_delayed_work_sync(&cbd_blkdev->hb_work); + cbd_blkdev_destroy_queues(cbd_blkdev); +destroy_cache: + if (cbd_blkdev->cbd_cache) + cbd_cache_destroy(cbd_blkdev->cbd_cache); +destroy_wq: + cbd_blkdev->blkdev_info->state = cbd_blkdev_state_none; + destroy_workqueue(cbd_blkdev->task_wq); +ida_remove: + ida_simple_remove(&cbd_mapped_id_ida, cbd_blkdev->mapped_id); +blkdev_free: + kfree(cbd_blkdev); + return ret; +} + +static void disk_stop(struct cbd_blkdev *cbd_blkdev) +{ + sysfs_remove_link(&disk_to_dev(cbd_blkdev->disk)->kobj, "cbd_blkdev"); + del_gendisk(cbd_blkdev->disk); + put_disk(cbd_blkdev->disk); + blk_mq_free_tag_set(&cbd_blkdev->tag_set); +} + +int cbd_blkdev_stop(struct cbd_transport *cbdt, u32 devid, bool force) +{ + struct cbd_blkdev *cbd_blkdev; + struct cbd_backend_info *backend_info; + + cbd_blkdev = cbdt_get_blkdev(cbdt, devid); + if (!cbd_blkdev) + return -EINVAL; + + mutex_lock(&cbd_blkdev->lock); + if (cbd_blkdev->open_count > 0 && !force) { + mutex_unlock(&cbd_blkdev->lock); + return -EBUSY; + } + + cbdt_del_blkdev(cbdt, cbd_blkdev); + mutex_unlock(&cbd_blkdev->lock); + + cbd_blkdev_stop_queues(cbd_blkdev); + disk_stop(cbd_blkdev); + kfree(cbd_blkdev->queues); + + cancel_delayed_work_sync(&cbd_blkdev->hb_work); + cbd_blkdev->blkdev_info->state = cbd_blkdev_state_none; + + drain_workqueue(cbd_blkdev->task_wq); + destroy_workqueue(cbd_blkdev->task_wq); + ida_simple_remove(&cbd_mapped_id_ida, cbd_blkdev->mapped_id); + backend_info = cbdt_get_backend_info(cbdt, cbd_blkdev->backend_id); + + if (cbd_blkdev->cbd_cache) + cbd_cache_destroy(cbd_blkdev->cbd_cache); + + kfree(cbd_blkdev); + + backend_info->blkdev_count--; + + return 0; +} + +int cbd_blkdev_clear(struct cbd_transport *cbdt, u32 devid) +{ + struct cbd_blkdev_info *blkdev_info; + + blkdev_info = cbdt_get_blkdev_info(cbdt, devid); + if (cbd_blkdev_info_is_alive(blkdev_info)) { + cbdt_err(cbdt, "blkdev %u is still alive\n", devid); + return -EBUSY; + } + + if (blkdev_info->state == cbd_blkdev_state_none) + return 0; + + blkdev_info->state = cbd_blkdev_state_none; + + return 0; +} + +int cbd_blkdev_init(void) +{ + cbd_major = register_blkdev(0, "cbd"); + if (cbd_major < 0) + return cbd_major; + + return 0; +} + +void cbd_blkdev_exit(void) +{ + unregister_blkdev(cbd_major, "cbd"); +} diff --git a/drivers/block/cbd/cbd_queue.c b/drivers/block/cbd/cbd_queue.c new file mode 100644 index 000000000000..cd40574d4a48 --- /dev/null +++ b/drivers/block/cbd/cbd_queue.c @@ -0,0 +1,574 @@ +#include "cbd_internal.h" + +static inline struct cbd_se *get_submit_entry(struct cbd_queue *cbdq) +{ + return (struct cbd_se *)(cbdq->channel.submr + cbdq->channel_info->submr_head); +} + +static inline struct cbd_se *get_oldest_se(struct cbd_queue *cbdq) +{ + if (cbdq->channel_info->submr_tail == cbdq->channel_info->submr_head) + return NULL; + + return (struct cbd_se *)(cbdq->channel.submr + cbdq->channel_info->submr_tail); +} + +static inline struct cbd_ce *get_complete_entry(struct cbd_queue *cbdq) +{ + if (cbdq->channel_info->compr_tail == cbdq->channel_info->compr_head) + return NULL; + + return (struct cbd_ce *)(cbdq->channel.compr + cbdq->channel_info->compr_tail); +} + +static void cbd_req_init(struct cbd_queue *cbdq, enum cbd_op op, struct request *rq) +{ + struct cbd_request *cbd_req = blk_mq_rq_to_pdu(rq); + + cbd_req->req = rq; + cbd_req->cbdq = cbdq; + cbd_req->op = op; + + if (req_op(rq) == REQ_OP_READ || req_op(rq) == REQ_OP_WRITE) + cbd_req->data_len = blk_rq_bytes(rq); + else + cbd_req->data_len = 0; + + cbd_req->bio = rq->bio; + cbd_req->off = (u64)blk_rq_pos(rq) << SECTOR_SHIFT; +} + +static bool cbd_req_nodata(struct cbd_request *cbd_req) +{ + switch (cbd_req->op) { + case CBD_OP_WRITE: + case CBD_OP_READ: + return false; + case CBD_OP_FLUSH: + return true; + default: + BUG(); + } +} + +static void queue_req_se_init(struct cbd_request *cbd_req) +{ + struct cbd_se *se; + u64 offset = cbd_req->off; + u32 length = cbd_req->data_len; + + se = get_submit_entry(cbd_req->cbdq); + memset(se, 0, sizeof(struct cbd_se)); + se->op = cbd_req->op; + + se->req_tid = cbd_req->req_tid; + se->offset = offset; + se->len = length; + + if (cbd_req->op == CBD_OP_READ || cbd_req->op == CBD_OP_WRITE) { + se->data_off = cbd_req->cbdq->channel.data_head; + se->data_len = length; + } + + cbd_req->se = se; +} + +static bool data_space_enough(struct cbd_queue *cbdq, struct cbd_request *cbd_req) +{ + struct cbd_channel *channel = &cbdq->channel; + u32 space_available = channel->data_size; + u32 space_needed; + + if (channel->data_head > channel->data_tail) { + space_available = channel->data_size - channel->data_head; + space_available += channel->data_tail; + } else if (channel->data_head < channel->data_tail) { + space_available = channel->data_tail - channel->data_head; + } + + space_needed = round_up(cbd_req->data_len, CBDC_DATA_ALIGH); + + if (space_available - CBDC_DATA_RESERVED < space_needed) + return false; + + return true; +} + +static bool submit_ring_full(struct cbd_queue *cbdq) +{ + u32 space_available = cbdq->channel.submr_size; + struct cbd_channel_info *info = cbdq->channel_info; + + if (info->submr_head > info->submr_tail) { + space_available = cbdq->channel.submr_size - info->submr_head; + space_available += info->submr_tail; + } else if (info->submr_head < info->submr_tail) { + space_available = info->submr_tail - info->submr_head; + } + + /* There is a SUBMR_RESERVED we dont use to prevent the ring to be used up */ + if (space_available - CBDC_SUBMR_RESERVED < sizeof(struct cbd_se)) + return true; + + return false; +} + +static void queue_req_data_init(struct cbd_request *cbd_req) +{ + struct cbd_queue *cbdq = cbd_req->cbdq; + struct bio *bio = cbd_req->bio; + + if (cbd_req->op == CBD_OP_READ) + goto advance_data_head; + + cbdc_copy_from_bio(&cbdq->channel, cbd_req->data_off, cbd_req->data_len, bio, cbd_req->bio_off); + +advance_data_head: + cbdq->channel.data_head = round_up(cbdq->channel.data_head + cbd_req->data_len, PAGE_SIZE); + cbdq->channel.data_head %= cbdq->channel.data_size; +} + +#ifdef CONFIG_CBD_CRC +static void cbd_req_crc_init(struct cbd_request *cbd_req) +{ + struct cbd_queue *cbdq = cbd_req->cbdq; + struct cbd_se *se = cbd_req->se; + + if (cbd_req->op == CBD_OP_WRITE) + se->data_crc = cbd_channel_crc(&cbdq->channel, + cbd_req->data_off, + cbd_req->data_len); + + se->se_crc = cbd_se_crc(se); +} +#endif + +static void end_req(struct kref *ref) +{ + struct cbd_request *cbd_req = container_of(ref, struct cbd_request, ref); + struct request *req = cbd_req->req; + int ret = cbd_req->ret; + + if (cbd_req->end_req) + cbd_req->end_req(cbd_req, cbd_req->priv_data); + + if (req) { + if (ret == -ENOMEM || ret == -EBUSY) + blk_mq_requeue_request(req, true); + else + blk_mq_end_request(req, errno_to_blk_status(ret)); + } + +} + +void cbd_req_get(struct cbd_request *cbd_req) +{ + kref_get(&cbd_req->ref); +} + +void cbd_req_put(struct cbd_request *cbd_req, int ret) +{ + struct cbd_request *parent = cbd_req->parent; + + if (ret && !cbd_req->ret) + cbd_req->ret = ret; + + if (kref_put(&cbd_req->ref, end_req) && parent) + cbd_req_put(parent, ret); +} + +static void complete_inflight_req(struct cbd_queue *cbdq, struct cbd_request *cbd_req, int ret); +int cbd_queue_req_to_backend(struct cbd_request *cbd_req) +{ + struct cbd_queue *cbdq = cbd_req->cbdq; + size_t command_size; + int ret; + + spin_lock(&cbdq->inflight_reqs_lock); + if (atomic_read(&cbdq->state) == cbd_queue_state_removing) { + spin_unlock(&cbdq->inflight_reqs_lock); + ret = -EIO; + goto err; + } + list_add_tail(&cbd_req->inflight_reqs_node, &cbdq->inflight_reqs); + spin_unlock(&cbdq->inflight_reqs_lock); + + command_size = sizeof(struct cbd_se); + + spin_lock(&cbdq->channel.submr_lock); + if (cbd_req->op == CBD_OP_WRITE || cbd_req->op == CBD_OP_READ) + cbd_req->data_off = cbdq->channel.data_head; + else + cbd_req->data_off = -1; + + if (submit_ring_full(cbdq) || + !data_space_enough(cbdq, cbd_req)) { + spin_unlock(&cbdq->channel.submr_lock); + + /* remove request from inflight_reqs */ + spin_lock(&cbdq->inflight_reqs_lock); + list_del_init(&cbd_req->inflight_reqs_node); + spin_unlock(&cbdq->inflight_reqs_lock); + + /* return ocuppied space */ + cbd_req->data_len = 0; + + ret = -ENOMEM; + goto err; + } + + cbd_req->req_tid = ++cbdq->req_tid; + queue_req_se_init(cbd_req); + + if (!cbd_req_nodata(cbd_req)) + queue_req_data_init(cbd_req); + + cbd_req_get(cbd_req); +#ifdef CONFIG_CBD_CRC + cbd_req_crc_init(cbd_req); +#endif + CBDC_UPDATE_SUBMR_HEAD(cbdq->channel_info->submr_head, + sizeof(struct cbd_se), + cbdq->channel.submr_size); + spin_unlock(&cbdq->channel.submr_lock); + + if (cbdq->cbd_blkdev->backend) + cbd_backend_notify(cbdq->cbd_blkdev->backend, cbdq->channel.seg_id); + queue_delayed_work(cbdq->cbd_blkdev->task_wq, &cbdq->complete_work, 0); + + return 0; + +err: + return ret; +} + +static void queue_req_end_req(struct cbd_request *cbd_req, void *priv_data); +static void cbd_queue_req(struct cbd_queue *cbdq, struct cbd_request *cbd_req) +{ + int ret; + + if (cbdq->cbd_blkdev->cbd_cache) { + ret = cbd_cache_handle_req(cbdq->cbd_blkdev->cbd_cache, cbd_req); + goto end; + } + + cbd_req->end_req = queue_req_end_req; + + ret = cbd_queue_req_to_backend(cbd_req); +end: + cbd_req_put(cbd_req, ret); +} + +static void advance_subm_ring(struct cbd_queue *cbdq) +{ + struct cbd_se *se; +again: + se = get_oldest_se(cbdq); + if (!se) + goto out; + + if (cbd_se_flags_test(se, CBD_SE_FLAGS_DONE)) { + CBDC_UPDATE_SUBMR_TAIL(cbdq->channel_info->submr_tail, + sizeof(struct cbd_se), + cbdq->channel.submr_size); + goto again; + } +out: + return; +} + +static bool __advance_data_tail(struct cbd_queue *cbdq, u32 data_off, u32 data_len) +{ + if (data_off == cbdq->channel.data_tail) { + cbdq->released_extents[data_off / PAGE_SIZE] = 0; + cbdq->channel.data_tail += data_len; + cbdq->channel.data_tail %= cbdq->channel.data_size; + return true; + } + + return false; +} + +static void advance_data_tail(struct cbd_queue *cbdq, u32 data_off, u32 data_len) +{ + data_off %= cbdq->channel.data_size; + cbdq->released_extents[data_off / PAGE_SIZE] = data_len; + + while (__advance_data_tail(cbdq, data_off, data_len)) { + data_off += data_len; + data_off %= cbdq->channel.data_size; + data_len = cbdq->released_extents[data_off / PAGE_SIZE]; + if (!data_len) + break; + } +} + +void cbd_queue_advance(struct cbd_queue *cbdq, struct cbd_request *cbd_req) +{ + spin_lock(&cbdq->channel.submr_lock); + advance_subm_ring(cbdq); + if (!cbd_req_nodata(cbd_req) && cbd_req->data_len) + advance_data_tail(cbdq, cbd_req->data_off, round_up(cbd_req->data_len, PAGE_SIZE)); + spin_unlock(&cbdq->channel.submr_lock); +} + +static void queue_req_end_req(struct cbd_request *cbd_req, void *priv_data) +{ + cbd_queue_advance(cbd_req->cbdq, cbd_req); +} + +static inline void complete_inflight_req(struct cbd_queue *cbdq, struct cbd_request *cbd_req, int ret) +{ + spin_lock(&cbdq->inflight_reqs_lock); + list_del_init(&cbd_req->inflight_reqs_node); + spin_unlock(&cbdq->inflight_reqs_lock); + + cbd_se_flags_set(cbd_req->se, CBD_SE_FLAGS_DONE); + + cbd_req_put(cbd_req, ret); +} + +static struct cbd_request *find_inflight_req(struct cbd_queue *cbdq, u64 req_tid) +{ + struct cbd_request *req; + bool found = false; + + list_for_each_entry(req, &cbdq->inflight_reqs, inflight_reqs_node) { + if (req->req_tid == req_tid) { + found = true; + break; + } + } + + if (found) + return req; + + return NULL; +} + +static void copy_data_from_cbdreq(struct cbd_request *cbd_req) +{ + struct bio *bio = cbd_req->bio; + struct cbd_queue *cbdq = cbd_req->cbdq; + + spin_lock(&cbd_req->lock); + cbdc_copy_to_bio(&cbdq->channel, cbd_req->data_off, cbd_req->data_len, bio, cbd_req->bio_off); + spin_unlock(&cbd_req->lock); +} + +static void complete_work_fn(struct work_struct *work) +{ + struct cbd_queue *cbdq = container_of(work, struct cbd_queue, complete_work.work); + struct cbd_ce *ce; + struct cbd_request *cbd_req; + +again: + /* compr_head would be updated by backend handler */ + spin_lock(&cbdq->channel.compr_lock); + ce = get_complete_entry(cbdq); + spin_unlock(&cbdq->channel.compr_lock); + if (!ce) + goto miss; + + spin_lock(&cbdq->inflight_reqs_lock); + cbd_req = find_inflight_req(cbdq, ce->req_tid); + spin_unlock(&cbdq->inflight_reqs_lock); + if (!cbd_req) { + cbd_queue_err(cbdq, "inflight request not found: %llu.", ce->req_tid); + goto miss; + } + +#ifdef CONFIG_CBD_CRC + if (ce->ce_crc != cbd_ce_crc(ce)) { + cbd_queue_err(cbdq, "ce crc bad 0x%x != 0x%x(expected)", + cbd_ce_crc(ce), ce->ce_crc); + goto miss; + } + + if (cbd_req->op == CBD_OP_READ && + ce->data_crc != cbd_channel_crc(&cbdq->channel, + cbd_req->data_off, + cbd_req->data_len)) { + cbd_queue_err(cbdq, "ce data_crc bad 0x%x != 0x%x(expected)", + cbd_channel_crc(&cbdq->channel, + cbd_req->data_off, + cbd_req->data_len), + ce->data_crc); + goto miss; + } +#endif + + cbdwc_hit(&cbdq->complete_worker_cfg); + CBDC_UPDATE_COMPR_TAIL(cbdq->channel_info->compr_tail, + sizeof(struct cbd_ce), + cbdq->channel.compr_size); + + if (cbd_req->op == CBD_OP_READ) { + spin_lock(&cbdq->channel.submr_lock); + copy_data_from_cbdreq(cbd_req); + spin_unlock(&cbdq->channel.submr_lock); + } + + complete_inflight_req(cbdq, cbd_req, ce->result); + + goto again; + +miss: + if (cbdwc_need_retry(&cbdq->complete_worker_cfg)) + goto again; + + spin_lock(&cbdq->inflight_reqs_lock); + if (list_empty(&cbdq->inflight_reqs)) { + spin_unlock(&cbdq->inflight_reqs_lock); + cbdwc_init(&cbdq->complete_worker_cfg); + return; + } + spin_unlock(&cbdq->inflight_reqs_lock); + cbdwc_miss(&cbdq->complete_worker_cfg); + + cpu_relax(); + queue_delayed_work(cbdq->cbd_blkdev->task_wq, &cbdq->complete_work, 0); +} + +static blk_status_t cbd_queue_rq(struct blk_mq_hw_ctx *hctx, + const struct blk_mq_queue_data *bd) +{ + struct request *req = bd->rq; + struct cbd_queue *cbdq = hctx->driver_data; + struct cbd_request *cbd_req = blk_mq_rq_to_pdu(bd->rq); + + memset(cbd_req, 0, sizeof(struct cbd_request)); + INIT_LIST_HEAD(&cbd_req->inflight_reqs_node); + kref_init(&cbd_req->ref); + spin_lock_init(&cbd_req->lock); + + blk_mq_start_request(bd->rq); + + switch (req_op(bd->rq)) { + case REQ_OP_FLUSH: + cbd_req_init(cbdq, CBD_OP_FLUSH, req); + break; + case REQ_OP_WRITE: + cbd_req_init(cbdq, CBD_OP_WRITE, req); + break; + case REQ_OP_READ: + cbd_req_init(cbdq, CBD_OP_READ, req); + break; + default: + return BLK_STS_IOERR; + } + + cbd_queue_req(cbdq, cbd_req); + + return BLK_STS_OK; +} + +static int cbd_init_hctx(struct blk_mq_hw_ctx *hctx, void *driver_data, + unsigned int hctx_idx) +{ + struct cbd_blkdev *cbd_blkdev = driver_data; + struct cbd_queue *cbdq; + + cbdq = &cbd_blkdev->queues[hctx_idx]; + hctx->driver_data = cbdq; + + return 0; +} + +const struct blk_mq_ops cbd_mq_ops = { + .queue_rq = cbd_queue_rq, + .init_hctx = cbd_init_hctx, +}; + +static int cbd_queue_channel_init(struct cbd_queue *cbdq, u32 channel_id) +{ + struct cbd_blkdev *cbd_blkdev = cbdq->cbd_blkdev; + struct cbd_transport *cbdt = cbd_blkdev->cbdt; + + cbd_channel_init(&cbdq->channel, cbdt, channel_id); + cbdq->channel_info = cbdq->channel.channel_info; + + if (!cbd_blkdev->backend) + cbdq->channel_info->polling = true; + + cbdq->channel.data_head = cbdq->channel.data_tail = 0; + cbdq->channel_info->submr_tail = cbdq->channel_info->submr_head = 0; + cbdq->channel_info->compr_tail = cbdq->channel_info->compr_head = 0; + + cbdq->channel_info->backend_id = cbd_blkdev->backend_id; + cbdq->channel_info->blkdev_id = cbd_blkdev->blkdev_id; + cbdq->channel_info->blkdev_state = cbdc_blkdev_state_running; + + return 0; +} + +int cbd_queue_start(struct cbd_queue *cbdq) +{ + struct cbd_transport *cbdt = cbdq->cbd_blkdev->cbdt; + u32 channel_id; + int ret; + + ret = cbd_get_empty_channel_id(cbdt, &channel_id); + if (ret < 0) { + cbdt_err(cbdt, "failed find available channel_id.\n"); + goto err; + } + + ret = cbd_queue_channel_init(cbdq, channel_id); + if (ret) { + cbdt_err(cbdt, "failed to init dev channel_info: %d.", ret); + goto err; + } + + INIT_LIST_HEAD(&cbdq->inflight_reqs); + spin_lock_init(&cbdq->inflight_reqs_lock); + cbdq->req_tid = 0; + INIT_DELAYED_WORK(&cbdq->complete_work, complete_work_fn); + cbdwc_init(&cbdq->complete_worker_cfg); + + cbdq->released_extents = kzalloc(sizeof(u64) * (CBDC_DATA_SIZE >> PAGE_SHIFT), GFP_KERNEL); + if (!cbdq->released_extents) { + ret = -ENOMEM; + goto channel_exit; + } + + queue_delayed_work(cbdq->cbd_blkdev->task_wq, &cbdq->complete_work, 0); + + atomic_set(&cbdq->state, cbd_queue_state_running); + + return 0; + +channel_exit: + cbdq->channel_info->blkdev_state = cbdc_blkdev_state_none; + cbd_channel_exit(&cbdq->channel); +err: + return ret; +} + +void cbd_queue_stop(struct cbd_queue *cbdq) +{ + LIST_HEAD(tmp_list); + struct cbd_request *cbd_req; + + if (atomic_read(&cbdq->state) != cbd_queue_state_running) + return; + + atomic_set(&cbdq->state, cbd_queue_state_removing); + cancel_delayed_work_sync(&cbdq->complete_work); + + spin_lock(&cbdq->inflight_reqs_lock); + list_splice_init(&cbdq->inflight_reqs, &tmp_list); + spin_unlock(&cbdq->inflight_reqs_lock); + + while (!list_empty(&tmp_list)) { + cbd_req = list_first_entry(&tmp_list, + struct cbd_request, inflight_reqs_node); + list_del_init(&cbd_req->inflight_reqs_node); + cancel_work_sync(&cbd_req->work); + cbd_req_put(cbd_req, -EIO); + } + + kfree(cbdq->released_extents); + cbdq->channel_info->blkdev_state = cbdc_blkdev_state_none; + cbd_channel_exit(&cbdq->channel); +} From patchwork Wed Sep 18 10:18:20 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dongsheng Yang X-Patchwork-Id: 13806780 Received: from out-185.mta1.migadu.com (out-185.mta1.migadu.com [95.215.58.185]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0E09917333D for ; Wed, 18 Sep 2024 10:19:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.185 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726654760; cv=none; b=HeHP4cHxX6mM37YxQTb5gVD6xRjQnWnSkElsKTHIdA3fDf5BTX9YL4ehHelPtPeXsR54PkbHdhEdoFW5CSPGxUSjRL70vYwiBEXmmFD0rUGV42j2Vuk4AkT7RZF691uTNIj91htMOIVd4SlnIMZV1i/rwxtvmdaG3HeKWjP/Tsc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726654760; c=relaxed/simple; bh=nQSOWFnAXLLYKidxd/Kf+SBffy0qObGhTTWo18/oS3E=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=JgNURLa+O+ny5mqRpz8qqkXmmZsCTqXBcyrn1Q+zpZcAHFLUaok/w8RhukMMQJMU5oxEfx1emojjRFQXWUOmNpPWxO5ouBqm2i2k39TeE9dlbOi11WMgZqzQZw0ngSrNj+/lbJpJGvlqfoe8N9RAIAcY6KIGEoQesHol+phNAJ0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=SsHV7yO9; arc=none smtp.client-ip=95.215.58.185 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="SsHV7yO9" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1726654755; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=v4dKpUx85fl3YTGHcK89lmOveS0EI6KnldbDyoQf1bs=; b=SsHV7yO9iYgdH1YQRgodUzX3Smdscq8o3yFVxXTNH1UVKX1QT5nJvKqo4KU2/1PSKb7j1Y MTHOqTZzs5pTqUtjQ+QNm2Xb4UfoTmbObbKBOjtAreJlTUuMnatv2SwyiXxxt+T5Bf0+vc +lEYr4itRvcfx9OSlNrvJqN6hta+7Pc= From: Dongsheng Yang To: axboe@kernel.dk, dan.j.williams@intel.com, gregory.price@memverge.com, John@groves.net, Jonathan.Cameron@Huawei.com, bbhushan2@marvell.com, chaitanyak@nvidia.com, rdunlap@infradead.org Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, linux-bcache@vger.kernel.org, Dongsheng Yang Subject: [PATCH v2 7/8] cbd: introduce cbd_backend Date: Wed, 18 Sep 2024 10:18:20 +0000 Message-Id: <20240918101821.681118-8-dongsheng.yang@linux.dev> In-Reply-To: <20240918101821.681118-1-dongsheng.yang@linux.dev> References: <20240918101821.681118-1-dongsheng.yang@linux.dev> Precedence: bulk X-Mailing-List: linux-cxl@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT The "cbd_backend" is responsible for exposing a local block device (such as "/dev/sda") through the "cbd_transport" to other hosts. Any host that registers this transport can map this backend to a local "cbd device"(such as "/dev/cbd0"). All reads and writes to "cbd0" are transmitted through the channel inside the transport to the backend. The handler inside the backend is responsible for processing these read and write requests, converting them into read and write requests corresponding to "sda". Signed-off-by: Dongsheng Yang --- drivers/block/cbd/cbd_backend.c | 395 ++++++++++++++++++++++++++++++++ drivers/block/cbd/cbd_handler.c | 242 +++++++++++++++++++ 2 files changed, 637 insertions(+) create mode 100644 drivers/block/cbd/cbd_backend.c create mode 100644 drivers/block/cbd/cbd_handler.c diff --git a/drivers/block/cbd/cbd_backend.c b/drivers/block/cbd/cbd_backend.c new file mode 100644 index 000000000000..2f7e61b09779 --- /dev/null +++ b/drivers/block/cbd/cbd_backend.c @@ -0,0 +1,395 @@ +#include "cbd_internal.h" + +static ssize_t backend_host_id_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct cbd_backend_device *backend; + struct cbd_backend_info *backend_info; + + backend = container_of(dev, struct cbd_backend_device, dev); + backend_info = backend->backend_info; + + if (backend_info->state == cbd_backend_state_none) + return 0; + + return sprintf(buf, "%u\n", backend_info->host_id); +} + +static DEVICE_ATTR(host_id, 0400, backend_host_id_show, NULL); + +static ssize_t backend_path_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct cbd_backend_device *backend; + struct cbd_backend_info *backend_info; + + backend = container_of(dev, struct cbd_backend_device, dev); + backend_info = backend->backend_info; + + if (backend_info->state == cbd_backend_state_none) + return 0; + + return sprintf(buf, "%s\n", backend_info->path); +} + +static DEVICE_ATTR(path, 0400, backend_path_show, NULL); + +CBD_OBJ_HEARTBEAT(backend); + +static struct attribute *cbd_backend_attrs[] = { + &dev_attr_path.attr, + &dev_attr_host_id.attr, + &dev_attr_alive.attr, + NULL +}; + +static struct attribute_group cbd_backend_attr_group = { + .attrs = cbd_backend_attrs, +}; + +static const struct attribute_group *cbd_backend_attr_groups[] = { + &cbd_backend_attr_group, + NULL +}; + +static void cbd_backend_release(struct device *dev) +{ +} + +const struct device_type cbd_backend_type = { + .name = "cbd_backend", + .groups = cbd_backend_attr_groups, + .release = cbd_backend_release, +}; + +const struct device_type cbd_backends_type = { + .name = "cbd_backends", + .release = cbd_backend_release, +}; + +int cbdb_add_handler(struct cbd_backend *cbdb, struct cbd_handler *handler) +{ + int ret = 0; + + spin_lock(&cbdb->lock); + if (cbdb->backend_info->state == cbd_backend_state_removing) { + ret = -EFAULT; + goto out; + } + hash_add(cbdb->handlers_hash, &handler->hash_node, handler->channel.seg_id); +out: + spin_unlock(&cbdb->lock); + return ret; +} + +void cbdb_del_handler(struct cbd_backend *cbdb, struct cbd_handler *handler) +{ + spin_lock(&cbdb->lock); + hash_del(&handler->hash_node); + spin_unlock(&cbdb->lock); +} + +static struct cbd_handler *cbdb_get_handler(struct cbd_backend *cbdb, u32 seg_id) +{ + struct cbd_handler *handler; + bool found = false; + + spin_lock(&cbdb->lock); + hash_for_each_possible(cbdb->handlers_hash, handler, + hash_node, seg_id) { + if (handler->channel.seg_id == seg_id) { + found = true; + break; + } + } + spin_unlock(&cbdb->lock); + + if (found) + return handler; + + return NULL; +} + +static void state_work_fn(struct work_struct *work) +{ + struct cbd_backend *cbdb = container_of(work, struct cbd_backend, state_work.work); + struct cbd_transport *cbdt = cbdb->cbdt; + struct cbd_segment_info *segment_info; + struct cbd_channel_info *channel_info; + u32 blkdev_state, backend_state, backend_id; + int ret; + int i; + + for (i = 0; i < cbdt->transport_info->segment_num; i++) { + segment_info = cbdt_get_segment_info(cbdt, i); + if (segment_info->type != cbds_type_channel) + continue; + + channel_info = (struct cbd_channel_info *)segment_info; + + blkdev_state = channel_info->blkdev_state; + backend_state = channel_info->backend_state; + backend_id = channel_info->backend_id; + + if (blkdev_state == cbdc_blkdev_state_running && + backend_state == cbdc_backend_state_none && + backend_id == cbdb->backend_id) { + + ret = cbd_handler_create(cbdb, i); + if (ret) { + cbdb_err(cbdb, "create handler for %u error", i); + continue; + } + } + + if (blkdev_state == cbdc_blkdev_state_none && + backend_state == cbdc_backend_state_running && + backend_id == cbdb->backend_id) { + struct cbd_handler *handler; + + handler = cbdb_get_handler(cbdb, i); + if (!handler) + continue; + cbd_handler_destroy(handler); + } + } + + queue_delayed_work(cbd_wq, &cbdb->state_work, 1 * HZ); +} + +static int cbd_backend_init(struct cbd_backend *cbdb, bool new_backend) +{ + struct cbd_backend_info *b_info; + struct cbd_transport *cbdt = cbdb->cbdt; + int ret; + + b_info = cbdt_get_backend_info(cbdt, cbdb->backend_id); + cbdb->backend_info = b_info; + + b_info->host_id = cbdb->cbdt->host->host_id; + + cbdb->backend_io_cache = KMEM_CACHE(cbd_backend_io, 0); + if (!cbdb->backend_io_cache) { + ret = -ENOMEM; + goto err; + } + + cbdb->task_wq = alloc_workqueue("cbdt%d-b%u", WQ_UNBOUND | WQ_MEM_RECLAIM, + 0, cbdt->id, cbdb->backend_id); + if (!cbdb->task_wq) { + ret = -ENOMEM; + goto destroy_io_cache; + } + + cbdb->bdev_file = bdev_file_open_by_path(cbdb->path, + BLK_OPEN_READ | BLK_OPEN_WRITE, cbdb, NULL); + if (IS_ERR(cbdb->bdev_file)) { + cbdt_err(cbdt, "failed to open bdev: %d", (int)PTR_ERR(cbdb->bdev_file)); + ret = PTR_ERR(cbdb->bdev_file); + goto destroy_wq; + } + + cbdb->bdev = file_bdev(cbdb->bdev_file); + if (new_backend) { + b_info->dev_size = bdev_nr_sectors(cbdb->bdev); + } else { + if (b_info->dev_size != bdev_nr_sectors(cbdb->bdev)) { + cbdt_err(cbdt, "Unexpected backend size: %llu, expected: %llu\n", + bdev_nr_sectors(cbdb->bdev), b_info->dev_size); + ret = -EINVAL; + goto close_file; + } + } + + INIT_DELAYED_WORK(&cbdb->state_work, state_work_fn); + INIT_DELAYED_WORK(&cbdb->hb_work, backend_hb_workfn); + hash_init(cbdb->handlers_hash); + cbdb->backend_device = &cbdt->cbd_backends_dev->backend_devs[cbdb->backend_id]; + + spin_lock_init(&cbdb->lock); + + b_info->state = cbd_backend_state_running; + + queue_delayed_work(cbd_wq, &cbdb->state_work, 0); + queue_delayed_work(cbd_wq, &cbdb->hb_work, 0); + + return 0; + +close_file: + fput(cbdb->bdev_file); +destroy_wq: + destroy_workqueue(cbdb->task_wq); +destroy_io_cache: + kmem_cache_destroy(cbdb->backend_io_cache); +err: + return ret; +} + +extern struct device_type cbd_cache_type; + +int cbd_backend_start(struct cbd_transport *cbdt, char *path, u32 backend_id, u32 cache_segs) +{ + struct cbd_backend *backend; + struct cbd_backend_info *backend_info; + struct cbd_cache_info *cache_info; + bool new_backend = false; + int ret; + + if (backend_id == U32_MAX) + new_backend = true; + + if (new_backend) { + ret = cbdt_get_empty_backend_id(cbdt, &backend_id); + if (ret) + return ret; + + backend_info = cbdt_get_backend_info(cbdt, backend_id); + cache_info = &backend_info->cache_info; + cache_info->n_segs = cache_segs; + } else { + backend_info = cbdt_get_backend_info(cbdt, backend_id); + if (cbd_backend_info_is_alive(backend_info)) + return -EBUSY; + cache_info = &backend_info->cache_info; + } + + backend = kzalloc(sizeof(*backend), GFP_KERNEL); + if (!backend) + return -ENOMEM; + + strscpy(backend->path, path, CBD_PATH_LEN); + memcpy(backend_info->path, backend->path, CBD_PATH_LEN); + INIT_LIST_HEAD(&backend->node); + backend->backend_id = backend_id; + backend->cbdt = cbdt; + + ret = cbd_backend_init(backend, new_backend); + if (ret) { + kfree(backend); + return ret; + } + + cbdt_add_backend(cbdt, backend); + + if (cache_info->n_segs) { + struct cbd_cache_opts cache_opts = { 0 }; + + cache_opts.cache_info = cache_info; + cache_opts.alloc_segs = new_backend; + cache_opts.start_writeback = true; + cache_opts.start_gc = false; + cache_opts.init_keys = false; + cache_opts.bdev_file = backend->bdev_file; + cache_opts.dev_size = backend_info->dev_size; + backend->cbd_cache = cbd_cache_alloc(cbdt, &cache_opts); + if (!backend->cbd_cache) { + ret = -ENOMEM; + goto backend_stop; + } + + device_initialize(&backend->cache_dev); + device_set_pm_not_required(&backend->cache_dev); + dev_set_name(&backend->cache_dev, "cache"); + backend->cache_dev.parent = &backend->backend_device->dev; + backend->cache_dev.type = &cbd_cache_type; + ret = device_add(&backend->cache_dev); + if (ret) + goto backend_stop; + backend->cache_dev_registered = true; + } + + return 0; + +backend_stop: + cbd_backend_stop(cbdt, backend_id); + + return ret; +} + +int cbd_backend_stop(struct cbd_transport *cbdt, u32 backend_id) +{ + struct cbd_backend *cbdb; + struct cbd_backend_info *backend_info; + + cbdb = cbdt_get_backend(cbdt, backend_id); + if (!cbdb) + return -ENOENT; + + spin_lock(&cbdb->lock); + if (!hash_empty(cbdb->handlers_hash)) { + spin_unlock(&cbdb->lock); + return -EBUSY; + } + + if (cbdb->backend_info->state == cbd_backend_state_removing) { + spin_unlock(&cbdb->lock); + return -EBUSY; + } + + cbdb->backend_info->state = cbd_backend_state_removing; + spin_unlock(&cbdb->lock); + + cbdt_del_backend(cbdt, cbdb); + + if (cbdb->cbd_cache) { + if (cbdb->cache_dev_registered) + device_unregister(&cbdb->cache_dev); + cbd_cache_destroy(cbdb->cbd_cache); + } + + cancel_delayed_work_sync(&cbdb->hb_work); + cancel_delayed_work_sync(&cbdb->state_work); + + backend_info = cbdt_get_backend_info(cbdt, cbdb->backend_id); + backend_info->state = cbd_backend_state_none; + backend_info->alive_ts = 0; + + drain_workqueue(cbdb->task_wq); + destroy_workqueue(cbdb->task_wq); + + kmem_cache_destroy(cbdb->backend_io_cache); + + fput(cbdb->bdev_file); + kfree(cbdb); + + return 0; +} + +int cbd_backend_clear(struct cbd_transport *cbdt, u32 backend_id) +{ + struct cbd_backend_info *backend_info; + + backend_info = cbdt_get_backend_info(cbdt, backend_id); + if (cbd_backend_info_is_alive(backend_info)) { + cbdt_err(cbdt, "backend %u is still alive\n", backend_id); + return -EBUSY; + } + + if (backend_info->state == cbd_backend_state_none) + return 0; + + backend_info->state = cbd_backend_state_none; + + return 0; +} + +bool cbd_backend_cache_on(struct cbd_backend_info *backend_info) +{ + return (backend_info->cache_info.n_segs != 0); +} + +void cbd_backend_notify(struct cbd_backend *cbdb, u32 seg_id) +{ + struct cbd_handler *handler; + + handler = cbdb_get_handler(cbdb, seg_id); + /* + * If the handler is not ready, return directly and + * wait handler to queue the handle_work in creating + */ + if (!handler) + return; + cbd_handler_notify(handler); +} diff --git a/drivers/block/cbd/cbd_handler.c b/drivers/block/cbd/cbd_handler.c new file mode 100644 index 000000000000..6589dbe9acc7 --- /dev/null +++ b/drivers/block/cbd/cbd_handler.c @@ -0,0 +1,242 @@ +#include "cbd_internal.h" + +static inline struct cbd_se *get_se_head(struct cbd_handler *handler) +{ + return (struct cbd_se *)(handler->channel.submr + handler->channel_info->submr_head); +} + +static inline struct cbd_se *get_se_to_handle(struct cbd_handler *handler) +{ + return (struct cbd_se *)(handler->channel.submr + handler->se_to_handle); +} + +static inline struct cbd_ce *get_compr_head(struct cbd_handler *handler) +{ + return (struct cbd_ce *)(handler->channel.compr + handler->channel_info->compr_head); +} + +static inline void complete_cmd(struct cbd_handler *handler, struct cbd_se *se, int ret) +{ + struct cbd_ce *ce; + unsigned long flags; + + spin_lock_irqsave(&handler->compr_lock, flags); + ce = get_compr_head(handler); + + memset(ce, 0, sizeof(*ce)); + ce->req_tid = se->req_tid; + ce->result = ret; +#ifdef CONFIG_CBD_CRC + if (se->op == CBD_OP_READ) + ce->data_crc = cbd_channel_crc(&handler->channel, se->data_off, se->data_len); + ce->ce_crc = cbd_ce_crc(ce); +#endif + CBDC_UPDATE_COMPR_HEAD(handler->channel_info->compr_head, + sizeof(struct cbd_ce), + handler->channel.compr_size); + spin_unlock_irqrestore(&handler->compr_lock, flags); +} + +static void backend_bio_end(struct bio *bio) +{ + struct cbd_backend_io *backend_io = bio->bi_private; + struct cbd_se *se = backend_io->se; + struct cbd_handler *handler = backend_io->handler; + struct cbd_backend *cbdb = handler->cbdb; + + complete_cmd(handler, se, bio->bi_status); + + bio_put(bio); + kmem_cache_free(cbdb->backend_io_cache, backend_io); +} + +static struct cbd_backend_io *backend_prepare_io(struct cbd_handler *handler, + struct cbd_se *se, blk_opf_t opf) +{ + struct cbd_backend_io *backend_io; + struct cbd_backend *cbdb = handler->cbdb; + + backend_io = kmem_cache_zalloc(cbdb->backend_io_cache, GFP_KERNEL); + if (!backend_io) + return NULL; + backend_io->se = se; + + backend_io->handler = handler; + backend_io->bio = bio_alloc_bioset(cbdb->bdev, + DIV_ROUND_UP(se->len, PAGE_SIZE), + opf, GFP_KERNEL, &handler->bioset); + + if (!backend_io->bio) { + kmem_cache_free(cbdb->backend_io_cache, backend_io); + return NULL; + } + + backend_io->bio->bi_iter.bi_sector = se->offset >> SECTOR_SHIFT; + backend_io->bio->bi_iter.bi_size = 0; + backend_io->bio->bi_private = backend_io; + backend_io->bio->bi_end_io = backend_bio_end; + + return backend_io; +} + +static int handle_backend_cmd(struct cbd_handler *handler, struct cbd_se *se) +{ + struct cbd_backend *cbdb = handler->cbdb; + struct cbd_backend_io *backend_io = NULL; + int ret; + + if (cbd_se_flags_test(se, CBD_SE_FLAGS_DONE)) + return 0; + + switch (se->op) { + case CBD_OP_READ: + backend_io = backend_prepare_io(handler, se, REQ_OP_READ); + break; + case CBD_OP_WRITE: + backend_io = backend_prepare_io(handler, se, REQ_OP_WRITE); + break; + case CBD_OP_FLUSH: + ret = blkdev_issue_flush(cbdb->bdev); + goto complete_cmd; + default: + cbd_handler_err(handler, "unrecognized op: 0x%x", se->op); + ret = -EIO; + goto complete_cmd; + } + + if (!backend_io) + return -ENOMEM; + + ret = cbdc_map_pages(&handler->channel, backend_io); + if (ret) { + kmem_cache_free(cbdb->backend_io_cache, backend_io); + return ret; + } + + submit_bio(backend_io->bio); + + return 0; + +complete_cmd: + complete_cmd(handler, se, ret); + return 0; +} + +void cbd_handler_notify(struct cbd_handler *handler) +{ + queue_delayed_work(handler->cbdb->task_wq, &handler->handle_work, 0); +} + +static void handle_work_fn(struct work_struct *work) +{ + struct cbd_handler *handler = container_of(work, struct cbd_handler, + handle_work.work); + struct cbd_se *se; + u64 req_tid; + int ret; + +again: + /* channel ctrl would be updated by blkdev queue */ + se = get_se_to_handle(handler); + if (se == get_se_head(handler)) + goto miss; + + req_tid = se->req_tid; + if (handler->req_tid_expected != U64_MAX && + req_tid != handler->req_tid_expected) { + cbd_handler_err(handler, "req_tid (%llu) is not expected (%llu)", + req_tid, handler->req_tid_expected); + goto miss; + } + +#ifdef CONFIG_CBD_CRC + if (se->se_crc != cbd_se_crc(se)) { + cbd_handler_err(handler, "se crc(0x%x) is not expected(0x%x)", + cbd_se_crc(se), se->se_crc); + goto miss; + } + + if (se->op == CBD_OP_WRITE && + se->data_crc != cbd_channel_crc(&handler->channel, + se->data_off, + se->data_len)) { + cbd_handler_err(handler, "data crc(0x%x) is not expected(0x%x)", + cbd_channel_crc(&handler->channel, se->data_off, se->data_len), + se->data_crc); + goto miss; + } +#endif + + cbdwc_hit(&handler->handle_worker_cfg); + ret = handle_backend_cmd(handler, se); + if (!ret) { + /* this se is handled */ + handler->req_tid_expected = req_tid + 1; + handler->se_to_handle = (handler->se_to_handle + sizeof(struct cbd_se)) % + handler->channel.submr_size; + } + + goto again; + +miss: + if (cbdwc_need_retry(&handler->handle_worker_cfg)) + goto again; + + cbdwc_miss(&handler->handle_worker_cfg); + + if (handler->polling) + queue_delayed_work(handler->cbdb->task_wq, &handler->handle_work, 0); +} + +int cbd_handler_create(struct cbd_backend *cbdb, u32 channel_id) +{ + struct cbd_transport *cbdt = cbdb->cbdt; + struct cbd_handler *handler; + int ret; + + handler = kzalloc(sizeof(struct cbd_handler), GFP_KERNEL); + if (!handler) + return -ENOMEM; + + ret = bioset_init(&handler->bioset, 256, 0, BIOSET_NEED_BVECS); + if (ret) + goto err; + + handler->cbdb = cbdb; + cbd_channel_init(&handler->channel, cbdt, channel_id); + handler->channel_info = handler->channel.channel_info; + + if (handler->channel_info->polling) + handler->polling = true; + + handler->se_to_handle = handler->channel_info->submr_tail; + handler->req_tid_expected = U64_MAX; + + spin_lock_init(&handler->compr_lock); + INIT_DELAYED_WORK(&handler->handle_work, handle_work_fn); + + cbdwc_init(&handler->handle_worker_cfg); + + cbdb_add_handler(cbdb, handler); + handler->channel_info->backend_state = cbdc_backend_state_running; + + queue_delayed_work(cbdb->task_wq, &handler->handle_work, 0); + + return 0; +err: + kfree(handler); + return ret; +}; + +void cbd_handler_destroy(struct cbd_handler *handler) +{ + cbdb_del_handler(handler->cbdb, handler); + + cancel_delayed_work_sync(&handler->handle_work); + + handler->channel_info->backend_state = cbdc_backend_state_none; + cbd_channel_exit(&handler->channel); + + bioset_exit(&handler->bioset); + kfree(handler); +} From patchwork Wed Sep 18 10:18:21 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dongsheng Yang X-Patchwork-Id: 13806781 Received: from out-186.mta1.migadu.com (out-186.mta1.migadu.com [95.215.58.186]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 137CC189F2F for ; Wed, 18 Sep 2024 10:19:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.186 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726654765; cv=none; b=ME1M9Le23VmfHYuy0wBWt/uvlsqr/4hDC4y2jqk+GfkndzgGRaJIyCseQIIng7NJqoSO8hNT2fuTskoL2ZDlN+Sm9he3C08FIsUyLDE4C2r4CJcCYf6iesFo5TwFSn7JOklKzwvWIQI0vxQdOY/TnTYYTG/jgLtyEGjt5JVPjCA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726654765; c=relaxed/simple; bh=VLXexvB5xOFks2PgsfgM+N0lQgH8W4ID6OXaYy2fUrI=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Fg2mLpBvs0O/19ORCFh+8W3SENJYHpQfRgK+a4y8DurP+A07ns3Zr/qx741NUfbmOI2EavfFoblvvZcwefbnxO0hiMLog7n/rWAyCWkDybPF8VU8cfxTO+7J+2J7t/PSSqLw5nSbQ7U38CpMd1QkLekDOY8/aHwN0njxvu8EOs4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=CeDgi91S; arc=none smtp.client-ip=95.215.58.186 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="CeDgi91S" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1726654760; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=cbUtm5WprPjz08SnW/Ys9oOg4o7q9TuKMEC1OuMigdk=; b=CeDgi91SrzcSP22d5q5fM6DkAwgWSE9IivmIKe2TK/e0ksM9JFecn1U8JxIPBNBNqBzu92 5vjOAWKyGK4ZPLZRhwcf3UytLdQGtnQV5nR/UanP2KvrKATqjDIsAxRVKz16oN/h+vH2z8 ucqWxoxI3RgOMSnV3u0xdorsUDxu+L0= From: Dongsheng Yang To: axboe@kernel.dk, dan.j.williams@intel.com, gregory.price@memverge.com, John@groves.net, Jonathan.Cameron@Huawei.com, bbhushan2@marvell.com, chaitanyak@nvidia.com, rdunlap@infradead.org Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, linux-bcache@vger.kernel.org, Dongsheng Yang Subject: [PATCH v2 8/8] block: Init for CBD(CXL Block Device) module Date: Wed, 18 Sep 2024 10:18:21 +0000 Message-Id: <20240918101821.681118-9-dongsheng.yang@linux.dev> In-Reply-To: <20240918101821.681118-1-dongsheng.yang@linux.dev> References: <20240918101821.681118-1-dongsheng.yang@linux.dev> Precedence: bulk X-Mailing-List: linux-cxl@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT CBD (CXL Block Device) provides two usage scenarios: single-host and multi-hosts. (1) Single-host scenario, CBD can use a pmem device as a cache for block devices, providing a caching mechanism specifically designed for persistent memory. +-----------------------------------------------------------------+ | single-host | +-----------------------------------------------------------------+ | | | | | | | | | | | +-----------+ +------------+ | | | /dev/cbd0 | | /dev/cbd1 | | | | | | | | | +---------------------|-----------|-----|------------|-------+ | | | | | | | | | | | /dev/pmem0 | cbd0 cache| | cbd1 cache | | | | | | | | | | | | +---------------------|-----------|-----|------------|-------+ | | |+---------+| |+----------+| | | ||/dev/sda || || /dev/sdb || | | |+---------+| |+----------+| | | +-----------+ +------------+ | +-----------------------------------------------------------------+ (2) Multi-hosts scenario, CBD also provides a cache while taking advantage of shared memory features, allowing users to access block devices on other nodes across different hosts. As shared memory is supported in CXL3.0 spec, we can transfer data via CXL shared memory. CBD use CXL shared memory to transfer data between node-1 and node-2. This scenario require your shared memory device support Hardware-consistency as CXL 3.0 described, and CONFIG_CBD_MULTIHOST to be enabled. Signed-off-by: Dongsheng Yang --- drivers/block/Kconfig | 2 + drivers/block/Makefile | 2 + drivers/block/cbd/Kconfig | 45 +++++++ drivers/block/cbd/Makefile | 3 + drivers/block/cbd/cbd_main.c | 224 +++++++++++++++++++++++++++++++++++ 5 files changed, 276 insertions(+) create mode 100644 drivers/block/cbd/Kconfig create mode 100644 drivers/block/cbd/Makefile create mode 100644 drivers/block/cbd/cbd_main.c diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index 5b9d4aaebb81..1f6376828af9 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -219,6 +219,8 @@ config BLK_DEV_NBD If unsure, say N. +source "drivers/block/cbd/Kconfig" + config BLK_DEV_RAM tristate "RAM block device support" help diff --git a/drivers/block/Makefile b/drivers/block/Makefile index 101612cba303..8be2a39f5a7c 100644 --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -39,4 +39,6 @@ obj-$(CONFIG_BLK_DEV_NULL_BLK) += null_blk/ obj-$(CONFIG_BLK_DEV_UBLK) += ublk_drv.o +obj-$(CONFIG_BLK_DEV_CBD) += cbd/ + swim_mod-y := swim.o swim_asm.o diff --git a/drivers/block/cbd/Kconfig b/drivers/block/cbd/Kconfig new file mode 100644 index 000000000000..16ffcca058c5 --- /dev/null +++ b/drivers/block/cbd/Kconfig @@ -0,0 +1,45 @@ +config BLK_DEV_CBD + tristate "CXL Block Device (Experimental)" + depends on DEV_DAX && FS_DAX + help + CBD allows you to register a persistent memory device as a CBD transport. + You can use this persistent memory as a data cache to improve your block + device performance. Additionally, if you enable CBD_MULTIHOST, cbd allows + you to access block devices on a remote host as if they were local disks. + + Select 'y' to build this module directly into the kernel. + Select 'm' to build this module as a loadable kernel module. + + If unsure say 'N'. + +config CBD_CRC + bool "Enable CBD checksum" + default N + depends on BLK_DEV_CBD + help + When CBD_CRC is enabled, all data sent by CBD will include + a checksum. This includes a data checksum, a submit entry checksum, + and a completion entry checksum. This ensures the integrity of the + data transmitted through the CXL memory device. + +config CBD_DEBUG + bool "Enable CBD debug" + default N + depends on BLK_DEV_CBD + help + When CBD_DEBUG is enabled, cbd module will print more messages + for debugging. But that will affact performance, so do not use it + in production case. + +config CBD_MULTIHOST + bool "multi-hosts CXL Dlock Device" + default N + depends on BLK_DEV_CBD + help + When CBD_MULTIHOST is enabled, cbd allows the use of a shared memory device + as a cbd transport. In this mode, the blkdev and backends on different + hosts can be connected through the shared memory device, enabling cross-node + disk access. + + IMPORTANT: This Require your shared memory device support Hardware-consistency + as CXL 3.0 described. diff --git a/drivers/block/cbd/Makefile b/drivers/block/cbd/Makefile new file mode 100644 index 000000000000..ee61f7e2b978 --- /dev/null +++ b/drivers/block/cbd/Makefile @@ -0,0 +1,3 @@ +cbd-y := cbd_main.o cbd_transport.o cbd_channel.o cbd_host.o cbd_backend.o cbd_handler.o cbd_blkdev.o cbd_queue.o cbd_segment.o cbd_cache.o + +obj-$(CONFIG_BLK_DEV_CBD) += cbd.o diff --git a/drivers/block/cbd/cbd_main.c b/drivers/block/cbd/cbd_main.c new file mode 100644 index 000000000000..066596ca9b82 --- /dev/null +++ b/drivers/block/cbd/cbd_main.c @@ -0,0 +1,224 @@ +/* + * Copyright(C) 2024, Dongsheng Yang + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "cbd_internal.h" + +struct workqueue_struct *cbd_wq; + +enum { + CBDT_REG_OPT_ERR = 0, + CBDT_REG_OPT_FORCE, + CBDT_REG_OPT_FORMAT, + CBDT_REG_OPT_PATH, + CBDT_REG_OPT_HOSTNAME, +}; + +static const match_table_t register_opt_tokens = { + { CBDT_REG_OPT_FORCE, "force=%u" }, + { CBDT_REG_OPT_FORMAT, "format=%u" }, + { CBDT_REG_OPT_PATH, "path=%s" }, + { CBDT_REG_OPT_HOSTNAME, "hostname=%s" }, + { CBDT_REG_OPT_ERR, NULL } +}; + +static int parse_register_options( + char *buf, + struct cbdt_register_options *opts) +{ + substring_t args[MAX_OPT_ARGS]; + char *o, *p; + int token, ret = 0; + + o = buf; + + while ((p = strsep(&o, ",\n")) != NULL) { + if (!*p) + continue; + + token = match_token(p, register_opt_tokens, args); + switch (token) { + case CBDT_REG_OPT_PATH: + if (match_strlcpy(opts->path, &args[0], + CBD_PATH_LEN) == 0) { + ret = -EINVAL; + break; + } + break; + case CBDT_REG_OPT_FORCE: + if (match_uint(args, &token)) { + ret = -EINVAL; + goto out; + } + opts->force = (token != 0); + break; + case CBDT_REG_OPT_FORMAT: + if (match_uint(args, &token)) { + ret = -EINVAL; + goto out; + } + opts->format = (token != 0); + break; + case CBDT_REG_OPT_HOSTNAME: + if (match_strlcpy(opts->hostname, &args[0], + CBD_NAME_LEN) == 0) { + ret = -EINVAL; + break; + } + break; + default: + pr_err("unknown parameter or missing value '%s'\n", p); + ret = -EINVAL; + goto out; + } + } + +out: + return ret; +} + +static ssize_t transport_unregister_store(const struct bus_type *bus, const char *ubuf, + size_t size) +{ + u32 transport_id; + int ret; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + if (sscanf(ubuf, "transport_id=%u", &transport_id) != 1) + return -EINVAL; + + ret = cbdt_unregister(transport_id); + if (ret < 0) + return ret; + + return size; +} + +static ssize_t transport_register_store(const struct bus_type *bus, const char *ubuf, + size_t size) +{ + struct cbdt_register_options opts = { 0 }; + char *buf; + int ret; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + buf = kmemdup(ubuf, size + 1, GFP_KERNEL); + if (IS_ERR(buf)) { + pr_err("failed to dup buf for adm option: %d", (int)PTR_ERR(buf)); + return PTR_ERR(buf); + } + buf[size] = '\0'; + + ret = parse_register_options(buf, &opts); + if (ret < 0) { + kfree(buf); + return ret; + } + kfree(buf); + + ret = cbdt_register(&opts); + if (ret < 0) + return ret; + + return size; +} + +static BUS_ATTR_WO(transport_unregister); +static BUS_ATTR_WO(transport_register); + +static struct attribute *cbd_bus_attrs[] = { + &bus_attr_transport_unregister.attr, + &bus_attr_transport_register.attr, + NULL, +}; + +static const struct attribute_group cbd_bus_group = { + .attrs = cbd_bus_attrs, +}; +__ATTRIBUTE_GROUPS(cbd_bus); + +const struct bus_type cbd_bus_type = { + .name = "cbd", + .bus_groups = cbd_bus_groups, +}; + +static void cbd_root_dev_release(struct device *dev) +{ +} + +struct device cbd_root_dev = { + .init_name = "cbd", + .release = cbd_root_dev_release, +}; + +static int __init cbd_init(void) +{ + int ret; + + cbd_wq = alloc_workqueue(CBD_DRV_NAME, WQ_UNBOUND | WQ_MEM_RECLAIM, 0); + if (!cbd_wq) + return -ENOMEM; + + ret = device_register(&cbd_root_dev); + if (ret < 0) { + put_device(&cbd_root_dev); + goto destroy_wq; + } + + ret = bus_register(&cbd_bus_type); + if (ret < 0) + goto device_unregister; + + ret = cbd_blkdev_init(); + if (ret < 0) + goto bus_unregister; + + return 0; + +bus_unregister: + bus_unregister(&cbd_bus_type); +device_unregister: + device_unregister(&cbd_root_dev); +destroy_wq: + destroy_workqueue(cbd_wq); + + return ret; +} + +static void cbd_exit(void) +{ + cbd_blkdev_exit(); + bus_unregister(&cbd_bus_type); + device_unregister(&cbd_root_dev); + + destroy_workqueue(cbd_wq); +} + +MODULE_AUTHOR("Dongsheng Yang "); +MODULE_DESCRIPTION("CXL(Compute Express Link) Block Device"); +MODULE_LICENSE("GPL v2"); +module_init(cbd_init); +module_exit(cbd_exit);