From patchwork Wed Feb 10 05:07:23 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 12079859 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1FBDFC433E0 for ; Wed, 10 Feb 2021 05:08:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id DF32E64E2A for ; Wed, 10 Feb 2021 05:08:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229608AbhBJFIr (ORCPT ); Wed, 10 Feb 2021 00:08:47 -0500 Received: from mx2.suse.de ([195.135.220.15]:40098 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229610AbhBJFIp (ORCPT ); Wed, 10 Feb 2021 00:08:45 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 496BAAB98; Wed, 10 Feb 2021 05:08:02 +0000 (UTC) From: Coly Li To: axboe@kernel.dk Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, dongdong tao , Coly Li Subject: [PATCH 01/20] bcache: consider the fragmentation when update the writeback rate Date: Wed, 10 Feb 2021 13:07:23 +0800 Message-Id: <20210210050742.31237-2-colyli@suse.de> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210210050742.31237-1-colyli@suse.de> References: <20210210050742.31237-1-colyli@suse.de> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org From: dongdong tao Current way to calculate the writeback rate only considered the dirty sectors, this usually works fine when the fragmentation is not high, but it will give us unreasonable small rate when we are under a situation that very few dirty sectors consumed a lot dirty buckets. In some case, the dirty bucekts can reached to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) not even reached the writeback_percent, the writeback rate will still be the minimum value (4k), thus it will cause all the writes to be stucked in a non-writeback mode because of the slow writeback. We accelerate the rate in 3 stages with different aggressiveness, the first stage starts when dirty buckets percent reach above BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), the second is BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57), the third is BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64). By default the first stage tries to writeback the amount of dirty data in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) second, the second stage tries to writeback the amount of dirty data in one bucket in (1 / (dirty_buckets_percent - 57)) * 100 millisecond, the third stage tries to writeback the amount of dirty data in one bucket in (1 / (dirty_buckets_percent - 64)) millisecond. the initial rate at each stage can be controlled by 3 configurable parameters writeback_rate_fp_term_{low|mid|high}, they are by default 1, 10, 1000, the hint of IO throughput that these values are trying to achieve is described by above paragraph, the reason that I choose those value as default is based on the testing and the production data, below is some details: A. When it comes to the low stage, there is still a bit far from the 70 threshold, so we only want to give it a little bit push by setting the term to 1, it means the initial rate will be 170 if the fragment is 6, it is calculated by bucket_size/fragment, this rate is very small, but still much reasonable than the minimum 8. For a production bcache with unheavy workload, if the cache device is bigger than 1 TB, it may take hours to consume 1% buckets, so it is very possible to reclaim enough dirty buckets in this stage, thus to avoid entering the next stage. B. If the dirty buckets ratio didn't turn around during the first stage, it comes to the mid stage, then it is necessary for mid stage to be more aggressive than low stage, so i choose the initial rate to be 10 times more than low stage, that means 1700 as the initial rate if the fragment is 6. This is some normal rate we usually see for a normal workload when writeback happens because of writeback_percent. C. If the dirty buckets ratio didn't turn around during the low and mid stages, it comes to the third stage, and it is the last chance that we can turn around to avoid the horrible cutoff writeback sync issue, then we choose 100 times more aggressive than the mid stage, that means 170000 as the initial rate if the fragment is 6. This is also inferred from a production bcache, I've got one week's writeback rate data from a production bcache which has quite heavy workloads, again, the writeback is triggered by the writeback percent, the highest rate area is around 100000 to 240000, so I believe this kind aggressiveness at this stage is reasonable for production. And it should be mostly enough because the hint is trying to reclaim 1000 bucket per second, and from that heavy production env, it is consuming 50 bucket per second on average in one week's data. Option writeback_consider_fragment is to control whether we want this feature to be on or off, it's on by default. Lastly, below is the performance data for all the testing result, including the data from production env: https://docs.google.com/document/d/1AmbIEa_2MhB9bqhC3rfga9tp7n9YX9PLn0jSUxscVW0/edit?usp=sharing Signed-off-by: dongdong tao Signed-off-by: Coly Li --- drivers/md/bcache/bcache.h | 4 ++++ drivers/md/bcache/sysfs.c | 23 +++++++++++++++++++ drivers/md/bcache/writeback.c | 42 +++++++++++++++++++++++++++++++++++ drivers/md/bcache/writeback.h | 4 ++++ 4 files changed, 73 insertions(+) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 1d57f48307e6..d7a84327b7f1 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -373,6 +373,7 @@ struct cached_dev { unsigned int partial_stripes_expensive:1; unsigned int writeback_metadata:1; unsigned int writeback_running:1; + unsigned int writeback_consider_fragment:1; unsigned char writeback_percent; unsigned int writeback_delay; @@ -385,6 +386,9 @@ struct cached_dev { unsigned int writeback_rate_update_seconds; unsigned int writeback_rate_i_term_inverse; unsigned int writeback_rate_p_term_inverse; + unsigned int writeback_rate_fp_term_low; + unsigned int writeback_rate_fp_term_mid; + unsigned int writeback_rate_fp_term_high; unsigned int writeback_rate_minimum; enum stop_on_failure stop_when_cache_set_failed; diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index 00a520c03f41..eef15f8022ba 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -117,10 +117,14 @@ rw_attribute(writeback_running); rw_attribute(writeback_percent); rw_attribute(writeback_delay); rw_attribute(writeback_rate); +rw_attribute(writeback_consider_fragment); rw_attribute(writeback_rate_update_seconds); rw_attribute(writeback_rate_i_term_inverse); rw_attribute(writeback_rate_p_term_inverse); +rw_attribute(writeback_rate_fp_term_low); +rw_attribute(writeback_rate_fp_term_mid); +rw_attribute(writeback_rate_fp_term_high); rw_attribute(writeback_rate_minimum); read_attribute(writeback_rate_debug); @@ -195,6 +199,7 @@ SHOW(__bch_cached_dev) var_printf(bypass_torture_test, "%i"); var_printf(writeback_metadata, "%i"); var_printf(writeback_running, "%i"); + var_printf(writeback_consider_fragment, "%i"); var_print(writeback_delay); var_print(writeback_percent); sysfs_hprint(writeback_rate, @@ -205,6 +210,9 @@ SHOW(__bch_cached_dev) var_print(writeback_rate_update_seconds); var_print(writeback_rate_i_term_inverse); var_print(writeback_rate_p_term_inverse); + var_print(writeback_rate_fp_term_low); + var_print(writeback_rate_fp_term_mid); + var_print(writeback_rate_fp_term_high); var_print(writeback_rate_minimum); if (attr == &sysfs_writeback_rate_debug) { @@ -303,6 +311,7 @@ STORE(__cached_dev) sysfs_strtoul_bool(bypass_torture_test, dc->bypass_torture_test); sysfs_strtoul_bool(writeback_metadata, dc->writeback_metadata); sysfs_strtoul_bool(writeback_running, dc->writeback_running); + sysfs_strtoul_bool(writeback_consider_fragment, dc->writeback_consider_fragment); sysfs_strtoul_clamp(writeback_delay, dc->writeback_delay, 0, UINT_MAX); sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent, @@ -331,6 +340,16 @@ STORE(__cached_dev) sysfs_strtoul_clamp(writeback_rate_p_term_inverse, dc->writeback_rate_p_term_inverse, 1, UINT_MAX); + sysfs_strtoul_clamp(writeback_rate_fp_term_low, + dc->writeback_rate_fp_term_low, + 1, dc->writeback_rate_fp_term_mid - 1); + sysfs_strtoul_clamp(writeback_rate_fp_term_mid, + dc->writeback_rate_fp_term_mid, + dc->writeback_rate_fp_term_low + 1, + dc->writeback_rate_fp_term_high - 1); + sysfs_strtoul_clamp(writeback_rate_fp_term_high, + dc->writeback_rate_fp_term_high, + dc->writeback_rate_fp_term_mid + 1, UINT_MAX); sysfs_strtoul_clamp(writeback_rate_minimum, dc->writeback_rate_minimum, 1, UINT_MAX); @@ -499,9 +518,13 @@ static struct attribute *bch_cached_dev_files[] = { &sysfs_writeback_delay, &sysfs_writeback_percent, &sysfs_writeback_rate, + &sysfs_writeback_consider_fragment, &sysfs_writeback_rate_update_seconds, &sysfs_writeback_rate_i_term_inverse, &sysfs_writeback_rate_p_term_inverse, + &sysfs_writeback_rate_fp_term_low, + &sysfs_writeback_rate_fp_term_mid, + &sysfs_writeback_rate_fp_term_high, &sysfs_writeback_rate_minimum, &sysfs_writeback_rate_debug, &sysfs_io_errors, diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index a129e4d2707c..82d4e0880a99 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -88,6 +88,44 @@ static void __update_writeback_rate(struct cached_dev *dc) int64_t integral_scaled; uint32_t new_rate; + /* + * We need to consider the number of dirty buckets as well + * when calculating the proportional_scaled, Otherwise we might + * have an unreasonable small writeback rate at a highly fragmented situation + * when very few dirty sectors consumed a lot dirty buckets, the + * worst case is when dirty buckets reached cutoff_writeback_sync and + * dirty data is still not even reached to writeback percent, so the rate + * still will be at the minimum value, which will cause the write + * stuck at a non-writeback mode. + */ + struct cache_set *c = dc->disk.c; + + int64_t dirty_buckets = c->nbuckets - c->avail_nbuckets; + + if (dc->writeback_consider_fragment && + c->gc_stats.in_use > BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW && dirty > 0) { + int64_t fragment = + div_s64((dirty_buckets * c->cache->sb.bucket_size), dirty); + int64_t fp_term; + int64_t fps; + + if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID) { + fp_term = dc->writeback_rate_fp_term_low * + (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW); + } else if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH) { + fp_term = dc->writeback_rate_fp_term_mid * + (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID); + } else { + fp_term = dc->writeback_rate_fp_term_high * + (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH); + } + fps = div_s64(dirty, dirty_buckets) * fp_term; + if (fragment > 3 && fps > proportional_scaled) { + /* Only overrite the p when fragment > 3 */ + proportional_scaled = fps; + } + } + if ((error < 0 && dc->writeback_rate_integral > 0) || (error > 0 && time_before64(local_clock(), dc->writeback_rate.next + NSEC_PER_MSEC))) { @@ -977,6 +1015,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc) dc->writeback_metadata = true; dc->writeback_running = false; + dc->writeback_consider_fragment = true; dc->writeback_percent = 10; dc->writeback_delay = 30; atomic_long_set(&dc->writeback_rate.rate, 1024); @@ -984,6 +1023,9 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc) dc->writeback_rate_update_seconds = WRITEBACK_RATE_UPDATE_SECS_DEFAULT; dc->writeback_rate_p_term_inverse = 40; + dc->writeback_rate_fp_term_low = 1; + dc->writeback_rate_fp_term_mid = 10; + dc->writeback_rate_fp_term_high = 1000; dc->writeback_rate_i_term_inverse = 10000; WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING, &dc->disk.flags)); diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h index 3f1230e22de0..02b2f9df73f6 100644 --- a/drivers/md/bcache/writeback.h +++ b/drivers/md/bcache/writeback.h @@ -16,6 +16,10 @@ #define BCH_AUTO_GC_DIRTY_THRESHOLD 50 +#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW 50 +#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID 57 +#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH 64 + #define BCH_DIRTY_INIT_THRD_MAX 64 /* * 14 (16384ths) is chosen here as something that each backing device From patchwork Wed Feb 10 05:07:24 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 12079857 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 684D3C433E9 for ; Wed, 10 Feb 2021 05:08:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3EC3364E4B for ; Wed, 10 Feb 2021 05:08:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230442AbhBJFIr (ORCPT ); Wed, 10 Feb 2021 00:08:47 -0500 Received: from mx2.suse.de ([195.135.220.15]:40128 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229805AbhBJFIq (ORCPT ); Wed, 10 Feb 2021 00:08:46 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 188BEAFFB; Wed, 10 Feb 2021 05:08:04 +0000 (UTC) From: Coly Li To: axboe@kernel.dk Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, Kai Krakow , Coly Li Subject: [PATCH 02/20] bcache: Fix register_device_aync typo Date: Wed, 10 Feb 2021 13:07:24 +0800 Message-Id: <20210210050742.31237-3-colyli@suse.de> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210210050742.31237-1-colyli@suse.de> References: <20210210050742.31237-1-colyli@suse.de> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org From: Kai Krakow Should be `register_device_async`. Cc: Coly Li Signed-off-by: Kai Krakow Signed-off-by: Coly Li --- drivers/md/bcache/super.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 2047a9cccdb5..e7d1b52c5cc8 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2517,7 +2517,7 @@ static void register_cache_worker(struct work_struct *work) module_put(THIS_MODULE); } -static void register_device_aync(struct async_reg_args *args) +static void register_device_async(struct async_reg_args *args) { if (SB_IS_BDEV(args->sb)) INIT_DELAYED_WORK(&args->reg_work, register_bdev_worker); @@ -2611,7 +2611,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr, args->sb = sb; args->sb_disk = sb_disk; args->bdev = bdev; - register_device_aync(args); + register_device_async(args); /* No wait and returns to user space */ goto async_done; } From patchwork Wed Feb 10 05:07:25 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 12079861 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 318C5C433E6 for ; Wed, 10 Feb 2021 05:08:50 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0A98164E53 for ; Wed, 10 Feb 2021 05:08:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230445AbhBJFIs (ORCPT ); Wed, 10 Feb 2021 00:08:48 -0500 Received: from mx2.suse.de ([195.135.220.15]:40138 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230429AbhBJFIr (ORCPT ); Wed, 10 Feb 2021 00:08:47 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 22FC6B016; Wed, 10 Feb 2021 05:08:06 +0000 (UTC) From: Coly Li To: axboe@kernel.dk Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, Kai Krakow , Coly Li , stable@vger.kernel.org Subject: [PATCH 03/20] Revert "bcache: Kill btree_io_wq" Date: Wed, 10 Feb 2021 13:07:25 +0800 Message-Id: <20210210050742.31237-4-colyli@suse.de> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210210050742.31237-1-colyli@suse.de> References: <20210210050742.31237-1-colyli@suse.de> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org From: Kai Krakow This reverts commit 56b30770b27d54d68ad51eccc6d888282b568cee. With the btree using the `system_wq`, I seem to see a lot more desktop latency than I should. After some more investigation, it looks like the original assumption of 56b3077 no longer is true, and bcache has a very high potential of congesting the `system_wq`. In turn, this introduces laggy desktop performance, IO stalls (at least with btrfs), and input events may be delayed. So let's revert this. It's important to note that the semantics of using `system_wq` previously mean that `btree_io_wq` should be created before and destroyed after other bcache wqs to keep the same assumptions. Cc: Coly Li Cc: stable@vger.kernel.org # 5.4+ Signed-off-by: Kai Krakow Signed-off-by: Coly Li --- drivers/md/bcache/bcache.h | 2 ++ drivers/md/bcache/btree.c | 21 +++++++++++++++++++-- drivers/md/bcache/super.c | 4 ++++ 3 files changed, 25 insertions(+), 2 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index d7a84327b7f1..2b8c7dd2cfae 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -1046,5 +1046,7 @@ void bch_debug_exit(void); void bch_debug_init(void); void bch_request_exit(void); int bch_request_init(void); +void bch_btree_exit(void); +int bch_btree_init(void); #endif /* _BCACHE_H */ diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 910df242c83d..952f022db5a5 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -99,6 +99,8 @@ #define PTR_HASH(c, k) \ (((k)->ptr[0] >> c->bucket_bits) | PTR_GEN(k, 0)) +static struct workqueue_struct *btree_io_wq; + #define insert_lock(s, b) ((b)->level <= (s)->lock) @@ -308,7 +310,7 @@ static void __btree_node_write_done(struct closure *cl) btree_complete_write(b, w); if (btree_node_dirty(b)) - schedule_delayed_work(&b->work, 30 * HZ); + queue_delayed_work(btree_io_wq, &b->work, 30 * HZ); closure_return_with_destructor(cl, btree_node_write_unlock); } @@ -481,7 +483,7 @@ static void bch_btree_leaf_dirty(struct btree *b, atomic_t *journal_ref) BUG_ON(!i->keys); if (!btree_node_dirty(b)) - schedule_delayed_work(&b->work, 30 * HZ); + queue_delayed_work(btree_io_wq, &b->work, 30 * HZ); set_btree_node_dirty(b); @@ -2764,3 +2766,18 @@ void bch_keybuf_init(struct keybuf *buf) spin_lock_init(&buf->lock); array_allocator_init(&buf->freelist); } + +void bch_btree_exit(void) +{ + if (btree_io_wq) + destroy_workqueue(btree_io_wq); +} + +int __init bch_btree_init(void) +{ + btree_io_wq = create_singlethread_workqueue("bch_btree_io"); + if (!btree_io_wq) + return -ENOMEM; + + return 0; +} diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index e7d1b52c5cc8..85a44a0cffe0 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2821,6 +2821,7 @@ static void bcache_exit(void) destroy_workqueue(bcache_wq); if (bch_journal_wq) destroy_workqueue(bch_journal_wq); + bch_btree_exit(); if (bcache_major) unregister_blkdev(bcache_major, "bcache"); @@ -2876,6 +2877,9 @@ static int __init bcache_init(void) return bcache_major; } + if (bch_btree_init()) + goto err; + bcache_wq = alloc_workqueue("bcache", WQ_MEM_RECLAIM, 0); if (!bcache_wq) goto err; From patchwork Wed Feb 10 05:07:26 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 12079863 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A1D96C433E6 for ; Wed, 10 Feb 2021 05:08:53 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7D6AE64E42 for ; Wed, 10 Feb 2021 05:08:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230466AbhBJFIv (ORCPT ); Wed, 10 Feb 2021 00:08:51 -0500 Received: from mx2.suse.de ([195.135.220.15]:40156 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230429AbhBJFIu (ORCPT ); Wed, 10 Feb 2021 00:08:50 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 2BAE3B023; Wed, 10 Feb 2021 05:08:08 +0000 (UTC) From: Coly Li To: axboe@kernel.dk Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, Kai Krakow , Coly Li , stable@vger.kernel.org Subject: [PATCH 04/20] bcache: Give btree_io_wq correct semantics again Date: Wed, 10 Feb 2021 13:07:26 +0800 Message-Id: <20210210050742.31237-5-colyli@suse.de> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210210050742.31237-1-colyli@suse.de> References: <20210210050742.31237-1-colyli@suse.de> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org From: Kai Krakow Before killing `btree_io_wq`, the queue was allocated using `create_singlethread_workqueue()` which has `WQ_MEM_RECLAIM`. After killing it, it no longer had this property but `system_wq` is not single threaded. Let's combine both worlds and make it multi threaded but able to reclaim memory. Cc: Coly Li Cc: stable@vger.kernel.org # 5.4+ Signed-off-by: Kai Krakow Signed-off-by: Coly Li --- drivers/md/bcache/btree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 952f022db5a5..fe6dce125aba 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -2775,7 +2775,7 @@ void bch_btree_exit(void) int __init bch_btree_init(void) { - btree_io_wq = create_singlethread_workqueue("bch_btree_io"); + btree_io_wq = alloc_workqueue("bch_btree_io", WQ_MEM_RECLAIM, 0); if (!btree_io_wq) return -ENOMEM; From patchwork Wed Feb 10 05:07:27 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 12079865 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 17EA8C433E6 for ; Wed, 10 Feb 2021 05:09:35 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id DF7A064E42 for ; Wed, 10 Feb 2021 05:09:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230482AbhBJFJb (ORCPT ); Wed, 10 Feb 2021 00:09:31 -0500 Received: from mx2.suse.de ([195.135.220.15]:40716 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230015AbhBJFJ0 (ORCPT ); Wed, 10 Feb 2021 00:09:26 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 33018B061; Wed, 10 Feb 2021 05:08:10 +0000 (UTC) From: Coly Li To: axboe@kernel.dk Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, Kai Krakow , Coly Li , stable@vger.kernel.org Subject: [PATCH 05/20] bcache: Move journal work to new flush wq Date: Wed, 10 Feb 2021 13:07:27 +0800 Message-Id: <20210210050742.31237-6-colyli@suse.de> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210210050742.31237-1-colyli@suse.de> References: <20210210050742.31237-1-colyli@suse.de> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org From: Kai Krakow This is potentially long running and not latency sensitive, let's get it out of the way of other latency sensitive events. As observed in the previous commit, the `system_wq` comes easily congested by bcache, and this fixes a few more stalls I was observing every once in a while. Let's not make this `WQ_MEM_RECLAIM` as it showed to reduce performance of boot and file system operations in my tests. Also, without `WQ_MEM_RECLAIM`, I no longer see desktop stalls. This matches the previous behavior as `system_wq` also does no memory reclaim: > // workqueue.c: > system_wq = alloc_workqueue("events", 0, 0); Cc: Coly Li Cc: stable@vger.kernel.org # 5.4+ Signed-off-by: Kai Krakow Signed-off-by: Coly Li --- drivers/md/bcache/bcache.h | 1 + drivers/md/bcache/journal.c | 4 ++-- drivers/md/bcache/super.c | 16 ++++++++++++++++ 3 files changed, 19 insertions(+), 2 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 2b8c7dd2cfae..848dd4db1659 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -1005,6 +1005,7 @@ void bch_write_bdev_super(struct cached_dev *dc, struct closure *parent); extern struct workqueue_struct *bcache_wq; extern struct workqueue_struct *bch_journal_wq; +extern struct workqueue_struct *bch_flush_wq; extern struct mutex bch_register_lock; extern struct list_head bch_cache_sets; diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index aefbdb7e003b..c6613e817333 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -932,8 +932,8 @@ atomic_t *bch_journal(struct cache_set *c, journal_try_write(c); } else if (!w->dirty) { w->dirty = true; - schedule_delayed_work(&c->journal.work, - msecs_to_jiffies(c->journal_delay_ms)); + queue_delayed_work(bch_flush_wq, &c->journal.work, + msecs_to_jiffies(c->journal_delay_ms)); spin_unlock(&c->journal.lock); } else { spin_unlock(&c->journal.lock); diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 85a44a0cffe0..0228ccb293fc 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -49,6 +49,7 @@ static int bcache_major; static DEFINE_IDA(bcache_device_idx); static wait_queue_head_t unregister_wait; struct workqueue_struct *bcache_wq; +struct workqueue_struct *bch_flush_wq; struct workqueue_struct *bch_journal_wq; @@ -2821,6 +2822,8 @@ static void bcache_exit(void) destroy_workqueue(bcache_wq); if (bch_journal_wq) destroy_workqueue(bch_journal_wq); + if (bch_flush_wq) + destroy_workqueue(bch_flush_wq); bch_btree_exit(); if (bcache_major) @@ -2884,6 +2887,19 @@ static int __init bcache_init(void) if (!bcache_wq) goto err; + /* + * Let's not make this `WQ_MEM_RECLAIM` for the following reasons: + * + * 1. It used `system_wq` before which also does no memory reclaim. + * 2. With `WQ_MEM_RECLAIM` desktop stalls, increased boot times, and + * reduced throughput can be observed. + * + * We still want to user our own queue to not congest the `system_wq`. + */ + bch_flush_wq = alloc_workqueue("bch_flush", 0, 0); + if (!bch_flush_wq) + goto err; + bch_journal_wq = alloc_workqueue("bch_journal", WQ_MEM_RECLAIM, 0); if (!bch_journal_wq) goto err; From patchwork Wed Feb 10 05:07:28 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 12079867 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 79105C43381 for ; Wed, 10 Feb 2021 05:09:35 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5908C64E4B for ; Wed, 10 Feb 2021 05:09:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230488AbhBJFJc (ORCPT ); Wed, 10 Feb 2021 00:09:32 -0500 Received: from mx2.suse.de ([195.135.220.15]:40714 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230308AbhBJFJ1 (ORCPT ); Wed, 10 Feb 2021 00:09:27 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 22E37B062; Wed, 10 Feb 2021 05:08:12 +0000 (UTC) From: Coly Li To: axboe@kernel.dk Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, Joe Perches , Coly Li Subject: [PATCH 06/20] bcache: Avoid comma separated statements Date: Wed, 10 Feb 2021 13:07:28 +0800 Message-Id: <20210210050742.31237-7-colyli@suse.de> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210210050742.31237-1-colyli@suse.de> References: <20210210050742.31237-1-colyli@suse.de> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org From: Joe Perches Use semicolons and braces. Signed-off-by: Joe Perches Signed-off-by: Coly Li --- drivers/md/bcache/bset.c | 12 ++++++++---- drivers/md/bcache/sysfs.c | 6 ++++-- 2 files changed, 12 insertions(+), 6 deletions(-) diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c index 67a2c47f4201..94d38e8a59b3 100644 --- a/drivers/md/bcache/bset.c +++ b/drivers/md/bcache/bset.c @@ -712,8 +712,10 @@ void bch_bset_build_written_tree(struct btree_keys *b) for (j = inorder_next(0, t->size); j; j = inorder_next(j, t->size)) { - while (bkey_to_cacheline(t, k) < cacheline) - prev = k, k = bkey_next(k); + while (bkey_to_cacheline(t, k) < cacheline) { + prev = k; + k = bkey_next(k); + } t->prev[j] = bkey_u64s(prev); t->tree[j].m = bkey_to_cacheline_offset(t, cacheline++, k); @@ -901,8 +903,10 @@ unsigned int bch_btree_insert_key(struct btree_keys *b, struct bkey *k, status = BTREE_INSERT_STATUS_INSERT; while (m != bset_bkey_last(i) && - bkey_cmp(k, b->ops->is_extents ? &START_KEY(m) : m) > 0) - prev = m, m = bkey_next(m); + bkey_cmp(k, b->ops->is_extents ? &START_KEY(m) : m) > 0) { + prev = m; + m = bkey_next(m); + } /* prev is in the tree, if we merge we're done */ status = BTREE_INSERT_STATUS_BACK_MERGE; diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index eef15f8022ba..cc89f3156d1a 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -1094,8 +1094,10 @@ SHOW(__bch_cache) --n; while (cached < p + n && - *cached == BTREE_PRIO) - cached++, n--; + *cached == BTREE_PRIO) { + cached++; + n--; + } for (i = 0; i < n; i++) sum += INITIAL_PRIO - cached[i]; From patchwork Wed Feb 10 05:07:29 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 12079873 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3B98AC433E0 for ; Wed, 10 Feb 2021 05:09:40 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0FA7C64E42 for ; Wed, 10 Feb 2021 05:09:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230418AbhBJFJg (ORCPT ); Wed, 10 Feb 2021 00:09:36 -0500 Received: from mx2.suse.de ([195.135.220.15]:40746 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230412AbhBJFJ3 (ORCPT ); Wed, 10 Feb 2021 00:09:29 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 0F47AB07B; Wed, 10 Feb 2021 05:08:14 +0000 (UTC) From: Coly Li To: axboe@kernel.dk Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, Coly Li , Jianpeng Ma , Qiaowei Ren Subject: [PATCH 07/20] bcache: add initial data structures for nvm pages Date: Wed, 10 Feb 2021 13:07:29 +0800 Message-Id: <20210210050742.31237-8-colyli@suse.de> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210210050742.31237-1-colyli@suse.de> References: <20210210050742.31237-1-colyli@suse.de> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org This patch initializes the prototype data structures for nvm pages allocator, - struct bch_nvm_pages_sb This is the super block allocated on each nvdimm namespace. A nvdimm set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used to mark which nvdimm set this name space belongs to. Normally we will use the bcache's cache set UUID to initialize this uuid, to connect this nvdimm set to a specified bcache cache set. - struct bch_owner_list_head This is a table for all heads of all owner lists. A owner list records which page(s) allocated to which owner. After reboot from power failure, the ownwer may find all its requested and allocated pages from the owner list by a handler which is converted by a UUID. - struct bch_nvm_pages_owner_head This is a head of an owner list. Each owner only has one owner list, and a nvm page only belongs to an specific owner. uuid[] will be set to owner's uuid, for bcache it is the bcache's cache set uuid. label is not mandatory, it is a human-readable string for debug purpose. The pointer *recs references to separated nvm page which hold the table of struct bch_nvm_pgalloc_rec. - struct bch_nvm_pgalloc_recs This struct occupies a whole page, owner_uuid should match the uuid in struct bch_nvm_pages_owner_head. recs[] is the real table contains all allocated records. - struct bch_nvm_pgalloc_rec Each structure records a range of allocated nvm pages. pgoff is offset in unit of page size of this allocated nvm page range. The adjoint page ranges of same owner can be merged into a larger one, therefore pages_nr is NOT always power of 2. Signed-off-by: Coly Li Cc: Jianpeng Ma Cc: Qiaowei Ren --- include/uapi/linux/bcache-nvm.h | 195 ++++++++++++++++++++++++++++++++ 1 file changed, 195 insertions(+) create mode 100644 include/uapi/linux/bcache-nvm.h diff --git a/include/uapi/linux/bcache-nvm.h b/include/uapi/linux/bcache-nvm.h new file mode 100644 index 000000000000..61108bf2a63e --- /dev/null +++ b/include/uapi/linux/bcache-nvm.h @@ -0,0 +1,195 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ + +#ifndef _UAPI_BCACHE_NVM_H +#define _UAPI_BCACHE_NVM_H + +/* + * Bcache on NVDIMM data structures + */ + +/* + * - struct bch_nvm_pages_sb + * This is the super block allocated on each nvdimm namespace. A nvdimm + * set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used to mark + * which nvdimm set this name space belongs to. Normally we will use the + * bcache's cache set UUID to initialize this uuid, to connect this nvdimm + * set to a specified bcache cache set. + * + * - struct bch_owner_list_head + * This is a table for all heads of all owner lists. A owner list records + * which page(s) allocated to which owner. After reboot from power failure, + * the ownwer may find all its requested and allocated pages from the owner + * list by a handler which is converted by a UUID. + * + * - struct bch_nvm_pages_owner_head + * This is a head of an owner list. Each owner only has one owner list, + * and a nvm page only belongs to an specific owner. uuid[] will be set to + * owner's uuid, for bcache it is the bcache's cache set uuid. label is not + * mandatory, it is a human-readable string for debug purpose. The pointer + * recs references to separated nvm page which hold the table of struct + * bch_pgalloc_rec. + * + *- struct bch_nvm_pgalloc_recs + * This structure occupies a whole page, owner_uuid should match the uuid + * in struct bch_nvm_pages_owner_head. recs[] is the real table contains all + * allocated records. + * + * - struct bch_pgalloc_rec + * Each structure records a range of allocated nvm pages. pgoff is offset + * in unit of page size of this allocated nvm page range. The adjoint page + * ranges of same owner can be merged into a larger one, therefore pages_nr + * is NOT always power of 2. + * + * + * Memory layout on nvdimm namespace 0 + * + * 0 +---------------------------------+ + * | | + * 4KB +---------------------------------+ + * | bch_nvm_pages_sb | + * 8KB +---------------------------------+ <--- bch_nvm_pages_sb.bch_owner_list_head + * | bch_owner_list_head | + * | | + * 16KB +---------------------------------+ <--- bch_owner_list_head.heads[0].recs[0] + * | bch_nvm_pgalloc_recs | + * | (nvm pages internal usage) | + * 24KB +---------------------------------+ + * | | + * | | + * 16MB +---------------------------------+ + * | allocable nvm pages | + * | for buddy allocator | + * end +---------------------------------+ + * + * + * + * Memory layout on nvdimm namespace N + * (doesn't have owner list) + * + * 0 +---------------------------------+ + * | | + * 4KB +---------------------------------+ + * | bch_nvm_pages_sb | + * 8KB +---------------------------------+ + * | | + * | | + * | | + * | | + * | | + * | | + * 16MB +---------------------------------+ + * | allocable nvm pages | + * | for buddy allocator | + * end +---------------------------------+ + * + */ + +#include + +/* In sectors */ +#define BCH_NVM_PAGES_SB_OFFSET 4096 +#define BCH_NVM_PAGES_OFFSET (16 << 20) + +#define BCH_NVM_PAGES_LABEL_SIZE 32 +#define BCH_NVM_PAGES_NAMESPACES_MAX 8 + +#define BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET (8<<10) +#define BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET (16<<10) + +#define BCH_NVM_PAGES_SB_VERSION 0 +#define BCH_NVM_PAGES_SB_VERSION_MAX 0 + +static const char bch_nvm_pages_magic[] = { + 0x17, 0xbd, 0x53, 0x7f, 0x1b, 0x23, 0xd6, 0x83, + 0x46, 0xa4, 0xf8, 0x28, 0x17, 0xda, 0xec, 0xa9 }; +static const char bch_nvm_pages_pgalloc_magic[] = { + 0x39, 0x25, 0x3f, 0xf7, 0x27, 0x17, 0xd0, 0xb9, + 0x10, 0xe6, 0xd2, 0xda, 0x38, 0x68, 0x26, 0xae }; + +struct bch_pgalloc_rec { + __u32 pgoff; + __u32 nr; +}; + +struct bch_nvm_pgalloc_recs { +union { + struct { + struct bch_nvm_pages_owner_head *owner; + struct bch_nvm_pgalloc_recs *next; + __u8 magic[16]; + __u8 owner_uuid[16]; + __u32 size; + __u32 used; + __u64 _pad[4]; + struct bch_pgalloc_rec recs[]; + }; + __u8 pad[8192]; +}; +}; +#define BCH_MAX_RECS \ + ((sizeof(struct bch_nvm_pgalloc_recs) - \ + offsetof(struct bch_nvm_pgalloc_recs, recs)) / \ + sizeof(struct bch_pgalloc_rec)) + +struct bch_nvm_pages_owner_head { + __u8 uuid[16]; + char label[BCH_NVM_PAGES_LABEL_SIZE]; + /* Per-namespace own lists */ + struct bch_nvm_pgalloc_recs *recs[BCH_NVM_PAGES_NAMESPACES_MAX]; +}; + +/* heads[0] is always for nvm_pages internal usage */ +struct bch_owner_list_head { +union { + struct { + __u32 size; + __u32 used; + __u64 _pad[4]; + struct bch_nvm_pages_owner_head heads[]; + }; + __u8 pad[8192]; +}; +}; +#define BCH_MAX_OWNER_LIST \ + ((sizeof(struct bch_owner_list_head) - \ + offsetof(struct bch_owner_list_head, heads)) / \ + sizeof(struct bch_nvm_pages_owner_head)) + +/* The on-media bit order is local CPU order */ +struct bch_nvm_pages_sb { + __u64 csum; + __u64 ns_start; + __u64 sb_offset; + __u64 version; + __u8 magic[16]; + __u8 uuid[16]; + __u32 page_size; + __u32 total_namespaces_nr; + __u32 this_namespace_nr; + union { + __u8 set_uuid[16]; + __u64 set_magic; + }; + + __u64 flags; + __u64 seq; + + __u64 feature_compat; + __u64 feature_incompat; + __u64 feature_ro_compat; + + /* For allocable nvm pages from buddy systems */ + __u64 pages_offset; + __u64 pages_total; + + __u64 pad[8]; + + /* Only on the first name space */ + struct bch_owner_list_head *owner_list_head; + + /* Just for csum_set() */ + __u32 keys; + __u64 d[0]; +}; + +#endif /* _UAPI_BCACHE_NVM_H */ From patchwork Wed Feb 10 05:07:30 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 12079881 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DDC5BC433E6 for ; Wed, 10 Feb 2021 05:09:45 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id BA40764E45 for ; Wed, 10 Feb 2021 05:09:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230522AbhBJFJn (ORCPT ); Wed, 10 Feb 2021 00:09:43 -0500 Received: from mx2.suse.de ([195.135.220.15]:40748 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230429AbhBJFJb (ORCPT ); Wed, 10 Feb 2021 00:09:31 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 4A700B066; Wed, 10 Feb 2021 05:08:16 +0000 (UTC) From: Coly Li To: axboe@kernel.dk Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, Jianpeng Ma , Qiaowei Ren , Coly Li Subject: [PATCH 08/20] bcache: initialize the nvm pages allocator Date: Wed, 10 Feb 2021 13:07:30 +0800 Message-Id: <20210210050742.31237-9-colyli@suse.de> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210210050742.31237-1-colyli@suse.de> References: <20210210050742.31237-1-colyli@suse.de> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org From: Jianpeng Ma This patch define the prototype data structures in memory and initializes the nvm pages allocator. The nv address space which is managed by this allocatior can consist of many nvm namespaces, and some namespaces can compose into one nvm set, like cache set. For this initial implementation, only one set can be supported. The users of this nvm pages allocator need to call regiseter_namespace() to register the nvdimm device (like /dev/pmemX) into this allocator as the instance of struct nvm_namespace. Signed-off-by: Jianpeng Ma Co-authored-by: Qiaowei Ren Signed-off-by: Coly Li --- drivers/md/bcache/Kconfig | 6 + drivers/md/bcache/Makefile | 2 +- drivers/md/bcache/nvm-pages.c | 404 ++++++++++++++++++++++++++++++++ drivers/md/bcache/nvm-pages.h | 92 ++++++++ drivers/md/bcache/super.c | 3 + include/uapi/linux/bcache-nvm.h | 7 - 6 files changed, 506 insertions(+), 8 deletions(-) create mode 100644 drivers/md/bcache/nvm-pages.c create mode 100644 drivers/md/bcache/nvm-pages.h diff --git a/drivers/md/bcache/Kconfig b/drivers/md/bcache/Kconfig index d1ca4d059c20..fdec9905ef40 100644 --- a/drivers/md/bcache/Kconfig +++ b/drivers/md/bcache/Kconfig @@ -35,3 +35,9 @@ config BCACHE_ASYNC_REGISTRATION device path into this file will returns immediately and the real registration work is handled in kernel work queue in asynchronous way. + +config BCACHE_NVM_PAGES + bool "NVDIMM support for bcache (EXPERIMENTAL)" + depends on BCACHE + help + nvm pages allocator for bcache. diff --git a/drivers/md/bcache/Makefile b/drivers/md/bcache/Makefile index 5b87e59676b8..948e5ed2ca66 100644 --- a/drivers/md/bcache/Makefile +++ b/drivers/md/bcache/Makefile @@ -4,4 +4,4 @@ obj-$(CONFIG_BCACHE) += bcache.o bcache-y := alloc.o bset.o btree.o closure.o debug.o extents.o\ io.o journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\ - util.o writeback.o features.o + util.o writeback.o features.o nvm-pages.o diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c new file mode 100644 index 000000000000..4fa8e2764773 --- /dev/null +++ b/drivers/md/bcache/nvm-pages.c @@ -0,0 +1,404 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Nvdimm page-buddy allocator + * + * Copyright (c) 2021, Intel Corporation. + * Copyright (c) 2021, Qiaowei Ren . + * Copyright (c) 2021, Jianpeng Ma . + */ + +#include "bcache.h" +#include "nvm-pages.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#ifdef CONFIG_BCACHE_NVM_PAGES + +static const char bch_nvm_pages_magic[] = { + 0x17, 0xbd, 0x53, 0x7f, 0x1b, 0x23, 0xd6, 0x83, + 0x46, 0xa4, 0xf8, 0x28, 0x17, 0xda, 0xec, 0xa9 }; +static const char bch_nvm_pages_pgalloc_magic[] = { + 0x39, 0x25, 0x3f, 0xf7, 0x27, 0x17, 0xd0, 0xb9, + 0x10, 0xe6, 0xd2, 0xda, 0x38, 0x68, 0x26, 0xae }; + +struct bch_nvm_set *only_set; + +static struct bch_owner_list *alloc_owner_list(const char *owner_uuid, + const char *label, int total_namespaces) +{ + struct bch_owner_list *owner_list; + + owner_list = kzalloc(sizeof(*owner_list), GFP_KERNEL); + if (!owner_list) + return NULL; + + owner_list->alloced_recs = kcalloc(total_namespaces, + sizeof(struct bch_nvm_alloced_recs *), GFP_KERNEL); + if (!owner_list->alloced_recs) { + kfree(owner_list); + return NULL; + } + + if (owner_uuid) + memcpy(owner_list->owner_uuid, owner_uuid, 16); + if (label) + memcpy(owner_list->label, label, BCH_NVM_PAGES_LABEL_SIZE); + + return owner_list; +} + +static void release_extents(struct bch_nvm_alloced_recs *extents) +{ + struct list_head *list = extents->extent_head.next; + struct bch_extent *extent; + + while (list != &extents->extent_head) { + extent = container_of(list, struct bch_extent, list); + list_del(list); + kfree(extent); + list = extents->extent_head.next; + } + kfree(extents); +} + +static void release_owner_info(struct bch_nvm_set *nvm_set) +{ + struct bch_owner_list *owner_list; + int i, j; + + for (i = 0; i < nvm_set->owner_list_used; i++) { + owner_list = nvm_set->owner_lists[i]; + for (j = 0; j < nvm_set->total_namespaces_nr; j++) { + if (owner_list->alloced_recs[j]) + release_extents(owner_list->alloced_recs[j]); + } + kfree(owner_list->alloced_recs); + kfree(owner_list); + } + kfree(nvm_set->owner_lists); +} + +static void release_nvm_namespaces(struct bch_nvm_set *nvm_set) +{ + int i; + + for (i = 0; i < nvm_set->total_namespaces_nr; i++) { + blkdev_put(nvm_set->nss[i]->bdev, FMODE_READ|FMODE_WRITE|FMODE_EXEC); + kfree(nvm_set->nss[i]); + } + + kfree(nvm_set->nss); +} + +static void release_nvm_set(struct bch_nvm_set *nvm_set) +{ + release_nvm_namespaces(nvm_set); + release_owner_info(nvm_set); + kfree(nvm_set); +} + +static void *nvm_pgoff_to_vaddr(struct bch_nvm_namespace *ns, pgoff_t pgoff) +{ + return ns->kaddr + (pgoff << PAGE_SHIFT); +} + +static int init_owner_info(struct bch_nvm_namespace *ns) +{ + struct bch_owner_list_head *owner_list_head; + struct bch_nvm_pages_owner_head *owner_head; + struct bch_nvm_pgalloc_recs *nvm_pgalloc_recs; + struct bch_owner_list *owner_list; + struct bch_nvm_alloced_recs *extents; + struct bch_extent *extent; + u32 i, j, k; + + owner_list_head = (struct bch_owner_list_head *) + (ns->kaddr + BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET); + + mutex_lock(&only_set->lock); + only_set->owner_list_size = owner_list_head->size; + only_set->owner_list_used = owner_list_head->used; + + for (i = 0; i < owner_list_head->used; i++) { + owner_head = &owner_list_head->heads[i]; + owner_list = alloc_owner_list(owner_head->uuid, owner_head->label, + only_set->total_namespaces_nr); + if (!owner_list) { + mutex_unlock(&only_set->lock); + return -ENOMEM; + } + + for (j = 0; j < only_set->total_namespaces_nr; j++) { + if (!only_set->nss[j] || !owner_head->recs[j]) + continue; + + nvm_pgalloc_recs = (struct bch_nvm_pgalloc_recs *) + ((long)owner_head->recs[j] + ns->kaddr); + if (memcmp(nvm_pgalloc_recs->magic, bch_nvm_pages_pgalloc_magic, 16)) { + pr_info("invalid bch_nvmpages_pgalloc_magic\n"); + mutex_unlock(&only_set->lock); + return -EINVAL; + } + + extents = kzalloc(sizeof(*extents), GFP_KERNEL); + if (!extents) { + mutex_unlock(&only_set->lock); + return -ENOMEM; + } + + extents->ns = only_set->nss[j]; + INIT_LIST_HEAD(&extents->extent_head); + owner_list->alloced_recs[j] = extents; + + do { + struct bch_pgalloc_rec *rec; + + for (k = 0; k < nvm_pgalloc_recs->used; k++) { + rec = &nvm_pgalloc_recs->recs[k]; + extent = kzalloc(sizeof(*extent), GFP_KERNEL); + if (!extents) { + mutex_unlock(&only_set->lock); + return -ENOMEM; + } + extent->kaddr = nvm_pgoff_to_vaddr(extents->ns, rec->pgoff); + extent->nr = rec->nr; + list_add_tail(&extent->list, &extents->extent_head); + } + extents->nr += nvm_pgalloc_recs->used; + + if (nvm_pgalloc_recs->next) { + nvm_pgalloc_recs = (struct bch_nvm_pgalloc_recs *) + ((long)nvm_pgalloc_recs->next + ns->kaddr); + if (memcmp(nvm_pgalloc_recs->magic, + bch_nvm_pages_pgalloc_magic, 16)) { + pr_info("invalid bch_nvmpages_pgalloc_magic\n"); + mutex_unlock(&only_set->lock); + return -EINVAL; + } + } else + nvm_pgalloc_recs = NULL; + } while (nvm_pgalloc_recs); + } + only_set->owner_lists[i] = owner_list; + owner_list->nvm_set = only_set; + } + mutex_unlock(&only_set->lock); + + return 0; +} + +static bool attach_nvm_set(struct bch_nvm_namespace *ns) +{ + bool rc = true; + + mutex_lock(&only_set->lock); + if (only_set->nss) { + if (memcmp(ns->sb.set_uuid, only_set->set_uuid, 16)) { + pr_info("namespace id does't match nvm set\n"); + rc = false; + goto unlock; + } + + if (only_set->nss[ns->sb.this_namespace_nr]) { + pr_info("already has the same position(%d) nvm\n", + ns->sb.this_namespace_nr); + rc = false; + goto unlock; + } + } else { + memcpy(only_set->set_uuid, ns->sb.set_uuid, 16); + only_set->total_namespaces_nr = ns->sb.total_namespaces_nr; + only_set->nss = kcalloc(only_set->total_namespaces_nr, + sizeof(struct bch_nvm_namespace *), GFP_KERNEL); + only_set->owner_lists = kcalloc(BCH_MAX_OWNER_LIST, + sizeof(struct nvm_pages_owner_head *), GFP_KERNEL); + if (!only_set->nss || !only_set->owner_lists) { + pr_info("can't alloc nss or owner_list\n"); + kfree(only_set->nss); + kfree(only_set->owner_lists); + rc = false; + goto unlock; + } + } + + only_set->nss[ns->sb.this_namespace_nr] = ns; + +unlock: + mutex_unlock(&only_set->lock); + return rc; +} + +static int read_nvdimm_meta_super(struct block_device *bdev, + struct bch_nvm_namespace *ns) +{ + struct page *page; + struct bch_nvm_pages_sb *sb; + + page = read_cache_page_gfp(bdev->bd_inode->i_mapping, + BCH_NVM_PAGES_SB_OFFSET >> PAGE_SHIFT, GFP_KERNEL); + + if (IS_ERR(page)) + return -EIO; + + sb = page_address(page) + offset_in_page(BCH_NVM_PAGES_SB_OFFSET); + memcpy(&ns->sb, sb, sizeof(struct bch_nvm_pages_sb)); + + put_page(page); + + return 0; +} + +struct bch_nvm_namespace *bch_register_namespace(const char *dev_path) +{ + struct bch_nvm_namespace *ns; + int err; + pgoff_t pgoff; + char buf[BDEVNAME_SIZE]; + struct block_device *bdev; + uint64_t expected_csum; + int id; + char *path = NULL; + + path = kstrndup(dev_path, 512, GFP_KERNEL); + if (!path) { + pr_err("kstrndup failed\n"); + return ERR_PTR(-ENOMEM); + } + + bdev = blkdev_get_by_path(strim(path), + FMODE_READ|FMODE_WRITE|FMODE_EXEC, + only_set); + if (IS_ERR(bdev)) { + pr_info("get %s error\n", dev_path); + kfree(path); + return ERR_PTR(PTR_ERR(bdev)); + } + + ns = kmalloc(sizeof(struct bch_nvm_namespace), GFP_KERNEL); + if (!ns) + goto bdput; + + err = -EIO; + if (read_nvdimm_meta_super(bdev, ns)) { + pr_info("%s read nvdimm meta super block failed.\n", + bdevname(bdev, buf)); + goto free_ns; + } + + if (memcmp(ns->sb.magic, bch_nvm_pages_magic, 16)) { + pr_info("invalid bch_nvm_pages_magic\n"); + goto free_ns; + } + + if (ns->sb.sb_offset != BCH_NVM_PAGES_SB_OFFSET) { + pr_info("invalid superblock offset\n"); + goto free_ns; + } + + if (ns->sb.total_namespaces_nr != 1) { + pr_info("only one nvm device\n"); + goto free_ns; + } + + expected_csum = csum_set(&ns->sb); + if (expected_csum != ns->sb.csum) { + pr_info("csum is not match with expected one\n"); + goto free_ns; + } + + err = -EOPNOTSUPP; + if (!bdev_dax_supported(bdev, ns->sb.page_size)) { + pr_info("%s don't support DAX\n", bdevname(bdev, buf)); + goto free_ns; + } + + err = -EINVAL; + if (bdev_dax_pgoff(bdev, 0, ns->sb.page_size, &pgoff)) { + pr_info("invalid offset of %s\n", bdevname(bdev, buf)); + goto free_ns; + } + + err = -ENOMEM; + ns->dax_dev = fs_dax_get_by_bdev(bdev); + if (!ns->dax_dev) { + pr_info("can't by dax device by %s\n", bdevname(bdev, buf)); + goto free_ns; + } + + err = -EINVAL; + id = dax_read_lock(); + if (dax_direct_access(ns->dax_dev, pgoff, ns->sb.pages_total, + &ns->kaddr, &ns->start_pfn) <= 0) { + pr_info("dax_direct_access error\n"); + dax_read_unlock(id); + goto free_ns; + } + dax_read_unlock(id); + + + err = -EEXIST; + if (!attach_nvm_set(ns)) + goto free_ns; + + ns->page_size = ns->sb.page_size; + ns->pages_offset = ns->sb.pages_offset; + ns->pages_total = ns->sb.pages_total; + ns->free = 0; + ns->bdev = bdev; + ns->nvm_set = only_set; + + mutex_init(&ns->lock); + + if (ns->sb.this_namespace_nr == 0) { + pr_info("only first namespace contain owner info\n"); + err = init_owner_info(ns); + if (err < 0) { + pr_info("init_owner_info met error %d\n", err); + goto free_ns; + } + } + + kfree(path); + return ns; +free_ns: + kfree(ns); +bdput: + blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXEC); + kfree(path); + return ERR_PTR(err); +} +EXPORT_SYMBOL_GPL(bch_register_namespace); + +int __init bch_nvm_init(void) +{ + only_set = kzalloc(sizeof(*only_set), GFP_KERNEL); + if (!only_set) + return -ENOMEM; + + only_set->total_namespaces_nr = 0; + only_set->owner_lists = NULL; + only_set->nss = NULL; + + mutex_init(&only_set->lock); + + pr_info("bcache nvm init\n"); + return 0; +} + +void bch_nvm_exit(void) +{ + release_nvm_set(only_set); + pr_info("bcache nvm exit\n"); +} + +#endif diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h new file mode 100644 index 000000000000..1b10b4b6db0f --- /dev/null +++ b/drivers/md/bcache/nvm-pages.h @@ -0,0 +1,92 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef _BCACHE_NVM_PAGES_H +#define _BCACHE_NVM_PAGES_H + +#include + +/* + * Bcache NVDIMM in memory data structures + */ + +/* + * The following three structures in memory records which page(s) allocated + * to which owner. After reboot from power failure, they will be initialized + * based on nvm pages superblock in NVDIMM device. + */ +struct bch_extent { + void *kaddr; + u32 nr; + struct list_head list; +}; + +struct bch_nvm_alloced_recs { + u32 nr; + struct bch_nvm_namespace *ns; + struct list_head extent_head; +}; + +struct bch_owner_list { + u8 owner_uuid[16]; + char label[BCH_NVM_PAGES_LABEL_SIZE]; + + struct bch_nvm_set *nvm_set; + struct bch_nvm_alloced_recs **alloced_recs; +}; + +struct bch_nvm_namespace { + struct bch_nvm_pages_sb sb; + void *kaddr; + + u8 uuid[16]; + u64 free; + u32 page_size; + u64 pages_offset; + u64 pages_total; + pfn_t start_pfn; + + struct dax_device *dax_dev; + struct block_device *bdev; + struct bch_nvm_set *nvm_set; + + struct mutex lock; +}; + +/* + * A set of namespaces. Currently only one set can be supported. + */ +struct bch_nvm_set { + u8 set_uuid[16]; + u32 total_namespaces_nr; + + u32 owner_list_size; + u32 owner_list_used; + struct bch_owner_list **owner_lists; + + struct bch_nvm_namespace **nss; + + struct mutex lock; +}; +extern struct bch_nvm_set *only_set; + +#ifdef CONFIG_BCACHE_NVM_PAGES + +struct bch_nvm_namespace *bch_register_namespace(const char *dev_path); +int bch_nvm_init(void); +void bch_nvm_exit(void); + +#else + +static inline struct bch_nvm_namespace *bch_register_namespace(const char *dev_path) +{ + return NULL; +} +static inline int bch_nvm_init(void) +{ + return 0; +} +static inline void bch_nvm_exit(void) { } + +#endif /* CONFIG_BCACHE_NVM_PAGES */ + +#endif /* _BCACHE_NVM_PAGES_H */ diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 0228ccb293fc..915f1ea4dfd9 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -14,6 +14,7 @@ #include "request.h" #include "writeback.h" #include "features.h" +#include "nvm-pages.h" #include #include @@ -2816,6 +2817,7 @@ static void bcache_exit(void) { bch_debug_exit(); bch_request_exit(); + bch_nvm_exit(); if (bcache_kobj) kobject_put(bcache_kobj); if (bcache_wq) @@ -2914,6 +2916,7 @@ static int __init bcache_init(void) bch_debug_init(); closure_debug_init(); + bch_nvm_init(); bcache_is_reboot = false; diff --git a/include/uapi/linux/bcache-nvm.h b/include/uapi/linux/bcache-nvm.h index 61108bf2a63e..0a6dc4a6e470 100644 --- a/include/uapi/linux/bcache-nvm.h +++ b/include/uapi/linux/bcache-nvm.h @@ -99,13 +99,6 @@ #define BCH_NVM_PAGES_SB_VERSION 0 #define BCH_NVM_PAGES_SB_VERSION_MAX 0 -static const char bch_nvm_pages_magic[] = { - 0x17, 0xbd, 0x53, 0x7f, 0x1b, 0x23, 0xd6, 0x83, - 0x46, 0xa4, 0xf8, 0x28, 0x17, 0xda, 0xec, 0xa9 }; -static const char bch_nvm_pages_pgalloc_magic[] = { - 0x39, 0x25, 0x3f, 0xf7, 0x27, 0x17, 0xd0, 0xb9, - 0x10, 0xe6, 0xd2, 0xda, 0x38, 0x68, 0x26, 0xae }; - struct bch_pgalloc_rec { __u32 pgoff; __u32 nr; From patchwork Wed Feb 10 05:07:31 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 12079871 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 819D4C433E9 for ; Wed, 10 Feb 2021 05:09:38 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5FCF064E2A for ; Wed, 10 Feb 2021 05:09:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230470AbhBJFJe (ORCPT ); Wed, 10 Feb 2021 00:09:34 -0500 Received: from mx2.suse.de ([195.135.220.15]:40750 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230418AbhBJFJ3 (ORCPT ); Wed, 10 Feb 2021 00:09:29 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 53C16B08C; Wed, 10 Feb 2021 05:08:18 +0000 (UTC) From: Coly Li To: axboe@kernel.dk Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, Jianpeng Ma , Qiaowei Ren , Coly Li Subject: [PATCH 09/20] bcache: initialization of the buddy Date: Wed, 10 Feb 2021 13:07:31 +0800 Message-Id: <20210210050742.31237-10-colyli@suse.de> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210210050742.31237-1-colyli@suse.de> References: <20210210050742.31237-1-colyli@suse.de> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org From: Jianpeng Ma This nvm pages allocator will implement the simple buddy to manage the nvm address space. This patch initializes this buddy for new namespace. the unit of alloc/free of the buddy is page. DAX device has their struct page(in dram or PMEM). struct { /* ZONE_DEVICE pages */ /** @pgmap: Points to the hosting device page map. */ struct dev_pagemap *pgmap; void *zone_device_data; /* * ZONE_DEVICE private pages are counted as being * mapped so the next 3 words hold the mapping, index, * and private fields from the source anonymous or * page cache page while the page is migrated to device * private memory. * ZONE_DEVICE MEMORY_DEVICE_FS_DAX pages also * use the mapping, index, and private fields when * pmem backed DAX files are mapped. */ }; ZONE_DEVICE pages only use pgmap. Other 4 words[16/32 bytes] don't use. So the second/third word will be used as 'struct list_head ' which list in buddy. The fourth word(that is normal struct page::index) store pgoff which the page-offset in the dax device. And the fifth word (that is normal struct page::private) store order of buddy. page_type will be used to store buddy flags. Signed-off-by: Jianpeng Ma Co-authored-by: Qiaowei Ren Signed-off-by: Coly Li --- drivers/md/bcache/nvm-pages.c | 75 ++++++++++++++++++++++++++++++++++- drivers/md/bcache/nvm-pages.h | 5 +++ 2 files changed, 78 insertions(+), 2 deletions(-) diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c index 4fa8e2764773..7efb99c0fc07 100644 --- a/drivers/md/bcache/nvm-pages.c +++ b/drivers/md/bcache/nvm-pages.c @@ -93,6 +93,7 @@ static void release_nvm_namespaces(struct bch_nvm_set *nvm_set) int i; for (i = 0; i < nvm_set->total_namespaces_nr; i++) { + kvfree(nvm_set->nss[i]->pages_bitmap); blkdev_put(nvm_set->nss[i]->bdev, FMODE_READ|FMODE_WRITE|FMODE_EXEC); kfree(nvm_set->nss[i]); } @@ -112,6 +113,17 @@ static void *nvm_pgoff_to_vaddr(struct bch_nvm_namespace *ns, pgoff_t pgoff) return ns->kaddr + (pgoff << PAGE_SHIFT); } +static struct page *nvm_vaddr_to_page(struct bch_nvm_namespace *ns, void *addr) +{ + return virt_to_page(addr); +} + +static inline void remove_owner_space(struct bch_nvm_namespace *ns, + pgoff_t pgoff, u32 nr) +{ + bitmap_set(ns->pages_bitmap, pgoff, nr); +} + static int init_owner_info(struct bch_nvm_namespace *ns) { struct bch_owner_list_head *owner_list_head; @@ -129,6 +141,8 @@ static int init_owner_info(struct bch_nvm_namespace *ns) only_set->owner_list_size = owner_list_head->size; only_set->owner_list_used = owner_list_head->used; + remove_owner_space(ns, 0, ns->pages_offset/ns->page_size); + for (i = 0; i < owner_list_head->used; i++) { owner_head = &owner_list_head->heads[i]; owner_list = alloc_owner_list(owner_head->uuid, owner_head->label, @@ -162,6 +176,8 @@ static int init_owner_info(struct bch_nvm_namespace *ns) do { struct bch_pgalloc_rec *rec; + int order; + struct page *page; for (k = 0; k < nvm_pgalloc_recs->used; k++) { rec = &nvm_pgalloc_recs->recs[k]; @@ -172,7 +188,17 @@ static int init_owner_info(struct bch_nvm_namespace *ns) } extent->kaddr = nvm_pgoff_to_vaddr(extents->ns, rec->pgoff); extent->nr = rec->nr; + WARN_ON(!is_power_of_2(extent->nr)); + + /*init struct page: index/private */ + order = ilog2(extent->nr); + page = nvm_vaddr_to_page(ns, extent->kaddr); + set_page_private(page, order); + page->index = rec->pgoff; + list_add_tail(&extent->list, &extents->extent_head); + /*remove already alloced space*/ + remove_owner_space(extents->ns, rec->pgoff, rec->nr); } extents->nr += nvm_pgalloc_recs->used; @@ -197,6 +223,36 @@ static int init_owner_info(struct bch_nvm_namespace *ns) return 0; } +static void init_nvm_free_space(struct bch_nvm_namespace *ns) +{ + unsigned int start, end, i; + struct page *page; + long long pages; + pgoff_t pgoff_start; + + bitmap_for_each_clear_region(ns->pages_bitmap, start, end, 0, ns->pages_total) { + pgoff_start = start; + pages = end - start; + + while (pages) { + for (i = BCH_MAX_ORDER - 1; i >= 0 ; i--) { + if ((pgoff_start % (1 << i) == 0) && (pages >= (1 << i))) + break; + } + + page = nvm_vaddr_to_page(ns, nvm_pgoff_to_vaddr(ns, pgoff_start)); + page->index = pgoff_start; + set_page_private(page, i); + __SetPageBuddy(page); + list_add((struct list_head *)&page->zone_device_data, &ns->free_area[i]); + + pgoff_start += 1 << i; + pages -= 1 << i; + } + } + +} + static bool attach_nvm_set(struct bch_nvm_namespace *ns) { bool rc = true; @@ -261,7 +317,7 @@ static int read_nvdimm_meta_super(struct block_device *bdev, struct bch_nvm_namespace *bch_register_namespace(const char *dev_path) { struct bch_nvm_namespace *ns; - int err; + int i, err; pgoff_t pgoff; char buf[BDEVNAME_SIZE]; struct block_device *bdev; @@ -357,6 +413,16 @@ struct bch_nvm_namespace *bch_register_namespace(const char *dev_path) ns->bdev = bdev; ns->nvm_set = only_set; + ns->pages_bitmap = kvcalloc(BITS_TO_LONGS(ns->pages_total), + sizeof(unsigned long), GFP_KERNEL); + if (!ns->pages_bitmap) { + err = -ENOMEM; + goto free_ns; + } + + for (i = 0; i < BCH_MAX_ORDER; i++) + INIT_LIST_HEAD(&ns->free_area[i]); + mutex_init(&ns->lock); if (ns->sb.this_namespace_nr == 0) { @@ -364,12 +430,17 @@ struct bch_nvm_namespace *bch_register_namespace(const char *dev_path) err = init_owner_info(ns); if (err < 0) { pr_info("init_owner_info met error %d\n", err); - goto free_ns; + goto free_bitmap; } + /* init buddy allocator */ + init_nvm_free_space(ns); } kfree(path); return ns; + +free_bitmap: + kvfree(ns->pages_bitmap); free_ns: kfree(ns); bdput: diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h index 1b10b4b6db0f..ed3431daae06 100644 --- a/drivers/md/bcache/nvm-pages.h +++ b/drivers/md/bcache/nvm-pages.h @@ -34,6 +34,7 @@ struct bch_owner_list { struct bch_nvm_alloced_recs **alloced_recs; }; +#define BCH_MAX_ORDER 20 struct bch_nvm_namespace { struct bch_nvm_pages_sb sb; void *kaddr; @@ -45,6 +46,10 @@ struct bch_nvm_namespace { u64 pages_total; pfn_t start_pfn; + unsigned long *pages_bitmap; + struct list_head free_area[BCH_MAX_ORDER]; + + struct dax_device *dax_dev; struct block_device *bdev; struct bch_nvm_set *nvm_set; From patchwork Wed Feb 10 05:07:32 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 12079869 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3B8C6C4332D for ; Wed, 10 Feb 2021 05:09:37 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 136F464E2A for ; Wed, 10 Feb 2021 05:09:37 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230497AbhBJFJe (ORCPT ); Wed, 10 Feb 2021 00:09:34 -0500 Received: from mx2.suse.de ([195.135.220.15]:40752 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230470AbhBJFJ3 (ORCPT ); Wed, 10 Feb 2021 00:09:29 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 620DAAFE6; Wed, 10 Feb 2021 05:08:20 +0000 (UTC) From: Coly Li To: axboe@kernel.dk Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, Jianpeng Ma , Qiaowei Ren , Coly Li Subject: [PATCH 10/20] bcache: bch_nvm_alloc_pages() of the buddy Date: Wed, 10 Feb 2021 13:07:32 +0800 Message-Id: <20210210050742.31237-11-colyli@suse.de> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210210050742.31237-1-colyli@suse.de> References: <20210210050742.31237-1-colyli@suse.de> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org From: Jianpeng Ma This patch implements the bch_nvm_alloc_pages() of the buddy. Signed-off-by: Jianpeng Ma Co-authored-by: Qiaowei Ren Signed-off-by: Coly Li --- drivers/md/bcache/nvm-pages.c | 121 ++++++++++++++++++++++++++++++++++ drivers/md/bcache/nvm-pages.h | 6 ++ 2 files changed, 127 insertions(+) diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c index 7efb99c0fc07..0b992c17ce47 100644 --- a/drivers/md/bcache/nvm-pages.c +++ b/drivers/md/bcache/nvm-pages.c @@ -124,6 +124,127 @@ static inline void remove_owner_space(struct bch_nvm_namespace *ns, bitmap_set(ns->pages_bitmap, pgoff, nr); } +/* If not found, it will create if create == true */ +static struct bch_owner_list *find_owner_list(const char *owner_uuid, bool create) +{ + struct bch_owner_list *owner_list; + int i; + + for (i = 0; i < only_set->owner_list_used; i++) { + if (!memcmp(owner_uuid, only_set->owner_lists[i]->owner_uuid, 16)) + return only_set->owner_lists[i]; + } + + if (create) { + owner_list = alloc_owner_list(owner_uuid, NULL, only_set->total_namespaces_nr); + only_set->owner_lists[only_set->owner_list_used++] = owner_list; + return owner_list; + } else + return NULL; +} + +static struct bch_nvm_alloced_recs *find_nvm_alloced_recs(struct bch_owner_list *owner_list, + struct bch_nvm_namespace *ns, bool create) +{ + int position = ns->sb.this_namespace_nr; + + if (create && !owner_list->alloced_recs[position]) { + struct bch_nvm_alloced_recs *alloced_recs = + kzalloc(sizeof(*alloced_recs), GFP_KERNEL|__GFP_NOFAIL); + + alloced_recs->ns = ns; + INIT_LIST_HEAD(&alloced_recs->extent_head); + owner_list->alloced_recs[position] = alloced_recs; + return alloced_recs; + } else + return owner_list->alloced_recs[position]; +} + +static inline void *extent_end_addr(struct bch_extent *extent) +{ + return extent->kaddr + ((u64)(extent->nr) << PAGE_SHIFT); +} + +static void add_extent(struct bch_nvm_alloced_recs *alloced_recs, void *addr, int order) +{ + struct list_head *list = alloced_recs->extent_head.next; + struct bch_extent *extent, *tmp; + void *end_addr = addr + (((u64)1 << order) << PAGE_SHIFT); + + while (list != &alloced_recs->extent_head) { + extent = container_of(list, struct bch_extent, list); + if (addr > extent->kaddr) { + list = list->next; + continue; + } + break; + } + + extent = kzalloc(sizeof(*extent), GFP_KERNEL); + extent->kaddr = addr; + extent->nr = 1 << order; + list_add_tail(&extent->list, list); + alloced_recs->nr++; +} + +void *bch_nvm_alloc_pages(int order, const char *owner_uuid) +{ + void *kaddr = NULL; + struct bch_owner_list *owner_list; + struct bch_nvm_alloced_recs *alloced_recs; + int i, j; + + mutex_lock(&only_set->lock); + owner_list = find_owner_list(owner_uuid, true); + + for (j = 0; j < only_set->total_namespaces_nr; j++) { + struct bch_nvm_namespace *ns = only_set->nss[j]; + + if (!ns || (ns->free < (1 << order))) + continue; + + for (i = order; i < BCH_MAX_ORDER; i++) { + struct list_head *list; + struct page *page, *buddy_page; + + if (list_empty(&ns->free_area[i])) + continue; + + list = ns->free_area[i].next; + page = container_of((void *)list, struct page, zone_device_data); + + list_del(list); + + while (i != order) { + buddy_page = nvm_vaddr_to_page(ns, + nvm_pgoff_to_vaddr(ns, page->index + (1 << (i - 1)))); + set_page_private(buddy_page, i - 1); + buddy_page->index = page->index + (1 << (i - 1)); + __SetPageBuddy(buddy_page); + list_add((struct list_head *)&buddy_page->zone_device_data, + &ns->free_area[i - 1]); + i--; + } + + set_page_private(page, order); + __ClearPageBuddy(page); + ns->free -= 1 << order; + kaddr = nvm_pgoff_to_vaddr(ns, page->index); + break; + } + + if (i != BCH_MAX_ORDER) { + alloced_recs = find_nvm_alloced_recs(owner_list, ns, true); + add_extent(alloced_recs, kaddr, order); + break; + } + } + + mutex_unlock(&only_set->lock); + return kaddr; +} +EXPORT_SYMBOL_GPL(bch_nvm_alloc_pages); + static int init_owner_info(struct bch_nvm_namespace *ns) { struct bch_owner_list_head *owner_list_head; diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h index ed3431daae06..10157d993126 100644 --- a/drivers/md/bcache/nvm-pages.h +++ b/drivers/md/bcache/nvm-pages.h @@ -79,6 +79,7 @@ extern struct bch_nvm_set *only_set; struct bch_nvm_namespace *bch_register_namespace(const char *dev_path); int bch_nvm_init(void); void bch_nvm_exit(void); +void *bch_nvm_alloc_pages(int order, const char *owner_uuid); #else @@ -92,6 +93,11 @@ static inline int bch_nvm_init(void) } static inline void bch_nvm_exit(void) { } +static inline void *bch_nvm_alloc_pages(int order, const char *owner_uuid) +{ + return NULL; +} + #endif /* CONFIG_BCACHE_NVM_PAGES */ #endif /* _BCACHE_NVM_PAGES_H */ From patchwork Wed Feb 10 05:07:33 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 12079877 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1CD48C433E9 for ; Wed, 10 Feb 2021 05:09:43 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E5FB764E42 for ; Wed, 10 Feb 2021 05:09:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230303AbhBJFJk (ORCPT ); Wed, 10 Feb 2021 00:09:40 -0500 Received: from mx2.suse.de ([195.135.220.15]:40762 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230479AbhBJFJa (ORCPT ); Wed, 10 Feb 2021 00:09:30 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 68B07B087; Wed, 10 Feb 2021 05:08:22 +0000 (UTC) From: Coly Li To: axboe@kernel.dk Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, Jianpeng Ma , Qiaowei Ren , Coly Li Subject: [PATCH 11/20] bcache: bch_nvm_free_pages() of the buddy Date: Wed, 10 Feb 2021 13:07:33 +0800 Message-Id: <20210210050742.31237-12-colyli@suse.de> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210210050742.31237-1-colyli@suse.de> References: <20210210050742.31237-1-colyli@suse.de> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org From: Jianpeng Ma This patch implements the bch_nvm_free_pages() of the buddy. Signed-off-by: Jianpeng Ma Co-authored-by: Qiaowei Ren Signed-off-by: Coly Li --- drivers/md/bcache/nvm-pages.c | 143 ++++++++++++++++++++++++++++++++-- drivers/md/bcache/nvm-pages.h | 3 + 2 files changed, 138 insertions(+), 8 deletions(-) diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c index 0b992c17ce47..b40bdbac873f 100644 --- a/drivers/md/bcache/nvm-pages.c +++ b/drivers/md/bcache/nvm-pages.c @@ -168,8 +168,7 @@ static inline void *extent_end_addr(struct bch_extent *extent) static void add_extent(struct bch_nvm_alloced_recs *alloced_recs, void *addr, int order) { struct list_head *list = alloced_recs->extent_head.next; - struct bch_extent *extent, *tmp; - void *end_addr = addr + (((u64)1 << order) << PAGE_SHIFT); + struct bch_extent *extent; while (list != &alloced_recs->extent_head) { extent = container_of(list, struct bch_extent, list); @@ -187,6 +186,136 @@ static void add_extent(struct bch_nvm_alloced_recs *alloced_recs, void *addr, in alloced_recs->nr++; } +static inline void *nvm_end_addr(struct bch_nvm_namespace *ns) +{ + return ns->kaddr + (ns->pages_total << PAGE_SHIFT); +} + +static inline bool in_nvm_range(struct bch_nvm_namespace *ns, + void *start_addr, void *end_addr) +{ + return (start_addr >= ns->kaddr) && (end_addr <= nvm_end_addr(ns)); +} + +static struct bch_nvm_namespace *find_nvm_by_addr(void *addr, int order) +{ + int i; + struct bch_nvm_namespace *ns; + + for (i = 0; i < only_set->total_namespaces_nr; i++) { + ns = only_set->nss[i]; + if (ns && in_nvm_range(ns, addr, addr + (1 << order))) + return ns; + } + return NULL; +} + +static int remove_extent(struct bch_nvm_alloced_recs *alloced_recs, void *addr, int order) +{ + struct list_head *list = alloced_recs->extent_head.next; + struct bch_extent *extent; + + while (list != &alloced_recs->extent_head) { + extent = container_of(list, struct bch_extent, list); + + if (addr < extent->kaddr) + return -ENOENT; + if (addr > extent->kaddr) { + list = list->next; + continue; + } + + WARN_ON(extent->nr != (1 << order)); + list_del(list); + kfree(extent); + alloced_recs->nr--; + break; + } + return (list == &alloced_recs->extent_head) ? -ENOENT : 0; +} + +static void __free_space(struct bch_nvm_namespace *ns, void *addr, int order) +{ + unsigned int add_pages = (1 << order); + pgoff_t pgoff; + struct page *page; + + page = nvm_vaddr_to_page(ns, addr); + WARN_ON((!page) || (page->private != order)); + pgoff = page->index; + + while (order < BCH_MAX_ORDER - 1) { + struct page *buddy_page; + + pgoff_t buddy_pgoff = pgoff ^ (1 << order); + pgoff_t parent_pgoff = pgoff & ~(1 << order); + + if ((parent_pgoff + (1 << (order + 1)) > ns->pages_total)) + break; + + buddy_page = nvm_vaddr_to_page(ns, nvm_pgoff_to_vaddr(ns, buddy_pgoff)); + WARN_ON(!buddy_page); + + if (PageBuddy(buddy_page) && (buddy_page->private == order)) { + list_del((struct list_head *)&buddy_page->zone_device_data); + __ClearPageBuddy(buddy_page); + pgoff = parent_pgoff; + order++; + continue; + } + break; + } + + page = nvm_vaddr_to_page(ns, nvm_pgoff_to_vaddr(ns, pgoff)); + WARN_ON(!page); + list_add((struct list_head *)&page->zone_device_data, &ns->free_area[order]); + page->index = pgoff; + set_page_private(page, order); + __SetPageBuddy(page); + ns->free += add_pages; +} + +void bch_nvm_free_pages(void *addr, int order, const char *owner_uuid) +{ + struct bch_nvm_namespace *ns; + struct bch_owner_list *owner_list; + struct bch_nvm_alloced_recs *alloced_recs; + int r; + + mutex_lock(&only_set->lock); + + ns = find_nvm_by_addr(addr, order); + if (!ns) { + pr_info("can't find nvm_dev by kaddr %p\n", addr); + goto unlock; + } + + owner_list = find_owner_list(owner_uuid, false); + if (!owner_list) { + pr_info("can't found owner(uuid=%s)\n", owner_uuid); + goto unlock; + } + + alloced_recs = find_nvm_alloced_recs(owner_list, ns, false); + if (!alloced_recs) { + pr_info("can't find alloced_recs(uuid=%s)\n", ns->uuid); + goto unlock; + } + + r = remove_extent(alloced_recs, addr, order); + if (r < 0) { + pr_info("can't find extent\n"); + goto unlock; + } + + __free_space(ns, addr, order); + +unlock: + mutex_unlock(&only_set->lock); +} +EXPORT_SYMBOL_GPL(bch_nvm_free_pages); + + void *bch_nvm_alloc_pages(int order, const char *owner_uuid) { void *kaddr = NULL; @@ -276,7 +405,6 @@ static int init_owner_info(struct bch_nvm_namespace *ns) for (j = 0; j < only_set->total_namespaces_nr; j++) { if (!only_set->nss[j] || !owner_head->recs[j]) continue; - nvm_pgalloc_recs = (struct bch_nvm_pgalloc_recs *) ((long)owner_head->recs[j] + ns->kaddr); if (memcmp(nvm_pgalloc_recs->magic, bch_nvm_pages_pgalloc_magic, 16)) { @@ -348,7 +476,7 @@ static void init_nvm_free_space(struct bch_nvm_namespace *ns) { unsigned int start, end, i; struct page *page; - long long pages; + u64 pages; pgoff_t pgoff_start; bitmap_for_each_clear_region(ns->pages_bitmap, start, end, 0, ns->pages_total) { @@ -364,9 +492,8 @@ static void init_nvm_free_space(struct bch_nvm_namespace *ns) page = nvm_vaddr_to_page(ns, nvm_pgoff_to_vaddr(ns, pgoff_start)); page->index = pgoff_start; set_page_private(page, i); - __SetPageBuddy(page); - list_add((struct list_head *)&page->zone_device_data, &ns->free_area[i]); - + /* in order to update ns->free */ + __free_space(ns, nvm_pgoff_to_vaddr(ns, pgoff_start), i); pgoff_start += 1 << i; pages -= 1 << i; } @@ -530,7 +657,7 @@ struct bch_nvm_namespace *bch_register_namespace(const char *dev_path) ns->page_size = ns->sb.page_size; ns->pages_offset = ns->sb.pages_offset; ns->pages_total = ns->sb.pages_total; - ns->free = 0; + ns->free = 0; /* increased by __free_space() */ ns->bdev = bdev; ns->nvm_set = only_set; diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h index 10157d993126..1bc3129f2482 100644 --- a/drivers/md/bcache/nvm-pages.h +++ b/drivers/md/bcache/nvm-pages.h @@ -80,6 +80,7 @@ struct bch_nvm_namespace *bch_register_namespace(const char *dev_path); int bch_nvm_init(void); void bch_nvm_exit(void); void *bch_nvm_alloc_pages(int order, const char *owner_uuid); +void bch_nvm_free_pages(void *addr, int order, const char *owner_uuid); #else @@ -98,6 +99,8 @@ static inline void *bch_nvm_alloc_pages(int order, const char *owner_uuid) return NULL; } +static inline void bch_nvm_free_pages(void *addr, int order, const char *owner_uuid) { } + #endif /* CONFIG_BCACHE_NVM_PAGES */ #endif /* _BCACHE_NVM_PAGES_H */ From patchwork Wed Feb 10 05:07:34 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 12079875 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 86DA1C433E6 for ; Wed, 10 Feb 2021 05:09:41 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 6113A64E4B for ; Wed, 10 Feb 2021 05:09:41 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230226AbhBJFJi (ORCPT ); Wed, 10 Feb 2021 00:09:38 -0500 Received: from mx2.suse.de ([195.135.220.15]:40764 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230475AbhBJFJa (ORCPT ); Wed, 10 Feb 2021 00:09:30 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 731EAB0AE; Wed, 10 Feb 2021 05:08:24 +0000 (UTC) From: Coly Li To: axboe@kernel.dk Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, Jianpeng Ma , Qiaowei Ren , Coly Li Subject: [PATCH 12/20] bcache: get allocated pages from specific owner Date: Wed, 10 Feb 2021 13:07:34 +0800 Message-Id: <20210210050742.31237-13-colyli@suse.de> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210210050742.31237-1-colyli@suse.de> References: <20210210050742.31237-1-colyli@suse.de> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org From: Jianpeng Ma This patch implements bch_get_allocated_pages() of the buddy to be used to get allocated pages from specific owner. Signed-off-by: Jianpeng Ma Co-authored-by: Qiaowei Ren Signed-off-by: Coly Li --- drivers/md/bcache/nvm-pages.c | 39 +++++++++++++++++++++++++++++++++++ drivers/md/bcache/nvm-pages.h | 6 ++++++ 2 files changed, 45 insertions(+) diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c index b40bdbac873f..2b079a277e88 100644 --- a/drivers/md/bcache/nvm-pages.c +++ b/drivers/md/bcache/nvm-pages.c @@ -374,6 +374,45 @@ void *bch_nvm_alloc_pages(int order, const char *owner_uuid) } EXPORT_SYMBOL_GPL(bch_nvm_alloc_pages); +struct bch_extent *bch_get_allocated_pages(const char *owner_uuid) +{ + struct bch_owner_list *owner_list = find_owner_list(owner_uuid, false); + struct bch_nvm_alloced_recs *alloced_recs; + struct bch_extent *head = NULL, *e, *tmp; + int i; + + if (!owner_list) + return NULL; + + for (i = 0; i < only_set->total_namespaces_nr; i++) { + struct list_head *l; + + alloced_recs = owner_list->alloced_recs[i]; + + if (!alloced_recs || alloced_recs->nr == 0) + continue; + + l = alloced_recs->extent_head.next; + while (l != &alloced_recs->extent_head) { + e = container_of(l, struct bch_extent, list); + tmp = kzalloc(sizeof(*tmp), GFP_KERNEL|__GFP_NOFAIL); + + INIT_LIST_HEAD(&tmp->list); + tmp->kaddr = e->kaddr; + tmp->nr = e->nr; + + if (head) + list_add_tail(&tmp->list, &head->list); + else + head = tmp; + + l = l->next; + } + } + return head; +} +EXPORT_SYMBOL_GPL(bch_get_allocated_pages); + static int init_owner_info(struct bch_nvm_namespace *ns) { struct bch_owner_list_head *owner_list_head; diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h index 1bc3129f2482..8ffae11c7c61 100644 --- a/drivers/md/bcache/nvm-pages.h +++ b/drivers/md/bcache/nvm-pages.h @@ -81,6 +81,7 @@ int bch_nvm_init(void); void bch_nvm_exit(void); void *bch_nvm_alloc_pages(int order, const char *owner_uuid); void bch_nvm_free_pages(void *addr, int order, const char *owner_uuid); +struct bch_extent *bch_get_allocated_pages(const char *owner_uuid); #else @@ -101,6 +102,11 @@ static inline void *bch_nvm_alloc_pages(int order, const char *owner_uuid) static inline void bch_nvm_free_pages(void *addr, int order, const char *owner_uuid) { } +static inline struct bch_extent *bch_get_allocated_pages(const char *owner_uuid) +{ + return NULL; +} + #endif /* CONFIG_BCACHE_NVM_PAGES */ #endif /* _BCACHE_NVM_PAGES_H */ From patchwork Wed Feb 10 05:07:35 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 12079883 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2BF88C433E0 for ; Wed, 10 Feb 2021 05:09:48 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0135864E42 for ; Wed, 10 Feb 2021 05:09:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230429AbhBJFJo (ORCPT ); Wed, 10 Feb 2021 00:09:44 -0500 Received: from mx2.suse.de ([195.135.220.15]:40828 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230486AbhBJFJd (ORCPT ); Wed, 10 Feb 2021 00:09:33 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 79D38B0B6; Wed, 10 Feb 2021 05:08:26 +0000 (UTC) From: Coly Li To: axboe@kernel.dk Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, Jianpeng Ma , Qiaowei Ren , Coly Li Subject: [PATCH 13/20] bcache: persist owner info when alloc/free pages. Date: Wed, 10 Feb 2021 13:07:35 +0800 Message-Id: <20210210050742.31237-14-colyli@suse.de> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210210050742.31237-1-colyli@suse.de> References: <20210210050742.31237-1-colyli@suse.de> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org From: Jianpeng Ma This patch implement persist owner info on nvdimm device when alloc/free pages. Signed-off-by: Jianpeng Ma Co-authored-by: Qiaowei Ren Signed-off-by: Coly Li --- drivers/md/bcache/nvm-pages.c | 93 ++++++++++++++++++++++++++++++++++- 1 file changed, 92 insertions(+), 1 deletion(-) diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c index 2b079a277e88..c350dcd696dd 100644 --- a/drivers/md/bcache/nvm-pages.c +++ b/drivers/md/bcache/nvm-pages.c @@ -210,6 +210,19 @@ static struct bch_nvm_namespace *find_nvm_by_addr(void *addr, int order) return NULL; } +static void init_pgalloc_recs(struct bch_nvm_pgalloc_recs *recs, const char *owner_uuid) +{ + memset(recs, 0, sizeof(struct bch_nvm_pgalloc_recs)); + memcpy(recs->magic, bch_nvm_pages_pgalloc_magic, 16); + memcpy(recs->owner_uuid, owner_uuid, 16); + recs->size = BCH_MAX_RECS; +} + +static pgoff_t vaddr_to_nvm_pgoff(struct bch_nvm_namespace *ns, void *kaddr) +{ + return (kaddr - ns->kaddr) / PAGE_SIZE; +} + static int remove_extent(struct bch_nvm_alloced_recs *alloced_recs, void *addr, int order) { struct list_head *list = alloced_recs->extent_head.next; @@ -234,6 +247,82 @@ static int remove_extent(struct bch_nvm_alloced_recs *alloced_recs, void *addr, return (list == &alloced_recs->extent_head) ? -ENOENT : 0; } +#define BCH_RECS_LEN (sizeof(struct bch_nvm_pgalloc_recs)) + +static void write_owner_info(void) +{ + struct bch_owner_list *owner_list; + struct bch_nvm_pgalloc_recs *recs; + struct bch_nvm_namespace *ns = only_set->nss[0]; + struct bch_owner_list_head *owner_list_head; + struct bch_nvm_pages_owner_head *owner_head; + u64 recs_pos = BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET; + struct list_head *list; + int i, j; + + owner_list_head = kzalloc(sizeof(*owner_list_head), GFP_KERNEL); + recs = kmalloc(sizeof(*recs), GFP_KERNEL); + if (!owner_list_head || !recs) { + pr_info("can't alloc memory\n"); + goto free_resouce; + } + + owner_list_head->size = BCH_MAX_OWNER_LIST; + WARN_ON(only_set->owner_list_used > owner_list_head->size); + + // in-memory owner maybe not contain alloced-pages. + for (i = 0; i < only_set->owner_list_used; i++) { + owner_head = &owner_list_head->heads[i]; + owner_list = only_set->owner_lists[i]; + + memcpy(owner_head->uuid, owner_list->owner_uuid, 16); + + for (j = 0; j < only_set->total_namespaces_nr; j++) { + struct bch_nvm_alloced_recs *extents = owner_list->alloced_recs[j]; + + if (!extents || !extents->nr) + continue; + + init_pgalloc_recs(recs, owner_list->owner_uuid); + + BUG_ON(recs_pos >= BCH_NVM_PAGES_OFFSET); + owner_head->recs[j] = (struct bch_nvm_pgalloc_recs *)(uintptr_t)recs_pos; + + for (list = extents->extent_head.next; + list != &extents->extent_head; + list = list->next) { + struct bch_extent *extent; + + extent = container_of(list, struct bch_extent, list); + + if (recs->used == recs->size) { + BUG_ON(recs_pos >= BCH_NVM_PAGES_OFFSET); + recs->next = (struct bch_nvm_pgalloc_recs *) + (uintptr_t)(recs_pos + BCH_RECS_LEN); + memcpy_flushcache(ns->kaddr + recs_pos, recs, BCH_RECS_LEN); + init_pgalloc_recs(recs, owner_list->owner_uuid); + recs_pos += BCH_RECS_LEN; + } + + recs->recs[recs->used].pgoff = + vaddr_to_nvm_pgoff(only_set->nss[j], extent->kaddr); + recs->recs[recs->used].nr = extent->nr; + recs->used++; + } + + memcpy_flushcache(ns->kaddr + recs_pos, recs, BCH_RECS_LEN); + recs_pos += sizeof(struct bch_nvm_pgalloc_recs); + } + } + + owner_list_head->used = only_set->owner_list_used; + memcpy_flushcache(ns->kaddr + BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET, + (void *)owner_list_head, sizeof(struct bch_owner_list_head)); +free_resouce: + kfree(owner_list_head); + kfree(recs); +} + static void __free_space(struct bch_nvm_namespace *ns, void *addr, int order) { unsigned int add_pages = (1 << order); @@ -309,6 +398,7 @@ void bch_nvm_free_pages(void *addr, int order, const char *owner_uuid) } __free_space(ns, addr, order); + write_owner_info(); unlock: mutex_unlock(&only_set->lock); @@ -368,7 +458,8 @@ void *bch_nvm_alloc_pages(int order, const char *owner_uuid) break; } } - + if (kaddr) + write_owner_info(); mutex_unlock(&only_set->lock); return kaddr; } From patchwork Wed Feb 10 05:07:36 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 12079879 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 766ACC433E0 for ; Wed, 10 Feb 2021 05:09:44 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 52F3064E45 for ; Wed, 10 Feb 2021 05:09:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230489AbhBJFJl (ORCPT ); Wed, 10 Feb 2021 00:09:41 -0500 Received: from mx2.suse.de ([195.135.220.15]:40830 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230484AbhBJFJd (ORCPT ); Wed, 10 Feb 2021 00:09:33 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 8287CB0B3; Wed, 10 Feb 2021 05:08:28 +0000 (UTC) From: Coly Li To: axboe@kernel.dk Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, Coly Li , Jianpeng Ma , Qiaowei Ren Subject: [PATCH 14/20] bcache: use bucket index for SET_GC_MARK() in bch_btree_gc_finish() Date: Wed, 10 Feb 2021 13:07:36 +0800 Message-Id: <20210210050742.31237-15-colyli@suse.de> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210210050742.31237-1-colyli@suse.de> References: <20210210050742.31237-1-colyli@suse.de> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org Currently the meta data bucket locations on cache device are reserved after the meta data stored on NVDIMM pages, for the meta data layout consistentcy temporarily. So these buckets are still marked as meta data by SET_GC_MARK() in bch_btree_gc_finish(). When BCH_FEATURE_INCOMPAT_NVDIMM_META is set, the sb.d[] stores linear address of NVDIMM pages and not bucket index anymore. Therefore we should avoid to find bucket index from sb.d[], and directly use bucket index from ca->sb.first_bucket to (ca->sb.first_bucket + ca->sb.njournal_bucketsi) for setting the gc mark of journal bucket. Signed-off-by: Coly Li Cc: Jianpeng Ma Cc: Qiaowei Ren --- drivers/md/bcache/btree.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index fe6dce125aba..28edd884bd5d 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -1761,8 +1761,10 @@ static void bch_btree_gc_finish(struct cache_set *c) ca = c->cache; ca->invalidate_needs_gc = 0; - for (k = ca->sb.d; k < ca->sb.d + ca->sb.keys; k++) - SET_GC_MARK(ca->buckets + *k, GC_MARK_METADATA); + /* Range [first_bucket, first_bucket + keys) is for journal buckets */ + for (i = ca->sb.first_bucket; + i < ca->sb.first_bucket + ca->sb.njournal_buckets; i++) + SET_GC_MARK(ca->buckets + i, GC_MARK_METADATA); for (k = ca->prio_buckets; k < ca->prio_buckets + prio_buckets(ca) * 2; k++) From patchwork Wed Feb 10 05:07:37 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 12079885 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 30CCAC433DB for ; Wed, 10 Feb 2021 05:09:55 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0559864E2A for ; Wed, 10 Feb 2021 05:09:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230490AbhBJFJu (ORCPT ); Wed, 10 Feb 2021 00:09:50 -0500 Received: from mx2.suse.de ([195.135.220.15]:40716 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230352AbhBJFJp (ORCPT ); Wed, 10 Feb 2021 00:09:45 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 891D2B0B8; Wed, 10 Feb 2021 05:08:30 +0000 (UTC) From: Coly Li To: axboe@kernel.dk Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, Coly Li , Jianpeng Ma , Qiaowei Ren Subject: [PATCH 15/20] bcache: add BCH_FEATURE_INCOMPAT_NVDIMM_META into incompat feature set Date: Wed, 10 Feb 2021 13:07:37 +0800 Message-Id: <20210210050742.31237-16-colyli@suse.de> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210210050742.31237-1-colyli@suse.de> References: <20210210050742.31237-1-colyli@suse.de> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org This patch adds BCH_FEATURE_INCOMPAT_NVDIMM_META (value 0x0004) into the incompat feature set. When this bit is set by bcache-tools, it indicates bcache meta data should be stored on specific NVDIMM meta device. The bcache meta data mainly includes journal and btree nodes, when this bit is set in incompat feature set, bcache will ask the nvm-pages allocator for NVDIMM space to store the meta data. Signed-off-by: Coly Li Cc: Jianpeng Ma Cc: Qiaowei Ren --- drivers/md/bcache/features.h | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/drivers/md/bcache/features.h b/drivers/md/bcache/features.h index d1c8fd3977fc..333fb5efb6bd 100644 --- a/drivers/md/bcache/features.h +++ b/drivers/md/bcache/features.h @@ -17,11 +17,19 @@ #define BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET 0x0001 /* real bucket size is (1 << bucket_size) */ #define BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE 0x0002 +/* store bcache meta data on nvdimm */ +#define BCH_FEATURE_INCOMPAT_NVDIMM_META 0x0004 #define BCH_FEATURE_COMPAT_SUPP 0 #define BCH_FEATURE_RO_COMPAT_SUPP 0 +#ifdef CONFIG_BCACHE_NVM_PAGES +#define BCH_FEATURE_INCOMPAT_SUPP (BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET| \ + BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE| \ + BCH_FEATURE_INCOMPAT_NVDIMM_META) +#else #define BCH_FEATURE_INCOMPAT_SUPP (BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET| \ BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE) +#endif #define BCH_HAS_COMPAT_FEATURE(sb, mask) \ ((sb)->feature_compat & (mask)) @@ -89,6 +97,7 @@ static inline void bch_clear_feature_##name(struct cache_sb *sb) \ BCH_FEATURE_INCOMPAT_FUNCS(obso_large_bucket, OBSO_LARGE_BUCKET); BCH_FEATURE_INCOMPAT_FUNCS(large_bucket, LOG_LARGE_BUCKET_SIZE); +BCH_FEATURE_INCOMPAT_FUNCS(nvdimm_meta, NVDIMM_META); static inline bool bch_has_unknown_compat_features(struct cache_sb *sb) { From patchwork Wed Feb 10 05:07:38 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 12079893 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 28901C433DB for ; Wed, 10 Feb 2021 05:10:22 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id F18B964E42 for ; Wed, 10 Feb 2021 05:10:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230346AbhBJFKP (ORCPT ); Wed, 10 Feb 2021 00:10:15 -0500 Received: from mx2.suse.de ([195.135.220.15]:41484 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230242AbhBJFKL (ORCPT ); Wed, 10 Feb 2021 00:10:11 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 91724B0BA; Wed, 10 Feb 2021 05:08:32 +0000 (UTC) From: Coly Li To: axboe@kernel.dk Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, Coly Li , Jianpeng Ma , Qiaowei Ren Subject: [PATCH 16/20] bcache: initialize bcache journal for NVDIMM meta device Date: Wed, 10 Feb 2021 13:07:38 +0800 Message-Id: <20210210050742.31237-17-colyli@suse.de> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210210050742.31237-1-colyli@suse.de> References: <20210210050742.31237-1-colyli@suse.de> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org The nvm-pages allocator may store and index the NVDIMM pages allocated for bcache journal. This patch adds the initialization to store bcache journal space on NVDIMM pages if BCH_FEATURE_INCOMPAT_NVDIMM_META bit is set by bcache-tools. If BCH_FEATURE_INCOMPAT_NVDIMM_META is set, get_nvdimm_journal_space() will return the linear address of NVDIMM pages for bcache journal, - If there is previously allocated space, find it from nvm-pages owner list and return to bch_journal_init(). - If there is no previously allocated space, require a new NVDIMM range from the nvm-pages allocator, and return it to bch_journal_init(). And in bch_journal_init(), keys in sb.d[] store the corresponding linear address from NVDIMM into sb.d[i].ptr[0] where 'i' is the bucket index to iterate all journal buckets. Later when bcache journaling code stores the journaling jset, the target NVDIMM linear address stored (and updated) in sb.d[i].ptr[0] can be used directly in memory copy from DRAM pages into NVDIMM pages. Signed-off-by: Coly Li Cc: Jianpeng Ma Cc: Qiaowei Ren --- drivers/md/bcache/journal.c | 97 +++++++++++++++++++++++++++++++++++++ drivers/md/bcache/journal.h | 2 +- drivers/md/bcache/super.c | 16 +++--- 3 files changed, 107 insertions(+), 8 deletions(-) diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index c6613e817333..1f16d8e497cf 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -9,6 +9,8 @@ #include "btree.h" #include "debug.h" #include "extents.h" +#include "nvm-pages.h" +#include "features.h" #include @@ -982,3 +984,98 @@ int bch_journal_alloc(struct cache_set *c) return 0; } + +static void *find_journal_nvm_base(struct bch_extent *list, struct cache *ca) +{ + void *ret = NULL; + struct bch_extent *cur, *next; + + next = list; + do { + cur = next; + /* Match journal area's nvdimm address */ + if (cur->kaddr == (void *)ca->sb.d[0]) { + ret = cur->kaddr; + break; + } + next = list_entry(cur->list.next, struct bch_extent, list); + } while (next != list); + + return ret; +} + +static void bch_release_nvm_extent_list(struct bch_extent *list) +{ + struct bch_extent *ext; + struct list_head *cur, *next; + + list_for_each_safe(cur, next, &list->list) { + ext = list_entry(cur, struct bch_extent, list); + kfree(ext); + } +} + +static void *get_nvdimm_journal_space(struct cache *ca) +{ + struct bch_extent *allocated_list = NULL; + void *ret = NULL; + + allocated_list = bch_get_allocated_pages(ca->sb.set_uuid); + if (allocated_list) { + ret = find_journal_nvm_base(allocated_list, ca); + bch_release_nvm_extent_list(allocated_list); + } + + if (!ret) { + int order = ilog2(ca->sb.bucket_size * ca->sb.njournal_buckets / + PAGE_SECTORS); + + ret = bch_nvm_alloc_pages(order, ca->sb.set_uuid); + if (ret) + memset(ret, 0, (1 << order) * PAGE_SIZE); + } + + return ret; +} + +static int __bch_journal_nvdimm_init(struct cache *ca) +{ + int i, ret = 0; + void *journal_nvm_base = NULL; + + journal_nvm_base = get_nvdimm_journal_space(ca); + if (!journal_nvm_base) { + pr_err("Failed to get journal space from nvdimm\n"); + ret = -1; + goto out; + } + + /* Iniialized and reloaded from on-disk super block already */ + if (ca->sb.d[0] != 0) + goto out; + + for (i = 0; i < ca->sb.keys; i++) + ca->sb.d[i] = + (u64)(journal_nvm_base + (ca->sb.bucket_size * i)); + +out: + return ret; +} + +int bch_journal_init(struct cache_set *c) +{ + int i, ret = 0; + struct cache *ca = c->cache; + + ca->sb.keys = clamp_t(int, ca->sb.nbuckets >> 7, + 2, SB_JOURNAL_BUCKETS); + + if (!bch_has_feature_nvdimm_meta(&ca->sb)) { + for (i = 0; i < ca->sb.keys; i++) + ca->sb.d[i] = ca->sb.first_bucket + i; + } else { + ret = __bch_journal_nvdimm_init(ca); + } + + return ret; +} diff --git a/drivers/md/bcache/journal.h b/drivers/md/bcache/journal.h index f2ea34d5f431..e3a7fa5a8fda 100644 --- a/drivers/md/bcache/journal.h +++ b/drivers/md/bcache/journal.h @@ -179,7 +179,7 @@ void bch_journal_mark(struct cache_set *c, struct list_head *list); void bch_journal_meta(struct cache_set *c, struct closure *cl); int bch_journal_read(struct cache_set *c, struct list_head *list); int bch_journal_replay(struct cache_set *c, struct list_head *list); - +int bch_journal_init(struct cache_set *c); void bch_journal_free(struct cache_set *c); int bch_journal_alloc(struct cache_set *c); diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 915f1ea4dfd9..57c96c16ee16 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -146,10 +146,15 @@ static const char *read_super_common(struct cache_sb *sb, struct block_device * goto err; err = "Journal buckets not sequential"; +#ifdef CONFIG_BCACHE_NVM_PAGES + if (!bch_has_feature_nvdimm_meta(sb)) { +#endif for (i = 0; i < sb->keys; i++) if (sb->d[i] != sb->first_bucket + i) goto err; - +#ifdef CONFIG_BCACHE_NVM_PAGES + } /* bch_has_feature_nvdimm_meta */ +#endif err = "Too many journal buckets"; if (sb->first_bucket + sb->keys > sb->nbuckets) goto err; @@ -2072,14 +2077,11 @@ static int run_cache_set(struct cache_set *c) if (bch_journal_replay(c, &journal)) goto err; } else { - unsigned int j; - pr_notice("invalidating existing data\n"); - ca->sb.keys = clamp_t(int, ca->sb.nbuckets >> 7, - 2, SB_JOURNAL_BUCKETS); - for (j = 0; j < ca->sb.keys; j++) - ca->sb.d[j] = ca->sb.first_bucket + j; + err = "error initializing journal"; + if (bch_journal_init(c)) + goto err; bch_initial_gc_finish(c); From patchwork Wed Feb 10 05:07:39 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 12079889 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id F1418C433DB for ; Wed, 10 Feb 2021 05:10:01 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CB28764E45 for ; Wed, 10 Feb 2021 05:10:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230527AbhBJFJz (ORCPT ); Wed, 10 Feb 2021 00:09:55 -0500 Received: from mx2.suse.de ([195.135.220.15]:40714 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231124AbhBJFJq (ORCPT ); Wed, 10 Feb 2021 00:09:46 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 985D8B0BF; Wed, 10 Feb 2021 05:08:34 +0000 (UTC) From: Coly Li To: axboe@kernel.dk Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, Coly Li , Jianpeng Ma , Qiaowei Ren Subject: [PATCH 17/20] bcache: support storing bcache journal into NVDIMM meta device Date: Wed, 10 Feb 2021 13:07:39 +0800 Message-Id: <20210210050742.31237-18-colyli@suse.de> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210210050742.31237-1-colyli@suse.de> References: <20210210050742.31237-1-colyli@suse.de> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org This patch implements two methods to store bcache journal to, 1) __journal_write_unlocked() for block interface device The latency method to compose bio and issue the jset bio to cache device (e.g. SSD). c->journal.key.ptr[0] indicates the LBA on cache device to store the journal jset. 2) __journal_nvdimm_write_unlocked() for memory interface NVDIMM Use memory interface to access NVDIMM pages and store the jset by memcpy_flushcache(). c->journal.key.ptr[0] indicates the linear address from the NVDIMM pages to store the journal jset. For lagency configuration without NVDIMM meta device, journal I/O is handled by __journal_write_unlocked() with existing code logic. If the NVDIMM meta device is used (by bcache-tools), the journal I/O will be handled by __journal_nvdimm_write_unlocked() and go into the NVDIMM pages. And when NVDIMM meta device is used, sb.d[] stores the linear addresses from NVDIMM pages (no more bucket index), in journal_reclaim() the journaling location in c->journal.key.ptr[0] should also be updated by linear address from NVDIMM pages (no more LBA combined by sectors offset and bucket index). Signed-off-by: Coly Li Cc: Jianpeng Ma Cc: Qiaowei Ren --- drivers/md/bcache/journal.c | 111 ++++++++++++++++++++++++------------ 1 file changed, 75 insertions(+), 36 deletions(-) diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index 1f16d8e497cf..b242fcb47ce2 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -596,6 +596,8 @@ static void do_journal_discard(struct cache *ca) return; } + BUG_ON(bch_has_feature_nvdimm_meta(&ca->sb)); + switch (atomic_read(&ja->discard_in_flight)) { case DISCARD_IN_FLIGHT: return; @@ -661,9 +663,13 @@ static void journal_reclaim(struct cache_set *c) goto out; ja->cur_idx = next; - k->ptr[0] = MAKE_PTR(0, - bucket_to_sector(c, ca->sb.d[ja->cur_idx]), - ca->sb.nr_this_dev); + if (!bch_has_feature_nvdimm_meta(&ca->sb)) + k->ptr[0] = MAKE_PTR(0, + bucket_to_sector(c, ca->sb.d[ja->cur_idx]), + ca->sb.nr_this_dev); + else + k->ptr[0] = ca->sb.d[ja->cur_idx]; + atomic_long_inc(&c->reclaimed_journal_buckets); bkey_init(k); @@ -729,46 +735,21 @@ static void journal_write_unlock(struct closure *cl) spin_unlock(&c->journal.lock); } -static void journal_write_unlocked(struct closure *cl) + +static void __journal_write_unlocked(struct cache_set *c) __releases(c->journal.lock) { - struct cache_set *c = container_of(cl, struct cache_set, journal.io); - struct cache *ca = c->cache; - struct journal_write *w = c->journal.cur; struct bkey *k = &c->journal.key; - unsigned int i, sectors = set_blocks(w->data, block_bytes(ca)) * - ca->sb.block_size; - + struct journal_write *w = c->journal.cur; + struct closure *cl = &c->journal.io; + struct cache *ca = c->cache; struct bio *bio; struct bio_list list; + unsigned int i, sectors = set_blocks(w->data, block_bytes(ca)) * + ca->sb.block_size; bio_list_init(&list); - if (!w->need_write) { - closure_return_with_destructor(cl, journal_write_unlock); - return; - } else if (journal_full(&c->journal)) { - journal_reclaim(c); - spin_unlock(&c->journal.lock); - - btree_flush_write(c); - continue_at(cl, journal_write, bch_journal_wq); - return; - } - - c->journal.blocks_free -= set_blocks(w->data, block_bytes(ca)); - - w->data->btree_level = c->root->level; - - bkey_copy(&w->data->btree_root, &c->root->key); - bkey_copy(&w->data->uuid_bucket, &c->uuid_bucket); - - w->data->prio_bucket[ca->sb.nr_this_dev] = ca->prio_buckets[0]; - w->data->magic = jset_magic(&ca->sb); - w->data->version = BCACHE_JSET_VERSION; - w->data->last_seq = last_seq(&c->journal); - w->data->csum = csum_set(w->data); - for (i = 0; i < KEY_PTRS(k); i++) { ca = PTR_CACHE(c, k, i); bio = &ca->journal.bio; @@ -793,7 +774,6 @@ static void journal_write_unlocked(struct closure *cl) ca->journal.seq[ca->journal.cur_idx] = w->data->seq; } - /* If KEY_PTRS(k) == 0, this jset gets lost in air */ BUG_ON(i == 0); @@ -805,6 +785,65 @@ static void journal_write_unlocked(struct closure *cl) while ((bio = bio_list_pop(&list))) closure_bio_submit(c, bio, cl); +} + +static void __journal_nvdimm_write_unlocked(struct cache_set *c) + __releases(c->journal.lock) +{ + struct journal_write *w = c->journal.cur; + struct cache *ca = c->cache; + unsigned int sectors; + + sectors = set_blocks(w->data, block_bytes(ca)) * ca->sb.block_size; + atomic_long_add(sectors, &ca->meta_sectors_written); + + memcpy_flushcache((void *)c->journal.key.ptr[0], w->data, sectors << 9); + + c->journal.key.ptr[0] += sectors << 9; + ca->journal.seq[ca->journal.cur_idx] = w->data->seq; + + atomic_dec_bug(&fifo_back(&c->journal.pin)); + bch_journal_next(&c->journal); + journal_reclaim(c); + + spin_unlock(&c->journal.lock); +} + +static void journal_write_unlocked(struct closure *cl) +{ + struct cache_set *c = container_of(cl, struct cache_set, journal.io); + struct cache *ca = c->cache; + struct journal_write *w = c->journal.cur; + + if (!w->need_write) { + closure_return_with_destructor(cl, journal_write_unlock); + return; + } else if (journal_full(&c->journal)) { + journal_reclaim(c); + spin_unlock(&c->journal.lock); + + btree_flush_write(c); + continue_at(cl, journal_write, bch_journal_wq); + return; + } + + c->journal.blocks_free -= set_blocks(w->data, block_bytes(ca)); + + w->data->btree_level = c->root->level; + + bkey_copy(&w->data->btree_root, &c->root->key); + bkey_copy(&w->data->uuid_bucket, &c->uuid_bucket); + + w->data->prio_bucket[ca->sb.nr_this_dev] = ca->prio_buckets[0]; + w->data->magic = jset_magic(&ca->sb); + w->data->version = BCACHE_JSET_VERSION; + w->data->last_seq = last_seq(&c->journal); + w->data->csum = csum_set(w->data); + + if (!bch_has_feature_nvdimm_meta(&ca->sb)) + __journal_write_unlocked(c); + else + __journal_nvdimm_write_unlocked(c); continue_at(cl, journal_write_done, NULL); } From patchwork Wed Feb 10 05:07:40 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 12079895 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4A47FC433E0 for ; Wed, 10 Feb 2021 05:10:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 23A4E64E2A for ; Wed, 10 Feb 2021 05:10:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231128AbhBJFKQ (ORCPT ); Wed, 10 Feb 2021 00:10:16 -0500 Received: from mx2.suse.de ([195.135.220.15]:41498 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230332AbhBJFKM (ORCPT ); Wed, 10 Feb 2021 00:10:12 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id A0E3CB0C6; Wed, 10 Feb 2021 05:08:36 +0000 (UTC) From: Coly Li To: axboe@kernel.dk Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, Coly Li , Jianpeng Ma , Qiaowei Ren Subject: [PATCH 18/20] bcache: read jset from NVDIMM pages for journal replay Date: Wed, 10 Feb 2021 13:07:40 +0800 Message-Id: <20210210050742.31237-19-colyli@suse.de> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210210050742.31237-1-colyli@suse.de> References: <20210210050742.31237-1-colyli@suse.de> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org This patch implements two methods to read jset from media for journal replay, - __jnl_rd_bkt() for block device This is the legacy method to read jset via block device interface. - __jnl_rd_nvm_bkt() for NVDIMM This is the method to read jset from NVDIMM memory interface, a.k.a memcopy() from NVDIMM pages to DRAM pages. If BCH_FEATURE_INCOMPAT_NVDIMM_META is set in incompat feature set, during running cache set, journal_read_bucket() will read the journal content from NVDIMM by __jnl_rd_nvm_bkt(). The linear addresses of NVDIMM pages to read jset are stored in sb.d[SB_JOURNAL_BUCKETS], which were initialized and maintained in previous runs of the cache set. A thing should be noticed is, when bch_journal_read() is called, the linear address of NVDIMM pages is not loaded and initialized yet, it is necessary to call __bch_journal_nvdimm_init() before reading the jset from NVDIMM pages. Signed-off-by: Coly Li Cc: Jianpeng Ma Cc: Qiaowei Ren --- drivers/md/bcache/journal.c | 81 ++++++++++++++++++++++++++----------- 1 file changed, 57 insertions(+), 24 deletions(-) diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index b242fcb47ce2..8d08627f5a89 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -34,60 +34,84 @@ static void journal_read_endio(struct bio *bio) closure_put(cl); } +static struct jset *__jnl_rd_bkt(struct cache *ca, unsigned int bkt_idx, + unsigned int len, unsigned int offset, + struct closure *cl) +{ + sector_t bucket = bucket_to_sector(ca->set, ca->sb.d[bkt_idx]); + struct bio *bio = &ca->journal.bio; + struct jset *data = ca->set->journal.w[0].data; + + bio_reset(bio); + bio->bi_iter.bi_sector = bucket + offset; + bio_set_dev(bio, ca->bdev); + bio->bi_iter.bi_size = len << 9; + bio->bi_end_io = journal_read_endio; + bio->bi_private = cl; + bio_set_op_attrs(bio, REQ_OP_READ, 0); + bch_bio_map(bio, data); + + closure_bio_submit(ca->set, bio, cl); + closure_sync(cl); + + /* Indeed journal.w[0].data */ + return data; +} + +static struct jset *__jnl_rd_nvm_bkt(struct cache *ca, unsigned int bkt_idx, + unsigned int len, unsigned int offset) +{ + void *jset_addr = (void *)ca->sb.d[bkt_idx] + (offset << 9); + struct jset *data = ca->set->journal.w[0].data; + + memcpy(data, jset_addr, len << 9); + + /* Indeed journal.w[0].data */ + return data; +} + static int journal_read_bucket(struct cache *ca, struct list_head *list, - unsigned int bucket_index) + unsigned int bucket_idx) { struct journal_device *ja = &ca->journal; - struct bio *bio = &ja->bio; struct journal_replay *i; - struct jset *j, *data = ca->set->journal.w[0].data; + struct jset *j; struct closure cl; unsigned int len, left, offset = 0; int ret = 0; - sector_t bucket = bucket_to_sector(ca->set, ca->sb.d[bucket_index]); closure_init_stack(&cl); - pr_debug("reading %u\n", bucket_index); + pr_debug("reading %u\n", bucket_idx); while (offset < ca->sb.bucket_size) { reread: left = ca->sb.bucket_size - offset; len = min_t(unsigned int, left, PAGE_SECTORS << JSET_BITS); - bio_reset(bio); - bio->bi_iter.bi_sector = bucket + offset; - bio_set_dev(bio, ca->bdev); - bio->bi_iter.bi_size = len << 9; - - bio->bi_end_io = journal_read_endio; - bio->bi_private = &cl; - bio_set_op_attrs(bio, REQ_OP_READ, 0); - bch_bio_map(bio, data); - - closure_bio_submit(ca->set, bio, &cl); - closure_sync(&cl); + if (!bch_has_feature_nvdimm_meta(&ca->sb)) + j = __jnl_rd_bkt(ca, bucket_idx, len, offset, &cl); + else + j = __jnl_rd_nvm_bkt(ca, bucket_idx, len, offset); /* This function could be simpler now since we no longer write * journal entries that overlap bucket boundaries; this means * the start of a bucket will always have a valid journal entry * if it has any journal entries at all. */ - - j = data; while (len) { struct list_head *where; size_t blocks, bytes = set_bytes(j); if (j->magic != jset_magic(&ca->sb)) { - pr_debug("%u: bad magic\n", bucket_index); + pr_debug("%u: bad magic\n", bucket_idx); return ret; } if (bytes > left << 9 || bytes > PAGE_SIZE << JSET_BITS) { pr_info("%u: too big, %zu bytes, offset %u\n", - bucket_index, bytes, offset); + bucket_idx, bytes, offset); return ret; } @@ -96,7 +120,7 @@ reread: left = ca->sb.bucket_size - offset; if (j->csum != csum_set(j)) { pr_info("%u: bad csum, %zu bytes, offset %u\n", - bucket_index, bytes, offset); + bucket_idx, bytes, offset); return ret; } @@ -158,8 +182,8 @@ reread: left = ca->sb.bucket_size - offset; list_add(&i->list, where); ret = 1; - if (j->seq > ja->seq[bucket_index]) - ja->seq[bucket_index] = j->seq; + if (j->seq > ja->seq[bucket_idx]) + ja->seq[bucket_idx] = j->seq; next_set: offset += blocks * ca->sb.block_size; len -= blocks * ca->sb.block_size; @@ -170,6 +194,8 @@ reread: left = ca->sb.bucket_size - offset; return ret; } +static int __bch_journal_nvdimm_init(struct cache *ca); + int bch_journal_read(struct cache_set *c, struct list_head *list) { #define read_bucket(b) \ @@ -188,6 +214,13 @@ int bch_journal_read(struct cache_set *c, struct list_head *list) unsigned int i, l, r, m; uint64_t seq; + /* + * Linear addresses of NVDIMM pages for journaling is not + * initialized yet, do it before read jset from NVDIMM pages. + */ + if (bch_has_feature_nvdimm_meta(&ca->sb)) + __bch_journal_nvdimm_init(ca); + bitmap_zero(bitmap, SB_JOURNAL_BUCKETS); pr_debug("%u journal buckets\n", ca->sb.njournal_buckets); From patchwork Wed Feb 10 05:07:41 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 12079891 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 87052C433E0 for ; Wed, 10 Feb 2021 05:10:04 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 60A2964E2A for ; Wed, 10 Feb 2021 05:10:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231124AbhBJFJ5 (ORCPT ); Wed, 10 Feb 2021 00:09:57 -0500 Received: from mx2.suse.de ([195.135.220.15]:40752 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231128AbhBJFJr (ORCPT ); Wed, 10 Feb 2021 00:09:47 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id A7E1AB0D1; Wed, 10 Feb 2021 05:08:38 +0000 (UTC) From: Coly Li To: axboe@kernel.dk Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, Coly Li , Jianpeng Ma , Qiaowei Ren Subject: [PATCH 19/20] bcache: add sysfs interface register_nvdimm_meta to register NVDIMM meta device Date: Wed, 10 Feb 2021 13:07:41 +0800 Message-Id: <20210210050742.31237-20-colyli@suse.de> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210210050742.31237-1-colyli@suse.de> References: <20210210050742.31237-1-colyli@suse.de> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org This patch adds a sysfs interface register_nvdimm_meta to register NVDIMM meta device. The sysfs interface file only shows up when CONFIG_BCACHE_NVM_PAGES=y. Then a NVDIMM name space formatted by bcache-tools can be registered into bcache by e.g., echo /dev/pmem0 > /sys/fs/bcache/register_nvdimm_meta Signed-off-by: Coly Li Cc: Jianpeng Ma Cc: Qiaowei Ren --- drivers/md/bcache/super.c | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 57c96c16ee16..61fd5802a627 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2415,10 +2415,18 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr, static ssize_t bch_pending_bdevs_cleanup(struct kobject *k, struct kobj_attribute *attr, const char *buffer, size_t size); +#ifdef CONFIG_BCACHE_NVM_PAGES +static ssize_t register_nvdimm_meta(struct kobject *k, + struct kobj_attribute *attr, + const char *buffer, size_t size); +#endif kobj_attribute_write(register, register_bcache); kobj_attribute_write(register_quiet, register_bcache); kobj_attribute_write(pendings_cleanup, bch_pending_bdevs_cleanup); +#ifdef CONFIG_BCACHE_NVM_PAGES +kobj_attribute_write(register_nvdimm_meta, register_nvdimm_meta); +#endif static bool bch_is_open_backing(dev_t dev) { @@ -2532,6 +2540,24 @@ static void register_device_async(struct async_reg_args *args) queue_delayed_work(system_wq, &args->reg_work, 10); } +#ifdef CONFIG_BCACHE_NVM_PAGES +static ssize_t register_nvdimm_meta(struct kobject *k, struct kobj_attribute *attr, + const char *buffer, size_t size) +{ + ssize_t ret = size; + + struct bch_nvm_namespace *ns = bch_register_namespace(buffer); + + if (IS_ERR(ns)) { + pr_err("register nvdimm namespace %s for meta device failed.\n", + buffer); + ret = -EINVAL; + } + + return size; +} +#endif + static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr, const char *buffer, size_t size) { @@ -2867,6 +2893,9 @@ static int __init bcache_init(void) static const struct attribute *files[] = { &ksysfs_register.attr, &ksysfs_register_quiet.attr, +#ifdef CONFIG_BCACHE_NVM_PAGES + &ksysfs_register_nvdimm_meta.attr, +#endif &ksysfs_pendings_cleanup.attr, NULL }; From patchwork Wed Feb 10 05:07:42 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 12079887 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id F3BA0C433E6 for ; Wed, 10 Feb 2021 05:09:57 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CC5B264E2A for ; Wed, 10 Feb 2021 05:09:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230521AbhBJFJw (ORCPT ); Wed, 10 Feb 2021 00:09:52 -0500 Received: from mx2.suse.de ([195.135.220.15]:40750 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231126AbhBJFJq (ORCPT ); Wed, 10 Feb 2021 00:09:46 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id B2647B0D0; Wed, 10 Feb 2021 05:08:40 +0000 (UTC) From: Coly Li To: axboe@kernel.dk Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, Coly Li , Jianpeng Ma , Qiaowei Ren Subject: [PATCH 20/20] bcache: only initialize nvm-pages allocator when CONFIG_BCACHE_NVM_PAGES configured Date: Wed, 10 Feb 2021 13:07:42 +0800 Message-Id: <20210210050742.31237-21-colyli@suse.de> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210210050742.31237-1-colyli@suse.de> References: <20210210050742.31237-1-colyli@suse.de> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org It is unnecessary to initialize the EXPERIMENTAL nvm-pages allocator when CONFIG_BCACHE_NVM_PAGES is not configured. This patch uses "#ifdef CONFIG_BCACHE_NVM_PAGES" to wrap bch_nvm_init() and bch_nvm_exit(), and only calls them when bch_nvm_exit is configured. Signed-off-by: Coly Li Cc: Jianpeng Ma Cc: Qiaowei Ren --- drivers/md/bcache/super.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 61fd5802a627..c273eeef0d38 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2845,7 +2845,9 @@ static void bcache_exit(void) { bch_debug_exit(); bch_request_exit(); +#ifdef CONFIG_BCACHE_NVM_PAGES bch_nvm_exit(); +#endif if (bcache_kobj) kobject_put(bcache_kobj); if (bcache_wq) @@ -2947,7 +2949,9 @@ static int __init bcache_init(void) bch_debug_init(); closure_debug_init(); +#ifdef CONFIG_BCACHE_NVM_PAGES bch_nvm_init(); +#endif bcache_is_reboot = false;