From patchwork Tue Aug 21 21:54:37 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Liu Bo X-Patchwork-Id: 10572421 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 409B6921 for ; Tue, 21 Aug 2018 21:55:39 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4378D2AD22 for ; Tue, 21 Aug 2018 21:55:39 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 376972AD68; Tue, 21 Aug 2018 21:55:39 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id BF8022AD22 for ; Tue, 21 Aug 2018 21:55:37 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727863AbeHVBRb (ORCPT ); Tue, 21 Aug 2018 21:17:31 -0400 Received: from out30-131.freemail.mail.aliyun.com ([115.124.30.131]:47867 "EHLO out30-131.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727797AbeHVBRb (ORCPT ); Tue, 21 Aug 2018 21:17:31 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R121e4;CH=green;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e01429;MF=bo.liu@linux.alibaba.com;NM=1;PH=DS;RN=4;SR=0;TI=SMTPD_---0T7.-k7r_1534888531; Received: from localhost(mailfrom:bo.liu@linux.alibaba.com fp:SMTPD_---0T7.-k7r_1534888531) by smtp.aliyun-inc.com(127.0.0.1); Wed, 22 Aug 2018 05:55:31 +0800 From: Liu Bo To: Cc: Chris Mason , =?utf-8?q?_Holger_Hoffst=C3=A4tte?= , Nikolay Borisov Subject: [PATCH v2] Btrfs: kill btrfs_clear_path_blocking Date: Wed, 22 Aug 2018 05:54:37 +0800 Message-Id: <1534888477-15585-1-git-send-email-bo.liu@linux.alibaba.com> X-Mailer: git-send-email 1.8.3.1 Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Btrfs's btree locking has two modes, spinning mode and blocking mode, while searching btree, locking is always acquired in spinning mode and then converted to blocking mode if necessary, and in some hot paths we may switch the locking back to spinning mode by btrfs_clear_path_blocking(). When acquiring locks, both of reader and writer need to wait for blocking readers and writers to complete before doing read_lock()/write_lock(). The problem is that btrfs_clear_path_blocking() needs to switch nodes in the path to blocking mode at first (by btrfs_set_path_blocking) to make lockdep happy before doing its actual clearing blocking job. When switching to blocking mode from spinning mode, it consists of step 1) bumping up blocking readers counter and step 2) read_unlock()/write_unlock(), this has caused serious ping-pong effect if there're a great amount of concurrent readers/writers, as waiters will be woken up and go to sleep immediately. 1) Killing this kind of ping-pong results in a big improvement in my 1600k files creation script, MNT=/mnt/btrfs mkfs.btrfs -f /dev/sdf mount /dev/def $MNT time fsmark -D 10000 -S0 -n 100000 -s 0 -L 1 -l /tmp/fs_log.txt \ -d $MNT/0 -d $MNT/1 \ -d $MNT/2 -d $MNT/3 \ -d $MNT/4 -d $MNT/5 \ -d $MNT/6 -d $MNT/7 \ -d $MNT/8 -d $MNT/9 \ -d $MNT/10 -d $MNT/11 \ -d $MNT/12 -d $MNT/13 \ -d $MNT/14 -d $MNT/15 w/o patch: real 2m27.307s user 0m12.839s sys 13m42.831s w/ patch: real 1m2.273s user 0m15.802s sys 8m16.495s 1.1) latency histogram from funclatency[1] Overall with the patch, there're ~50% less write lock acquisition and the 95% max latency that write lock takes also reduces to ~100ms from >500ms. -------------------------------------------- w/o patch: -------------------------------------------- Function = btrfs_tree_lock msecs : count distribution 0 -> 1 : 2385222 |****************************************| 2 -> 3 : 37147 | | 4 -> 7 : 20452 | | 8 -> 15 : 13131 | | 16 -> 31 : 3877 | | 32 -> 63 : 3900 | | 64 -> 127 : 2612 | | 128 -> 255 : 974 | | 256 -> 511 : 165 | | 512 -> 1023 : 13 | | Function = btrfs_tree_read_lock msecs : count distribution 0 -> 1 : 6743860 |****************************************| 2 -> 3 : 2146 | | 4 -> 7 : 190 | | 8 -> 15 : 38 | | 16 -> 31 : 4 | | -------------------------------------------- w/ patch: -------------------------------------------- Function = btrfs_tree_lock msecs : count distribution 0 -> 1 : 1318454 |****************************************| 2 -> 3 : 6800 | | 4 -> 7 : 3664 | | 8 -> 15 : 2145 | | 16 -> 31 : 809 | | 32 -> 63 : 219 | | 64 -> 127 : 10 | | Function = btrfs_tree_read_lock msecs : count distribution 0 -> 1 : 6854317 |****************************************| 2 -> 3 : 2383 | | 4 -> 7 : 601 | | 8 -> 15 : 92 | | 2) dbench also proves the improvement, dbench -t 120 -D /mnt/btrfs 16 w/o patch: Throughput 158.363 MB/sec w/ patch: Throughput 449.52 MB/sec 3) xfstests didn't show any additional failures. One thing to note is that callers may set path->leave_spinning to have all nodes in the path stay in spinning mode, which means callers are ready to not sleep before releasing the path, but it won't cause problems if they don't want to sleep in blocking mode. [1]: https://github.com/iovisor/bcc/blob/master/tools/funclatency.py Signed-off-by: Liu Bo --- v2: Add latency histogram for btrfs_tree*_lock to show the difference before and after, the patch itself remains unchanged. fs/btrfs/ctree.c | 57 ++++-------------------------------------------- fs/btrfs/ctree.h | 2 -- fs/btrfs/delayed-inode.c | 3 --- 3 files changed, 4 insertions(+), 58 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index d436fb4c002e..8b31caa60b0a 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -52,42 +52,6 @@ noinline void btrfs_set_path_blocking(struct btrfs_path *p) } } -/* - * reset all the locked nodes in the patch to spinning locks. - * - * held is used to keep lockdep happy, when lockdep is enabled - * we set held to a blocking lock before we go around and - * retake all the spinlocks in the path. You can safely use NULL - * for held - */ -noinline void btrfs_clear_path_blocking(struct btrfs_path *p, - struct extent_buffer *held, int held_rw) -{ - int i; - - if (held) { - btrfs_set_lock_blocking_rw(held, held_rw); - if (held_rw == BTRFS_WRITE_LOCK) - held_rw = BTRFS_WRITE_LOCK_BLOCKING; - else if (held_rw == BTRFS_READ_LOCK) - held_rw = BTRFS_READ_LOCK_BLOCKING; - } - btrfs_set_path_blocking(p); - - for (i = BTRFS_MAX_LEVEL - 1; i >= 0; i--) { - if (p->nodes[i] && p->locks[i]) { - btrfs_clear_lock_blocking_rw(p->nodes[i], p->locks[i]); - if (p->locks[i] == BTRFS_WRITE_LOCK_BLOCKING) - p->locks[i] = BTRFS_WRITE_LOCK; - else if (p->locks[i] == BTRFS_READ_LOCK_BLOCKING) - p->locks[i] = BTRFS_READ_LOCK; - } - } - - if (held) - btrfs_clear_lock_blocking_rw(held, held_rw); -} - /* this also releases the path */ void btrfs_free_path(struct btrfs_path *p) { @@ -1306,7 +1270,6 @@ static struct tree_mod_elem *__tree_mod_log_oldest_root( } } - btrfs_clear_path_blocking(path, NULL, BTRFS_READ_LOCK); btrfs_tree_read_unlock_blocking(eb); free_extent_buffer(eb); @@ -2483,7 +2446,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level) btrfs_set_path_blocking(p); reada_for_balance(fs_info, p, level); sret = split_node(trans, root, p, level); - btrfs_clear_path_blocking(p, NULL, 0); BUG_ON(sret > 0); if (sret) { @@ -2504,7 +2466,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level) btrfs_set_path_blocking(p); reada_for_balance(fs_info, p, level); sret = balance_level(trans, root, p, level); - btrfs_clear_path_blocking(p, NULL, 0); if (sret) { ret = sret; @@ -2789,7 +2750,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root, } cow_done: p->nodes[level] = b; - btrfs_clear_path_blocking(p, NULL, 0); + /* + * Leave path with blocking locks to avoid massive + * lock context switch, this is made on purpose. + */ /* * we have a lock on b and as long as we aren't changing @@ -2871,8 +2835,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root, if (!err) { btrfs_set_path_blocking(p); btrfs_tree_lock(b); - btrfs_clear_path_blocking(p, b, - BTRFS_WRITE_LOCK); } p->locks[level] = BTRFS_WRITE_LOCK; } else { @@ -2880,8 +2842,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root, if (!err) { btrfs_set_path_blocking(p); btrfs_tree_read_lock(b); - btrfs_clear_path_blocking(p, b, - BTRFS_READ_LOCK); } p->locks[level] = BTRFS_READ_LOCK; } @@ -2900,7 +2860,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root, btrfs_set_path_blocking(p); err = split_leaf(trans, root, key, p, ins_len, ret == 0); - btrfs_clear_path_blocking(p, NULL, 0); BUG_ON(err > 0); if (err) { @@ -2967,7 +2926,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key, while (b) { level = btrfs_header_level(b); p->nodes[level] = b; - btrfs_clear_path_blocking(p, NULL, 0); /* * we have a lock on b and as long as we aren't changing @@ -3013,8 +2971,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key, if (!err) { btrfs_set_path_blocking(p); btrfs_tree_read_lock(b); - btrfs_clear_path_blocking(p, b, - BTRFS_READ_LOCK); } b = tree_mod_log_rewind(fs_info, p, b, time_seq); if (!b) { @@ -5198,7 +5154,6 @@ int btrfs_search_forward(struct btrfs_root *root, struct btrfs_key *min_key, path->locks[level - 1] = BTRFS_READ_LOCK; path->nodes[level - 1] = cur; unlock_up(path, level, 1, 0, NULL); - btrfs_clear_path_blocking(path, NULL, 0); } out: path->keep_locks = keep_locks; @@ -5783,8 +5738,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path, if (!ret) { btrfs_set_path_blocking(path); btrfs_tree_read_lock(next); - btrfs_clear_path_blocking(path, next, - BTRFS_READ_LOCK); } next_rw_lock = BTRFS_READ_LOCK; } @@ -5820,8 +5773,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path, if (!ret) { btrfs_set_path_blocking(path); btrfs_tree_read_lock(next); - btrfs_clear_path_blocking(path, next, - BTRFS_READ_LOCK); } next_rw_lock = BTRFS_READ_LOCK; } diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 318be7864072..1aeed3c0e949 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2876,8 +2876,6 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans, struct btrfs_path *btrfs_alloc_path(void); void btrfs_free_path(struct btrfs_path *p); void btrfs_set_path_blocking(struct btrfs_path *p); -void btrfs_clear_path_blocking(struct btrfs_path *p, - struct extent_buffer *held, int held_rw); void btrfs_unlock_up_safe(struct btrfs_path *p, int level); int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root, diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index f51b509f2d9b..db9f45082c61 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -762,9 +762,6 @@ static int btrfs_batch_insert_items(struct btrfs_root *root, i++; } - /* reset all the locked nodes in the patch to spinning locks. */ - btrfs_clear_path_blocking(path, NULL, 0); - /* insert the keys of the items */ setup_items_for_insert(root, path, keys, data_size, total_data_size, total_size, nitems);