[06/29] bcache: fix race in btree_flush_write()

There is a race between mca_reap(), btree_node_free() and journal code
btree_flush_write(), which results very rare and strange deadlock or
panic and are very hard to reproduce.

Let me explain how the race happens. In btree_flush_write() one btree
node with oldest journal pin is selected, then it is flushed to cache
device, the select-and-flush is a two steps operation. Between these two
steps, there are something may happen inside the race window,
- The selected btree node was reaped by mca_reap() and allocated to
  other requesters for other btree node.
- The slected btree node was selected, flushed and released by mca
  shrink callback bch_mca_scan().
When btree_flush_write() tries to flush the selected btree node, firstly
b->write_lock is held by mutex_lock(). If the race happens and the
memory of selected btree node is allocated to other btree node, if that
btree node's write_lock is held already, a deadlock very probably
happens here. A worse case is the memory of the selected btree node is
released, then all references to this btree node (e.g. b->write_lock)
will trigger NULL pointer deference panic.

This race was introduced in commit cafe56359144 ("bcache: A block layer
cache"), and enlarged by commit c4dc2497d50d ("bcache: fix high CPU
occupancy during journal"), which selected 128 btree nodes and flushed
them one-by-one in a quite long time period.

Such race is not easy to reproduce before. On a Lenovo SR650 server with
48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
device assembled by 3 NVMe SSDs as backing device, this race can be
observed around every 10,000 times btree_flush_write() gets called. Both
deadlock and kernel panic all happened as aftermath of the race.

The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
is set when selecting btree nodes, and cleared after btree nodes
flushed. Then when mca_reap() selects a btree node with this bit set,
this btree node will be skipped. Since mca_reap() only reaps btree node
without BTREE_NODE_journal_flush flag, such race is avoided.

Once corner case should be noticed, that is btree_node_free(). It might
be called in some error handling code path. For example the following
code piece from btree_split(),
	2149 err_free2:
	2150         bkey_put(b->c, &n2->key);
	2151         btree_node_free(n2);
	2152         rw_unlock(true, n2);
	2153 err_free1:
	2154         bkey_put(b->c, &n1->key);
	2155         btree_node_free(n1);
	2156         rw_unlock(true, n1);
At line 2151 and 2155, the btree node n2 and n1 are released without
mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
If btree_node_free() is called directly in such error handling path,
and the selected btree node has BTREE_NODE_journal_flush bit set, just
wait for 1 jiffy and retry again. In this case this btree node won't
be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
and free the btree node memory.

Wait for 1 jiffy inside btree_node_free() does not hurt too much
performance here, the reasons are,
- Error handling code path is not frequently executed, and the race
  inside this cold path should be very rare. If the very rare race
  happens in the cold code path, waiting 1 jiffy should be acceptible.
- If bree_node_free() is called inside mca_reap(), it means the bit
  BTREE_NODE_journal_flush is checked already, no wait will happen here.

Beside the above fix, the way to select flushing btree nodes is also
changed in this patch. Let me explain what changes in this patch.

- Use another spinlock journal.flush_write_lock to replace the very
  hot journal.lock. We don't have to use journal.lock here, selecting
  candidate btree nodes takes a lot of time, hold journal.lock here will
  block other jouranling threads and drop the overall I/O performance.
- Only select flushing btree node from c->btree_cache list. When the
  machine has a large system memory, mca cache may have a huge number of
  cached btree nodes. Iterating all the cached nodes will take a lot
  of CPU time, and most of the nodes on c->btree_cache_freeable and
  c->btree_cache_freed lists are cleared and have need to flush. So only
  travel mca list c->btree_cache to select flushing btree node should be
  enough for most of the cases.
- Don't iterate whole c->btree_cache list, only reversely select first
  BTREE_FLUSH_NR (32) btree nodes to flush. Iterate all btree nodes from
  c->btree_cache and select the oldest journal pin btree nodes consumes
  huge number of CPU cycles if the list is huge (push and pop a node
  into/out of a heap is expensive). The last several dirty btree nodes
  on the tail of c->btree_cache list are earlest allocated and cached
  btree nodes, they are relative to the oldest journal pin btree nodes.
  Therefore only flushing BTREE_FLUSH_NR btree nodes from tail of
  c->btree_cache probably includes the oldest journal pin btree nodes.

In my testing, the above change decreases 50%+ CPU consumption when
journal space is full. Some times IOPS drops to 0 for 5-8 seconds,
comparing blocking I/O for 120+ seconds in previous code, this is much
better. Maybe there is room to improve in future, but at this momment
the fix looks fine and performs well in my testing.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/btree.c   | 15 +++++++-
 drivers/md/bcache/btree.h   |  2 +
 drivers/md/bcache/journal.c | 93 ++++++++++++++++++++++++++++++++++-----------
 drivers/md/bcache/journal.h |  4 ++
 4 files changed, 90 insertions(+), 24 deletions(-)

Message ID	20190614131358.2771-7-colyli@suse.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 69AE51515 for <patchwork-linux-block@patchwork.kernel.org>; Fri, 14 Jun 2019 13:14:32 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5AD7A286B8 for <patchwork-linux-block@patchwork.kernel.org>; Fri, 14 Jun 2019 13:14:32 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 4F54B286C0; Fri, 14 Jun 2019 13:14:32 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7177E286BE for <patchwork-linux-block@patchwork.kernel.org>; Fri, 14 Jun 2019 13:14:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728108AbfFNNOa (ORCPT <rfc822;patchwork-linux-block@patchwork.kernel.org>); Fri, 14 Jun 2019 09:14:30 -0400 Received: from mx2.suse.de ([195.135.220.15]:45444 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728034AbfFNNOa (ORCPT <rfc822;linux-block@vger.kernel.org>); Fri, 14 Jun 2019 09:14:30 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 85C04AB8C; Fri, 14 Jun 2019 13:14:28 +0000 (UTC) From: Coly Li <colyli@suse.de> To: linux-bcache@vger.kernel.org Cc: linux-block@vger.kernel.org, Coly Li <colyli@suse.de> Subject: [PATCH 06/29] bcache: fix race in btree_flush_write() Date: Fri, 14 Jun 2019 21:13:35 +0800 Message-Id: <20190614131358.2771-7-colyli@suse.de> X-Mailer: git-send-email 2.16.4 In-Reply-To: <20190614131358.2771-1-colyli@suse.de> References: <20190614131358.2771-1-colyli@suse.de> Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: <linux-block.vger.kernel.org> X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP
Series	bcache candidate patches for Linux v5.3 \| expand [00/29] bcache candidate patches for Linux v5.3 [01/29] bcache: Revert "bcache: fix high CPU occupancy during journal" [02/29] bcache: Revert "bcache: free heap cache_set->flush_btree in bch_journal_free" [03/29] bcache: add code comments for journal_read_bucket() [04/29] bcache: set largest seq to ja->seq[bucket_index] in journal_read_bucket() [05/29] bcache: remove retry_flush_write from struct cache_set [06/29] bcache: fix race in btree_flush_write() [07/29] bcache: add reclaimed_journal_buckets to struct cache_set [08/29] bcache: fix return value error in bch_journal_read() [09/29] Revert "bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()" [10/29] bcache: avoid flushing btree node in cache_set_flush() if io disabled [11/29] bcache: ignore read-ahead request failure on backing device [12/29] bcache: add io error counting in write_bdev_super_endio() [13/29] bcache: remove "XXX:" comment line from run_cache_set() [14/29] bcache: remove unnecessary prefetch() in bset_search_tree() [15/29] bcache: use sysfs_match_string() instead of __sysfs_match_string() [16/29] bcache: add return value check to bch_cached_dev_run() [17/29] bcache: remove unncessary code in bch_btree_keys_init() [18/29] bcache: check CACHE_SET_IO_DISABLE in allocator code [19/29] bcache: check CACHE_SET_IO_DISABLE bit in bch_journal() [20/29] bcache: more detailed error message to bcache_device_link() [21/29] bcache: add more error message in bch_cached_dev_attach() [22/29] bcache: shrink btree node cache after bch_btree_check() [23/29] bcache: improve error message in bch_cached_dev_run() [24/29] bcache: make bset_search_tree() be more understandable [25/29] bcache: add pendings_cleanup to stop pending bcache device [26/29] bcache: avoid a deadlock in bcache_reboot() [27/29] bcache: acquire bch_register_lock later in cached_dev_detach_finish() [28/29] bcache: acquire bch_register_lock later in cached_dev_free() [29/29] bcache: fix potential deadlock in cached_def_free()

[06/29] bcache: fix race in btree_flush_write()

Commit Message

Comments

Patch