[35/37] bcache: fix race in btree_flush_write()

There is a race between mca_reap(), btree_node_free() and journal code
btree_flush_write(), which results very rare and strange deadlock or
panic and are very hard to reproduce.

Let me explain how the race happens. In btree_flush_write() one btree
node with oldest journal pin is selected, then it is flushed to cache
device, the select-and-flush is a two steps operation. Between these two
steps, there are something may happen inside the race window,
- The selected btree node was reaped by mca_reap() and allocated to
  other requesters for other btree node.
- The slected btree node was selected, flushed and released by mca
  shrink callback bch_mca_scan().
When btree_flush_write() tries to flush the selected btree node, firstly
b->write_lock is held by mutex_lock(). If the race happens and the
memory of selected btree node is allocated to other btree node, if that
btree node's write_lock is held already, a deadlock very probably
happens here. A worse case is the memory of the selected btree node is
released, then all references to this btree node (e.g. b->write_lock)
will trigger NULL pointer deference panic.

This race was introduced in commit cafe56359144 ("bcache: A block layer
cache"), and enlarged by commit c4dc2497d50d ("bcache: fix high CPU
occupancy during journal"), which selected 128 btree nodes and flushed
them one-by-one in a quite long time period.

Such race is not easy to reproduce before. On a Lenovo SR650 server with
48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
device assembled by 3 NVMe SSDs as backing device, this race can be
observed around every 10,000 times btree_flush_write() gets called. Both
deadlock and kernel panic all happened as aftermath of the race.

The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
is set when selecting btree nodes, and cleared after btree nodes
flushed. Then when mca_reap() selects a btree node with this bit set,
this btree node will be skipped. Since mca_reap() only reaps btree node
without BTREE_NODE_journal_flush flag, such race is avoided.

Once corner case should be noticed, that is btree_node_free(). It might
be called in some error handling code path. For example the following
code piece from btree_split(),
        2149 err_free2:
        2150         bkey_put(b->c, &n2->key);
        2151         btree_node_free(n2);
        2152         rw_unlock(true, n2);
        2153 err_free1:
        2154         bkey_put(b->c, &n1->key);
        2155         btree_node_free(n1);
        2156         rw_unlock(true, n1);
At line 2151 and 2155, the btree node n2 and n1 are released without
mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
If btree_node_free() is called directly in such error handling path,
and the selected btree node has BTREE_NODE_journal_flush bit set, just
delay for 1 us and retry again. In this case this btree node won't
be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
and free the btree node memory.

Fixes: cafe56359144 ("bcache: A block layer cache")
Signed-off-by: Coly Li <colyli@suse.de>
Reported-and-tested-by: kbuild test robot <lkp@intel.com>
Cc: stable@vger.kernel.org
---
 drivers/md/bcache/btree.c   | 28 +++++++++++++++++++++++++++-
 drivers/md/bcache/btree.h   |  2 ++
 drivers/md/bcache/journal.c |  7 +++++++
 3 files changed, 36 insertions(+), 1 deletion(-)

Message ID	20190628120000.40753-36-colyli@suse.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Coly Li <colyli@suse.de> To: axboe@kernel.dk Cc: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, Coly Li <colyli@suse.de>, stable@vger.kernel.org Subject: [PATCH 35/37] bcache: fix race in btree_flush_write() Date: Fri, 28 Jun 2019 19:59:58 +0800 Message-Id: <20190628120000.40753-36-colyli@suse.de> In-Reply-To: <20190628120000.40753-1-colyli@suse.de> References: <20190628120000.40753-1-colyli@suse.de> Sender: linux-block-owner@vger.kernel.org Precedence: bulk
Series	bcache patches for Linux v5.3 \| expand [00/37] bcache patches for Linux v5.3 [01/37] bcache: don't set max writeback rate if gc is running [02/37] bcache: check c->gc_thread by IS_ERR_OR_NULL in cache_set_flush() [03/37] bcache: fix return value error in bch_journal_read() [04/37] Revert "bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()" [05/37] bcache: avoid flushing btree node in cache_set_flush() if io disabled [06/37] bcache: ignore read-ahead request failure on backing device [07/37] bcache: add io error counting in write_bdev_super_endio() [08/37] bcache: remove unnecessary prefetch() in bset_search_tree() [09/37] bcache: use sysfs_match_string() instead of __sysfs_match_string() [10/37] bcache: add return value check to bch_cached_dev_run() [11/37] bcache: remove unncessary code in bch_btree_keys_init() [12/37] bcache: check CACHE_SET_IO_DISABLE in allocator code [13/37] bcache: check CACHE_SET_IO_DISABLE bit in bch_journal() [14/37] bcache: more detailed error message to bcache_device_link() [15/37] bcache: add more error message in bch_cached_dev_attach() [16/37] bcache: improve error message in bch_cached_dev_run() [17/37] bcache: remove "XXX:" comment line from run_cache_set() [18/37] bcache: make bset_search_tree() be more understandable [19/37] bcache: add pendings_cleanup to stop pending bcache device [20/37] bcache: fix mistaken sysfs entry for io_error counter [21/37] bcache: destroy dc->writeback_write_wq if failed to create dc->writeback_thread [22/37] bcache: stop writeback kthread and kworker when bch_cached_dev_run() failed [23/37] bcache: avoid a deadlock in bcache_reboot() [24/37] bcache: acquire bch_register_lock later in cached_dev_detach_finish() [25/37] bcache: acquire bch_register_lock later in cached_dev_free() [26/37] bcache: fix potential deadlock in cached_def_free() [27/37] bcache: add code comments for journal_read_bucket() [28/37] bcache: set largest seq to ja->seq[bucket_index] in journal_read_bucket() [29/37] bcache: shrink btree node cache after bch_btree_check() [30/37] bcache: Revert "bcache: free heap cache_set->flush_btree in bch_journal_free" [31/37] bcache: Revert "bcache: fix high CPU occupancy during journal" [32/37] bcache: only clear BTREE_NODE_dirty bit when it is set [33/37] bcache: add comments for mutex_lock(&b->write_lock) [34/37] bcache: remove retry_flush_write from struct cache_set [35/37] bcache: fix race in btree_flush_write() [36/37] bcache: performance improvement for btree_flush_write() [37/37] bcache: add reclaimed_journal_buckets to struct cache_set

[35/37] bcache: fix race in btree_flush_write()

Commit Message

Patch