From patchwork Fri Apr 15 17:26:04 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ilya Dryomov <idryomov@gmail.com>
X-Patchwork-Id: 8853561
Return-Path: <ceph-devel-owner@kernel.org>
X-Original-To: patchwork-ceph-devel@patchwork.kernel.org
Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.136])
	by patchwork2.web.kernel.org (Postfix) with ESMTP id EF0DFBF29F
	for <patchwork-ceph-devel@patchwork.kernel.org>;
	Fri, 15 Apr 2016 17:26:41 +0000 (UTC)
Received: from mail.kernel.org (localhost [127.0.0.1])
	by mail.kernel.org (Postfix) with ESMTP id CF7BD200E9
	for <patchwork-ceph-devel@patchwork.kernel.org>;
	Fri, 15 Apr 2016 17:26:40 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 898562017E
	for <patchwork-ceph-devel@patchwork.kernel.org>;
	Fri, 15 Apr 2016 17:26:39 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752671AbcDOR0h (ORCPT
	<rfc822;patchwork-ceph-devel@patchwork.kernel.org>);
	Fri, 15 Apr 2016 13:26:37 -0400
Received: from mail-qg0-f50.google.com ([209.85.192.50]:35667 "EHLO
	mail-qg0-f50.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752122AbcDOR0g (ORCPT
	<rfc822; ceph-devel@vger.kernel.org>); Fri, 15 Apr 2016 13:26:36 -0400
Received: by mail-qg0-f50.google.com with SMTP id f105so84986829qge.2
	for <ceph-devel@vger.kernel.org>;
	Fri, 15 Apr 2016 10:26:36 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=gmail.com; s=20120113;
	h=from:to:cc:subject:date:message-id;
	bh=DyIFSuG8lbSqlBGmghrMuMh4XjZZvkHpifVSZ5akeq0=;
	b=mFolhlGnO5GYY1QdleorM3UGXdQausDcBOVWN1M8sxsgdP1GvLS8mRQFc8Mz2pqeB8
	R3a7n7MRkHbM4DT50Bqvrz3boCQr7FUE4w4OxtEb2HQrCFfUPnXCbfTLHg1e8Zk8EDN7
	Bn3XY1ogEFUdjqfEDydtua6P8SgX5kau5qOvTZ0FLSK8DvnZrdM1xtUCCxGF/qs+JylS
	3RQgc/D9ZdPv8q34LRGraGqIB2l7O894yMKxY6X+OCM5SAVnhF1PooP3/Oqlgl2CNrlE
	0Fi5dxXyIE2RL+Y6cnoW1anNiJPwmiFzZ2OGJz3zw/JtmwRZ9Xl34PLkYmQoz/LZN+O6
	t9OA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20130820;
	h=x-gm-message-state:from:to:cc:subject:date:message-id;
	bh=DyIFSuG8lbSqlBGmghrMuMh4XjZZvkHpifVSZ5akeq0=;
	b=O+Sa8AIfHh0fTrZWrl+yRG01jiE9pGrNQbVSrmgpTOapzeVtLAppuw30k/143HIIzR
	XtjVxWdpHxWf3q4gS0iHw4lP09yjj4exNQmmio6j9U1267s1dIRqR8JxklxtTcTlljJ6
	zlgwWDtN+hb9Oqhce4gRAQTmVkG0aW5YkUMNjIROc1Xm4Yt3UCPFD/vLP721F3kpzzZs
	y2UY2mQXCTulSZQHxKNvYUWUmauf/LsI9J343sFJ7HKR9bwD2Fmn2xWbXSoRGzXHynbX
	yZIebUyRLDwmXmtGopQIRvK2B8OO5P6ehsGRoU5VWJ/aAmmnJbEvxFVJb7E9QP6Xy98D
	M7Dw==
X-Gm-Message-State: 
 AOPr4FX5GPeMpSn6clV+UdrcWbWmSUEHZ5MSWyMmpIzwMP3eWxrZXE8oVg113nDrgAIVZw==
X-Received: by 10.140.253.197 with SMTP id
	y188mr26808981qhc.67.1460741195522;
	Fri, 15 Apr 2016 10:26:35 -0700 (PDT)
Received: from dhcp-1-141.brq.redhat.com (nat-pool-brq-t.redhat.com.
	[213.175.37.10]) by smtp.gmail.com with ESMTPSA id
	d64sm20750618qgd.9.2016.04.15.10.26.33
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Fri, 15 Apr 2016 10:26:34 -0700 (PDT)
From: Ilya Dryomov <idryomov@gmail.com>
To: ceph-devel@vger.kernel.org
Cc: Josh Durgin <jdurgin@redhat.com>
Subject: [PATCH] rbd: fix rbd map vs notify races
Date: Fri, 15 Apr 2016 19:26:04 +0200
Message-Id: <1460741164-22850-1-git-send-email-idryomov@gmail.com>
X-Mailer: git-send-email 2.4.3
Sender: ceph-devel-owner@vger.kernel.org
Precedence: bulk
List-ID: <ceph-devel.vger.kernel.org>
X-Mailing-List: ceph-devel@vger.kernel.org
X-Spam-Status: No, score=-7.8 required=5.0 tests=BAYES_00,
	DKIM_ADSP_CUSTOM_MED,
	DKIM_SIGNED, FREEMAIL_FROM, RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD,
	T_DKIM_INVALID, UNPARSEABLE_RELAY autolearn=ham version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

A while ago, commit 9875201e1049 ("rbd: fix use-after free of
rbd_dev->disk") fixed rbd unmap vs notify race by introducing
an exported wrapper for flushing notifies and sticking it into
do_rbd_remove().

A similar problem exists on the rbd map path, though: the watch is
registered in rbd_dev_image_probe(), while the disk is set up quite
a few steps later, in rbd_dev_device_setup().  Nothing prevents
a notify from coming in and crashing on a NULL rbd_dev->disk:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
    Call Trace:
     [<ffffffffa0508344>] rbd_watch_cb+0x34/0x180 [rbd]
     [<ffffffffa04bd290>] do_event_work+0x40/0xb0 [libceph]
     [<ffffffff8109d5db>] process_one_work+0x17b/0x470
     [<ffffffff8109e3ab>] worker_thread+0x11b/0x400
     [<ffffffff8109e290>] ? rescuer_thread+0x400/0x400
     [<ffffffff810a5acf>] kthread+0xcf/0xe0
     [<ffffffff810b41b3>] ? finish_task_switch+0x53/0x170
     [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
     [<ffffffff81645dd8>] ret_from_fork+0x58/0x90
     [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
    RIP  [<ffffffffa050828a>] rbd_dev_refresh+0xfa/0x180 [rbd]

If an error occurs during rbd map, we have to error out, potentially
tearing down a watch.  Just like on rbd unmap, notifies have to be
flushed, otherwise rbd_watch_cb() may end up trying to read in the
image header after rbd_dev_image_release() has run:

    Assertion failure in rbd_dev_header_info() at line 4722:

     rbd_assert(rbd_image_format_valid(rbd_dev->image_format));

    Call Trace:
     [<ffffffff81cccee0>] ? rbd_parent_request_create+0x150/0x150
     [<ffffffff81cd4e59>] rbd_dev_refresh+0x59/0x390
     [<ffffffff81cd5229>] rbd_watch_cb+0x69/0x290
     [<ffffffff81fde9bf>] do_event_work+0x10f/0x1c0
     [<ffffffff81107799>] process_one_work+0x689/0x1a80
     [<ffffffff811076f7>] ? process_one_work+0x5e7/0x1a80
     [<ffffffff81132065>] ? finish_task_switch+0x225/0x640
     [<ffffffff81107110>] ? pwq_dec_nr_in_flight+0x2b0/0x2b0
     [<ffffffff81108c69>] worker_thread+0xd9/0x1320
     [<ffffffff81108b90>] ? process_one_work+0x1a80/0x1a80
     [<ffffffff8111b02d>] kthread+0x21d/0x2e0
     [<ffffffff8111ae10>] ? kthread_stop+0x550/0x550
     [<ffffffff82022802>] ret_from_fork+0x22/0x40
     [<ffffffff8111ae10>] ? kthread_stop+0x550/0x550
    RIP  [<ffffffff81ccd8f9>] rbd_dev_header_info+0xa19/0x1e30

To fix this, a) check if RBD_DEV_FLAG_EXISTS is set before calling
revalidate_disk(), b) move ceph_osdc_flush_notifies() call into
rbd_dev_header_unwatch_sync() to cover rbd map error paths and c) turn
header read-in into a critical section.  The latter also happens to
take care of rbd map foo@bar vs rbd snap rm foo@bar race.

Fixes: http://tracker.ceph.com/issues/15490

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
---
 drivers/block/rbd.c | 43 +++++++++++++++++++------------------------
 1 file changed, 19 insertions(+), 24 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 644940377954..9f1eb00b5e64 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -538,7 +538,6 @@ static int _rbd_dev_v2_snap_size(struct rbd_device *rbd_dev, u64 snap_id,
 				u8 *order, u64 *snap_size);
 static int _rbd_dev_v2_snap_features(struct rbd_device *rbd_dev, u64 snap_id,
 		u64 *snap_features);
-static u64 rbd_snap_id_by_name(struct rbd_device *rbd_dev, const char *name);
 
 static int rbd_open(struct block_device *bdev, fmode_t mode)
 {
@@ -3127,9 +3126,6 @@ static void rbd_watch_cb(u64 ver, u64 notify_id, u8 opcode, void *data)
 	struct rbd_device *rbd_dev = (struct rbd_device *)data;
 	int ret;
 
-	if (!rbd_dev)
-		return;
-
 	dout("%s: \"%s\" notify_id %llu opcode %u\n", __func__,
 		rbd_dev->header_name, (unsigned long long)notify_id,
 		(unsigned int)opcode);
@@ -3263,6 +3259,9 @@ static void rbd_dev_header_unwatch_sync(struct rbd_device *rbd_dev)
 
 	ceph_osdc_cancel_event(rbd_dev->watch_event);
 	rbd_dev->watch_event = NULL;
+
+	dout("%s flushing notifies\n", __func__);
+	ceph_osdc_flush_notifies(&rbd_dev->rbd_client->client->osdc);
 }
 
 /*
@@ -3642,21 +3641,14 @@ static void rbd_exists_validate(struct rbd_device *rbd_dev)
 static void rbd_dev_update_size(struct rbd_device *rbd_dev)
 {
 	sector_t size;
-	bool removing;
 
 	/*
-	 * Don't hold the lock while doing disk operations,
-	 * or lock ordering will conflict with the bdev mutex via:
-	 * rbd_add() -> blkdev_get() -> rbd_open()
+	 * If EXISTS is not set, rbd_dev->disk may be NULL, so don't
+	 * try to update its size.  If REMOVING is set, updating size
+	 * is just useless work since the device can't be opened.
 	 */
-	spin_lock_irq(&rbd_dev->lock);
-	removing = test_bit(RBD_DEV_FLAG_REMOVING, &rbd_dev->flags);
-	spin_unlock_irq(&rbd_dev->lock);
-	/*
-	 * If the device is being removed, rbd_dev->disk has
-	 * been destroyed, so don't try to update its size
-	 */
-	if (!removing) {
+	if (test_bit(RBD_DEV_FLAG_EXISTS, &rbd_dev->flags) &&
+	    !test_bit(RBD_DEV_FLAG_REMOVING, &rbd_dev->flags)) {
 		size = (sector_t)rbd_dev->mapping.size / SECTOR_SIZE;
 		dout("setting size to %llu sectors", (unsigned long long)size);
 		set_capacity(rbd_dev->disk, size);
@@ -5188,6 +5180,10 @@ out_err:
 	return ret;
 }
 
+/*
+ * rbd_dev->header_rwsem must be locked for write and will be unlocked
+ * upon return.
+ */
 static int rbd_dev_device_setup(struct rbd_device *rbd_dev)
 {
 	int ret;
@@ -5196,7 +5192,7 @@ static int rbd_dev_device_setup(struct rbd_device *rbd_dev)
 
 	ret = rbd_dev_id_get(rbd_dev);
 	if (ret)
-		return ret;
+		goto err_out_unlock;
 
 	BUILD_BUG_ON(DEV_NAME_LEN
 			< sizeof (RBD_DRV_NAME) + MAX_INT_FORMAT_WIDTH);
@@ -5237,8 +5233,9 @@ static int rbd_dev_device_setup(struct rbd_device *rbd_dev)
 	/* Everything's ready.  Announce the disk to the world. */
 
 	set_bit(RBD_DEV_FLAG_EXISTS, &rbd_dev->flags);
-	add_disk(rbd_dev->disk);
+	up_write(&rbd_dev->header_rwsem);
 
+	add_disk(rbd_dev->disk);
 	pr_info("%s: added with size 0x%llx\n", rbd_dev->disk->disk_name,
 		(unsigned long long) rbd_dev->mapping.size);
 
@@ -5253,6 +5250,8 @@ err_out_blkdev:
 		unregister_blkdev(rbd_dev->major, rbd_dev->name);
 err_out_id:
 	rbd_dev_id_put(rbd_dev);
+err_out_unlock:
+	up_write(&rbd_dev->header_rwsem);
 	return ret;
 }
 
@@ -5443,6 +5442,7 @@ static ssize_t do_rbd_add(struct bus_type *bus,
 	spec = NULL;		/* rbd_dev now owns this */
 	rbd_opts = NULL;	/* rbd_dev now owns this */
 
+	down_write(&rbd_dev->header_rwsem);
 	rc = rbd_dev_image_probe(rbd_dev, 0);
 	if (rc < 0)
 		goto err_out_rbd_dev;
@@ -5472,6 +5472,7 @@ out:
 	return rc;
 
 err_out_rbd_dev:
+	up_write(&rbd_dev->header_rwsem);
 	rbd_dev_destroy(rbd_dev);
 err_out_client:
 	rbd_put_client(rbdc);
@@ -5578,12 +5579,6 @@ static ssize_t do_rbd_remove(struct bus_type *bus,
 		return ret;
 
 	rbd_dev_header_unwatch_sync(rbd_dev);
-	/*
-	 * flush remaining watch callbacks - these must be complete
-	 * before the osd_client is shutdown
-	 */
-	dout("%s: flushing notifies", __func__);
-	ceph_osdc_flush_notifies(&rbd_dev->rbd_client->client->osdc);
 
 	/*
 	 * Don't free anything from rbd_dev->disk until after all