From patchwork Wed Aug  1 14:56:08 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Guilherme Piccoli <gpiccoli@canonical.com>
X-Patchwork-Id: 10553347
X-Patchwork-Delegate: snitzer@redhat.com
Return-Path: <dm-devel-bounces@redhat.com>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 011B613BF
	for <patchwork-dm-devel@patchwork.kernel.org>;
 Thu,  2 Aug 2018 08:08:37 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E87342B6D6
	for <patchwork-dm-devel@patchwork.kernel.org>;
 Thu,  2 Aug 2018 08:08:36 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id DC2132B709; Thu,  2 Aug 2018 08:08:36 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1
Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28])
	(using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id B2FF32B6D6
	for <patchwork-dm-devel@patchwork.kernel.org>;
 Thu,  2 Aug 2018 08:08:35 +0000 (UTC)
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com
 [10.5.11.16])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mx1.redhat.com (Postfix) with ESMTPS id 4262E13A88;
	Thu,  2 Aug 2018 08:08:34 +0000 (UTC)
Received: from colo-mx.corp.redhat.com
 (colo-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.21])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id 12287765BA;
	Thu,  2 Aug 2018 08:08:34 +0000 (UTC)
Received: from lists01.pubmisc.prod.ext.phx2.redhat.com
 (lists01.pubmisc.prod.ext.phx2.redhat.com [10.5.19.33])
	by colo-mx.corp.redhat.com (Postfix) with ESMTP id CAC374A46D;
	Thu,  2 Aug 2018 08:08:33 +0000 (UTC)
Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com
	[10.5.11.22])
	by lists01.pubmisc.prod.ext.phx2.redhat.com (8.13.8/8.13.8) with ESMTP
	id w71EuNsD006662 for <dm-devel@listman.util.phx.redhat.com>;
	Wed, 1 Aug 2018 10:56:23 -0400
Received: by smtp.corp.redhat.com (Postfix)
	id 3FBDB1084380; Wed,  1 Aug 2018 14:56:23 +0000 (UTC)
Delivered-To: dm-devel@redhat.com
Received: from mx1.redhat.com (ext-mx20.extmail.prod.ext.phx2.redhat.com
	[10.5.110.49])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id 328B7109200F
	for <dm-devel@redhat.com>; Wed,  1 Aug 2018 14:56:20 +0000 (UTC)
Received: from youngberry.canonical.com (youngberry.canonical.com
	[91.189.89.112]) (using TLSv1 with cipher AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mx1.redhat.com (Postfix) with ESMTPS id 49F8A308124B
	for <dm-devel@redhat.com>; Wed,  1 Aug 2018 14:56:18 +0000 (UTC)
Received: from mail-qt0-f197.google.com ([209.85.216.197])
	by youngberry.canonical.com with esmtps
	(TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.76)
	(envelope-from <gpiccoli@canonical.com>) id 1fksXw-0006jr-UZ
	for dm-devel@redhat.com; Wed, 01 Aug 2018 14:56:17 +0000
Received: by mail-qt0-f197.google.com with SMTP id j9-v6so15751309qtn.22
	for <dm-devel@redhat.com>; Wed, 01 Aug 2018 07:56:16 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
	:references;
	bh=s0nu79mPnvLGg/wWWI4pVob2vu5y/M5jaCKWt68oIPo=;
	b=Q9nnRDVfWNHKak8sRRRA06c9FO2mfeEFNE1j7yq/2X+GrntjWNvRVk8fnJcW38+p/E
	0ZAOt+dXc/VEtKJwZRDz81hpbYjHg70Hit1dkPN2iG7UAikAJqJjnf0rFTT9jmPOYINB
	YG10cJjISjarY/z0UrpD9kTXfdWk83BHAXd4TVq+M0zPGIfBtFoaKZMZ62a42KOl2m5R
	icukBq176W3naL25xdeNJxwwIpZqMyZh3lrl+RAdvhrwHxG+yWX4RWanOrAF4pYukyCu
	278V3jgRGNkLsVcBTE5rNcerwrlLsBh1yONASek5OjNr2I8Ix1cJMPk/oMGxToXyYOPC
	2h8A==
X-Gm-Message-State: AOUpUlHeKRl+Dru07kISu0jF/pSOoJp2ab2xPfuRozSErl0dlaQ5+sW+
	jUslZ1DOXvCcy73OOE6QVw8X8vHUZ9TqkBsXe49cEtDOkcsNZ+wAPNtW+tTOO+Txen8NWzdkIDQ
	tRV/L303PGU8OhKT9a6WzyII2et0D7w==
X-Received: by 2002:ac8:2d8c:: with SMTP id
	p12-v6mr25688138qta.262.1533135376038;
	Wed, 01 Aug 2018 07:56:16 -0700 (PDT)
X-Google-Smtp-Source: 
 AAOMgpcWsHpvpzYoemJVX3QtG7/CqxyqNYk3Hp7HOlb5AxDt8N0/nR5Gg7g1WYI2NUKgDYX42N/mfg==
X-Received: by 2002:ac8:2d8c:: with SMTP id
	p12-v6mr25688101qta.262.1533135375408;
	Wed, 01 Aug 2018 07:56:15 -0700 (PDT)
Received: from localhost ([187.116.97.219]) by smtp.gmail.com with ESMTPSA id
	e20-v6sm16393544qkh.32.2018.08.01.07.56.13
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Wed, 01 Aug 2018 07:56:14 -0700 (PDT)
From: "Guilherme G. Piccoli" <gpiccoli@canonical.com>
To: linux-raid@vger.kernel.org, shli@kernel.org
Date: Wed,  1 Aug 2018 11:56:08 -0300
Message-Id: <20180801145608.19880-2-gpiccoli@canonical.com>
In-Reply-To: <20180801145608.19880-1-gpiccoli@canonical.com>
References: <20180801145608.19880-1-gpiccoli@canonical.com>
X-Greylist: Sender passed SPF test, Sender IP whitelisted by DNSRBL, ACL 209
	matched, not delayed by milter-greylist-4.5.16 (mx1.redhat.com
	[10.5.110.49]); Wed, 01 Aug 2018 14:56:18 +0000 (UTC)
X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com
 [10.5.110.49]);
	Wed, 01 Aug 2018 14:56:18 +0000 (UTC) for IP:'91.189.89.112'
	DOMAIN:'youngberry.canonical.com'
	HELO:'youngberry.canonical.com' FROM:'gpiccoli@canonical.com'
	RCPT:''
X-RedHat-Spam-Score: -5 (RCVD_IN_DNSWL_HI) 91.189.89.112
	youngberry.canonical.com 91.189.89.112 youngberry.canonical.com
	<gpiccoli@canonical.com>
X-Scanned-By: MIMEDefang 2.84 on 10.5.110.49
X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22
X-loop: dm-devel@redhat.com
X-Mailman-Approved-At: Thu, 02 Aug 2018 04:07:55 -0400
Cc: kernel@gpiccoli.net, gpiccoli@canonical.com, neilb@suse.com,
	linux-block@vger.kernel.org, dm-devel@redhat.com,
	linux-fsdevel@vger.kernel.org, jay.vosburgh@canonical.com
Subject: [dm-devel] [RFC] [PATCH 1/1] md/raid0: Introduce emergency stop for
	raid0 arrays
X-BeenThere: dm-devel@redhat.com
X-Mailman-Version: 2.1.12
Precedence: junk
List-Id: device-mapper development <dm-devel.redhat.com>
List-Unsubscribe: <https://www.redhat.com/mailman/options/dm-devel>,
	<mailto:dm-devel-request@redhat.com?subject=unsubscribe>
List-Archive: <https://www.redhat.com/archives/dm-devel>
List-Post: <mailto:dm-devel@redhat.com>
List-Help: <mailto:dm-devel-request@redhat.com?subject=help>
List-Subscribe: <https://www.redhat.com/mailman/listinfo/dm-devel>,
	<mailto:dm-devel-request@redhat.com?subject=subscribe>
MIME-Version: 1.0
Sender: dm-devel-bounces@redhat.com
Errors-To: dm-devel-bounces@redhat.com
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16
X-Greylist: Sender IP whitelisted,
 not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.29]);
 Thu, 02 Aug 2018 08:08:34 +0000 (UTC)
X-Virus-Scanned: ClamAV using ClamSMTP

Currently the raid0 driver is not provided with any health checking
mechanism to verify its members are fine. So, if suddenly a member
is removed, for example, a STOP_ARRAY ioctl will be triggered from
userspace, i.e., all the logic for stopping an array relies in the
userspace tools, like mdadm/udev. Particularly, if a raid0 array is
mounted, this stop procedure will fail, since mdadm tries to open
the md block device with O_EXCL flag, which isn't allowed if md is
mounted.

That leads to the following situation: if a raid0 array member is
removed and the array is mounted, some user writing to this array
won't realize errors are happening unless they check kernel log.
In other words, no -EIO is returned and writes (except direct I/Os)
appear normal. Meaning this user might think the wrote data is stored
in the array, but instead garbage was written since raid0 does stripping
and require all members to be working in order not corrupt data.

This patch propose a change in this behavior: to emergency stop a
raid0 array in case one of its members are gone. The check happens
when I/O is queued to raid0 driver, so the driver will confirm if
the block device it plans to read/write has its queue healthy; in
case it's not fine (like a dying or dead queue), raid0 driver will
invoke an emergency removal routine that will mark the md device as
broken and trigger a delayed stop procedure. Also, raid0 will start
refusing new BIOs from this point, returning -EIO.
The emergency stop routine will mark the md request queue as dying
too, as a "flag" to indicate failure in case of a nested raid0 array
configuration (a raid0 composed of raid0 devices).

The delayed stop procedure then will perform the basic stop of the
md device, and will take care in case it holds mounted filesystems,
allowing the stop of a mounted raid0 array - which is common in
other regular block devices like NVMe and SCSI.

This emergency stop mechanism only affects raid0 arrays.

Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
---
 drivers/md/md.c    | 126 +++++++++++++++++++++++++++++++++++++++++----
 drivers/md/md.h    |   6 +++
 drivers/md/raid0.c |  14 +++++
 3 files changed, 136 insertions(+), 10 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 994aed2f9dff..bf725ba50ff2 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -94,6 +94,7 @@ EXPORT_SYMBOL(md_cluster_mod);
 static DECLARE_WAIT_QUEUE_HEAD(resync_wait);
 static struct workqueue_struct *md_wq;
 static struct workqueue_struct *md_misc_wq;
+static struct workqueue_struct *md_stop_wq;
 
 static int remove_and_add_spares(struct mddev *mddev,
 				 struct md_rdev *this);
@@ -455,15 +456,24 @@ static void md_end_flush(struct bio *fbio)
 	rdev_dec_pending(rdev, mddev);
 
 	if (atomic_dec_and_test(&fi->flush_pending)) {
-		if (bio->bi_iter.bi_size == 0)
+		if (unlikely(test_bit(MD_BROKEN, &mddev->flags))) {
+			goto bail_flush;
+		} else if (bio->bi_iter.bi_size == 0) {
 			/* an empty barrier - all done */
 			bio_endio(bio);
-		else {
+		} else {
 			INIT_WORK(&fi->flush_work, submit_flushes);
 			queue_work(md_wq, &fi->flush_work);
 		}
 	}
 
+bail_flush:
+	/* Prevents dangling bios to crash after raid0 emergency stop */
+	if (test_bit(MD_BROKEN, &mddev->flags) && !mddev->raid_disks) {
+		bio_uninit(fbio);
+		return;
+	}
+
 	mempool_free(fb, mddev->flush_bio_pool);
 	bio_put(fbio);
 }
@@ -473,6 +483,11 @@ void md_flush_request(struct mddev *mddev, struct bio *bio)
 	struct md_rdev *rdev;
 	struct flush_info *fi;
 
+	if (unlikely(test_bit(MD_BROKEN, &mddev->flags))) {
+		bio_io_error(bio);
+		return;
+	}
+
 	fi = mempool_alloc(mddev->flush_pool, GFP_NOIO);
 
 	fi->bio = bio;
@@ -5211,6 +5226,17 @@ md_attr_store(struct kobject *kobj, struct attribute *attr,
 	return rv;
 }
 
+static void __md_free(struct mddev *mddev)
+{
+	if (mddev->gendisk)
+		del_gendisk(mddev->gendisk);
+	if (mddev->queue)
+		blk_cleanup_queue(mddev->queue);
+	if (mddev->gendisk)
+		put_disk(mddev->gendisk);
+	percpu_ref_exit(&mddev->writes_pending);
+}
+
 static void md_free(struct kobject *ko)
 {
 	struct mddev *mddev = container_of(ko, struct mddev, kobj);
@@ -5218,13 +5244,8 @@ static void md_free(struct kobject *ko)
 	if (mddev->sysfs_state)
 		sysfs_put(mddev->sysfs_state);
 
-	if (mddev->gendisk)
-		del_gendisk(mddev->gendisk);
-	if (mddev->queue)
-		blk_cleanup_queue(mddev->queue);
-	if (mddev->gendisk)
-		put_disk(mddev->gendisk);
-	percpu_ref_exit(&mddev->writes_pending);
+	if (!test_bit(MD_BROKEN, &mddev->flags))
+		__md_free(mddev);
 
 	bioset_exit(&mddev->bio_set);
 	bioset_exit(&mddev->sync_set);
@@ -5247,7 +5268,9 @@ static void mddev_delayed_delete(struct work_struct *ws)
 {
 	struct mddev *mddev = container_of(ws, struct mddev, del_work);
 
-	sysfs_remove_group(&mddev->kobj, &md_bitmap_group);
+	if (!test_bit(MD_BROKEN, &mddev->flags))
+		sysfs_remove_group(&mddev->kobj, &md_bitmap_group);
+
 	kobject_del(&mddev->kobj);
 	kobject_put(&mddev->kobj);
 }
@@ -5987,6 +6010,75 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
 	return err;
 }
 
+static void mddev_delayed_stop(struct work_struct *ws)
+{
+	struct mddev *mddev = container_of(ws, struct mddev, stop_work);
+	struct gendisk *disk = mddev->gendisk;
+	struct block_device *bdev;
+	struct md_rdev *rdev;
+	int i = 0;
+
+	mutex_lock(&mddev->open_mutex);
+	del_timer_sync(&mddev->safemode_timer);
+	__md_stop(mddev);
+	rdev_for_each(rdev, mddev)
+		if (rdev->raid_disk >= 0)
+			sysfs_unlink_rdev(mddev, rdev);
+	set_capacity(disk, 0);
+	mutex_unlock(&mddev->open_mutex);
+
+	mddev->changed = 1;
+	revalidate_disk(disk);
+	export_array(mddev);
+	mddev->hold_active = 0;
+
+	/* Make sure unbind_rdev_from_array() jobs are done */
+	flush_workqueue(md_misc_wq);
+
+	do {
+		bdev = bdget_disk(disk, i++);
+		if (bdev) {
+			struct super_block *sb = bdev->bd_super;
+
+			if (sb) {
+				sb->s_flags |= SB_RDONLY;
+				sb->s_bdi->capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK;
+			}
+			bdput(bdev);
+		}
+	} while (bdev);
+
+	md_clean(mddev);
+	set_bit(MD_BROKEN, &mddev->flags); /* md_clean() will clear this too */
+	md_new_event(mddev);
+	sysfs_notify_dirent_safe(mddev->sysfs_state);
+
+	__md_free(mddev);
+	mddev->gendisk = NULL;
+}
+
+void mddev_emergency_stop(struct mddev *mddev)
+{
+	/* Prevents races if raid0 driver detects errors in multiple requests
+	 * at the same time.
+	 */
+	mutex_lock(&mddev->open_mutex);
+	if (test_bit(MD_BROKEN, &mddev->flags))
+		return;
+
+	set_bit(MD_BROKEN, &mddev->flags);
+	mutex_unlock(&mddev->open_mutex);
+
+	/* Necessary to set md queue as dying here in case this md device is
+	 * a member of some other md array - nested removal will proceed then.
+	 */
+	blk_set_queue_dying(mddev->queue);
+
+	INIT_WORK(&mddev->stop_work, mddev_delayed_stop);
+	queue_work(md_stop_wq, &mddev->stop_work);
+}
+EXPORT_SYMBOL_GPL(mddev_emergency_stop);
+
 /* mode:
  *   0 - completely stop and dis-assemble array
  *   2 - stop but do not disassemble array
@@ -5998,6 +6090,13 @@ static int do_md_stop(struct mddev *mddev, int mode,
 	struct md_rdev *rdev;
 	int did_freeze = 0;
 
+	/* If we already started the emergency removal procedure,
+	 * the STOP_ARRAY ioctl can race with the removal, leading to
+	 * dangling devices. Below check is to prevent this corner case.
+	 */
+	if (test_bit(MD_BROKEN, &mddev->flags))
+		return -EBUSY;
+
 	if (!test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) {
 		did_freeze = 1;
 		set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
@@ -9122,6 +9221,10 @@ static int __init md_init(void)
 	if (!md_misc_wq)
 		goto err_misc_wq;
 
+	md_stop_wq = alloc_ordered_workqueue("md_stop", 0);
+	if (!md_stop_wq)
+		goto err_stop_wq;
+
 	if ((ret = register_blkdev(MD_MAJOR, "md")) < 0)
 		goto err_md;
 
@@ -9143,6 +9246,8 @@ static int __init md_init(void)
 err_mdp:
 	unregister_blkdev(MD_MAJOR, "md");
 err_md:
+	destroy_workqueue(md_stop_wq);
+err_stop_wq:
 	destroy_workqueue(md_misc_wq);
 err_misc_wq:
 	destroy_workqueue(md_wq);
@@ -9402,6 +9507,7 @@ static __exit void md_exit(void)
 		 * destroy_workqueue() below will wait for that to complete.
 		 */
 	}
+	destroy_workqueue(md_stop_wq);
 	destroy_workqueue(md_misc_wq);
 	destroy_workqueue(md_wq);
 }
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 2d148bdaba74..e5e7e3977b46 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -243,6 +243,9 @@ enum mddev_flags {
 	MD_UPDATING_SB,		/* md_check_recovery is updating the metadata
 				 * without explicitly holding reconfig_mutex.
 				 */
+	MD_BROKEN,		/* This is used in the emergency RAID-0 stop
+				 * in case one of its members gets removed.
+				 */
 };
 
 enum mddev_sb_flags {
@@ -411,6 +414,8 @@ struct mddev {
 
 	struct work_struct del_work;	/* used for delayed sysfs removal */
 
+	struct work_struct stop_work;	/* used for delayed emergency stop */
+
 	/* "lock" protects:
 	 *   flush_bio transition from NULL to !NULL
 	 *   rdev superblocks, events
@@ -713,6 +718,7 @@ extern void md_rdev_clear(struct md_rdev *rdev);
 extern void md_handle_request(struct mddev *mddev, struct bio *bio);
 extern void mddev_suspend(struct mddev *mddev);
 extern void mddev_resume(struct mddev *mddev);
+extern void mddev_emergency_stop(struct mddev *mddev);
 extern struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
 				   struct mddev *mddev);
 
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index ac1cffd2a09b..4f282087aba2 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -555,6 +555,7 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
 static bool raid0_make_request(struct mddev *mddev, struct bio *bio)
 {
 	struct strip_zone *zone;
+	struct block_device *bd;
 	struct md_rdev *tmp_dev;
 	sector_t bio_sector;
 	sector_t sector;
@@ -593,6 +594,19 @@ static bool raid0_make_request(struct mddev *mddev, struct bio *bio)
 
 	zone = find_zone(mddev->private, &sector);
 	tmp_dev = map_sector(mddev, zone, sector, &sector);
+	bd = tmp_dev->bdev;
+
+	if (unlikely((blk_queue_dead(bd->bd_queue) ||
+	    blk_queue_dying(bd->bd_queue)) &&
+	    !test_bit(MD_BROKEN, &mddev->flags))) {
+		mddev_emergency_stop(mddev);
+	}
+
+	if (unlikely(test_bit(MD_BROKEN, &mddev->flags))) {
+		bio_io_error(bio);
+		return true;
+	}
+
 	bio_set_dev(bio, tmp_dev->bdev);
 	bio->bi_iter.bi_sector = sector + zone->dev_start +
 		tmp_dev->data_offset;