diff mbox

[v4,1/7] md: Make md resync and reshape threads freezable

Message ID 1506481915.2822.9.camel@wdc.com (mailing list archive)
State New, archived
Headers show

Commit Message

Bart Van Assche Sept. 27, 2017, 3:12 a.m. UTC
On Tue, 2017-09-26 at 22:59 +0800, Ming Lei wrote:
> On Tue, Sep 26, 2017 at 02:42:07PM +0000, Bart Van Assche wrote:
> > On Tue, 2017-09-26 at 19:17 +0800, Ming Lei wrote:
> > > Just test this patch a bit and the following failure of freezing task
> > > is triggered during suspend: [ ... ]
> > 
> > What kernel version did you start from and which patches were applied on top of
> > that kernel? Only patch 1/7 or all seven patches? What storage configuration did
> 
> It is v4.14-rc1+, and top commit is 8d93c7a43157, with all your 7 patches
> applied.
> 
> > you use in your test and what command(s) did you use to trigger suspend?
> 
> Follows my pm test script:
> 
> 	#!/bin/sh
> 	
> 	echo check > /sys/block/md127/md/sync_action
> 	
> 	mkfs.ext4 -F /dev/md127
> 	
> 	mount /dev/md0 /mnt/data
> 	
> 	dd if=/dev/zero of=/mnt/data/d1.img bs=4k count=128k&
> 	
> 	echo 9 > /proc/sys/kernel/printk
> 	echo devices > /sys/power/pm_test
> 	echo mem > /sys/power/state
> 	
> 	wait
> 	umount /mnt/data
> 
> Storage setting:
> 
> 	sudo mdadm --create /dev/md/test /dev/sda /dev/sdb --level=1 --raid-devices=2
> 	both /dev/sda and /dev/sdb are virtio-scsi.

Thanks for the detailed reply. I have been able to reproduce the freeze failure
you reported. The output of SysRq-t learned me that the md reboot notifier was
waiting for the frozen md sync thread and that this caused the freeze failure. So
I have started testing the patch below instead of the patch at the start of this
e-mail thread:


Subject: [PATCH] md: Stop resync and reshape upon system freeze

Some people use the md driver on laptops and use the suspend and
resume functionality. Since it is essential that submitting of
new I/O requests stops before a hibernation image is created,
interrupt the md resync and reshape actions if the system is
being frozen. Note: the resync and reshape will restart after
the system is resumed and a message similar to the following
will appear in the system log:

md: md0: data-check interrupted.

---
 drivers/md/md.c | 34 ++++++++++++++++++++--------------
 1 file changed, 20 insertions(+), 14 deletions(-)

Comments

Ming Lei Sept. 27, 2017, 11 a.m. UTC | #1
On Wed, Sep 27, 2017 at 03:12:47AM +0000, Bart Van Assche wrote:
> On Tue, 2017-09-26 at 22:59 +0800, Ming Lei wrote:
> > On Tue, Sep 26, 2017 at 02:42:07PM +0000, Bart Van Assche wrote:
> > > On Tue, 2017-09-26 at 19:17 +0800, Ming Lei wrote:
> > > > Just test this patch a bit and the following failure of freezing task
> > > > is triggered during suspend: [ ... ]
> > > 
> > > What kernel version did you start from and which patches were applied on top of
> > > that kernel? Only patch 1/7 or all seven patches? What storage configuration did
> > 
> > It is v4.14-rc1+, and top commit is 8d93c7a43157, with all your 7 patches
> > applied.
> > 
> > > you use in your test and what command(s) did you use to trigger suspend?
> > 
> > Follows my pm test script:
> > 
> > 	#!/bin/sh
> > 	
> > 	echo check > /sys/block/md127/md/sync_action
> > 	
> > 	mkfs.ext4 -F /dev/md127
> > 	
> > 	mount /dev/md0 /mnt/data
> > 	
> > 	dd if=/dev/zero of=/mnt/data/d1.img bs=4k count=128k&
> > 	
> > 	echo 9 > /proc/sys/kernel/printk
> > 	echo devices > /sys/power/pm_test
> > 	echo mem > /sys/power/state
> > 	
> > 	wait
> > 	umount /mnt/data
> > 
> > Storage setting:
> > 
> > 	sudo mdadm --create /dev/md/test /dev/sda /dev/sdb --level=1 --raid-devices=2
> > 	both /dev/sda and /dev/sdb are virtio-scsi.
> 
> Thanks for the detailed reply. I have been able to reproduce the freeze failure
> you reported. The output of SysRq-t learned me that the md reboot notifier was
> waiting for the frozen md sync thread and that this caused the freeze failure. So
> I have started testing the patch below instead of the patch at the start of this
> e-mail thread:

OK, thanks for figuring out the reason.

With current linus tree, SCSI I/O is prevented from being dispatch to
device during suspend by SCSI quiesce, and will be dispatched again
in resume. With Safe SCSI quiesce[1], any underlying IO request will
stop at blk_queue_enter() during suspend, and restart from there during
resume.

For other non-SCSI driver, their .suspend/.resume often
handles I/O in similar way, for example, NVMe queue will
be frozen in .suspend, and unfreeze in .resume.

So could you explain a bit which kind of bug this patch fixes?

[1] https://marc.info/?l=linux-block&m=150649136519281&w=2
Bart Van Assche Oct. 2, 2017, 3:39 p.m. UTC | #2
On Wed, 2017-09-27 at 19:00 +0800, Ming Lei wrote:
> With current linus tree, SCSI I/O is prevented from being dispatch to

> device during suspend by SCSI quiesce, and will be dispatched again

> in resume. With Safe SCSI quiesce[1], any underlying IO request will

> stop at blk_queue_enter() during suspend, and restart from there during

> resume.

> 

> For other non-SCSI driver, their .suspend/.resume often

> handles I/O in similar way, for example, NVMe queue will

> be frozen in .suspend, and unfreeze in .resume.

> 

> So could you explain a bit which kind of bug this patch fixes?


It seems like you do not fully understand the motivation behind quiescing and
resuming I/O from inside the SCSI and NVMe freeze and thaw power management
callbacks. Hibernation can only work reliably if no I/O is submitted after
creation of the hibernation image has been started. If any I/O is submitted
that is not related to the hibernation process itself by any driver after
user space processes and kernel threads have been frozen then that's a *bug*
in the component that submitted the I/O. The freezing and suspending of I/O
from inside the SCSI and NVMe drivers is a *workaround* for these bugs but is
not a full solution. Before the hibernation image is written to disk I/O is
resumed and the I/O requests that got deferred will be executed at that time.
In other words, suspending and resuming I/O from inside the SCSI and NVMe
drivers is a workaround and not a full solution. The following quote from
Documentation/power/swsusp.txt illustrates this:

 * BIG FAT WARNING *********************************************************
 *
 * If you touch anything on disk between suspend and resume...
 *				...kiss your data goodbye.

Bart.
diff mbox

Patch

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 08fcaebc61bd..1e9d50f7345e 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -66,6 +66,7 @@ 
 #include <linux/raid/md_u.h>
 #include <linux/slab.h>
 #include <linux/percpu-refcount.h>
+#include <linux/freezer.h>
 
 #include <trace/events/block.h>
 #include "md.h"
@@ -8103,6 +8104,12 @@  void md_allow_write(struct mddev *mddev)
 }
 EXPORT_SYMBOL_GPL(md_allow_write);
 
+static bool md_sync_interrupted(struct mddev *mddev)
+{
+	return test_bit(MD_RECOVERY_INTR, &mddev->recovery) ||
+		freezing(current);
+}
+
 #define SYNC_MARKS	10
 #define	SYNC_MARK_STEP	(3*HZ)
 #define UPDATE_FREQUENCY (5*60*HZ)
@@ -8133,6 +8140,8 @@  void md_do_sync(struct md_thread *thread)
 		return;
 	}
 
+	set_freezable();
+
 	if (mddev_is_clustered(mddev)) {
 		ret = md_cluster_ops->resync_start(mddev);
 		if (ret)
@@ -8184,7 +8193,7 @@  void md_do_sync(struct md_thread *thread)
 		mddev->curr_resync = 2;
 
 	try_again:
-		if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
+		if (md_sync_interrupted(mddev))
 			goto skip;
 		for_each_mddev(mddev2, tmp) {
 			if (mddev2 == mddev)
@@ -8208,7 +8217,7 @@  void md_do_sync(struct md_thread *thread)
 				 * be caught by 'softlockup'
 				 */
 				prepare_to_wait(&resync_wait, &wq, TASK_INTERRUPTIBLE);
-				if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
+				if (!md_sync_interrupted(mddev) &&
 				    mddev2->curr_resync >= mddev->curr_resync) {
 					if (mddev2_minor != mddev2->md_minor) {
 						mddev2_minor = mddev2->md_minor;
@@ -8335,8 +8344,7 @@  void md_do_sync(struct md_thread *thread)
 			sysfs_notify(&mddev->kobj, NULL, "sync_completed");
 		}
 
-		while (j >= mddev->resync_max &&
-		       !test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
+		while (j >= mddev->resync_max && !md_sync_interrupted(mddev)) {
 			/* As this condition is controlled by user-space,
 			 * we can block indefinitely, so use '_interruptible'
 			 * to avoid triggering warnings.
@@ -8348,7 +8356,7 @@  void md_do_sync(struct md_thread *thread)
 							     &mddev->recovery));
 		}
 
-		if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
+		if (md_sync_interrupted(mddev))
 			break;
 
 		sectors = mddev->pers->sync_request(mddev, j, &skipped);
@@ -8362,7 +8370,7 @@  void md_do_sync(struct md_thread *thread)
 			atomic_add(sectors, &mddev->recovery_active);
 		}
 
-		if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
+		if (md_sync_interrupted(mddev))
 			break;
 
 		j += sectors;
@@ -8394,7 +8402,7 @@  void md_do_sync(struct md_thread *thread)
 			last_mark = next;
 		}
 
-		if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
+		if (md_sync_interrupted(mddev))
 			break;
 
 		/*
@@ -8427,8 +8435,7 @@  void md_do_sync(struct md_thread *thread)
 		}
 	}
 	pr_info("md: %s: %s %s.\n",mdname(mddev), desc,
-		test_bit(MD_RECOVERY_INTR, &mddev->recovery)
-		? "interrupted" : "done");
+		md_sync_interrupted(mddev) ? "interrupted" : "done");
 	/*
 	 * this also signals 'finished resyncing' to md_stop
 	 */
@@ -8436,8 +8443,7 @@  void md_do_sync(struct md_thread *thread)
 	wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
 
 	if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
-	    !test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
-	    mddev->curr_resync > 3) {
+	    !md_sync_interrupted(mddev) && mddev->curr_resync > 3) {
 		mddev->curr_resync_completed = mddev->curr_resync;
 		sysfs_notify(&mddev->kobj, NULL, "sync_completed");
 	}
@@ -8446,7 +8452,7 @@  void md_do_sync(struct md_thread *thread)
 	if (!test_bit(MD_RECOVERY_CHECK, &mddev->recovery) &&
 	    mddev->curr_resync > 3) {
 		if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
-			if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
+			if (md_sync_interrupted(mddev)) {
 				if (mddev->curr_resync >= mddev->recovery_cp) {
 					pr_debug("md: checkpointing %s of %s.\n",
 						 desc, mdname(mddev));
@@ -8461,7 +8467,7 @@  void md_do_sync(struct md_thread *thread)
 			} else
 				mddev->recovery_cp = MaxSector;
 		} else {
-			if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery))
+			if (!md_sync_interrupted(mddev))
 				mddev->curr_resync = MaxSector;
 			rcu_read_lock();
 			rdev_for_each_rcu(rdev, mddev)
@@ -8483,7 +8489,7 @@  void md_do_sync(struct md_thread *thread)
 		      BIT(MD_SB_CHANGE_PENDING) | BIT(MD_SB_CHANGE_DEVS));
 
 	spin_lock(&mddev->lock);
-	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
+	if (!md_sync_interrupted(mddev)) {
 		/* We completed so min/max setting can be forgotten if used. */
 		if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
 			mddev->resync_min = 0;