Message ID | 1506481915.2822.9.camel@wdc.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Sep 27, 2017 at 03:12:47AM +0000, Bart Van Assche wrote: > On Tue, 2017-09-26 at 22:59 +0800, Ming Lei wrote: > > On Tue, Sep 26, 2017 at 02:42:07PM +0000, Bart Van Assche wrote: > > > On Tue, 2017-09-26 at 19:17 +0800, Ming Lei wrote: > > > > Just test this patch a bit and the following failure of freezing task > > > > is triggered during suspend: [ ... ] > > > > > > What kernel version did you start from and which patches were applied on top of > > > that kernel? Only patch 1/7 or all seven patches? What storage configuration did > > > > It is v4.14-rc1+, and top commit is 8d93c7a43157, with all your 7 patches > > applied. > > > > > you use in your test and what command(s) did you use to trigger suspend? > > > > Follows my pm test script: > > > > #!/bin/sh > > > > echo check > /sys/block/md127/md/sync_action > > > > mkfs.ext4 -F /dev/md127 > > > > mount /dev/md0 /mnt/data > > > > dd if=/dev/zero of=/mnt/data/d1.img bs=4k count=128k& > > > > echo 9 > /proc/sys/kernel/printk > > echo devices > /sys/power/pm_test > > echo mem > /sys/power/state > > > > wait > > umount /mnt/data > > > > Storage setting: > > > > sudo mdadm --create /dev/md/test /dev/sda /dev/sdb --level=1 --raid-devices=2 > > both /dev/sda and /dev/sdb are virtio-scsi. > > Thanks for the detailed reply. I have been able to reproduce the freeze failure > you reported. The output of SysRq-t learned me that the md reboot notifier was > waiting for the frozen md sync thread and that this caused the freeze failure. So > I have started testing the patch below instead of the patch at the start of this > e-mail thread: OK, thanks for figuring out the reason. With current linus tree, SCSI I/O is prevented from being dispatch to device during suspend by SCSI quiesce, and will be dispatched again in resume. With Safe SCSI quiesce[1], any underlying IO request will stop at blk_queue_enter() during suspend, and restart from there during resume. For other non-SCSI driver, their .suspend/.resume often handles I/O in similar way, for example, NVMe queue will be frozen in .suspend, and unfreeze in .resume. So could you explain a bit which kind of bug this patch fixes? [1] https://marc.info/?l=linux-block&m=150649136519281&w=2
On Wed, 2017-09-27 at 19:00 +0800, Ming Lei wrote: > With current linus tree, SCSI I/O is prevented from being dispatch to > device during suspend by SCSI quiesce, and will be dispatched again > in resume. With Safe SCSI quiesce[1], any underlying IO request will > stop at blk_queue_enter() during suspend, and restart from there during > resume. > > For other non-SCSI driver, their .suspend/.resume often > handles I/O in similar way, for example, NVMe queue will > be frozen in .suspend, and unfreeze in .resume. > > So could you explain a bit which kind of bug this patch fixes? It seems like you do not fully understand the motivation behind quiescing and resuming I/O from inside the SCSI and NVMe freeze and thaw power management callbacks. Hibernation can only work reliably if no I/O is submitted after creation of the hibernation image has been started. If any I/O is submitted that is not related to the hibernation process itself by any driver after user space processes and kernel threads have been frozen then that's a *bug* in the component that submitted the I/O. The freezing and suspending of I/O from inside the SCSI and NVMe drivers is a *workaround* for these bugs but is not a full solution. Before the hibernation image is written to disk I/O is resumed and the I/O requests that got deferred will be executed at that time. In other words, suspending and resuming I/O from inside the SCSI and NVMe drivers is a workaround and not a full solution. The following quote from Documentation/power/swsusp.txt illustrates this: * BIG FAT WARNING ********************************************************* * * If you touch anything on disk between suspend and resume... * ...kiss your data goodbye. Bart.
diff --git a/drivers/md/md.c b/drivers/md/md.c index 08fcaebc61bd..1e9d50f7345e 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -66,6 +66,7 @@ #include <linux/raid/md_u.h> #include <linux/slab.h> #include <linux/percpu-refcount.h> +#include <linux/freezer.h> #include <trace/events/block.h> #include "md.h" @@ -8103,6 +8104,12 @@ void md_allow_write(struct mddev *mddev) } EXPORT_SYMBOL_GPL(md_allow_write); +static bool md_sync_interrupted(struct mddev *mddev) +{ + return test_bit(MD_RECOVERY_INTR, &mddev->recovery) || + freezing(current); +} + #define SYNC_MARKS 10 #define SYNC_MARK_STEP (3*HZ) #define UPDATE_FREQUENCY (5*60*HZ) @@ -8133,6 +8140,8 @@ void md_do_sync(struct md_thread *thread) return; } + set_freezable(); + if (mddev_is_clustered(mddev)) { ret = md_cluster_ops->resync_start(mddev); if (ret) @@ -8184,7 +8193,7 @@ void md_do_sync(struct md_thread *thread) mddev->curr_resync = 2; try_again: - if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) + if (md_sync_interrupted(mddev)) goto skip; for_each_mddev(mddev2, tmp) { if (mddev2 == mddev) @@ -8208,7 +8217,7 @@ void md_do_sync(struct md_thread *thread) * be caught by 'softlockup' */ prepare_to_wait(&resync_wait, &wq, TASK_INTERRUPTIBLE); - if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) && + if (!md_sync_interrupted(mddev) && mddev2->curr_resync >= mddev->curr_resync) { if (mddev2_minor != mddev2->md_minor) { mddev2_minor = mddev2->md_minor; @@ -8335,8 +8344,7 @@ void md_do_sync(struct md_thread *thread) sysfs_notify(&mddev->kobj, NULL, "sync_completed"); } - while (j >= mddev->resync_max && - !test_bit(MD_RECOVERY_INTR, &mddev->recovery)) { + while (j >= mddev->resync_max && !md_sync_interrupted(mddev)) { /* As this condition is controlled by user-space, * we can block indefinitely, so use '_interruptible' * to avoid triggering warnings. @@ -8348,7 +8356,7 @@ void md_do_sync(struct md_thread *thread) &mddev->recovery)); } - if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) + if (md_sync_interrupted(mddev)) break; sectors = mddev->pers->sync_request(mddev, j, &skipped); @@ -8362,7 +8370,7 @@ void md_do_sync(struct md_thread *thread) atomic_add(sectors, &mddev->recovery_active); } - if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) + if (md_sync_interrupted(mddev)) break; j += sectors; @@ -8394,7 +8402,7 @@ void md_do_sync(struct md_thread *thread) last_mark = next; } - if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) + if (md_sync_interrupted(mddev)) break; /* @@ -8427,8 +8435,7 @@ void md_do_sync(struct md_thread *thread) } } pr_info("md: %s: %s %s.\n",mdname(mddev), desc, - test_bit(MD_RECOVERY_INTR, &mddev->recovery) - ? "interrupted" : "done"); + md_sync_interrupted(mddev) ? "interrupted" : "done"); /* * this also signals 'finished resyncing' to md_stop */ @@ -8436,8 +8443,7 @@ void md_do_sync(struct md_thread *thread) wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active)); if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) && - !test_bit(MD_RECOVERY_INTR, &mddev->recovery) && - mddev->curr_resync > 3) { + !md_sync_interrupted(mddev) && mddev->curr_resync > 3) { mddev->curr_resync_completed = mddev->curr_resync; sysfs_notify(&mddev->kobj, NULL, "sync_completed"); } @@ -8446,7 +8452,7 @@ void md_do_sync(struct md_thread *thread) if (!test_bit(MD_RECOVERY_CHECK, &mddev->recovery) && mddev->curr_resync > 3) { if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) { - if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) { + if (md_sync_interrupted(mddev)) { if (mddev->curr_resync >= mddev->recovery_cp) { pr_debug("md: checkpointing %s of %s.\n", desc, mdname(mddev)); @@ -8461,7 +8467,7 @@ void md_do_sync(struct md_thread *thread) } else mddev->recovery_cp = MaxSector; } else { - if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) + if (!md_sync_interrupted(mddev)) mddev->curr_resync = MaxSector; rcu_read_lock(); rdev_for_each_rcu(rdev, mddev) @@ -8483,7 +8489,7 @@ void md_do_sync(struct md_thread *thread) BIT(MD_SB_CHANGE_PENDING) | BIT(MD_SB_CHANGE_DEVS)); spin_lock(&mddev->lock); - if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) { + if (!md_sync_interrupted(mddev)) { /* We completed so min/max setting can be forgotten if used. */ if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) mddev->resync_min = 0;