diff mbox

[RFC] PM / Freezer: Freeze filesystems along with freezing processes (was: Re: PM / hibernate xfs lock up / xfs_reclaim_inodes_ag)

Message ID 201108032315.06012.rjw@sisk.pl (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Rafael Wysocki Aug. 3, 2011, 9:15 p.m. UTC
On Wednesday, July 27, 2011, Nigel Cunningham wrote:
> Hi.
> 
> On 27/07/11 20:33, Christoph Hellwig wrote:
> > On Wed, Jul 27, 2011 at 11:35:13AM +0200, Rafael J. Wysocki wrote:
> >> The Pavel's objection, if I remember it correctly, was that some
> >> (or the majority of?) filesystems didn't implement the freezing operation,
> >> so they would be more vulnerable to data loss in case of a failing hibernation
> >> after this change.  However, that's better than actively causing pain to XFS
> >> users.
> > 
> > The objection never made sense and only means he never read the code.
> > freeze_super (or freeze_bdev back then) always does a sync_filesystem
> > before even checking if we have a freeze method, and sync_filesystem is
> > what we iterate over for each superblock in sync().
> 
> I've had freezing supers in TOI for a couple of years now and it has
> only ever helped. To be honest, if you have a ton of dirty pages, it
> does result in a big delay, but that's the worst of it.

OK, so below is the revived patch.

To be precise, we don't call sys_sync() from the freezer an more
(evidently, I'd removed in myself, but later forgot about that), so
it only adds freeze_filesystems() and thaw_filesystems().

Comments welcome.

Thanks,
Rafael

---

Freeze all filesystems during the freezing of tasks by calling
freeze_bdev() for each of them and thaw them during the thawing
of tasks with the help of thaw_bdev().

This is needed by hibernation, because some filesystems (e.g. XFS)
deadlock with the preallocation of memory used by it if the memory
pressure caused by it is too heavy.

The additional benefit of this change is that, if something goes
wrong after filesystems have been frozen, they will stay in a
consistent state and journal replays won't be necessary (e.g. after
a failing suspend or resume).  In particular, this should help to
solve a long-standing issue that in some cases during resume from
hibernation the boot loader causes the journal to be replied for the
filesystem containing the kernel image and initrd causing it to
become inconsistent with the information stored in the hibernation
image.

This change is based on earlier work by Nigel Cunningham.

Not-really-signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 fs/block_dev.c         |   43 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h     |    6 ++++++
 kernel/power/process.c |    7 ++++++-
 3 files changed, 55 insertions(+), 1 deletion(-)

Comments

Pavel Machek Aug. 3, 2011, 5:29 p.m. UTC | #1
Hi!

> Freeze all filesystems during the freezing of tasks by calling
> freeze_bdev() for each of them and thaw them during the thawing
> of tasks with the help of thaw_bdev().
> 
> This is needed by hibernation, because some filesystems (e.g. XFS)
> deadlock with the preallocation of memory used by it if the memory
> pressure caused by it is too heavy.
> 
> The additional benefit of this change is that, if something goes
> wrong after filesystems have been frozen, they will stay in a
> consistent state and journal replays won't be necessary (e.g. after
> a failing suspend or resume).  In particular, this should help to
> solve a long-standing issue that in some cases during resume from
> hibernation the boot loader causes the journal to be replied for the
> filesystem containing the kernel image and initrd causing it to
> become inconsistent with the information stored in the hibernation
> image.

> +/**
> + * freeze_filesystems - Force all filesystems into a consistent state.
> + */
> +void freeze_filesystems(void)
> +{
> +	struct super_block *sb;
> +
> +	lockdep_off();

Ouch. So... why do we need to silence this?

> +	/*
> +	 * Freeze in reverse order so filesystems dependant upon others are
> +	 * frozen in the right order (eg. loopback on ext3).
> +	 */
> +	list_for_each_entry_reverse(sb, &super_blocks, s_list) {
> +		if (!sb->s_root || !sb->s_bdev ||
> +		    (sb->s_frozen == SB_FREEZE_TRANS) ||
> +		    (sb->s_flags & MS_RDONLY) ||
> +		    (sb->s_flags & MS_FROZEN))
> +			continue;

Should we stop NFS from modifying remote server, too?

Plus... ext3 writes to read-only filesystems on mount; not sure if it
does it later. But RDONLY means 'user cant write to it' not 'bdev will
not be modified'. Should we freeze all?

How can 'already frozen' happen?

> +	list_for_each_entry(sb, &super_blocks, s_list)
> +		if (sb->s_flags & MS_FROZEN) {
> +			sb->s_flags &= ~MS_FROZEN;
> +			thaw_bdev(sb->s_bdev, sb);
> +		}

...because we'll unfreeze it even if we did not freeze it...

									Pavel
Rafael Wysocki Aug. 4, 2011, 9:27 a.m. UTC | #2
On Wednesday, August 03, 2011, Pavel Machek wrote:
> Hi!
> 
> > Freeze all filesystems during the freezing of tasks by calling
> > freeze_bdev() for each of them and thaw them during the thawing
> > of tasks with the help of thaw_bdev().
> > 
> > This is needed by hibernation, because some filesystems (e.g. XFS)
> > deadlock with the preallocation of memory used by it if the memory
> > pressure caused by it is too heavy.
> > 
> > The additional benefit of this change is that, if something goes
> > wrong after filesystems have been frozen, they will stay in a
> > consistent state and journal replays won't be necessary (e.g. after
> > a failing suspend or resume).  In particular, this should help to
> > solve a long-standing issue that in some cases during resume from
> > hibernation the boot loader causes the journal to be replied for the
> > filesystem containing the kernel image and initrd causing it to
> > become inconsistent with the information stored in the hibernation
> > image.
> 
> > +/**
> > + * freeze_filesystems - Force all filesystems into a consistent state.
> > + */
> > +void freeze_filesystems(void)
> > +{
> > +	struct super_block *sb;
> > +
> > +	lockdep_off();
> 
> Ouch. So... why do we need to silence this?

So that it doesn't complain? :-)

I'll need some time to get the exact details here.

> > +	/*
> > +	 * Freeze in reverse order so filesystems dependant upon others are
> > +	 * frozen in the right order (eg. loopback on ext3).
> > +	 */
> > +	list_for_each_entry_reverse(sb, &super_blocks, s_list) {
> > +		if (!sb->s_root || !sb->s_bdev ||
> > +		    (sb->s_frozen == SB_FREEZE_TRANS) ||
> > +		    (sb->s_flags & MS_RDONLY) ||
> > +		    (sb->s_flags & MS_FROZEN))
> > +			continue;
> 
> Should we stop NFS from modifying remote server, too?

What do you mean exactly?

> Plus... ext3 writes to read-only filesystems on mount; not sure if it
> does it later. But RDONLY means 'user cant write to it' not 'bdev will
> not be modified'. Should we freeze all?
> 
> How can 'already frozen' happen?
> 
> > +	list_for_each_entry(sb, &super_blocks, s_list)
> > +		if (sb->s_flags & MS_FROZEN) {
> > +			sb->s_flags &= ~MS_FROZEN;
> > +			thaw_bdev(sb->s_bdev, sb);
> > +		}
> 
> ...because we'll unfreeze it even if we did not freeze it...

So we need not check MS_FROZEN in freeze_filesystems().  OK

Thanks,
Rafael
Rafael Wysocki Aug. 4, 2011, 10:25 p.m. UTC | #3
On Thursday, August 04, 2011, Rafael J. Wysocki wrote:
> On Wednesday, August 03, 2011, Pavel Machek wrote:
> > Hi!
> > 
> > > Freeze all filesystems during the freezing of tasks by calling
> > > freeze_bdev() for each of them and thaw them during the thawing
> > > of tasks with the help of thaw_bdev().
> > > 
> > > This is needed by hibernation, because some filesystems (e.g. XFS)
> > > deadlock with the preallocation of memory used by it if the memory
> > > pressure caused by it is too heavy.
> > > 
> > > The additional benefit of this change is that, if something goes
> > > wrong after filesystems have been frozen, they will stay in a
> > > consistent state and journal replays won't be necessary (e.g. after
> > > a failing suspend or resume).  In particular, this should help to
> > > solve a long-standing issue that in some cases during resume from
> > > hibernation the boot loader causes the journal to be replied for the
> > > filesystem containing the kernel image and initrd causing it to
> > > become inconsistent with the information stored in the hibernation
> > > image.
> > 
> > > +/**
> > > + * freeze_filesystems - Force all filesystems into a consistent state.
> > > + */
> > > +void freeze_filesystems(void)
> > > +{
> > > +	struct super_block *sb;
> > > +
> > > +	lockdep_off();
> > 
> > Ouch. So... why do we need to silence this?
> 
> So that it doesn't complain? :-)
> 
> I'll need some time to get the exact details here.

So, this is because ext3_freeze() that doesn't call
journal_unlock_updates() on success, which quite frankly looks like
a bug in ext3 to me.  At least that's different from what ext4 does
in exactly the same situation (which looks correct).

If ext3_freeze() called journal_unlock_updates() on success too and
the call to journal_unlock_updates() is removed from ext3_unfreeze(),
we wouldn't need that lockdep_off()/lockdep_on() around the loop.

I need someone with ext3/ext4 knowledge to comment here, though.

Moreover, I'm not sure if other filesystems don't do such things.

Anyway, this is just a false-positive, even with the ext3 code as is.

> > > +	/*
> > > +	 * Freeze in reverse order so filesystems dependant upon others are
> > > +	 * frozen in the right order (eg. loopback on ext3).
> > > +	 */
> > > +	list_for_each_entry_reverse(sb, &super_blocks, s_list) {
> > > +		if (!sb->s_root || !sb->s_bdev ||
> > > +		    (sb->s_frozen == SB_FREEZE_TRANS) ||
> > > +		    (sb->s_flags & MS_RDONLY) ||
> > > +		    (sb->s_flags & MS_FROZEN))
> > > +			continue;
> > 
> > Should we stop NFS from modifying remote server, too?
> 
> What do you mean exactly?
> 
> > Plus... ext3 writes to read-only filesystems on mount; not sure if it
> > does it later. But RDONLY means 'user cant write to it' not 'bdev will
> > not be modified'. Should we freeze all?
> > 
> > How can 'already frozen' happen?
> > 
> > > +	list_for_each_entry(sb, &super_blocks, s_list)
> > > +		if (sb->s_flags & MS_FROZEN) {
> > > +			sb->s_flags &= ~MS_FROZEN;
> > > +			thaw_bdev(sb->s_bdev, sb);
> > > +		}
> > 
> > ...because we'll unfreeze it even if we did not freeze it...
> 
> So we need not check MS_FROZEN in freeze_filesystems().  OK

Thanks,
Rafael
diff mbox

Patch

Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -211,6 +211,7 @@  struct inodes_stat_t {
 #define MS_KERNMOUNT	(1<<22) /* this is a kern_mount call */
 #define MS_I_VERSION	(1<<23) /* Update inode I_version field */
 #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
+#define MS_FROZEN	(1<<25) /* bdev has been frozen */
 #define MS_NOSEC	(1<<28)
 #define MS_BORN		(1<<29)
 #define MS_ACTIVE	(1<<30)
@@ -2047,6 +2048,8 @@  extern struct super_block *freeze_bdev(s
 extern void emergency_thaw_all(void);
 extern int thaw_bdev(struct block_device *bdev, struct super_block *sb);
 extern int fsync_bdev(struct block_device *);
+extern void freeze_filesystems(void);
+extern void thaw_filesystems(void);
 #else
 static inline void bd_forget(struct inode *inode) {}
 static inline int sync_blockdev(struct block_device *bdev) { return 0; }
@@ -2061,6 +2064,9 @@  static inline int thaw_bdev(struct block
 {
 	return 0;
 }
+
+static inline void freeze_filesystems(void) {}
+static inline void thaw_filesystems(void) {}
 #endif
 extern int sync_filesystem(struct super_block *);
 extern const struct file_operations def_blk_fops;
Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c
+++ linux-2.6/fs/block_dev.c
@@ -314,6 +314,49 @@  out:
 }
 EXPORT_SYMBOL(thaw_bdev);
 
+/**
+ * freeze_filesystems - Force all filesystems into a consistent state.
+ */
+void freeze_filesystems(void)
+{
+	struct super_block *sb;
+
+	lockdep_off();
+	/*
+	 * Freeze in reverse order so filesystems dependant upon others are
+	 * frozen in the right order (eg. loopback on ext3).
+	 */
+	list_for_each_entry_reverse(sb, &super_blocks, s_list) {
+		if (!sb->s_root || !sb->s_bdev ||
+		    (sb->s_frozen == SB_FREEZE_TRANS) ||
+		    (sb->s_flags & MS_RDONLY) ||
+		    (sb->s_flags & MS_FROZEN))
+			continue;
+
+		freeze_bdev(sb->s_bdev);
+		sb->s_flags |= MS_FROZEN;
+	}
+	lockdep_on();
+}
+
+/**
+ * thaw_filesystems - Make all filesystems active again.
+ */
+void thaw_filesystems(void)
+{
+	struct super_block *sb;
+
+	lockdep_off();
+
+	list_for_each_entry(sb, &super_blocks, s_list)
+		if (sb->s_flags & MS_FROZEN) {
+			sb->s_flags &= ~MS_FROZEN;
+			thaw_bdev(sb->s_bdev, sb);
+		}
+
+	lockdep_on();
+}
+
 static int blkdev_writepage(struct page *page, struct writeback_control *wbc)
 {
 	return block_write_full_page(page, blkdev_get_block, wbc);
Index: linux-2.6/kernel/power/process.c
===================================================================
--- linux-2.6.orig/kernel/power/process.c
+++ linux-2.6/kernel/power/process.c
@@ -12,10 +12,10 @@ 
 #include <linux/oom.h>
 #include <linux/suspend.h>
 #include <linux/module.h>
-#include <linux/syscalls.h>
 #include <linux/freezer.h>
 #include <linux/delay.h>
 #include <linux/workqueue.h>
+#include <linux/fs.h>
 
 /* 
  * Timeout for stopping processes
@@ -147,6 +147,10 @@  int freeze_processes(void)
 		goto Exit;
 	printk("done.\n");
 
+	pr_info("Freezing filesystems ... ");
+	freeze_filesystems();
+	pr_info("done.\n");
+
 	printk("Freezing remaining freezable tasks ... ");
 	error = try_to_freeze_tasks(false);
 	if (error)
@@ -188,6 +192,7 @@  void thaw_processes(void)
 	printk("Restarting tasks ... ");
 	thaw_workqueues();
 	thaw_tasks(true);
+	thaw_filesystems();
 	thaw_tasks(false);
 	schedule();
 	printk("done.\n");