diff mbox series

[RFC,4/4] vfs: add filesystem freeze/thaw callbacks for power management

Message ID 20250327140613.25178-5-James.Bottomley@HansenPartnership.com (mailing list archive)
State New
Headers show
Series vfs freeze/thaw on suspend/resume | expand

Commit Message

James Bottomley March 27, 2025, 2:06 p.m. UTC
Introduce a freeze function, which iterates superblocks in reverse
order freezing filesystems.  The indicator a filesystem is freezable
is either possessing a s_bdev or a freeze_super method.  So this can
be used in efivarfs, whether the freeze is for hibernate is also
passed in via the new FREEZE_FOR_HIBERNATE flag.

Thawing is done opposite to freezing (so superblock traversal in
regular order) and the whole thing is plumbed into power management.
The original ksys_sync() is preserved so the whole freezing step is
optional (if it fails we're no worse off than we are today) so it
doesn't inhibit suspend/hibernate if there's a failure.

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
---
 fs/super.c               | 61 ++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h       |  5 ++++
 kernel/power/hibernate.c | 12 ++++++++
 kernel/power/suspend.c   |  4 +++
 4 files changed, 82 insertions(+)

Comments

Jan Kara March 27, 2025, 6:20 p.m. UTC | #1
On Thu 27-03-25 10:06:13, James Bottomley wrote:
> Introduce a freeze function, which iterates superblocks in reverse
> order freezing filesystems.  The indicator a filesystem is freezable
> is either possessing a s_bdev or a freeze_super method.  So this can
> be used in efivarfs, whether the freeze is for hibernate is also
> passed in via the new FREEZE_FOR_HIBERNATE flag.
> 
> Thawing is done opposite to freezing (so superblock traversal in
> regular order) and the whole thing is plumbed into power management.
> The original ksys_sync() is preserved so the whole freezing step is
> optional (if it fails we're no worse off than we are today) so it
> doesn't inhibit suspend/hibernate if there's a failure.
> 
> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>

OK, I've seen you are setting the new FREEZE_FOR_HIBERNATE flag but I didn't
find anything using that flag. What do you plan to use it for? Does you
efivars usecase need it? I find passing down this detail about the caller
down to all filesystems a bit awkward. Isn't it possible to extract the
information "hibernate is ongoing" from PM subsystem?

> +/*
> + * Kernel freezing and thawing is only done in the power management
> + * subsystem and is thus single threaded (so we don't have to worry
> + * here about multiple calls to filesystems_freeze/thaw().
> + */
> +
> +static int freeze_flags;

Frankly, the global variable to propagate flags is pretty ugly... If we
really have to propagate some context into the iterator callback, rather do
it explicitly like iterate_supers() does it.

> +static void filesystems_freeze_callback(struct super_block *sb)
> +{
> +	/* errors don't fail suspend so ignore them */
> +	if (sb->s_op->freeze_super)
> +		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST
> +				       | FREEZE_HOLDER_KERNEL
> +				       | freeze_flags);
> +	else if (sb->s_bdev)
> +		freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL
> +			     | freeze_flags);

Style nit - braces around above blocks would be IMHO appropriate.

> +	else {
> +		pr_info("Ignoring filesystem %s\n", sb->s_type->name);
> +		return;
> +	}
> +
> +	pr_info("frozen %s, now syncing block ...", sb->s_type->name);
> +	sync_blockdev(sb->s_bdev);
> +	pr_info("done.");
> +}

Generally this callback is not safe because it can race with filesystem
unmount and calling ->freeze_super() after the filesystem's ->put_super()
was called may have all sorts of interesting effects (freeze_super() itself
will just bail with a warning, which is better but not great either).

The cleanest way I see how to make the iteration safe is to grab active sb
reference (like grab_super() does it) for the duration of freeze_super()
calls. Another possibility would be to grab sb->s_umount rwsem exclusively
as Luis does it in his series but that requires a bit of locking surgery
and ->freeze_super() handlers make this particularly nasty these days so I
think active sb reference is going to be nicer these days.

								Honza
Christian Brauner March 28, 2025, 10:08 a.m. UTC | #2
On Thu, Mar 27, 2025 at 10:06:13AM -0400, James Bottomley wrote:
> Introduce a freeze function, which iterates superblocks in reverse
> order freezing filesystems.  The indicator a filesystem is freezable
> is either possessing a s_bdev or a freeze_super method.  So this can
> be used in efivarfs, whether the freeze is for hibernate is also
> passed in via the new FREEZE_FOR_HIBERNATE flag.
> 
> Thawing is done opposite to freezing (so superblock traversal in
> regular order) and the whole thing is plumbed into power management.
> The original ksys_sync() is preserved so the whole freezing step is
> optional (if it fails we're no worse off than we are today) so it
> doesn't inhibit suspend/hibernate if there's a failure.
> 
> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
> ---
>  fs/super.c               | 61 ++++++++++++++++++++++++++++++++++++++++
>  include/linux/fs.h       |  5 ++++
>  kernel/power/hibernate.c | 12 ++++++++
>  kernel/power/suspend.c   |  4 +++
>  4 files changed, 82 insertions(+)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 76785509d906..b4b0986414b0 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1461,6 +1461,67 @@ static struct super_block *get_bdev_super(struct block_device *bdev)
>  	return sb;
>  }
>  
> +/*
> + * Kernel freezing and thawing is only done in the power management
> + * subsystem and is thus single threaded (so we don't have to worry
> + * here about multiple calls to filesystems_freeze/thaw().
> + */
> +
> +static int freeze_flags;

Ugh, please don't use a global flag for this.

> +
> +static void filesystems_freeze_callback(struct super_block *sb)
> +{
> +	/* errors don't fail suspend so ignore them */
> +	if (sb->s_op->freeze_super)
> +		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST
> +				       | FREEZE_HOLDER_KERNEL
> +				       | freeze_flags);
> +	else if (sb->s_bdev)
> +		freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL
> +			     | freeze_flags);
> +	else {
> +		pr_info("Ignoring filesystem %s\n", sb->s_type->name);
> +		return;
> +	}
> +
> +	pr_info("frozen %s, now syncing block ...", sb->s_type->name);
> +	sync_blockdev(sb->s_bdev);
> +	pr_info("done.");
> +}
> +
> +/**
> + * filesystems_freeze - freeze callback for power management
> + *
> + * Freeze all active filesystems (in reverse superblock order)
> + */
> +void filesystems_freeze(bool for_hibernate)
> +{
> +	freeze_flags = for_hibernate ? FREEZE_FOR_HIBERNATE : 0;
> +	__iterate_supers_rev(filesystems_freeze_callback);
> +}
> +
> +static void filesystems_thaw_callback(struct super_block *sb)
> +{
> +	if (sb->s_op->thaw_super)
> +		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST
> +				     | FREEZE_HOLDER_KERNEL
> +				     | freeze_flags);
> +	else if (sb->s_bdev)
> +		thaw_super(sb,	FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL
> +			   | freeze_flags);
> +}
> +
> +/**
> + * filesystems_thaw - thaw callback for power management
> + *
> + * Thaw all active filesystems (in forward superblock order)
> + */
> +void filesystems_thaw(bool for_hibernate)
> +{
> +	freeze_flags = for_hibernate ? FREEZE_FOR_HIBERNATE : 0;
> +	__iterate_supers(filesystems_thaw_callback);

This doesn't work and I've explained in my reply to Luis how this
doesn't work and what the alternative are:

A concurrent umount() can wipe the filesystem behind your back. So you
either need an active superblock reference or you need to communicate
that the superblock is locked through the new flag I proposed (naming
irrelevant for now).

> +}
> +
>  /**
>   * fs_bdev_freeze - freeze owning filesystem of block device
>   * @bdev: block device
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index cbbb704eff74..de154e9379ec 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2272,6 +2272,7 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
>   * @FREEZE_HOLDER_KERNEL: kernel wants to freeze or thaw filesystem
>   * @FREEZE_HOLDER_USERSPACE: userspace wants to freeze or thaw filesystem
>   * @FREEZE_MAY_NEST: whether nesting freeze and thaw requests is allowed
> + * @FREEZE_FOR_HIBERNATE: set if freeze is from power management hibernate
>   *
>   * Indicate who the owner of the freeze or thaw request is and whether
>   * the freeze needs to be exclusive or can nest.
> @@ -2285,6 +2286,7 @@ enum freeze_holder {
>  	FREEZE_HOLDER_KERNEL	= (1U << 0),
>  	FREEZE_HOLDER_USERSPACE	= (1U << 1),
>  	FREEZE_MAY_NEST		= (1U << 2),
> +	FREEZE_FOR_HIBERNATE	= (1U << 3),
>  };
>  
>  struct super_operations {
> @@ -3919,4 +3921,7 @@ static inline bool vfs_empty_path(int dfd, const char __user *path)
>  
>  int generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter);
>  
> +void filesystems_freeze(bool for_hibernate);
> +void filesystems_thaw(bool for_hibernate);
> +
>  #endif /* _LINUX_FS_H */
> diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
> index 10a01af63a80..fc2106e6685a 100644
> --- a/kernel/power/hibernate.c
> +++ b/kernel/power/hibernate.c
> @@ -778,7 +778,12 @@ int hibernate(void)
>  
>  	ksys_sync_helper();
>  
> +	pr_info("about to freeze filesystems\n");
> +	filesystems_freeze(true);
> +	pr_info("filesystem freeze done\n");
> +
>  	error = freeze_processes();
> +	pr_info("process freeze done\n");
>  	if (error)
>  		goto Exit;
>  
> @@ -788,7 +793,9 @@ int hibernate(void)
>  	if (error)
>  		goto Thaw;
>  
> +	pr_info("About to create snapshot\n");
>  	error = hibernation_snapshot(hibernation_mode == HIBERNATION_PLATFORM);
> +	pr_info("snapshot done\n");
>  	if (error || freezer_test_done)
>  		goto Free_bitmaps;
>  
> @@ -842,6 +849,8 @@ int hibernate(void)
>  	}
>  	thaw_processes();
>  
> +	filesystems_thaw(true);
> +
>  	/* Don't bother checking whether freezer_test_done is true */
>  	freezer_test_done = false;
>   Exit:
> @@ -939,6 +948,8 @@ int hibernate_quiet_exec(int (*func)(void *data), void *data)
>  
>  	thaw_processes();
>  
> +	filesystems_thaw(true);
> +
>  exit:
>  	pm_notifier_call_chain(PM_POST_HIBERNATION);
>  
> @@ -1041,6 +1052,7 @@ static int software_resume(void)
>  
>  	error = load_image_and_restore();
>  	thaw_processes();
> +	filesystems_thaw(true);
>   Finish:
>  	pm_notifier_call_chain(PM_POST_RESTORE);
>   Restore:
> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> index 09f8397bae15..34cc5b0c408c 100644
> --- a/kernel/power/suspend.c
> +++ b/kernel/power/suspend.c
> @@ -544,6 +544,7 @@ int suspend_devices_and_enter(suspend_state_t state)
>  static void suspend_finish(void)
>  {
>  	suspend_thaw_processes();
> +	filesystems_thaw(false);
>  	pm_notifier_call_chain(PM_POST_SUSPEND);
>  	pm_restore_console();
>  }
> @@ -581,6 +582,7 @@ static int enter_state(suspend_state_t state)
>  		trace_suspend_resume(TPS("sync_filesystems"), 0, true);
>  		ksys_sync_helper();
>  		trace_suspend_resume(TPS("sync_filesystems"), 0, false);
> +		filesystems_freeze(false);
>  	}
>  
>  	pm_pr_dbg("Preparing system for sleep (%s)\n", mem_sleep_labels[state]);
> @@ -603,6 +605,8 @@ static int enter_state(suspend_state_t state)
>  	pm_pr_dbg("Finishing wakeup.\n");
>  	suspend_finish();
>   Unlock:
> +	if (sync_on_suspend_enabled)
> +		filesystems_thaw(false);
>  	mutex_unlock(&system_transition_mutex);
>  	return error;
>  }
> -- 
> 2.43.0
>
Christian Brauner March 28, 2025, 12:01 p.m. UTC | #3
On Thu, Mar 27, 2025 at 10:06:13AM -0400, James Bottomley wrote:
> Introduce a freeze function, which iterates superblocks in reverse
> order freezing filesystems.  The indicator a filesystem is freezable
> is either possessing a s_bdev or a freeze_super method.  So this can
> be used in efivarfs, whether the freeze is for hibernate is also
> passed in via the new FREEZE_FOR_HIBERNATE flag.
> 
> Thawing is done opposite to freezing (so superblock traversal in
> regular order) and the whole thing is plumbed into power management.
> The original ksys_sync() is preserved so the whole freezing step is
> optional (if it fails we're no worse off than we are today) so it
> doesn't inhibit suspend/hibernate if there's a failure.
> 
> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
> ---
>  fs/super.c               | 61 ++++++++++++++++++++++++++++++++++++++++
>  include/linux/fs.h       |  5 ++++
>  kernel/power/hibernate.c | 12 ++++++++
>  kernel/power/suspend.c   |  4 +++
>  4 files changed, 82 insertions(+)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 76785509d906..b4b0986414b0 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1461,6 +1461,67 @@ static struct super_block *get_bdev_super(struct block_device *bdev)
>  	return sb;
>  }
>  
> +/*
> + * Kernel freezing and thawing is only done in the power management
> + * subsystem and is thus single threaded (so we don't have to worry
> + * here about multiple calls to filesystems_freeze/thaw().
> + */
> +
> +static int freeze_flags;
> +
> +static void filesystems_freeze_callback(struct super_block *sb)
> +{
> +	/* errors don't fail suspend so ignore them */
> +	if (sb->s_op->freeze_super)
> +		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST
> +				       | FREEZE_HOLDER_KERNEL
> +				       | freeze_flags);
> +	else if (sb->s_bdev)
> +		freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL
> +			     | freeze_flags);
> +	else {
> +		pr_info("Ignoring filesystem %s\n", sb->s_type->name);
> +		return;
> +	}
> +
> +	pr_info("frozen %s, now syncing block ...", sb->s_type->name);
> +	sync_blockdev(sb->s_bdev);

Unnecessary, either the filesystem is responsible for this if it
provides its own ->freeze_super() or freeze_super() does it in
sync_filesystem.

> +	pr_info("done.");
> +}
> +
> +/**
> + * filesystems_freeze - freeze callback for power management
> + *
> + * Freeze all active filesystems (in reverse superblock order)
> + */
> +void filesystems_freeze(bool for_hibernate)
> +{
> +	freeze_flags = for_hibernate ? FREEZE_FOR_HIBERNATE : 0;
> +	__iterate_supers_rev(filesystems_freeze_callback);
> +}
> +
> +static void filesystems_thaw_callback(struct super_block *sb)
> +{
> +	if (sb->s_op->thaw_super)
> +		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST
> +				     | FREEZE_HOLDER_KERNEL
> +				     | freeze_flags);
> +	else if (sb->s_bdev)
> +		thaw_super(sb,	FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL
> +			   | freeze_flags);
> +}
> +
> +/**
> + * filesystems_thaw - thaw callback for power management
> + *
> + * Thaw all active filesystems (in forward superblock order)
> + */
> +void filesystems_thaw(bool for_hibernate)
> +{
> +	freeze_flags = for_hibernate ? FREEZE_FOR_HIBERNATE : 0;
> +	__iterate_supers(filesystems_thaw_callback);
> +}
> +
>  /**
>   * fs_bdev_freeze - freeze owning filesystem of block device
>   * @bdev: block device
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index cbbb704eff74..de154e9379ec 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2272,6 +2272,7 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
>   * @FREEZE_HOLDER_KERNEL: kernel wants to freeze or thaw filesystem
>   * @FREEZE_HOLDER_USERSPACE: userspace wants to freeze or thaw filesystem
>   * @FREEZE_MAY_NEST: whether nesting freeze and thaw requests is allowed
> + * @FREEZE_FOR_HIBERNATE: set if freeze is from power management hibernate
>   *
>   * Indicate who the owner of the freeze or thaw request is and whether
>   * the freeze needs to be exclusive or can nest.
> @@ -2285,6 +2286,7 @@ enum freeze_holder {
>  	FREEZE_HOLDER_KERNEL	= (1U << 0),
>  	FREEZE_HOLDER_USERSPACE	= (1U << 1),
>  	FREEZE_MAY_NEST		= (1U << 2),
> +	FREEZE_FOR_HIBERNATE	= (1U << 3),
>  };
>  
>  struct super_operations {
> @@ -3919,4 +3921,7 @@ static inline bool vfs_empty_path(int dfd, const char __user *path)
>  
>  int generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter);
>  
> +void filesystems_freeze(bool for_hibernate);
> +void filesystems_thaw(bool for_hibernate);
> +
>  #endif /* _LINUX_FS_H */
> diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
> index 10a01af63a80..fc2106e6685a 100644
> --- a/kernel/power/hibernate.c
> +++ b/kernel/power/hibernate.c
> @@ -778,7 +778,12 @@ int hibernate(void)
>  
>  	ksys_sync_helper();
>  
> +	pr_info("about to freeze filesystems\n");
> +	filesystems_freeze(true);
> +	pr_info("filesystem freeze done\n");
> +
>  	error = freeze_processes();
> +	pr_info("process freeze done\n");
>  	if (error)
>  		goto Exit;
>  
> @@ -788,7 +793,9 @@ int hibernate(void)
>  	if (error)
>  		goto Thaw;
>  
> +	pr_info("About to create snapshot\n");
>  	error = hibernation_snapshot(hibernation_mode == HIBERNATION_PLATFORM);
> +	pr_info("snapshot done\n");
>  	if (error || freezer_test_done)
>  		goto Free_bitmaps;
>  
> @@ -842,6 +849,8 @@ int hibernate(void)
>  	}
>  	thaw_processes();
>  
> +	filesystems_thaw(true);
> +
>  	/* Don't bother checking whether freezer_test_done is true */
>  	freezer_test_done = false;
>   Exit:
> @@ -939,6 +948,8 @@ int hibernate_quiet_exec(int (*func)(void *data), void *data)
>  
>  	thaw_processes();
>  
> +	filesystems_thaw(true);
> +
>  exit:
>  	pm_notifier_call_chain(PM_POST_HIBERNATION);
>  
> @@ -1041,6 +1052,7 @@ static int software_resume(void)
>  
>  	error = load_image_and_restore();
>  	thaw_processes();
> +	filesystems_thaw(true);
>   Finish:
>  	pm_notifier_call_chain(PM_POST_RESTORE);
>   Restore:
> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> index 09f8397bae15..34cc5b0c408c 100644
> --- a/kernel/power/suspend.c
> +++ b/kernel/power/suspend.c
> @@ -544,6 +544,7 @@ int suspend_devices_and_enter(suspend_state_t state)
>  static void suspend_finish(void)
>  {
>  	suspend_thaw_processes();
> +	filesystems_thaw(false);
>  	pm_notifier_call_chain(PM_POST_SUSPEND);
>  	pm_restore_console();
>  }
> @@ -581,6 +582,7 @@ static int enter_state(suspend_state_t state)
>  		trace_suspend_resume(TPS("sync_filesystems"), 0, true);
>  		ksys_sync_helper();
>  		trace_suspend_resume(TPS("sync_filesystems"), 0, false);
> +		filesystems_freeze(false);
>  	}
>  
>  	pm_pr_dbg("Preparing system for sleep (%s)\n", mem_sleep_labels[state]);
> @@ -603,6 +605,8 @@ static int enter_state(suspend_state_t state)
>  	pm_pr_dbg("Finishing wakeup.\n");
>  	suspend_finish();
>   Unlock:
> +	if (sync_on_suspend_enabled)
> +		filesystems_thaw(false);
>  	mutex_unlock(&system_transition_mutex);
>  	return error;
>  }
> -- 
> 2.43.0
>
James Bottomley March 28, 2025, 2:14 p.m. UTC | #4
On Fri, 2025-03-28 at 11:08 +0100, Christian Brauner wrote:
> On Thu, Mar 27, 2025 at 10:06:13AM -0400, James Bottomley wrote:
[...]
> > +
> > +static void filesystems_freeze_callback(struct super_block *sb)
> > +{
> > +	/* errors don't fail suspend so ignore them */
> > +	if (sb->s_op->freeze_super)
> > +		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST
> > +				       | FREEZE_HOLDER_KERNEL
> > +				       | freeze_flags);
> > +	else if (sb->s_bdev)
> > +		freeze_super(sb, FREEZE_MAY_NEST |
> > FREEZE_HOLDER_KERNEL
> > +			     | freeze_flags);
> > +	else {
> > +		pr_info("Ignoring filesystem %s\n", sb->s_type-
> > >name);
> > +		return;
> > +	}
> > +
> > +	pr_info("frozen %s, now syncing block ...", sb->s_type-
> > >name);
> > +	sync_blockdev(sb->s_bdev);
> > +	pr_info("done.");
> > +}
> > +
> > +/**
> > + * filesystems_freeze - freeze callback for power management
> > + *
> > + * Freeze all active filesystems (in reverse superblock order)
> > + */
> > +void filesystems_freeze(bool for_hibernate)
> > +{
> > +	freeze_flags = for_hibernate ? FREEZE_FOR_HIBERNATE : 0;
> > +	__iterate_supers_rev(filesystems_freeze_callback);
> > +}
> > +
> > +static void filesystems_thaw_callback(struct super_block *sb)
> > +{
> > +	if (sb->s_op->thaw_super)
> > +		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST
> > +				     | FREEZE_HOLDER_KERNEL
> > +				     | freeze_flags);
> > +	else if (sb->s_bdev)
> > +		thaw_super(sb,	FREEZE_MAY_NEST |
> > FREEZE_HOLDER_KERNEL
> > +			   | freeze_flags);
> > +}
> > +
> > +/**
> > + * filesystems_thaw - thaw callback for power management
> > + *
> > + * Thaw all active filesystems (in forward superblock order)
> > + */
> > +void filesystems_thaw(bool for_hibernate)
> > +{
> > +	freeze_flags = for_hibernate ? FREEZE_FOR_HIBERNATE : 0;
> > +	__iterate_supers(filesystems_thaw_callback);
> 
> This doesn't work and I've explained in my reply to Luis how this
> doesn't work and what the alternative are:
> 
> A concurrent umount() can wipe the filesystem behind your back. So
> you either need an active superblock reference or you need to
> communicate that the superblock is locked through the new flag I
> proposed (naming irrelevant for now).

Since this is a hybrid thread between power management and VFS, could I
just summarize what I think the various superblock locks are before
discussing the actual problem (important because the previous threads
always gave the impression of petering out for fear of vfs locking).

s_count: outermost of the superblock locks refcounting the superblock
structure itself, making no guarantee that any of the underlying
filesystem superblock structures are attached (i.e. kill_sb() may have
been called).  Taken by incrementing under the global sb_lock and
decremented using a put_super() variant.

s_active: an atomic reference counting the underlying filesystem
specific superblock structures.  if you hold s_active, kill_sb cannot
be called.  Acquired by atomic_inc_not_zero() with a possible failure
if it is zero and released by deactivate_super() and its variants.

s_umount: rwsem and innermost of the superblock locks. Used to protect
various operations from races.  Taken exclusively with down_write and
shared with down_read. Private functions internal to super.c wrap this
with grab_super and super_lock_shared/excl() wrappers.

The explicit freeze/thaw_super() functions require the s_umount rwsem
in down_write or exclusive mode and take it as the first step in their
operation.  Looking at the locking in fs_bdev_freeze/thaw() implies
that the super_operations freeze_super/thaw_super *don't* need this
taken (presumably they handle it internally).

Regards,

James
James Bottomley March 28, 2025, 2:21 p.m. UTC | #5
On Thu, 2025-03-27 at 19:20 +0100, Jan Kara wrote:
> On Thu 27-03-25 10:06:13, James Bottomley wrote:
> > Introduce a freeze function, which iterates superblocks in reverse
> > order freezing filesystems.  The indicator a filesystem is
> > freezable is either possessing a s_bdev or a freeze_super method. 
> > So this can be used in efivarfs, whether the freeze is for
> > hibernate is also passed in via the new FREEZE_FOR_HIBERNATE flag.
> > 
> > Thawing is done opposite to freezing (so superblock traversal in
> > regular order) and the whole thing is plumbed into power
> > management.
> > The original ksys_sync() is preserved so the whole freezing step is
> > optional (if it fails we're no worse off than we are today) so it
> > doesn't inhibit suspend/hibernate if there's a failure.
> > 
> > Signed-off-by: James Bottomley
> > <James.Bottomley@HansenPartnership.com>
> 
> OK, I've seen you are setting the new FREEZE_FOR_HIBERNATE flag but I
> didn't find anything using that flag. What do you plan to use it for?
> Does you efivars usecase need it? I find passing down this detail
> about the caller down to all filesystems a bit awkward. Isn't it
> possible to extract the information "hibernate is ongoing" from PM
> subsystem?

That's right.  I'm happy to post my patch below, but it depends on Al
accepting the simple_next_child() proposal, so it doesn't apply to any
tree.

> > +/*
> > + * Kernel freezing and thawing is only done in the power
> > management
> > + * subsystem and is thus single threaded (so we don't have to
> > worry
> > + * here about multiple calls to filesystems_freeze/thaw().
> > + */
> > +
> > +static int freeze_flags;
> 
> Frankly, the global variable to propagate flags is pretty ugly... If
> we really have to propagate some context into the iterator callback,
> rather do it explicitly like iterate_supers() does it.

Christian said the same thing.  I can do it, but if you look in the
power management subsystem, it relies on single threading and has a lot
of global variables like this, so I thought of this as a
simplification.

> > +static void filesystems_freeze_callback(struct super_block *sb)
> > +{
> > +	/* errors don't fail suspend so ignore them */
> > +	if (sb->s_op->freeze_super)
> > +		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST
> > +				       | FREEZE_HOLDER_KERNEL
> > +				       | freeze_flags);
> > +	else if (sb->s_bdev)
> > +		freeze_super(sb, FREEZE_MAY_NEST |
> > FREEZE_HOLDER_KERNEL
> > +			     | freeze_flags);
> 
> Style nit - braces around above blocks would be IMHO appropriate.
> 
> > +	else {
> > +		pr_info("Ignoring filesystem %s\n", sb->s_type-
> > >name);
> > +		return;
> > +	}
> > +
> > +	pr_info("frozen %s, now syncing block ...", sb->s_type-
> > >name);
> > +	sync_blockdev(sb->s_bdev);
> > +	pr_info("done.");
> > +}
> 
> Generally this callback is not safe because it can race with
> filesystem unmount and calling ->freeze_super() after the
> filesystem's ->put_super() was called may have all sorts of
> interesting effects (freeze_super() itself will just bail with a
> warning, which is better but not great either).
> 
> The cleanest way I see how to make the iteration safe is to grab
> active sb reference (like grab_super() does it) for the duration of
> freeze_super() calls. Another possibility would be to grab sb-
> >s_umount rwsem exclusively as Luis does it in his series but that
> requires a bit of locking surgery and ->freeze_super() handlers make
> this particularly nasty these days so I think active sb reference is
> going to be nicer these days.

Before getting into the how of this, could you just confirm my
understanding of what the various locks do:

https://lore.kernel.org/cd5c3d8aab9c5fb37fa018cb3302ecf7d2bdb140.camel@HansenPartnership.com

Is correct?

Regards,

James
James Bottomley March 28, 2025, 2:36 p.m. UTC | #6
On Fri, 2025-03-28 at 10:21 -0400, James Bottomley wrote:
> [...]
> That's right.  I'm happy to post my patch below, but it depends on Al
> accepting the simple_next_child() proposal, so it doesn't apply to
> any
> tree.

And here's the patch I forgot to attach.

Regards,

James

---

diff --git a/fs/efivarfs/internal.h b/fs/efivarfs/internal.h
index ac6a1dd0a6a5..f913b6824289 100644
--- a/fs/efivarfs/internal.h
+++ b/fs/efivarfs/internal.h
@@ -17,7 +17,6 @@ struct efivarfs_fs_info {
 	struct efivarfs_mount_opts mount_opts;
 	struct super_block *sb;
 	struct notifier_block nb;
-	struct notifier_block pm_nb;
 };
 
 struct efi_variable {
diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c
index 7d47f8d7ad1d..3398ec5c60b3 100644
--- a/fs/efivarfs/super.c
+++ b/fs/efivarfs/super.c
@@ -119,12 +119,15 @@ static int efivarfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 
 	return 0;
 }
+
+static int efivarfs_thaw(struct super_block *sb, enum freeze_holder who);
 static const struct super_operations efivarfs_ops = {
 	.statfs = efivarfs_statfs,
 	.drop_inode = generic_delete_inode,
 	.alloc_inode = efivarfs_alloc_inode,
 	.free_inode = efivarfs_free_inode,
 	.show_options = efivarfs_show_options,
+	.thaw_super = efivarfs_thaw,
 };
 
 /*
@@ -367,8 +370,6 @@ static int efivarfs_fill_super(struct super_block *sb, struct fs_context *fc)
 	if (err)
 		return err;
 
-	register_pm_notifier(&sfi->pm_nb);
-
 	return efivar_init(efivarfs_callback, sb, true);
 }
 
@@ -432,24 +433,17 @@ static int efivarfs_check_missing(efi_char16_t *name16, efi_guid_t vendor,
 	return err;
 }
 
-static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action,
-			      void *ptr)
+static int efivarfs_thaw(struct super_block *sb, enum freeze_holder who)
 {
-	struct efivarfs_fs_info *sfi = container_of(nb, struct efivarfs_fs_info,
-						    pm_nb);
 	static bool rescan_done = true;
-	struct dentry *parent = sfi->sb->s_root;
+	struct dentry *parent = sb->s_root;
 	struct dentry *child = NULL;
 
-	if (action == PM_HIBERNATION_PREPARE) {
-		rescan_done = false;
-		return NOTIFY_OK;
-	} else if (action != PM_POST_HIBERNATION) {
-		return NOTIFY_DONE;
-	}
+	if ((who & FREEZE_FOR_HIBERNATE) == 0)
+		return 0;
 
 	if (rescan_done)
-		return NOTIFY_DONE;
+		return 0;
 
 	pr_info("efivarfs: resyncing variable state\n");
 
@@ -488,9 +482,9 @@ static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action,
 	 * then loop over variables, creating them if there's no matching
 	 * dentry
 	 */
-	efivar_init(efivarfs_check_missing, sfi->sb, false);
+	efivar_init(efivarfs_check_missing, sb, false);
 
-	return NOTIFY_OK;
+	return 0;
 }
 
 static int efivarfs_init_fs_context(struct fs_context *fc)
@@ -510,9 +504,6 @@ static int efivarfs_init_fs_context(struct fs_context *fc)
 	fc->s_fs_info = sfi;
 	fc->ops = &efivarfs_context_ops;
 
-	sfi->pm_nb.notifier_call = efivarfs_pm_notify;
-	sfi->pm_nb.priority = 0;
-
 	return 0;
 }
 
@@ -521,7 +512,6 @@ static void efivarfs_kill_sb(struct super_block *sb)
 	struct efivarfs_fs_info *sfi = sb->s_fs_info;
 
 	blocking_notifier_chain_unregister(&efivar_ops_nh, &sfi->nb);
-	unregister_pm_notifier(&sfi->pm_nb);
 	kill_litter_super(sb);
 
 	kfree(sfi);
James Bottomley March 28, 2025, 2:40 p.m. UTC | #7
On Fri, 2025-03-28 at 13:01 +0100, Christian Brauner wrote:
> On Thu, Mar 27, 2025 at 10:06:13AM -0400, James Bottomley wrote:
[...]
> > diff --git a/fs/super.c b/fs/super.c
> > index 76785509d906..b4b0986414b0 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -1461,6 +1461,67 @@ static struct super_block
> > *get_bdev_super(struct block_device *bdev)
> >  	return sb;
> >  }
> >  
> > +/*
> > + * Kernel freezing and thawing is only done in the power
> > management
> > + * subsystem and is thus single threaded (so we don't have to
> > worry
> > + * here about multiple calls to filesystems_freeze/thaw().
> > + */
> > +
> > +static int freeze_flags;
> > +
> > +static void filesystems_freeze_callback(struct super_block *sb)
> > +{
> > +	/* errors don't fail suspend so ignore them */
> > +	if (sb->s_op->freeze_super)
> > +		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST
> > +				       | FREEZE_HOLDER_KERNEL
> > +				       | freeze_flags);
> > +	else if (sb->s_bdev)
> > +		freeze_super(sb, FREEZE_MAY_NEST |
> > FREEZE_HOLDER_KERNEL
> > +			     | freeze_flags);
> > +	else {
> > +		pr_info("Ignoring filesystem %s\n", sb->s_type-
> > >name);
> > +		return;
> > +	}
> > +
> > +	pr_info("frozen %s, now syncing block ...", sb->s_type-
> > >name);
> > +	sync_blockdev(sb->s_bdev);
> 
> Unnecessary, either the filesystem is responsible for this if it
> provides its own ->freeze_super() or freeze_super() does it in
> sync_filesystem.

I simply copied it from super.c:fs_bdev_freeze(), so is the
sync_blockdev() in there unnecessary as well?

Regards,

James
Christian Brauner March 28, 2025, 3:52 p.m. UTC | #8
> Since this is a hybrid thread between power management and VFS, could I
> just summarize what I think the various superblock locks are before
> discussing the actual problem (important because the previous threads
> always gave the impression of petering out for fear of vfs locking).
> 
> s_count: outermost of the superblock locks refcounting the superblock
> structure itself, making no guarantee that any of the underlying
> filesystem superblock structures are attached (i.e. kill_sb() may have
> been called).  Taken by incrementing under the global sb_lock and
> decremented using a put_super() variant.

and protects the presence of the superblock on the global super lists.

> 
> s_active: an atomic reference counting the underlying filesystem
> specific superblock structures.  if you hold s_active, kill_sb cannot
> be called.  Acquired by atomic_inc_not_zero() with a possible failure
> if it is zero and released by deactivate_super() and its variants.

or deactivate_locked_super() depending on whether s_umount is held or
not.

> 
> s_umount: rwsem and innermost of the superblock locks. Used to protect

No, it's not innermost. super_lock is a spinlock and obviously doesn't
nest with the semaphore. It's almost always the outmost lock for what
we're discussing here. Even is the outermost lock with most block device
locks.

It's also intimately tied into mount code and has implications for the
dcache and icache. That's all orthogonal to this thread.

> various operations from races.  Taken exclusively with down_write and
> shared with down_read. Private functions internal to super.c wrap this
> with grab_super and super_lock_shared/excl() wrappers.

See also the Documentation/filesystems/lock I added.

> 
> The explicit freeze/thaw_super() functions require the s_umount rwsem
> in down_write or exclusive mode and take it as the first step in their
> operation.  Looking at the locking in fs_bdev_freeze/thaw() implies
> that the super_operations freeze_super/thaw_super *don't* need this
> taken (presumably they handle it internally).

Block device locking cannot acquire the s_umount as that would cause
lock inversion with the block device open_mutex. The locking scheme
using sb_lock and the holder mutex allow safely acquiring the
superblock. It's orthogonal to what you're doing though.
James Bottomley March 28, 2025, 4:15 p.m. UTC | #9
On Fri, 2025-03-28 at 16:52 +0100, Christian Brauner wrote:
[...]
> 
> > various operations from races.  Taken exclusively with down_write
> > and shared with down_read. Private functions internal to super.c
> > wrap this with grab_super and super_lock_shared/excl() wrappers.
> 
> See also the Documentation/filesystems/lock I added.

you mean locking.rst which covers s_umount?  It would be nice to add
the others as well.

> > The explicit freeze/thaw_super() functions require the s_umount
> > rwsem in down_write or exclusive mode and take it as the first step
> > in their operation.  Looking at the locking in
> > fs_bdev_freeze/thaw() implies that the super_operations
> > freeze_super/thaw_super *don't* need this taken (presumably they
> > handle it internally).
> 
> Block device locking cannot acquire the s_umount as that would cause
> lock inversion with the block device open_mutex. The locking scheme
> using sb_lock and the holder mutex allow safely acquiring the
> superblock. It's orthogonal to what you're doing though.

OK, but based on the above and the fact that the code has to call
either the super op freeze/thaw_super or the global call, I think this
can be handled in the callback as something like rather than trying to
thread an exclusive s_umount:

static void filesystems_thaw_callback(struct super_block *sb)
{
	if (unlikely(!atomic_inc_not_zero(&sb->s_active)))
		return;

	if (sb->s_op->thaw_super)
		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST
				     | FREEZE_HOLDER_KERNEL
				     | freeze_flags);
	else if (sb->s_bdev)
		thaw_super(sb,	FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL
			   | freeze_flags);

	deactivate_super(sb);
}


Regards,

James
Christian Brauner March 29, 2025, 8:23 a.m. UTC | #10
On Fri, Mar 28, 2025 at 12:15:55PM -0400, James Bottomley wrote:
> On Fri, 2025-03-28 at 16:52 +0100, Christian Brauner wrote:
> [...]
> > 
> > > various operations from races.  Taken exclusively with down_write
> > > and shared with down_read. Private functions internal to super.c
> > > wrap this with grab_super and super_lock_shared/excl() wrappers.
> > 
> > See also the Documentation/filesystems/lock I added.
> 
> you mean locking.rst which covers s_umount?  It would be nice to add
> the others as well.
> 
> > > The explicit freeze/thaw_super() functions require the s_umount
> > > rwsem in down_write or exclusive mode and take it as the first step
> > > in their operation.  Looking at the locking in
> > > fs_bdev_freeze/thaw() implies that the super_operations
> > > freeze_super/thaw_super *don't* need this taken (presumably they
> > > handle it internally).
> > 
> > Block device locking cannot acquire the s_umount as that would cause
> > lock inversion with the block device open_mutex. The locking scheme
> > using sb_lock and the holder mutex allow safely acquiring the
> > superblock. It's orthogonal to what you're doing though.
> 
> OK, but based on the above and the fact that the code has to call
> either the super op freeze/thaw_super or the global call, I think this
> can be handled in the callback as something like rather than trying to
> thread an exclusive s_umount:

Eww, no. We're not going to open-code that in two different places.

> static void filesystems_thaw_callback(struct super_block *sb)
> {
> 	if (unlikely(!atomic_inc_not_zero(&sb->s_active)))
> 		return;
> 
> 	if (sb->s_op->thaw_super)
> 		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST
> 				     | FREEZE_HOLDER_KERNEL
> 				     | freeze_flags);
> 	else if (sb->s_bdev)
> 		thaw_super(sb,	FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL
> 			   | freeze_flags);
> 
> 	deactivate_super(sb);
> }

This is broken. The freeze/thaw functions cannot be called with s_umount
held otherwise they deadlock. And not holding s_umount while taking an
active reference count isn't supported as we're optimistically dropping
reference counts. We're not introducing exceptions to that scheme for no
good reason.

The other option is to move everything into the caller and bring back
get_active_super() and then add SUPER_ITER_UNLOCKED instead of
SUPER_ITER_GRAB. That's what I've done now.
diff mbox series

Patch

diff --git a/fs/super.c b/fs/super.c
index 76785509d906..b4b0986414b0 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1461,6 +1461,67 @@  static struct super_block *get_bdev_super(struct block_device *bdev)
 	return sb;
 }
 
+/*
+ * Kernel freezing and thawing is only done in the power management
+ * subsystem and is thus single threaded (so we don't have to worry
+ * here about multiple calls to filesystems_freeze/thaw().
+ */
+
+static int freeze_flags;
+
+static void filesystems_freeze_callback(struct super_block *sb)
+{
+	/* errors don't fail suspend so ignore them */
+	if (sb->s_op->freeze_super)
+		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST
+				       | FREEZE_HOLDER_KERNEL
+				       | freeze_flags);
+	else if (sb->s_bdev)
+		freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL
+			     | freeze_flags);
+	else {
+		pr_info("Ignoring filesystem %s\n", sb->s_type->name);
+		return;
+	}
+
+	pr_info("frozen %s, now syncing block ...", sb->s_type->name);
+	sync_blockdev(sb->s_bdev);
+	pr_info("done.");
+}
+
+/**
+ * filesystems_freeze - freeze callback for power management
+ *
+ * Freeze all active filesystems (in reverse superblock order)
+ */
+void filesystems_freeze(bool for_hibernate)
+{
+	freeze_flags = for_hibernate ? FREEZE_FOR_HIBERNATE : 0;
+	__iterate_supers_rev(filesystems_freeze_callback);
+}
+
+static void filesystems_thaw_callback(struct super_block *sb)
+{
+	if (sb->s_op->thaw_super)
+		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST
+				     | FREEZE_HOLDER_KERNEL
+				     | freeze_flags);
+	else if (sb->s_bdev)
+		thaw_super(sb,	FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL
+			   | freeze_flags);
+}
+
+/**
+ * filesystems_thaw - thaw callback for power management
+ *
+ * Thaw all active filesystems (in forward superblock order)
+ */
+void filesystems_thaw(bool for_hibernate)
+{
+	freeze_flags = for_hibernate ? FREEZE_FOR_HIBERNATE : 0;
+	__iterate_supers(filesystems_thaw_callback);
+}
+
 /**
  * fs_bdev_freeze - freeze owning filesystem of block device
  * @bdev: block device
diff --git a/include/linux/fs.h b/include/linux/fs.h
index cbbb704eff74..de154e9379ec 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2272,6 +2272,7 @@  extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
  * @FREEZE_HOLDER_KERNEL: kernel wants to freeze or thaw filesystem
  * @FREEZE_HOLDER_USERSPACE: userspace wants to freeze or thaw filesystem
  * @FREEZE_MAY_NEST: whether nesting freeze and thaw requests is allowed
+ * @FREEZE_FOR_HIBERNATE: set if freeze is from power management hibernate
  *
  * Indicate who the owner of the freeze or thaw request is and whether
  * the freeze needs to be exclusive or can nest.
@@ -2285,6 +2286,7 @@  enum freeze_holder {
 	FREEZE_HOLDER_KERNEL	= (1U << 0),
 	FREEZE_HOLDER_USERSPACE	= (1U << 1),
 	FREEZE_MAY_NEST		= (1U << 2),
+	FREEZE_FOR_HIBERNATE	= (1U << 3),
 };
 
 struct super_operations {
@@ -3919,4 +3921,7 @@  static inline bool vfs_empty_path(int dfd, const char __user *path)
 
 int generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter);
 
+void filesystems_freeze(bool for_hibernate);
+void filesystems_thaw(bool for_hibernate);
+
 #endif /* _LINUX_FS_H */
diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index 10a01af63a80..fc2106e6685a 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -778,7 +778,12 @@  int hibernate(void)
 
 	ksys_sync_helper();
 
+	pr_info("about to freeze filesystems\n");
+	filesystems_freeze(true);
+	pr_info("filesystem freeze done\n");
+
 	error = freeze_processes();
+	pr_info("process freeze done\n");
 	if (error)
 		goto Exit;
 
@@ -788,7 +793,9 @@  int hibernate(void)
 	if (error)
 		goto Thaw;
 
+	pr_info("About to create snapshot\n");
 	error = hibernation_snapshot(hibernation_mode == HIBERNATION_PLATFORM);
+	pr_info("snapshot done\n");
 	if (error || freezer_test_done)
 		goto Free_bitmaps;
 
@@ -842,6 +849,8 @@  int hibernate(void)
 	}
 	thaw_processes();
 
+	filesystems_thaw(true);
+
 	/* Don't bother checking whether freezer_test_done is true */
 	freezer_test_done = false;
  Exit:
@@ -939,6 +948,8 @@  int hibernate_quiet_exec(int (*func)(void *data), void *data)
 
 	thaw_processes();
 
+	filesystems_thaw(true);
+
 exit:
 	pm_notifier_call_chain(PM_POST_HIBERNATION);
 
@@ -1041,6 +1052,7 @@  static int software_resume(void)
 
 	error = load_image_and_restore();
 	thaw_processes();
+	filesystems_thaw(true);
  Finish:
 	pm_notifier_call_chain(PM_POST_RESTORE);
  Restore:
diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index 09f8397bae15..34cc5b0c408c 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -544,6 +544,7 @@  int suspend_devices_and_enter(suspend_state_t state)
 static void suspend_finish(void)
 {
 	suspend_thaw_processes();
+	filesystems_thaw(false);
 	pm_notifier_call_chain(PM_POST_SUSPEND);
 	pm_restore_console();
 }
@@ -581,6 +582,7 @@  static int enter_state(suspend_state_t state)
 		trace_suspend_resume(TPS("sync_filesystems"), 0, true);
 		ksys_sync_helper();
 		trace_suspend_resume(TPS("sync_filesystems"), 0, false);
+		filesystems_freeze(false);
 	}
 
 	pm_pr_dbg("Preparing system for sleep (%s)\n", mem_sleep_labels[state]);
@@ -603,6 +605,8 @@  static int enter_state(suspend_state_t state)
 	pm_pr_dbg("Finishing wakeup.\n");
 	suspend_finish();
  Unlock:
+	if (sync_on_suspend_enabled)
+		filesystems_thaw(false);
 	mutex_unlock(&system_transition_mutex);
 	return error;
 }