Message ID | 20250327140613.25178-5-James.Bottomley@HansenPartnership.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | vfs freeze/thaw on suspend/resume | expand |
On Thu 27-03-25 10:06:13, James Bottomley wrote: > Introduce a freeze function, which iterates superblocks in reverse > order freezing filesystems. The indicator a filesystem is freezable > is either possessing a s_bdev or a freeze_super method. So this can > be used in efivarfs, whether the freeze is for hibernate is also > passed in via the new FREEZE_FOR_HIBERNATE flag. > > Thawing is done opposite to freezing (so superblock traversal in > regular order) and the whole thing is plumbed into power management. > The original ksys_sync() is preserved so the whole freezing step is > optional (if it fails we're no worse off than we are today) so it > doesn't inhibit suspend/hibernate if there's a failure. > > Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> OK, I've seen you are setting the new FREEZE_FOR_HIBERNATE flag but I didn't find anything using that flag. What do you plan to use it for? Does you efivars usecase need it? I find passing down this detail about the caller down to all filesystems a bit awkward. Isn't it possible to extract the information "hibernate is ongoing" from PM subsystem? > +/* > + * Kernel freezing and thawing is only done in the power management > + * subsystem and is thus single threaded (so we don't have to worry > + * here about multiple calls to filesystems_freeze/thaw(). > + */ > + > +static int freeze_flags; Frankly, the global variable to propagate flags is pretty ugly... If we really have to propagate some context into the iterator callback, rather do it explicitly like iterate_supers() does it. > +static void filesystems_freeze_callback(struct super_block *sb) > +{ > + /* errors don't fail suspend so ignore them */ > + if (sb->s_op->freeze_super) > + sb->s_op->freeze_super(sb, FREEZE_MAY_NEST > + | FREEZE_HOLDER_KERNEL > + | freeze_flags); > + else if (sb->s_bdev) > + freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL > + | freeze_flags); Style nit - braces around above blocks would be IMHO appropriate. > + else { > + pr_info("Ignoring filesystem %s\n", sb->s_type->name); > + return; > + } > + > + pr_info("frozen %s, now syncing block ...", sb->s_type->name); > + sync_blockdev(sb->s_bdev); > + pr_info("done."); > +} Generally this callback is not safe because it can race with filesystem unmount and calling ->freeze_super() after the filesystem's ->put_super() was called may have all sorts of interesting effects (freeze_super() itself will just bail with a warning, which is better but not great either). The cleanest way I see how to make the iteration safe is to grab active sb reference (like grab_super() does it) for the duration of freeze_super() calls. Another possibility would be to grab sb->s_umount rwsem exclusively as Luis does it in his series but that requires a bit of locking surgery and ->freeze_super() handlers make this particularly nasty these days so I think active sb reference is going to be nicer these days. Honza
On Thu, Mar 27, 2025 at 10:06:13AM -0400, James Bottomley wrote: > Introduce a freeze function, which iterates superblocks in reverse > order freezing filesystems. The indicator a filesystem is freezable > is either possessing a s_bdev or a freeze_super method. So this can > be used in efivarfs, whether the freeze is for hibernate is also > passed in via the new FREEZE_FOR_HIBERNATE flag. > > Thawing is done opposite to freezing (so superblock traversal in > regular order) and the whole thing is plumbed into power management. > The original ksys_sync() is preserved so the whole freezing step is > optional (if it fails we're no worse off than we are today) so it > doesn't inhibit suspend/hibernate if there's a failure. > > Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> > --- > fs/super.c | 61 ++++++++++++++++++++++++++++++++++++++++ > include/linux/fs.h | 5 ++++ > kernel/power/hibernate.c | 12 ++++++++ > kernel/power/suspend.c | 4 +++ > 4 files changed, 82 insertions(+) > > diff --git a/fs/super.c b/fs/super.c > index 76785509d906..b4b0986414b0 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -1461,6 +1461,67 @@ static struct super_block *get_bdev_super(struct block_device *bdev) > return sb; > } > > +/* > + * Kernel freezing and thawing is only done in the power management > + * subsystem and is thus single threaded (so we don't have to worry > + * here about multiple calls to filesystems_freeze/thaw(). > + */ > + > +static int freeze_flags; Ugh, please don't use a global flag for this. > + > +static void filesystems_freeze_callback(struct super_block *sb) > +{ > + /* errors don't fail suspend so ignore them */ > + if (sb->s_op->freeze_super) > + sb->s_op->freeze_super(sb, FREEZE_MAY_NEST > + | FREEZE_HOLDER_KERNEL > + | freeze_flags); > + else if (sb->s_bdev) > + freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL > + | freeze_flags); > + else { > + pr_info("Ignoring filesystem %s\n", sb->s_type->name); > + return; > + } > + > + pr_info("frozen %s, now syncing block ...", sb->s_type->name); > + sync_blockdev(sb->s_bdev); > + pr_info("done."); > +} > + > +/** > + * filesystems_freeze - freeze callback for power management > + * > + * Freeze all active filesystems (in reverse superblock order) > + */ > +void filesystems_freeze(bool for_hibernate) > +{ > + freeze_flags = for_hibernate ? FREEZE_FOR_HIBERNATE : 0; > + __iterate_supers_rev(filesystems_freeze_callback); > +} > + > +static void filesystems_thaw_callback(struct super_block *sb) > +{ > + if (sb->s_op->thaw_super) > + sb->s_op->thaw_super(sb, FREEZE_MAY_NEST > + | FREEZE_HOLDER_KERNEL > + | freeze_flags); > + else if (sb->s_bdev) > + thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL > + | freeze_flags); > +} > + > +/** > + * filesystems_thaw - thaw callback for power management > + * > + * Thaw all active filesystems (in forward superblock order) > + */ > +void filesystems_thaw(bool for_hibernate) > +{ > + freeze_flags = for_hibernate ? FREEZE_FOR_HIBERNATE : 0; > + __iterate_supers(filesystems_thaw_callback); This doesn't work and I've explained in my reply to Luis how this doesn't work and what the alternative are: A concurrent umount() can wipe the filesystem behind your back. So you either need an active superblock reference or you need to communicate that the superblock is locked through the new flag I proposed (naming irrelevant for now). > +} > + > /** > * fs_bdev_freeze - freeze owning filesystem of block device > * @bdev: block device > diff --git a/include/linux/fs.h b/include/linux/fs.h > index cbbb704eff74..de154e9379ec 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -2272,6 +2272,7 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos, > * @FREEZE_HOLDER_KERNEL: kernel wants to freeze or thaw filesystem > * @FREEZE_HOLDER_USERSPACE: userspace wants to freeze or thaw filesystem > * @FREEZE_MAY_NEST: whether nesting freeze and thaw requests is allowed > + * @FREEZE_FOR_HIBERNATE: set if freeze is from power management hibernate > * > * Indicate who the owner of the freeze or thaw request is and whether > * the freeze needs to be exclusive or can nest. > @@ -2285,6 +2286,7 @@ enum freeze_holder { > FREEZE_HOLDER_KERNEL = (1U << 0), > FREEZE_HOLDER_USERSPACE = (1U << 1), > FREEZE_MAY_NEST = (1U << 2), > + FREEZE_FOR_HIBERNATE = (1U << 3), > }; > > struct super_operations { > @@ -3919,4 +3921,7 @@ static inline bool vfs_empty_path(int dfd, const char __user *path) > > int generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter); > > +void filesystems_freeze(bool for_hibernate); > +void filesystems_thaw(bool for_hibernate); > + > #endif /* _LINUX_FS_H */ > diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c > index 10a01af63a80..fc2106e6685a 100644 > --- a/kernel/power/hibernate.c > +++ b/kernel/power/hibernate.c > @@ -778,7 +778,12 @@ int hibernate(void) > > ksys_sync_helper(); > > + pr_info("about to freeze filesystems\n"); > + filesystems_freeze(true); > + pr_info("filesystem freeze done\n"); > + > error = freeze_processes(); > + pr_info("process freeze done\n"); > if (error) > goto Exit; > > @@ -788,7 +793,9 @@ int hibernate(void) > if (error) > goto Thaw; > > + pr_info("About to create snapshot\n"); > error = hibernation_snapshot(hibernation_mode == HIBERNATION_PLATFORM); > + pr_info("snapshot done\n"); > if (error || freezer_test_done) > goto Free_bitmaps; > > @@ -842,6 +849,8 @@ int hibernate(void) > } > thaw_processes(); > > + filesystems_thaw(true); > + > /* Don't bother checking whether freezer_test_done is true */ > freezer_test_done = false; > Exit: > @@ -939,6 +948,8 @@ int hibernate_quiet_exec(int (*func)(void *data), void *data) > > thaw_processes(); > > + filesystems_thaw(true); > + > exit: > pm_notifier_call_chain(PM_POST_HIBERNATION); > > @@ -1041,6 +1052,7 @@ static int software_resume(void) > > error = load_image_and_restore(); > thaw_processes(); > + filesystems_thaw(true); > Finish: > pm_notifier_call_chain(PM_POST_RESTORE); > Restore: > diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c > index 09f8397bae15..34cc5b0c408c 100644 > --- a/kernel/power/suspend.c > +++ b/kernel/power/suspend.c > @@ -544,6 +544,7 @@ int suspend_devices_and_enter(suspend_state_t state) > static void suspend_finish(void) > { > suspend_thaw_processes(); > + filesystems_thaw(false); > pm_notifier_call_chain(PM_POST_SUSPEND); > pm_restore_console(); > } > @@ -581,6 +582,7 @@ static int enter_state(suspend_state_t state) > trace_suspend_resume(TPS("sync_filesystems"), 0, true); > ksys_sync_helper(); > trace_suspend_resume(TPS("sync_filesystems"), 0, false); > + filesystems_freeze(false); > } > > pm_pr_dbg("Preparing system for sleep (%s)\n", mem_sleep_labels[state]); > @@ -603,6 +605,8 @@ static int enter_state(suspend_state_t state) > pm_pr_dbg("Finishing wakeup.\n"); > suspend_finish(); > Unlock: > + if (sync_on_suspend_enabled) > + filesystems_thaw(false); > mutex_unlock(&system_transition_mutex); > return error; > } > -- > 2.43.0 >
On Thu, Mar 27, 2025 at 10:06:13AM -0400, James Bottomley wrote: > Introduce a freeze function, which iterates superblocks in reverse > order freezing filesystems. The indicator a filesystem is freezable > is either possessing a s_bdev or a freeze_super method. So this can > be used in efivarfs, whether the freeze is for hibernate is also > passed in via the new FREEZE_FOR_HIBERNATE flag. > > Thawing is done opposite to freezing (so superblock traversal in > regular order) and the whole thing is plumbed into power management. > The original ksys_sync() is preserved so the whole freezing step is > optional (if it fails we're no worse off than we are today) so it > doesn't inhibit suspend/hibernate if there's a failure. > > Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> > --- > fs/super.c | 61 ++++++++++++++++++++++++++++++++++++++++ > include/linux/fs.h | 5 ++++ > kernel/power/hibernate.c | 12 ++++++++ > kernel/power/suspend.c | 4 +++ > 4 files changed, 82 insertions(+) > > diff --git a/fs/super.c b/fs/super.c > index 76785509d906..b4b0986414b0 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -1461,6 +1461,67 @@ static struct super_block *get_bdev_super(struct block_device *bdev) > return sb; > } > > +/* > + * Kernel freezing and thawing is only done in the power management > + * subsystem and is thus single threaded (so we don't have to worry > + * here about multiple calls to filesystems_freeze/thaw(). > + */ > + > +static int freeze_flags; > + > +static void filesystems_freeze_callback(struct super_block *sb) > +{ > + /* errors don't fail suspend so ignore them */ > + if (sb->s_op->freeze_super) > + sb->s_op->freeze_super(sb, FREEZE_MAY_NEST > + | FREEZE_HOLDER_KERNEL > + | freeze_flags); > + else if (sb->s_bdev) > + freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL > + | freeze_flags); > + else { > + pr_info("Ignoring filesystem %s\n", sb->s_type->name); > + return; > + } > + > + pr_info("frozen %s, now syncing block ...", sb->s_type->name); > + sync_blockdev(sb->s_bdev); Unnecessary, either the filesystem is responsible for this if it provides its own ->freeze_super() or freeze_super() does it in sync_filesystem. > + pr_info("done."); > +} > + > +/** > + * filesystems_freeze - freeze callback for power management > + * > + * Freeze all active filesystems (in reverse superblock order) > + */ > +void filesystems_freeze(bool for_hibernate) > +{ > + freeze_flags = for_hibernate ? FREEZE_FOR_HIBERNATE : 0; > + __iterate_supers_rev(filesystems_freeze_callback); > +} > + > +static void filesystems_thaw_callback(struct super_block *sb) > +{ > + if (sb->s_op->thaw_super) > + sb->s_op->thaw_super(sb, FREEZE_MAY_NEST > + | FREEZE_HOLDER_KERNEL > + | freeze_flags); > + else if (sb->s_bdev) > + thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL > + | freeze_flags); > +} > + > +/** > + * filesystems_thaw - thaw callback for power management > + * > + * Thaw all active filesystems (in forward superblock order) > + */ > +void filesystems_thaw(bool for_hibernate) > +{ > + freeze_flags = for_hibernate ? FREEZE_FOR_HIBERNATE : 0; > + __iterate_supers(filesystems_thaw_callback); > +} > + > /** > * fs_bdev_freeze - freeze owning filesystem of block device > * @bdev: block device > diff --git a/include/linux/fs.h b/include/linux/fs.h > index cbbb704eff74..de154e9379ec 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -2272,6 +2272,7 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos, > * @FREEZE_HOLDER_KERNEL: kernel wants to freeze or thaw filesystem > * @FREEZE_HOLDER_USERSPACE: userspace wants to freeze or thaw filesystem > * @FREEZE_MAY_NEST: whether nesting freeze and thaw requests is allowed > + * @FREEZE_FOR_HIBERNATE: set if freeze is from power management hibernate > * > * Indicate who the owner of the freeze or thaw request is and whether > * the freeze needs to be exclusive or can nest. > @@ -2285,6 +2286,7 @@ enum freeze_holder { > FREEZE_HOLDER_KERNEL = (1U << 0), > FREEZE_HOLDER_USERSPACE = (1U << 1), > FREEZE_MAY_NEST = (1U << 2), > + FREEZE_FOR_HIBERNATE = (1U << 3), > }; > > struct super_operations { > @@ -3919,4 +3921,7 @@ static inline bool vfs_empty_path(int dfd, const char __user *path) > > int generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter); > > +void filesystems_freeze(bool for_hibernate); > +void filesystems_thaw(bool for_hibernate); > + > #endif /* _LINUX_FS_H */ > diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c > index 10a01af63a80..fc2106e6685a 100644 > --- a/kernel/power/hibernate.c > +++ b/kernel/power/hibernate.c > @@ -778,7 +778,12 @@ int hibernate(void) > > ksys_sync_helper(); > > + pr_info("about to freeze filesystems\n"); > + filesystems_freeze(true); > + pr_info("filesystem freeze done\n"); > + > error = freeze_processes(); > + pr_info("process freeze done\n"); > if (error) > goto Exit; > > @@ -788,7 +793,9 @@ int hibernate(void) > if (error) > goto Thaw; > > + pr_info("About to create snapshot\n"); > error = hibernation_snapshot(hibernation_mode == HIBERNATION_PLATFORM); > + pr_info("snapshot done\n"); > if (error || freezer_test_done) > goto Free_bitmaps; > > @@ -842,6 +849,8 @@ int hibernate(void) > } > thaw_processes(); > > + filesystems_thaw(true); > + > /* Don't bother checking whether freezer_test_done is true */ > freezer_test_done = false; > Exit: > @@ -939,6 +948,8 @@ int hibernate_quiet_exec(int (*func)(void *data), void *data) > > thaw_processes(); > > + filesystems_thaw(true); > + > exit: > pm_notifier_call_chain(PM_POST_HIBERNATION); > > @@ -1041,6 +1052,7 @@ static int software_resume(void) > > error = load_image_and_restore(); > thaw_processes(); > + filesystems_thaw(true); > Finish: > pm_notifier_call_chain(PM_POST_RESTORE); > Restore: > diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c > index 09f8397bae15..34cc5b0c408c 100644 > --- a/kernel/power/suspend.c > +++ b/kernel/power/suspend.c > @@ -544,6 +544,7 @@ int suspend_devices_and_enter(suspend_state_t state) > static void suspend_finish(void) > { > suspend_thaw_processes(); > + filesystems_thaw(false); > pm_notifier_call_chain(PM_POST_SUSPEND); > pm_restore_console(); > } > @@ -581,6 +582,7 @@ static int enter_state(suspend_state_t state) > trace_suspend_resume(TPS("sync_filesystems"), 0, true); > ksys_sync_helper(); > trace_suspend_resume(TPS("sync_filesystems"), 0, false); > + filesystems_freeze(false); > } > > pm_pr_dbg("Preparing system for sleep (%s)\n", mem_sleep_labels[state]); > @@ -603,6 +605,8 @@ static int enter_state(suspend_state_t state) > pm_pr_dbg("Finishing wakeup.\n"); > suspend_finish(); > Unlock: > + if (sync_on_suspend_enabled) > + filesystems_thaw(false); > mutex_unlock(&system_transition_mutex); > return error; > } > -- > 2.43.0 >
On Fri, 2025-03-28 at 11:08 +0100, Christian Brauner wrote: > On Thu, Mar 27, 2025 at 10:06:13AM -0400, James Bottomley wrote: [...] > > + > > +static void filesystems_freeze_callback(struct super_block *sb) > > +{ > > + /* errors don't fail suspend so ignore them */ > > + if (sb->s_op->freeze_super) > > + sb->s_op->freeze_super(sb, FREEZE_MAY_NEST > > + | FREEZE_HOLDER_KERNEL > > + | freeze_flags); > > + else if (sb->s_bdev) > > + freeze_super(sb, FREEZE_MAY_NEST | > > FREEZE_HOLDER_KERNEL > > + | freeze_flags); > > + else { > > + pr_info("Ignoring filesystem %s\n", sb->s_type- > > >name); > > + return; > > + } > > + > > + pr_info("frozen %s, now syncing block ...", sb->s_type- > > >name); > > + sync_blockdev(sb->s_bdev); > > + pr_info("done."); > > +} > > + > > +/** > > + * filesystems_freeze - freeze callback for power management > > + * > > + * Freeze all active filesystems (in reverse superblock order) > > + */ > > +void filesystems_freeze(bool for_hibernate) > > +{ > > + freeze_flags = for_hibernate ? FREEZE_FOR_HIBERNATE : 0; > > + __iterate_supers_rev(filesystems_freeze_callback); > > +} > > + > > +static void filesystems_thaw_callback(struct super_block *sb) > > +{ > > + if (sb->s_op->thaw_super) > > + sb->s_op->thaw_super(sb, FREEZE_MAY_NEST > > + | FREEZE_HOLDER_KERNEL > > + | freeze_flags); > > + else if (sb->s_bdev) > > + thaw_super(sb, FREEZE_MAY_NEST | > > FREEZE_HOLDER_KERNEL > > + | freeze_flags); > > +} > > + > > +/** > > + * filesystems_thaw - thaw callback for power management > > + * > > + * Thaw all active filesystems (in forward superblock order) > > + */ > > +void filesystems_thaw(bool for_hibernate) > > +{ > > + freeze_flags = for_hibernate ? FREEZE_FOR_HIBERNATE : 0; > > + __iterate_supers(filesystems_thaw_callback); > > This doesn't work and I've explained in my reply to Luis how this > doesn't work and what the alternative are: > > A concurrent umount() can wipe the filesystem behind your back. So > you either need an active superblock reference or you need to > communicate that the superblock is locked through the new flag I > proposed (naming irrelevant for now). Since this is a hybrid thread between power management and VFS, could I just summarize what I think the various superblock locks are before discussing the actual problem (important because the previous threads always gave the impression of petering out for fear of vfs locking). s_count: outermost of the superblock locks refcounting the superblock structure itself, making no guarantee that any of the underlying filesystem superblock structures are attached (i.e. kill_sb() may have been called). Taken by incrementing under the global sb_lock and decremented using a put_super() variant. s_active: an atomic reference counting the underlying filesystem specific superblock structures. if you hold s_active, kill_sb cannot be called. Acquired by atomic_inc_not_zero() with a possible failure if it is zero and released by deactivate_super() and its variants. s_umount: rwsem and innermost of the superblock locks. Used to protect various operations from races. Taken exclusively with down_write and shared with down_read. Private functions internal to super.c wrap this with grab_super and super_lock_shared/excl() wrappers. The explicit freeze/thaw_super() functions require the s_umount rwsem in down_write or exclusive mode and take it as the first step in their operation. Looking at the locking in fs_bdev_freeze/thaw() implies that the super_operations freeze_super/thaw_super *don't* need this taken (presumably they handle it internally). Regards, James
On Thu, 2025-03-27 at 19:20 +0100, Jan Kara wrote: > On Thu 27-03-25 10:06:13, James Bottomley wrote: > > Introduce a freeze function, which iterates superblocks in reverse > > order freezing filesystems. The indicator a filesystem is > > freezable is either possessing a s_bdev or a freeze_super method. > > So this can be used in efivarfs, whether the freeze is for > > hibernate is also passed in via the new FREEZE_FOR_HIBERNATE flag. > > > > Thawing is done opposite to freezing (so superblock traversal in > > regular order) and the whole thing is plumbed into power > > management. > > The original ksys_sync() is preserved so the whole freezing step is > > optional (if it fails we're no worse off than we are today) so it > > doesn't inhibit suspend/hibernate if there's a failure. > > > > Signed-off-by: James Bottomley > > <James.Bottomley@HansenPartnership.com> > > OK, I've seen you are setting the new FREEZE_FOR_HIBERNATE flag but I > didn't find anything using that flag. What do you plan to use it for? > Does you efivars usecase need it? I find passing down this detail > about the caller down to all filesystems a bit awkward. Isn't it > possible to extract the information "hibernate is ongoing" from PM > subsystem? That's right. I'm happy to post my patch below, but it depends on Al accepting the simple_next_child() proposal, so it doesn't apply to any tree. > > +/* > > + * Kernel freezing and thawing is only done in the power > > management > > + * subsystem and is thus single threaded (so we don't have to > > worry > > + * here about multiple calls to filesystems_freeze/thaw(). > > + */ > > + > > +static int freeze_flags; > > Frankly, the global variable to propagate flags is pretty ugly... If > we really have to propagate some context into the iterator callback, > rather do it explicitly like iterate_supers() does it. Christian said the same thing. I can do it, but if you look in the power management subsystem, it relies on single threading and has a lot of global variables like this, so I thought of this as a simplification. > > +static void filesystems_freeze_callback(struct super_block *sb) > > +{ > > + /* errors don't fail suspend so ignore them */ > > + if (sb->s_op->freeze_super) > > + sb->s_op->freeze_super(sb, FREEZE_MAY_NEST > > + | FREEZE_HOLDER_KERNEL > > + | freeze_flags); > > + else if (sb->s_bdev) > > + freeze_super(sb, FREEZE_MAY_NEST | > > FREEZE_HOLDER_KERNEL > > + | freeze_flags); > > Style nit - braces around above blocks would be IMHO appropriate. > > > + else { > > + pr_info("Ignoring filesystem %s\n", sb->s_type- > > >name); > > + return; > > + } > > + > > + pr_info("frozen %s, now syncing block ...", sb->s_type- > > >name); > > + sync_blockdev(sb->s_bdev); > > + pr_info("done."); > > +} > > Generally this callback is not safe because it can race with > filesystem unmount and calling ->freeze_super() after the > filesystem's ->put_super() was called may have all sorts of > interesting effects (freeze_super() itself will just bail with a > warning, which is better but not great either). > > The cleanest way I see how to make the iteration safe is to grab > active sb reference (like grab_super() does it) for the duration of > freeze_super() calls. Another possibility would be to grab sb- > >s_umount rwsem exclusively as Luis does it in his series but that > requires a bit of locking surgery and ->freeze_super() handlers make > this particularly nasty these days so I think active sb reference is > going to be nicer these days. Before getting into the how of this, could you just confirm my understanding of what the various locks do: https://lore.kernel.org/cd5c3d8aab9c5fb37fa018cb3302ecf7d2bdb140.camel@HansenPartnership.com Is correct? Regards, James
On Fri, 2025-03-28 at 10:21 -0400, James Bottomley wrote: > [...] > That's right. I'm happy to post my patch below, but it depends on Al > accepting the simple_next_child() proposal, so it doesn't apply to > any > tree. And here's the patch I forgot to attach. Regards, James --- diff --git a/fs/efivarfs/internal.h b/fs/efivarfs/internal.h index ac6a1dd0a6a5..f913b6824289 100644 --- a/fs/efivarfs/internal.h +++ b/fs/efivarfs/internal.h @@ -17,7 +17,6 @@ struct efivarfs_fs_info { struct efivarfs_mount_opts mount_opts; struct super_block *sb; struct notifier_block nb; - struct notifier_block pm_nb; }; struct efi_variable { diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c index 7d47f8d7ad1d..3398ec5c60b3 100644 --- a/fs/efivarfs/super.c +++ b/fs/efivarfs/super.c @@ -119,12 +119,15 @@ static int efivarfs_statfs(struct dentry *dentry, struct kstatfs *buf) return 0; } + +static int efivarfs_thaw(struct super_block *sb, enum freeze_holder who); static const struct super_operations efivarfs_ops = { .statfs = efivarfs_statfs, .drop_inode = generic_delete_inode, .alloc_inode = efivarfs_alloc_inode, .free_inode = efivarfs_free_inode, .show_options = efivarfs_show_options, + .thaw_super = efivarfs_thaw, }; /* @@ -367,8 +370,6 @@ static int efivarfs_fill_super(struct super_block *sb, struct fs_context *fc) if (err) return err; - register_pm_notifier(&sfi->pm_nb); - return efivar_init(efivarfs_callback, sb, true); } @@ -432,24 +433,17 @@ static int efivarfs_check_missing(efi_char16_t *name16, efi_guid_t vendor, return err; } -static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action, - void *ptr) +static int efivarfs_thaw(struct super_block *sb, enum freeze_holder who) { - struct efivarfs_fs_info *sfi = container_of(nb, struct efivarfs_fs_info, - pm_nb); static bool rescan_done = true; - struct dentry *parent = sfi->sb->s_root; + struct dentry *parent = sb->s_root; struct dentry *child = NULL; - if (action == PM_HIBERNATION_PREPARE) { - rescan_done = false; - return NOTIFY_OK; - } else if (action != PM_POST_HIBERNATION) { - return NOTIFY_DONE; - } + if ((who & FREEZE_FOR_HIBERNATE) == 0) + return 0; if (rescan_done) - return NOTIFY_DONE; + return 0; pr_info("efivarfs: resyncing variable state\n"); @@ -488,9 +482,9 @@ static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action, * then loop over variables, creating them if there's no matching * dentry */ - efivar_init(efivarfs_check_missing, sfi->sb, false); + efivar_init(efivarfs_check_missing, sb, false); - return NOTIFY_OK; + return 0; } static int efivarfs_init_fs_context(struct fs_context *fc) @@ -510,9 +504,6 @@ static int efivarfs_init_fs_context(struct fs_context *fc) fc->s_fs_info = sfi; fc->ops = &efivarfs_context_ops; - sfi->pm_nb.notifier_call = efivarfs_pm_notify; - sfi->pm_nb.priority = 0; - return 0; } @@ -521,7 +512,6 @@ static void efivarfs_kill_sb(struct super_block *sb) struct efivarfs_fs_info *sfi = sb->s_fs_info; blocking_notifier_chain_unregister(&efivar_ops_nh, &sfi->nb); - unregister_pm_notifier(&sfi->pm_nb); kill_litter_super(sb); kfree(sfi);
On Fri, 2025-03-28 at 13:01 +0100, Christian Brauner wrote: > On Thu, Mar 27, 2025 at 10:06:13AM -0400, James Bottomley wrote: [...] > > diff --git a/fs/super.c b/fs/super.c > > index 76785509d906..b4b0986414b0 100644 > > --- a/fs/super.c > > +++ b/fs/super.c > > @@ -1461,6 +1461,67 @@ static struct super_block > > *get_bdev_super(struct block_device *bdev) > > return sb; > > } > > > > +/* > > + * Kernel freezing and thawing is only done in the power > > management > > + * subsystem and is thus single threaded (so we don't have to > > worry > > + * here about multiple calls to filesystems_freeze/thaw(). > > + */ > > + > > +static int freeze_flags; > > + > > +static void filesystems_freeze_callback(struct super_block *sb) > > +{ > > + /* errors don't fail suspend so ignore them */ > > + if (sb->s_op->freeze_super) > > + sb->s_op->freeze_super(sb, FREEZE_MAY_NEST > > + | FREEZE_HOLDER_KERNEL > > + | freeze_flags); > > + else if (sb->s_bdev) > > + freeze_super(sb, FREEZE_MAY_NEST | > > FREEZE_HOLDER_KERNEL > > + | freeze_flags); > > + else { > > + pr_info("Ignoring filesystem %s\n", sb->s_type- > > >name); > > + return; > > + } > > + > > + pr_info("frozen %s, now syncing block ...", sb->s_type- > > >name); > > + sync_blockdev(sb->s_bdev); > > Unnecessary, either the filesystem is responsible for this if it > provides its own ->freeze_super() or freeze_super() does it in > sync_filesystem. I simply copied it from super.c:fs_bdev_freeze(), so is the sync_blockdev() in there unnecessary as well? Regards, James
> Since this is a hybrid thread between power management and VFS, could I > just summarize what I think the various superblock locks are before > discussing the actual problem (important because the previous threads > always gave the impression of petering out for fear of vfs locking). > > s_count: outermost of the superblock locks refcounting the superblock > structure itself, making no guarantee that any of the underlying > filesystem superblock structures are attached (i.e. kill_sb() may have > been called). Taken by incrementing under the global sb_lock and > decremented using a put_super() variant. and protects the presence of the superblock on the global super lists. > > s_active: an atomic reference counting the underlying filesystem > specific superblock structures. if you hold s_active, kill_sb cannot > be called. Acquired by atomic_inc_not_zero() with a possible failure > if it is zero and released by deactivate_super() and its variants. or deactivate_locked_super() depending on whether s_umount is held or not. > > s_umount: rwsem and innermost of the superblock locks. Used to protect No, it's not innermost. super_lock is a spinlock and obviously doesn't nest with the semaphore. It's almost always the outmost lock for what we're discussing here. Even is the outermost lock with most block device locks. It's also intimately tied into mount code and has implications for the dcache and icache. That's all orthogonal to this thread. > various operations from races. Taken exclusively with down_write and > shared with down_read. Private functions internal to super.c wrap this > with grab_super and super_lock_shared/excl() wrappers. See also the Documentation/filesystems/lock I added. > > The explicit freeze/thaw_super() functions require the s_umount rwsem > in down_write or exclusive mode and take it as the first step in their > operation. Looking at the locking in fs_bdev_freeze/thaw() implies > that the super_operations freeze_super/thaw_super *don't* need this > taken (presumably they handle it internally). Block device locking cannot acquire the s_umount as that would cause lock inversion with the block device open_mutex. The locking scheme using sb_lock and the holder mutex allow safely acquiring the superblock. It's orthogonal to what you're doing though.
On Fri, 2025-03-28 at 16:52 +0100, Christian Brauner wrote: [...] > > > various operations from races. Taken exclusively with down_write > > and shared with down_read. Private functions internal to super.c > > wrap this with grab_super and super_lock_shared/excl() wrappers. > > See also the Documentation/filesystems/lock I added. you mean locking.rst which covers s_umount? It would be nice to add the others as well. > > The explicit freeze/thaw_super() functions require the s_umount > > rwsem in down_write or exclusive mode and take it as the first step > > in their operation. Looking at the locking in > > fs_bdev_freeze/thaw() implies that the super_operations > > freeze_super/thaw_super *don't* need this taken (presumably they > > handle it internally). > > Block device locking cannot acquire the s_umount as that would cause > lock inversion with the block device open_mutex. The locking scheme > using sb_lock and the holder mutex allow safely acquiring the > superblock. It's orthogonal to what you're doing though. OK, but based on the above and the fact that the code has to call either the super op freeze/thaw_super or the global call, I think this can be handled in the callback as something like rather than trying to thread an exclusive s_umount: static void filesystems_thaw_callback(struct super_block *sb) { if (unlikely(!atomic_inc_not_zero(&sb->s_active))) return; if (sb->s_op->thaw_super) sb->s_op->thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL | freeze_flags); else if (sb->s_bdev) thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL | freeze_flags); deactivate_super(sb); } Regards, James
On Fri, Mar 28, 2025 at 12:15:55PM -0400, James Bottomley wrote: > On Fri, 2025-03-28 at 16:52 +0100, Christian Brauner wrote: > [...] > > > > > various operations from races. Taken exclusively with down_write > > > and shared with down_read. Private functions internal to super.c > > > wrap this with grab_super and super_lock_shared/excl() wrappers. > > > > See also the Documentation/filesystems/lock I added. > > you mean locking.rst which covers s_umount? It would be nice to add > the others as well. > > > > The explicit freeze/thaw_super() functions require the s_umount > > > rwsem in down_write or exclusive mode and take it as the first step > > > in their operation. Looking at the locking in > > > fs_bdev_freeze/thaw() implies that the super_operations > > > freeze_super/thaw_super *don't* need this taken (presumably they > > > handle it internally). > > > > Block device locking cannot acquire the s_umount as that would cause > > lock inversion with the block device open_mutex. The locking scheme > > using sb_lock and the holder mutex allow safely acquiring the > > superblock. It's orthogonal to what you're doing though. > > OK, but based on the above and the fact that the code has to call > either the super op freeze/thaw_super or the global call, I think this > can be handled in the callback as something like rather than trying to > thread an exclusive s_umount: Eww, no. We're not going to open-code that in two different places. > static void filesystems_thaw_callback(struct super_block *sb) > { > if (unlikely(!atomic_inc_not_zero(&sb->s_active))) > return; > > if (sb->s_op->thaw_super) > sb->s_op->thaw_super(sb, FREEZE_MAY_NEST > | FREEZE_HOLDER_KERNEL > | freeze_flags); > else if (sb->s_bdev) > thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL > | freeze_flags); > > deactivate_super(sb); > } This is broken. The freeze/thaw functions cannot be called with s_umount held otherwise they deadlock. And not holding s_umount while taking an active reference count isn't supported as we're optimistically dropping reference counts. We're not introducing exceptions to that scheme for no good reason. The other option is to move everything into the caller and bring back get_active_super() and then add SUPER_ITER_UNLOCKED instead of SUPER_ITER_GRAB. That's what I've done now.
diff --git a/fs/super.c b/fs/super.c index 76785509d906..b4b0986414b0 100644 --- a/fs/super.c +++ b/fs/super.c @@ -1461,6 +1461,67 @@ static struct super_block *get_bdev_super(struct block_device *bdev) return sb; } +/* + * Kernel freezing and thawing is only done in the power management + * subsystem and is thus single threaded (so we don't have to worry + * here about multiple calls to filesystems_freeze/thaw(). + */ + +static int freeze_flags; + +static void filesystems_freeze_callback(struct super_block *sb) +{ + /* errors don't fail suspend so ignore them */ + if (sb->s_op->freeze_super) + sb->s_op->freeze_super(sb, FREEZE_MAY_NEST + | FREEZE_HOLDER_KERNEL + | freeze_flags); + else if (sb->s_bdev) + freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL + | freeze_flags); + else { + pr_info("Ignoring filesystem %s\n", sb->s_type->name); + return; + } + + pr_info("frozen %s, now syncing block ...", sb->s_type->name); + sync_blockdev(sb->s_bdev); + pr_info("done."); +} + +/** + * filesystems_freeze - freeze callback for power management + * + * Freeze all active filesystems (in reverse superblock order) + */ +void filesystems_freeze(bool for_hibernate) +{ + freeze_flags = for_hibernate ? FREEZE_FOR_HIBERNATE : 0; + __iterate_supers_rev(filesystems_freeze_callback); +} + +static void filesystems_thaw_callback(struct super_block *sb) +{ + if (sb->s_op->thaw_super) + sb->s_op->thaw_super(sb, FREEZE_MAY_NEST + | FREEZE_HOLDER_KERNEL + | freeze_flags); + else if (sb->s_bdev) + thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL + | freeze_flags); +} + +/** + * filesystems_thaw - thaw callback for power management + * + * Thaw all active filesystems (in forward superblock order) + */ +void filesystems_thaw(bool for_hibernate) +{ + freeze_flags = for_hibernate ? FREEZE_FOR_HIBERNATE : 0; + __iterate_supers(filesystems_thaw_callback); +} + /** * fs_bdev_freeze - freeze owning filesystem of block device * @bdev: block device diff --git a/include/linux/fs.h b/include/linux/fs.h index cbbb704eff74..de154e9379ec 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2272,6 +2272,7 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos, * @FREEZE_HOLDER_KERNEL: kernel wants to freeze or thaw filesystem * @FREEZE_HOLDER_USERSPACE: userspace wants to freeze or thaw filesystem * @FREEZE_MAY_NEST: whether nesting freeze and thaw requests is allowed + * @FREEZE_FOR_HIBERNATE: set if freeze is from power management hibernate * * Indicate who the owner of the freeze or thaw request is and whether * the freeze needs to be exclusive or can nest. @@ -2285,6 +2286,7 @@ enum freeze_holder { FREEZE_HOLDER_KERNEL = (1U << 0), FREEZE_HOLDER_USERSPACE = (1U << 1), FREEZE_MAY_NEST = (1U << 2), + FREEZE_FOR_HIBERNATE = (1U << 3), }; struct super_operations { @@ -3919,4 +3921,7 @@ static inline bool vfs_empty_path(int dfd, const char __user *path) int generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter); +void filesystems_freeze(bool for_hibernate); +void filesystems_thaw(bool for_hibernate); + #endif /* _LINUX_FS_H */ diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c index 10a01af63a80..fc2106e6685a 100644 --- a/kernel/power/hibernate.c +++ b/kernel/power/hibernate.c @@ -778,7 +778,12 @@ int hibernate(void) ksys_sync_helper(); + pr_info("about to freeze filesystems\n"); + filesystems_freeze(true); + pr_info("filesystem freeze done\n"); + error = freeze_processes(); + pr_info("process freeze done\n"); if (error) goto Exit; @@ -788,7 +793,9 @@ int hibernate(void) if (error) goto Thaw; + pr_info("About to create snapshot\n"); error = hibernation_snapshot(hibernation_mode == HIBERNATION_PLATFORM); + pr_info("snapshot done\n"); if (error || freezer_test_done) goto Free_bitmaps; @@ -842,6 +849,8 @@ int hibernate(void) } thaw_processes(); + filesystems_thaw(true); + /* Don't bother checking whether freezer_test_done is true */ freezer_test_done = false; Exit: @@ -939,6 +948,8 @@ int hibernate_quiet_exec(int (*func)(void *data), void *data) thaw_processes(); + filesystems_thaw(true); + exit: pm_notifier_call_chain(PM_POST_HIBERNATION); @@ -1041,6 +1052,7 @@ static int software_resume(void) error = load_image_and_restore(); thaw_processes(); + filesystems_thaw(true); Finish: pm_notifier_call_chain(PM_POST_RESTORE); Restore: diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c index 09f8397bae15..34cc5b0c408c 100644 --- a/kernel/power/suspend.c +++ b/kernel/power/suspend.c @@ -544,6 +544,7 @@ int suspend_devices_and_enter(suspend_state_t state) static void suspend_finish(void) { suspend_thaw_processes(); + filesystems_thaw(false); pm_notifier_call_chain(PM_POST_SUSPEND); pm_restore_console(); } @@ -581,6 +582,7 @@ static int enter_state(suspend_state_t state) trace_suspend_resume(TPS("sync_filesystems"), 0, true); ksys_sync_helper(); trace_suspend_resume(TPS("sync_filesystems"), 0, false); + filesystems_freeze(false); } pm_pr_dbg("Preparing system for sleep (%s)\n", mem_sleep_labels[state]); @@ -603,6 +605,8 @@ static int enter_state(suspend_state_t state) pm_pr_dbg("Finishing wakeup.\n"); suspend_finish(); Unlock: + if (sync_on_suspend_enabled) + filesystems_thaw(false); mutex_unlock(&system_transition_mutex); return error; }
Introduce a freeze function, which iterates superblocks in reverse order freezing filesystems. The indicator a filesystem is freezable is either possessing a s_bdev or a freeze_super method. So this can be used in efivarfs, whether the freeze is for hibernate is also passed in via the new FREEZE_FOR_HIBERNATE flag. Thawing is done opposite to freezing (so superblock traversal in regular order) and the whole thing is plumbed into power management. The original ksys_sync() is preserved so the whole freezing step is optional (if it fails we're no worse off than we are today) so it doesn't inhibit suspend/hibernate if there's a failure. Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> --- fs/super.c | 61 ++++++++++++++++++++++++++++++++++++++++ include/linux/fs.h | 5 ++++ kernel/power/hibernate.c | 12 ++++++++ kernel/power/suspend.c | 4 +++ 4 files changed, 82 insertions(+)