diff mbox series

[v2,2/4] fs: allow all writers to be frozen

Message ID 20250402-work-freeze-v2-2-6719a97b52ac@kernel.org (mailing list archive)
State New
Headers show
Series power: wire-up filesystem freeze/thaw with suspend/resume | expand

Commit Message

Christian Brauner April 2, 2025, 2:07 p.m. UTC
During freeze/thaw we need to be able to freeze all writers during
suspend/hibernate. Otherwise tasks such as systemd-journald that mmap a
file and write to it will not be frozen after we've already frozen the
filesystem.

This has some risk of not being able to freeze processes in case a
process has acquired SB_FREEZE_PAGEFAULT under mmap_sem or
SB_FREEZE_INTERNAL under some other filesytem specific lock. If the
filesystem is frozen, a task can block on the frozen filesystem with
e.g., mmap_sem held. If some other task then blocks on grabbing that
mmap_sem, hibernation ill fail because it is unable to hibernate a task
holding mmap_sem. This could be fixed by making a range of filesystem
related locks use freezable sleeping. That's impractical and not
warranted just for suspend/hibernate. Assume that this is an infrequent
problem and we've given userspace a way to skip filesystem freezing
through a sysfs file.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 include/linux/fs.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Comments

Christian Brauner April 2, 2025, 3:32 p.m. UTC | #1
On Wed, Apr 02, 2025 at 04:07:32PM +0200, Christian Brauner wrote:
> During freeze/thaw we need to be able to freeze all writers during
> suspend/hibernate. Otherwise tasks such as systemd-journald that mmap a
> file and write to it will not be frozen after we've already frozen the
> filesystem.
> 
> This has some risk of not being able to freeze processes in case a
> process has acquired SB_FREEZE_PAGEFAULT under mmap_sem or
> SB_FREEZE_INTERNAL under some other filesytem specific lock. If the
> filesystem is frozen, a task can block on the frozen filesystem with
> e.g., mmap_sem held. If some other task then blocks on grabbing that
> mmap_sem, hibernation ill fail because it is unable to hibernate a task
> holding mmap_sem. This could be fixed by making a range of filesystem
> related locks use freezable sleeping. That's impractical and not
> warranted just for suspend/hibernate. Assume that this is an infrequent
> problem and we've given userspace a way to skip filesystem freezing
> through a sysfs file.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  include/linux/fs.h | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b379a46b5576..1edcba3cd68e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1781,8 +1781,7 @@ static inline void __sb_end_write(struct super_block *sb, int level)
>  
>  static inline void __sb_start_write(struct super_block *sb, int level)
>  {
> -	percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1,
> -				   level == SB_FREEZE_WRITE);
> +	percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1, true);
>  }

Jan, one more thought about freezability here. We know that there will
can be at least one process during hibernation that ends up generating
page faults and that's systemd-journald. When systemd-sleep requests
writing a hibernation image via /sys/power/ files it will inevitably end
up freezing systemd-journald and it may be generating a page fault with
->mmap_lock held. systemd-journald is now sleeping with
SB_FREEZE_PAGEFAULT and TASK_FREEZABLE. We know this can cause
hibernation to fail. That part is fine. What isn't is that we will very
likely always trigger:

#ifdef CONFIG_LOCKDEP
        /*
         * It's dangerous to freeze with locks held; there be dragons there.
         */
        if (!(state & __TASK_FREEZABLE_UNSAFE))
                WARN_ON_ONCE(debug_locks && p->lockdep_depth);
#endif

with lockdep enabled.

So we really actually need percpu_rswem_read_freezable_unsafe(), i.e.,
TASK_FREEZABLE_UNSAFE.
James Bottomley April 2, 2025, 4:03 p.m. UTC | #2
On Wed, 2025-04-02 at 17:32 +0200, Christian Brauner wrote:
[...]
> Jan, one more thought about freezability here. We know that there
> will can be at least one process during hibernation that ends up
> generating page faults and that's systemd-journald. When systemd-
> sleep requests writing a hibernation image via /sys/power/ files it
> will inevitably end up freezing systemd-journald and it may be
> generating a page fault with ->mmap_lock held. systemd-journald is
> now sleeping with SB_FREEZE_PAGEFAULT and TASK_FREEZABLE. We know
> this can cause hibernation to fail. That part is fine. What isn't is
> that we will very likely always trigger:
> 
> #ifdef CONFIG_LOCKDEP
>         /*
>          * It's dangerous to freeze with locks held; there be dragons
> there.
>          */
>         if (!(state & __TASK_FREEZABLE_UNSAFE))
>                 WARN_ON_ONCE(debug_locks && p->lockdep_depth);
> #endif
> 
> with lockdep enabled.
> 
> So we really actually need percpu_rswem_read_freezable_unsafe(),
> i.e., TASK_FREEZABLE_UNSAFE.

The sched people have pretty strong views about people not doing this,
expressed in the comment in sched.h and commit f5d39b020809
("freezer,sched: Rewrite core freezer logic") where most of the _unsafe
variants got removed with prejudice.

If we do get into this situation the worst that can happen is that
another upper lock acquisition triggers a hibernate failure and we thaw
everything, thus we can never truly deadlock, which is the fear, so
perhaps they might be OK with this.  Note that Rafael's solution to
this was to disable lockdep around hibernate/suspend and resume, which
is another possibility.

Regards,

James
Christian Brauner April 2, 2025, 4:13 p.m. UTC | #3
On Wed, Apr 02, 2025 at 12:03:24PM -0400, James Bottomley wrote:
> On Wed, 2025-04-02 at 17:32 +0200, Christian Brauner wrote:
> [...]
> > Jan, one more thought about freezability here. We know that there
> > will can be at least one process during hibernation that ends up
> > generating page faults and that's systemd-journald. When systemd-
> > sleep requests writing a hibernation image via /sys/power/ files it
> > will inevitably end up freezing systemd-journald and it may be
> > generating a page fault with ->mmap_lock held. systemd-journald is
> > now sleeping with SB_FREEZE_PAGEFAULT and TASK_FREEZABLE. We know
> > this can cause hibernation to fail. That part is fine. What isn't is
> > that we will very likely always trigger:
> > 
> > #ifdef CONFIG_LOCKDEP
> >         /*
> >          * It's dangerous to freeze with locks held; there be dragons
> > there.
> >          */
> >         if (!(state & __TASK_FREEZABLE_UNSAFE))
> >                 WARN_ON_ONCE(debug_locks && p->lockdep_depth);
> > #endif
> > 
> > with lockdep enabled.
> > 
> > So we really actually need percpu_rswem_read_freezable_unsafe(),
> > i.e., TASK_FREEZABLE_UNSAFE.
> 
> The sched people have pretty strong views about people not doing this,
> expressed in the comment in sched.h and commit f5d39b020809
> ("freezer,sched: Rewrite core freezer logic") where most of the _unsafe
> variants got removed with prejudice.
> 
> If we do get into this situation the worst that can happen is that
> another upper lock acquisition triggers a hibernate failure and we thaw
> everything, thus we can never truly deadlock, which is the fear, so

Yes, I know that it's harmless but we need to not generate misleading
lockdep splats when lockdep is turned on and confuse users.

> perhaps they might be OK with this.  Note that Rafael's solution to
> this was to disable lockdep around hibernate/suspend and resume, which
> is another possibility.

It can be done as a follow-up. I'm just saying it needs treatment.
diff mbox series

Patch

diff --git a/include/linux/fs.h b/include/linux/fs.h
index b379a46b5576..1edcba3cd68e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1781,8 +1781,7 @@  static inline void __sb_end_write(struct super_block *sb, int level)
 
 static inline void __sb_start_write(struct super_block *sb, int level)
 {
-	percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1,
-				   level == SB_FREEZE_WRITE);
+	percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1, true);
 }
 
 static inline bool __sb_start_write_trylock(struct super_block *sb, int level)