mbox series

[v2,0/6] Extend freeze support to suspend and hibernate

Message ID 20250329-work-freeze-v2-0-a47af37ecc3d@kernel.org (mailing list archive)
Headers show
Series Extend freeze support to suspend and hibernate | expand

Message

Christian Brauner March 29, 2025, 8:42 a.m. UTC
Add the necessary infrastructure changes to support freezing for suspend
and hibernate.

Just got back from LSFMM. So still jetlagged and likelihood of bugs
increased. This should all that's needed to wire up power.

This will be in vfs-6.16.super shortly.

---
Changes in v2:
- Don't grab reference in the iterator make that a requirement for the
  callers that need custom behavior.
- Link to v1: https://lore.kernel.org/r/20250328-work-freeze-v1-0-a2c3a6b0e7a6@kernel.org

---
Christian Brauner (6):
      super: remove pointless s_root checks
      super: simplify user_get_super()
      super: skip dying superblocks early
      super: use a common iterator (Part 1)
      super: use common iterator (Part 2)
      super: add filesystem freezing helpers for suspend and hibernate

 fs/super.c         | 201 ++++++++++++++++++++++++++++++++---------------------
 include/linux/fs.h |   4 +-
 2 files changed, 126 insertions(+), 79 deletions(-)
---
base-commit: acb4f33713b9f6cadb6143f211714c343465411c
change-id: 20250328-work-freeze-0a446869cd62

Comments

James Bottomley March 29, 2025, 2:04 p.m. UTC | #1
On Sat, 2025-03-29 at 09:42 +0100, Christian Brauner wrote:
> Add the necessary infrastructure changes to support freezing for
> suspend and hibernate.
> 
> Just got back from LSFMM. So still jetlagged and likelihood of bugs
> increased. This should all that's needed to wire up power.
> 
> This will be in vfs-6.16.super shortly.
> 
> ---
> Changes in v2:
> - Don't grab reference in the iterator make that a requirement for
> the callers that need custom behavior.
> - Link to v1:
> https://lore.kernel.org/r/20250328-work-freeze-v1-0-a2c3a6b0e7a6@kernel.org

Given I've been a bit quiet on this, I thought I'd better explain
what's going on: I do have these built, but I made the mistake of doing
a dist-upgrade on my testing VM master image and it pulled in a version
of systemd (257.4-3) that has a broken hibernate.  Since I upgraded in
place I don't have the old image so I'm spending my time currently
debugging systemd ... normal service will hopefully resume shortly.

Regards,

James
James Bottomley March 29, 2025, 5:02 p.m. UTC | #2
On Sat, 2025-03-29 at 10:04 -0400, James Bottomley wrote:
> On Sat, 2025-03-29 at 09:42 +0100, Christian Brauner wrote:
> > Add the necessary infrastructure changes to support freezing for
> > suspend and hibernate.
> > 
> > Just got back from LSFMM. So still jetlagged and likelihood of bugs
> > increased. This should all that's needed to wire up power.
> > 
> > This will be in vfs-6.16.super shortly.
> > 
> > ---
> > Changes in v2:
> > - Don't grab reference in the iterator make that a requirement for
> > the callers that need custom behavior.
> > - Link to v1:
> > https://lore.kernel.org/r/20250328-work-freeze-v1-0-a2c3a6b0e7a6@kernel.org
> 
> Given I've been a bit quiet on this, I thought I'd better explain
> what's going on: I do have these built, but I made the mistake of
> doing a dist-upgrade on my testing VM master image and it pulled in a
> version of systemd (257.4-3) that has a broken hibernate.  Since I
> upgraded in place I don't have the old image so I'm spending my time
> currently debugging systemd ... normal service will hopefully resume
> shortly.

I found the systemd bug

https://github.com/systemd/systemd/issues/36888

And hacked around it, so I can confirm a simple hibernate/resume works
provided the sd_start_write() patches are applied (and the hooks are
plumbed in to pm).

There is an oddity: the systemd-journald process that would usually
hang hibernate in D wait goes into R but seems to be hung and can't be
killed by the watchdog even with a -9.  It's stack trace says it's
still stuck in sb_start_write:

[<0>] percpu_rwsem_wait.constprop.10+0xd1/0x140
[<0>] ext4_page_mkwrite+0x3c1/0x560 [ext4]
[<0>] do_page_mkwrite+0x38/0xa0
[<0>] do_wp_page+0xd5/0xba0
[<0>] __handle_mm_fault+0xa29/0xca0
[<0>] handle_mm_fault+0x16a/0x2d0
[<0>] do_user_addr_fault+0x3ab/0x810
[<0>] exc_page_fault+0x68/0x150
[<0>] asm_exc_page_fault+0x22/0x30

So I think there's something funny going on in thaw.

Regards,

James
Christian Brauner March 30, 2025, 8:33 a.m. UTC | #3
On Sat, Mar 29, 2025 at 01:02:32PM -0400, James Bottomley wrote:
> On Sat, 2025-03-29 at 10:04 -0400, James Bottomley wrote:
> > On Sat, 2025-03-29 at 09:42 +0100, Christian Brauner wrote:
> > > Add the necessary infrastructure changes to support freezing for
> > > suspend and hibernate.
> > > 
> > > Just got back from LSFMM. So still jetlagged and likelihood of bugs
> > > increased. This should all that's needed to wire up power.
> > > 
> > > This will be in vfs-6.16.super shortly.
> > > 
> > > ---
> > > Changes in v2:
> > > - Don't grab reference in the iterator make that a requirement for
> > > the callers that need custom behavior.
> > > - Link to v1:
> > > https://lore.kernel.org/r/20250328-work-freeze-v1-0-a2c3a6b0e7a6@kernel.org
> > 
> > Given I've been a bit quiet on this, I thought I'd better explain
> > what's going on: I do have these built, but I made the mistake of
> > doing a dist-upgrade on my testing VM master image and it pulled in a
> > version of systemd (257.4-3) that has a broken hibernate.  Since I
> > upgraded in place I don't have the old image so I'm spending my time
> > currently debugging systemd ... normal service will hopefully resume
> > shortly.
> 
> I found the systemd bug
> 
> https://github.com/systemd/systemd/issues/36888

I don't think that's a systemd bug.

> And hacked around it, so I can confirm a simple hibernate/resume works
> provided the sd_start_write() patches are applied (and the hooks are
> plumbed in to pm).
> 
> There is an oddity: the systemd-journald process that would usually
> hang hibernate in D wait goes into R but seems to be hung and can't be
> killed by the watchdog even with a -9.  It's stack trace says it's
> still stuck in sb_start_write:
> 
> [<0>] percpu_rwsem_wait.constprop.10+0xd1/0x140
> [<0>] ext4_page_mkwrite+0x3c1/0x560 [ext4]
> [<0>] do_page_mkwrite+0x38/0xa0
> [<0>] do_wp_page+0xd5/0xba0
> [<0>] __handle_mm_fault+0xa29/0xca0
> [<0>] handle_mm_fault+0x16a/0x2d0
> [<0>] do_user_addr_fault+0x3ab/0x810
> [<0>] exc_page_fault+0x68/0x150
> [<0>] asm_exc_page_fault+0x22/0x30
> 
> So I think there's something funny going on in thaw.

My uneducated guess is that it's probably an issue with ext4 freezing
and unfreezing. xfs stops workqueues after all writes and pagefault
writers have stopped. This is done in ->sync_fs() when it's called from
freeze_super(). They are restarted when ->unfreeze_fs is called.

But for ext4 in ->sync_fs() the rsv_conversion_wq is flushed. I think
that should be safe to do but I'm not sure if there can't be other work
coming in on it before the actual freeze call. Jan will be able to
explain this a lot better. I don't have time today to figure out what
this does.
Christian Brauner March 30, 2025, 11:53 a.m. UTC | #4
On Sun, Mar 30, 2025 at 10:33:53AM +0200, Christian Brauner wrote:
> On Sat, Mar 29, 2025 at 01:02:32PM -0400, James Bottomley wrote:
> > On Sat, 2025-03-29 at 10:04 -0400, James Bottomley wrote:
> > > On Sat, 2025-03-29 at 09:42 +0100, Christian Brauner wrote:
> > > > Add the necessary infrastructure changes to support freezing for
> > > > suspend and hibernate.
> > > > 
> > > > Just got back from LSFMM. So still jetlagged and likelihood of bugs
> > > > increased. This should all that's needed to wire up power.
> > > > 
> > > > This will be in vfs-6.16.super shortly.
> > > > 
> > > > ---
> > > > Changes in v2:
> > > > - Don't grab reference in the iterator make that a requirement for
> > > > the callers that need custom behavior.
> > > > - Link to v1:
> > > > https://lore.kernel.org/r/20250328-work-freeze-v1-0-a2c3a6b0e7a6@kernel.org
> > > 
> > > Given I've been a bit quiet on this, I thought I'd better explain
> > > what's going on: I do have these built, but I made the mistake of
> > > doing a dist-upgrade on my testing VM master image and it pulled in a
> > > version of systemd (257.4-3) that has a broken hibernate.  Since I
> > > upgraded in place I don't have the old image so I'm spending my time
> > > currently debugging systemd ... normal service will hopefully resume
> > > shortly.
> > 
> > I found the systemd bug
> > 
> > https://github.com/systemd/systemd/issues/36888
> 
> I don't think that's a systemd bug.
> 
> > And hacked around it, so I can confirm a simple hibernate/resume works
> > provided the sd_start_write() patches are applied (and the hooks are
> > plumbed in to pm).
> > 
> > There is an oddity: the systemd-journald process that would usually
> > hang hibernate in D wait goes into R but seems to be hung and can't be
> > killed by the watchdog even with a -9.  It's stack trace says it's
> > still stuck in sb_start_write:
> > 
> > [<0>] percpu_rwsem_wait.constprop.10+0xd1/0x140
> > [<0>] ext4_page_mkwrite+0x3c1/0x560 [ext4]
> > [<0>] do_page_mkwrite+0x38/0xa0
> > [<0>] do_wp_page+0xd5/0xba0
> > [<0>] __handle_mm_fault+0xa29/0xca0
> > [<0>] handle_mm_fault+0x16a/0x2d0
> > [<0>] do_user_addr_fault+0x3ab/0x810
> > [<0>] exc_page_fault+0x68/0x150
> > [<0>] asm_exc_page_fault+0x22/0x30
> > 
> > So I think there's something funny going on in thaw.
> 
> My uneducated guess is that it's probably an issue with ext4 freezing
> and unfreezing. xfs stops workqueues after all writes and pagefault
> writers have stopped. This is done in ->sync_fs() when it's called from
> freeze_super(). They are restarted when ->unfreeze_fs is called.
> 
> But for ext4 in ->sync_fs() the rsv_conversion_wq is flushed. I think
> that should be safe to do but I'm not sure if there can't be other work
> coming in on it before the actual freeze call. Jan will be able to
> explain this a lot better. I don't have time today to figure out what
> this does.

Though I'm just looking at the patch snippet you posted for how you
hooked up efivarfs in https://lore.kernel.org/r/a7e6dee45ac11519c33a297797990fce6bb32bff.camel@HansenPartnership.com
and that looks pretty broken and is probably the root cause. You have:

+static int efivarfs_thaw(struct super_block *sb, enum freeze_holder who);
 static const struct super_operations efivarfs_ops = {
        .statfs = efivarfs_statfs,
        .drop_inode = generic_delete_inode,
        .alloc_inode = efivarfs_alloc_inode,
        .free_inode = efivarfs_free_inode,
        .show_options = efivarfs_show_options,
+       .thaw_super = efivarfs_thaw,
 };

Which adds ->thaw_super() without ->freeze_super() which means that
->thaw_super() is never called for efivarfs.

But also it's broken in other ways. You're not waiting for writers to
finish. Which is most often fine because efivarfs shouldn't be written
to that heavily but still this won't work and you need to call the
generic VFS helpers.

I'm appending a draft for how to do this with efivarfs. Note, I don't
have the means/time to test this right now. Would you please plumb in
your recursive removal into my patch and test it? I'm pushing it to
vfs-6.16.super for now (It likely will fail due to unused helpers right
now because I gutted the recursive removal.).
James Bottomley March 30, 2025, 2 p.m. UTC | #5
On Sun, 2025-03-30 at 10:33 +0200, Christian Brauner wrote:
[...]
> > I found the systemd bug
> > 
> > https://github.com/systemd/systemd/issues/36888
> 
> I don't think that's a systemd bug.

Heh, well I have zero interest in refereeing a turf war between systemd
and dracut over mismatched expectations.  The point for anyone who
wants to run hibernate tests is that until they both sort this out the
bug can be fixed by removing the system identifier check from systemd-
hibernate-resume-generator.

> > And hacked around it, so I can confirm a simple hibernate/resume
> > works provided the sd_start_write() patches are applied (and the
> > hooks are plumbed in to pm).
> > 
> > There is an oddity: the systemd-journald process that would usually
> > hang hibernate in D wait goes into R but seems to be hung and can't
> > be killed by the watchdog even with a -9.  It's stack trace says
> > it's still stuck in sb_start_write:
> > 
> > [<0>] percpu_rwsem_wait.constprop.10+0xd1/0x140
> > [<0>] ext4_page_mkwrite+0x3c1/0x560 [ext4]
> > [<0>] do_page_mkwrite+0x38/0xa0
> > [<0>] do_wp_page+0xd5/0xba0
> > [<0>] __handle_mm_fault+0xa29/0xca0
> > [<0>] handle_mm_fault+0x16a/0x2d0
> > [<0>] do_user_addr_fault+0x3ab/0x810
> > [<0>] exc_page_fault+0x68/0x150
> > [<0>] asm_exc_page_fault+0x22/0x30
> > 
> > So I think there's something funny going on in thaw.
> 
> My uneducated guess is that it's probably an issue with ext4 freezing
> and unfreezing. xfs stops workqueues after all writes and pagefault
> writers have stopped. This is done in ->sync_fs() when it's called
> from freeze_super(). They are restarted when ->unfreeze_fs is called.

It is possible, but I note that if I do

fsfreeze --freeze /

I can produce exactly the above stack trace in systemd-journald, but if
I unfreeze root it continues on normally.  Thus I think this is some
type of bad interaction with the process freezing that goes on in
hibernate.  I'm going to see if I can replicate using the cgroup
freezer.

> But for ext4 in ->sync_fs() the rsv_conversion_wq is flushed. I think
> that should be safe to do but I'm not sure if there can't be other
> work coming in on it before the actual freeze call. Jan will be able
> to explain this a lot better. I don't have time today to figure out
> what this does.

Understood.  The above is for Jan if he'd like to think about it.

Regards,

James
Christian Brauner March 31, 2025, 9:13 a.m. UTC | #6
On Sun, Mar 30, 2025 at 10:00:56AM -0400, James Bottomley wrote:
> On Sun, 2025-03-30 at 10:33 +0200, Christian Brauner wrote:
> [...]
> > > I found the systemd bug
> > > 
> > > https://github.com/systemd/systemd/issues/36888
> > 
> > I don't think that's a systemd bug.
> 
> Heh, well I have zero interest in refereeing a turf war between systemd
> and dracut over mismatched expectations.  The point for anyone who
> wants to run hibernate tests is that until they both sort this out the
> bug can be fixed by removing the system identifier check from systemd-
> hibernate-resume-generator.
> 
> > > And hacked around it, so I can confirm a simple hibernate/resume
> > > works provided the sd_start_write() patches are applied (and the
> > > hooks are plumbed in to pm).
> > > 
> > > There is an oddity: the systemd-journald process that would usually
> > > hang hibernate in D wait goes into R but seems to be hung and can't
> > > be killed by the watchdog even with a -9.  It's stack trace says
> > > it's still stuck in sb_start_write:
> > > 
> > > [<0>] percpu_rwsem_wait.constprop.10+0xd1/0x140
> > > [<0>] ext4_page_mkwrite+0x3c1/0x560 [ext4]
> > > [<0>] do_page_mkwrite+0x38/0xa0
> > > [<0>] do_wp_page+0xd5/0xba0
> > > [<0>] __handle_mm_fault+0xa29/0xca0
> > > [<0>] handle_mm_fault+0x16a/0x2d0
> > > [<0>] do_user_addr_fault+0x3ab/0x810
> > > [<0>] exc_page_fault+0x68/0x150
> > > [<0>] asm_exc_page_fault+0x22/0x30
> > > 
> > > So I think there's something funny going on in thaw.
> > 
> > My uneducated guess is that it's probably an issue with ext4 freezing
> > and unfreezing. xfs stops workqueues after all writes and pagefault
> > writers have stopped. This is done in ->sync_fs() when it's called
> > from freeze_super(). They are restarted when ->unfreeze_fs is called.
> 
> It is possible, but I note that if I do
> 
> fsfreeze --freeze /

Freezing the root filesystem from userspace will inevitably lead to an
odd form of deadlock eventually. Either the first accidental request for
opening something as writable or even the call to fsfreeze --unfreeze /
may deadlock.

The most likely explanation for this stacktrace is that the root
filesystem isn't unfrozen. In userspace it's easy enough to trigger by
leaving the filesystem frozen without also freezing userspace processes
accessing that filesystem:

[  243.232205] INFO: task systemd-journal:539 blocked for more than 120 seconds.
[  243.239491]       Not tainted 6.14.0-g9ad3884269ca #131
[  243.243771] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  243.248517] task:systemd-journal state:D stack:0     pid:539   tgid:539   ppid:1      task_flags:0x400100 flags:0x00000006
[  243.253480] Call Trace:
[  243.254641]  <TASK>
[  243.255663]  __schedule+0x61e/0x1080
[  243.257071]  ? percpu_rwsem_wait+0x149/0x1b0
[  243.258473]  schedule+0x3a/0x120
[  243.259533]  percpu_rwsem_wait+0x155/0x1b0
[  243.260844]  ? __pfx_percpu_rwsem_wake_function+0x10/0x10
[  243.262620]  __percpu_down_read+0x83/0x1c0
[  243.263968]  btrfs_page_mkwrite+0x45b/0x890 [btrfs]
[  243.266828]  ? find_held_lock+0x2b/0x80
[  243.267765]  do_page_mkwrite+0x4a/0xb0
[  243.268698]  do_wp_page+0x331/0xdc0
[  243.269559]  __handle_mm_fault+0xb15/0x11d0
[  243.270566]  handle_mm_fault+0xb8/0x2b0
[  243.271557]  do_user_addr_fault+0x20a/0x700
[  243.272574]  exc_page_fault+0x6a/0x200
[  243.273462]  asm_exc_page_fault+0x26/0x30

This happens because systemd-journald mmaps the journal file. It
triggers a pagefault which wants to get pagefault based write access to
the file. But it can't because pagefaults are frozen. So it hangs and as
it's not frozen it will trigger hung task warnings.

IOW, the most likely explanation is that the root filesystem wasn't
unfrozen and systemd-journald wasn't frozen.
Jan Kara March 31, 2025, 10:36 a.m. UTC | #7
On Sat 29-03-25 13:02:32, James Bottomley wrote:
> On Sat, 2025-03-29 at 10:04 -0400, James Bottomley wrote:
> > On Sat, 2025-03-29 at 09:42 +0100, Christian Brauner wrote:
> > > Add the necessary infrastructure changes to support freezing for
> > > suspend and hibernate.
> > > 
> > > Just got back from LSFMM. So still jetlagged and likelihood of bugs
> > > increased. This should all that's needed to wire up power.
> > > 
> > > This will be in vfs-6.16.super shortly.
> > > 
> > > ---
> > > Changes in v2:
> > > - Don't grab reference in the iterator make that a requirement for
> > > the callers that need custom behavior.
> > > - Link to v1:
> > > https://lore.kernel.org/r/20250328-work-freeze-v1-0-a2c3a6b0e7a6@kernel.org
> > 
> > Given I've been a bit quiet on this, I thought I'd better explain
> > what's going on: I do have these built, but I made the mistake of
> > doing a dist-upgrade on my testing VM master image and it pulled in a
> > version of systemd (257.4-3) that has a broken hibernate.  Since I
> > upgraded in place I don't have the old image so I'm spending my time
> > currently debugging systemd ... normal service will hopefully resume
> > shortly.
> 
> I found the systemd bug
> 
> https://github.com/systemd/systemd/issues/36888
> 
> And hacked around it, so I can confirm a simple hibernate/resume works
> provided the sd_start_write() patches are applied (and the hooks are
> plumbed in to pm).
> 
> There is an oddity: the systemd-journald process that would usually
> hang hibernate in D wait goes into R but seems to be hung and can't be
> killed by the watchdog even with a -9.  It's stack trace says it's
> still stuck in sb_start_write:
> 
> [<0>] percpu_rwsem_wait.constprop.10+0xd1/0x140
> [<0>] ext4_page_mkwrite+0x3c1/0x560 [ext4]
> [<0>] do_page_mkwrite+0x38/0xa0
> [<0>] do_wp_page+0xd5/0xba0
> [<0>] __handle_mm_fault+0xa29/0xca0
> [<0>] handle_mm_fault+0x16a/0x2d0
> [<0>] do_user_addr_fault+0x3ab/0x810
> [<0>] exc_page_fault+0x68/0x150
> [<0>] asm_exc_page_fault+0x22/0x30
> 
> So I think there's something funny going on in thaw.

As Christian wrote, it seems systemd-journald does a memory store to
mmapped file and gets blocked on sb_start_write() while doing the page
fault. What's strange is that R state. Is the task really executing on some
CPU or it only has 'R' state (i.e., got woken but never scheduled)?

								Honza
James Bottomley March 31, 2025, 2:49 p.m. UTC | #8
On Mon, 2025-03-31 at 12:36 +0200, Jan Kara wrote:
> On Sat 29-03-25 13:02:32, James Bottomley wrote:
> > On Sat, 2025-03-29 at 10:04 -0400, James Bottomley wrote:
> > > On Sat, 2025-03-29 at 09:42 +0100, Christian Brauner wrote:
> > > > Add the necessary infrastructure changes to support freezing
> > > > for
> > > > suspend and hibernate.
> > > > 
> > > > Just got back from LSFMM. So still jetlagged and likelihood of
> > > > bugs
> > > > increased. This should all that's needed to wire up power.
> > > > 
> > > > This will be in vfs-6.16.super shortly.
> > > > 
> > > > ---
> > > > Changes in v2:
> > > > - Don't grab reference in the iterator make that a requirement
> > > > for
> > > > the callers that need custom behavior.
> > > > - Link to v1:
> > > > https://lore.kernel.org/r/20250328-work-freeze-v1-0-a2c3a6b0e7a6@kernel.org
> > > 
> > > Given I've been a bit quiet on this, I thought I'd better explain
> > > what's going on: I do have these built, but I made the mistake of
> > > doing a dist-upgrade on my testing VM master image and it pulled
> > > in a
> > > version of systemd (257.4-3) that has a broken hibernate.  Since
> > > I
> > > upgraded in place I don't have the old image so I'm spending my
> > > time
> > > currently debugging systemd ... normal service will hopefully
> > > resume
> > > shortly.
> > 
> > I found the systemd bug
> > 
> > https://github.com/systemd/systemd/issues/36888
> > 
> > And hacked around it, so I can confirm a simple hibernate/resume
> > works
> > provided the sd_start_write() patches are applied (and the hooks
> > are
> > plumbed in to pm).
> > 
> > There is an oddity: the systemd-journald process that would usually
> > hang hibernate in D wait goes into R but seems to be hung and can't
> > be killed by the watchdog even with a -9.  It's stack trace says
> > it's still stuck in sb_start_write:
> > 
> > [<0>] percpu_rwsem_wait.constprop.10+0xd1/0x140
> > [<0>] ext4_page_mkwrite+0x3c1/0x560 [ext4]
> > [<0>] do_page_mkwrite+0x38/0xa0
> > [<0>] do_wp_page+0xd5/0xba0
> > [<0>] __handle_mm_fault+0xa29/0xca0
> > [<0>] handle_mm_fault+0x16a/0x2d0
> > [<0>] do_user_addr_fault+0x3ab/0x810
> > [<0>] exc_page_fault+0x68/0x150
> > [<0>] asm_exc_page_fault+0x22/0x30
> > 
> > So I think there's something funny going on in thaw.
> 
> As Christian wrote, it seems systemd-journald does a memory store to
> mmapped file and gets blocked on sb_start_write() while doing the
> page fault. What's strange is that R state. Is the task really
> executing on some CPU or it only has 'R' state (i.e., got woken but
> never scheduled)?

Yes, ps shows it definitely stuck in R state.  The trace above
identifies the rwsem being at set_current_state() which seems to imply
it never returns from schedule() even though it's in state R.

I've actually managed to reproduce this now just doing filesystem
freeze and thaw without using the freezer, so I'll continue
investigating.

Regards,

James
Christian Brauner March 31, 2025, 11:33 p.m. UTC | #9
On Mon, Mar 31, 2025 at 12:36:27PM +0200, Jan Kara wrote:
> On Sat 29-03-25 13:02:32, James Bottomley wrote:
> > On Sat, 2025-03-29 at 10:04 -0400, James Bottomley wrote:
> > > On Sat, 2025-03-29 at 09:42 +0100, Christian Brauner wrote:
> > > > Add the necessary infrastructure changes to support freezing for
> > > > suspend and hibernate.
> > > > 
> > > > Just got back from LSFMM. So still jetlagged and likelihood of bugs
> > > > increased. This should all that's needed to wire up power.
> > > > 
> > > > This will be in vfs-6.16.super shortly.
> > > > 
> > > > ---
> > > > Changes in v2:
> > > > - Don't grab reference in the iterator make that a requirement for
> > > > the callers that need custom behavior.
> > > > - Link to v1:
> > > > https://lore.kernel.org/r/20250328-work-freeze-v1-0-a2c3a6b0e7a6@kernel.org
> > > 
> > > Given I've been a bit quiet on this, I thought I'd better explain
> > > what's going on: I do have these built, but I made the mistake of
> > > doing a dist-upgrade on my testing VM master image and it pulled in a
> > > version of systemd (257.4-3) that has a broken hibernate.  Since I
> > > upgraded in place I don't have the old image so I'm spending my time
> > > currently debugging systemd ... normal service will hopefully resume
> > > shortly.
> > 
> > I found the systemd bug
> > 
> > https://github.com/systemd/systemd/issues/36888
> > 
> > And hacked around it, so I can confirm a simple hibernate/resume works
> > provided the sd_start_write() patches are applied (and the hooks are
> > plumbed in to pm).
> > 
> > There is an oddity: the systemd-journald process that would usually
> > hang hibernate in D wait goes into R but seems to be hung and can't be
> > killed by the watchdog even with a -9.  It's stack trace says it's
> > still stuck in sb_start_write:
> > 
> > [<0>] percpu_rwsem_wait.constprop.10+0xd1/0x140
> > [<0>] ext4_page_mkwrite+0x3c1/0x560 [ext4]
> > [<0>] do_page_mkwrite+0x38/0xa0
> > [<0>] do_wp_page+0xd5/0xba0
> > [<0>] __handle_mm_fault+0xa29/0xca0
> > [<0>] handle_mm_fault+0x16a/0x2d0
> > [<0>] do_user_addr_fault+0x3ab/0x810
> > [<0>] exc_page_fault+0x68/0x150
> > [<0>] asm_exc_page_fault+0x22/0x30
> > 
> > So I think there's something funny going on in thaw.
> 
> As Christian wrote, it seems systemd-journald does a memory store to
> mmapped file and gets blocked on sb_start_write() while doing the page
> fault. What's strange is that R state. Is the task really executing on some
> CPU or it only has 'R' state (i.e., got woken but never scheduled)?

I think the issue is that we need to also make pagefault based writers
such as systemd-journald freezable:

I don't think that's it. I think you're missing making pagefault writers
such
as systemd-journald freezable:

I don't think that's it. I think you're missing making pagefault writers such
as systemd-journald freezable:

diff --git a/include/linux/fs.h b/include/linux/fs.h
index b379a46b5576..528e73f192ac 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1782,7 +1782,8 @@ static inline void __sb_end_write(struct super_block *sb, int level)
 static inline void __sb_start_write(struct super_block *sb, int level)
 {
        percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1,
-                                  level == SB_FREEZE_WRITE);
+                                  (level == SB_FREEZE_WRITE ||
+                                   level == SB_FREEZE_PAGEFAULT));
 }