[LSF/MM,TOPIC,NOTES] Phasing out kernel thread freezing

Below are my notes about this session at LSF/MM 2018.

kthread freezer woes

	Commit fc558a7496bf ("[PATCH] swsusp: finally solve mysqld problem")
	by Rafael J. Wysocki <rjw@sisk.pl> on v2.6.17 moved try_to_freeze()
	to kernel/signal.c on get_signal_to_deliver(), now get_signal().

  * pcmcia on the pccardd kthread -- still present today
  * USB core hub_thread()
  * filesystmes
  * mm pdflush() context -- the original writeback daemon, "dirty page" flush
	pdflush added on v2.5.8. Prior to this we had bdflush(),
	and that was kicked off if try_to_free_buffers() was called on a page
	and it was determined not all the buffers of a page could be freed.
	Simple daemon to provide a dynamic response to dirty buffers.
	Its job was to writeback a limited number of buffers to disk and go
	back to sleep again.
  * sunrpc svc_recv()

kthread_freezable_should_stop()
-------------------------------

Commit 8a32c441c1609 ("freezer: implement and use
kthread_freezable_should_stop()" added via v3.3 by Tejun.
This commit claims you should *not* use try_to_freeze(), instead
you *should* use kthread_freezable_should_stop() if your kthread
is freezable....

But yet... only 4 kernel users!

The only filesystem using it is NFS on nfs4_callback_svc(). What gives?

Getting the kthread freezer right
---------------------------------

At the 2015 Kernel summit at South Korea, Jiri Kosina suggested we should
phase it out long term. The semantics are loose and experience shows that
it is difficult to get right. Best to phase it out if possible.

What to do - address filesystems first
--------------------------------------

Phasing out the kthread freezer is hard so lets divide and conquer.
Let's first address this on filesystems properly. Simple:

	freeze_fs() for filesystems that implement the callback

Patches are ongoing, so expect a new series soon for filesystem.
But what's left after this?

Filesystems with freeze_fs() and those using try_to_freeze()
------------------------------------------------------------

The current patches being worked on address filesystems which
implement freeze_fs(). Let's review these. The filesystems which
implement freeze_fs():

  * xfs
  * reiserfs
  * nilfs2
  * jfs
  * f2fs
  * ext4
  * ext2
  * btrfs

Of these, the following have freezer helpers, which can then be removed after
the kernel automaticaly calls freeze_fs() for us on suspend:

  * xfs
  * nilfs2
  * jfs
  * f2fs
  * ext4

What about other filesystems?

Order considerations
--------------------

Current simple solution:

	iterate_supers_reverse_excl() on freeze
	iterate_supers() for thaw

We acknowledged at LSF/MM that this order is not sufficient and definitely not
perfect. We would need a proper infrastructure to have a Directed Acyclic Graph
(DAG) we can rely on.  We do not have that today. We discussed possibly adding
infrastructure for this. We determined that this work is long term -- we're
fine to move on with life with the above simple implementation for now and
acknowledge the imperfect solution should work in most cases. Al Viro gave
an example where the order is not respected today.

You can use a loopback device to mount a file image on a filesystem A, and
at a later point in time dynamically change the backing device to another
file present on filesystem B using the loopback ioctl LOOP_CHANGE_FD --
loop_change_fd().

/*                                                                              
 * loop_change_fd switched the backing store of a loopback device to            
 * a new file. This is useful for operating system installers to free up        
 * the original file and in High Availability environments to switch to         
 * an alternative location for the content in case of server meltdown.          
 * This can only work if the loop device is used read-only, and if the          
 * new backing store is the same size and type as the old backing store.        
 */                                                                             
static int loop_change_fd(struct loop_device *lo, struct block_device *bdev,    
                          unsigned int arg)  
{
	...
}

The ioctl then currently enables the order for the a superblock to change
dynamically and break the above order assumptions to be correct.

It was mentioned it is believed some distributions (Fedora?) installers may use
this to complete installation and give control to the system withot rebooting.
If true, then suspend order *could* break in this case.

Should a flag be added to disable use of freeze_fs() if such ioctl is used?
Al didn't seem too happy with the idea of working around this by moving the
superblock around to the right order after the ioctl work.

One of the issues with the violation of this ordering is that its not possible
to detect violations to odering, given if you could do this, you may be able
to address ordering. Its a catch-22 situation. However, if there *are* known
cases where such violations do happen, one option is can skip the suspend
framework for filesystems later.

Long term we want proper infrastructure to address ordering, which can address
this example corner case. Other possible use cases for a DAG on the superblocks
is emergency remount.

Possible issues
---------------

David Howells noted future possible races with suspend/freeze and automount.
Should automount be skipped during the general suspend/resume cycle?

NFS
---

For NFS Jeff Layton has suggested to have freeze_fs() make the RPC 
engine "park" newly issued RPCs for that fs' client onto a rpc_wait_queue.
Any RPC that has already been sent however, we need to wait for a reply.

unfreeze_fs() can then just have the engine stop parking RPCs and wake up the
waitq. He however points out that if we're interested in making the cgroup
freezer also work, then we may need to do a bit more work to ensure that we
don't end up with frozen tasks squatting on VFS locks.

Is it true that cgroups freezer want tasks to not hold VFS locks prior to
freezing?

Dave Chinner notes that freezing a filesystem pretty much guarantees the
opposite - that tasks *will freeze when holding VFS locks* - the cgroup freezer
is broken by design *if* it requires tasks to be frozen without holding any
VFS/filesystem lock context, and as such we *should* be able to ignore it.

At LSF/MM it was acknowledged cgroup freezer may be broken, folks who would
care are aware, and a long term fix is needed.

Why the device model cannot be used for filesystems
---------------------------------------------------

Darick suggested that since the filesystem ordering is a layering problem we
should consider applying the device model to filesystems / block layer. Turns
out suspend of filesystems cannot be tied to suspend of devices.

# Do not use device model - breaks hibernation

Hibernation is one example of an issue.

Filesystems ordering is different that device ordering on a system.

During hibernation devices are suspended/resumed twice. When you hibernate
you need to freeze the filesystem, create the image, resume all devices to then
be able to store the image, and then suspend devices again.

Device ordering is not representative of the ordering in which you mount
filesystems. That has its own order, and suspending by bus order can be
different than the mount order. Suspending filesystems in the incorrect mount
order may yield in the inability for one filesystem to properly flush all
pending IO.

The ordering is not something which is only needed for suspend and hibernation
though. We'll need proper ordering for snapshotting, and if we later grow
the idea of snapshotting on files for the notion of subvolumes, odering
may play an important role here as well?

The current order strategy proposed is simply to iterate in reverse on all
superblocks, this should *in most cases* reflect the inverse of mount order.
The superblock on top will be the last mounted filesystem.

Filesystems on loopback freeze before the lower filesystem that hosts the
loopback image.

What about kernel filesystem which do not implement freeze_fs()?
----------------------------------------------------------------

Issuing a generic sync for filesystems which implement freeze_super() may
suffice.

cgroup notes
------------

Recall from Documentation/cgroup-v1/freezer-subsystem.txt:

	$ echo $$
	16644
	$ bash
	$ echo $$
	16690

	From a second, unrelated bash shell:
	$ kill -SIGSTOP 16690
	$ kill -SIGCONT 16690

	<at this point 16690 exits and causes 16644 to exit too>

This happens because bash can observe both signals and choose how it
responds to them.

FUSE drivers
------------

The implementation is only making use of filesystems which implement
freeze_fs() callback, it doesn't ouch other filesystems. It should
therefore technically not regress FUSE filesystems.

But there are still problems which need to be addressed on FUSE.

# New mount API with userspace freeze handler

One idea is that the kernel can notify userspace about the freeze and let
it deal with what it has to. There is a new mount call API being evolved,
one idea is to provide a userspace freeze hanlder call with this new mount
API.

It was punted that perhaps some of these things are ideas which can be worked
on under the Project Springfield umbrella [0].

[0] https://springfield-project.github.io/

# FUSE drivers and dirty data

https://lists.linux-foundation.org/pipermail/linux-pm/2007-May/012309.html

Paul Mackerras:

	"What happens if there is an encfs filesystem mounted,
	and we freeze all userspace processes and then do a sys_sync()?  Does
	the system wait forever for the encfs filesystem to write out its
	dirty data?"

Yes.

Technically sync() and the freezer should in theory address most considerations?

# An example broken approach

One could iterate over an XFS filesystems or FUSE filesystem and do something
like, but *this* script below is fragile and should be broken. Consider the
cgroup notes above, but this is also horribly slow and fragile. This uses the
filesystem freeze ioctl but also sends SIGSTOP/SIGCONT to processes. For FUSE
it could use whatever freeze op FUSE drivers come up with. But again, horribly
broken and why this needs proper infrastructure.

#!/bin/bash
set -e

XFS_FREEZE="/usr/sbin/xfs_freeze"
PROC_MOUNTS="/proc/mounts"
SUSPEND_SIGNAL="SIGSTOP"

error_quit()
{
        echo "$1" >&2
        exit 1
}

check-system()
{
        [ -r "${PROC_MOUNTS}" ] || error_quit "ERROR: cannot find or read ${PROC_MOUNTS}"
        [ -x "${XFS_FREEZE}" ] || error_quit "ERROR: cannot find or execute ${XFS_FREEZE}"
}

run-fs-freeze()
{
        local i FSTYPE MNT ROOTDEV ARGS

        while read ROOTDEV MNT FSTYPE ARGS; do
            [ "$ROOTDEV" = "rootfs" ] && continue
            [ "$MNT" = "/" ] && continue

            case $FSTYPE in
            xfs)
                echo "  Trying to freeze userspace processes on '$FSTYPE' mounted on ${MNT}... "
                for i in $(lsof +D $MNT -t 2>/dev/null); do kill -$SUSPEND_SIGNAL $i; done
                echo -n "  Trying to freeze fstype '$FSTYPE' mounted on ${MNT}... "
                $XFS_FREEZE -f $MNT
                echo "OK!"
                ;;
            *)
                ;;
            esac

        done < $PROC_MOUNTS
}

run-fs-unfreeze()
{
        local i FSTYPE MNT ROOTDEV ARGS

        while read ROOTDEV MNT FSTYPE ARGS; do
            [ "$ROOTDEV" = "rootfs" ] && continue
            [ "$MNT" = "/" ] && continue

            case $FSTYPE in
            xfs)
                echo "  Trying to unfreeze userspace processes on '$FSTYPE' mounted on ${MNT}... "
                for i in $(lsof +D $MNT -t 2>/dev/null); do kill -SIGCONT $i; done
                echo -n "  Trying to unfreeze fstype '$FSTYPE' mounted on ${MNT}... "
                $XFS_FREEZE -u $MNT
                echo "OK!"
                ;;
            *)
                ;;
            esac

        done < $PROC_MOUNTS
}

check-system

if [ "$2" = suspend ]; then
        echo "INFO: running $0 for $2"
else
        echo "INFO: running $0 for $2"
fi

if [ "$1" = pre ] ; then
        run-fs-freeze
fi
if [ "$1" = post ] ; then
        run-fs-unfreeze
fi

Future ideas
------------

# Flushing

Flushing is *not* fully needed as per Rafael, and he suggests this actually
creates high latencies for suspend. Addressing this as per Jan Kara is complex.
Jan Kara also notes that switching the fs freeze implementation to avoid
sync(2) if asked to is quite independent from implementing system suspend to
use fs freezing.

Can patches to make 'background buffered writeback not suck' help with future
suspend sync latencies?

Probably not?

Jens Axboe's patches:
https://lwn.net/Articles/681763/
https://lkml.org/lkml/2016/3/23/310

What about dynamically throttling the amount of writeback allowed through a
grace period, moments prior to suspend?

[LSF/MM,TOPIC,NOTES] Phasing out kernel thread freezing

Commit Message

Patch