Message ID | 20170519002032.GA21202@birch.djwong.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, May 19, 2017 at 3:20 AM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > > Therefore, add a reboot hook to freeze all filesystems (which in general > will induce ext4/xfs/btrfs to checkpoint the log) just prior to reboot. > This is an unfortunate and insufficient workaround for multiple layers > of inadequate external software, but at least it will reduce boot time > surprises for the "OS updater failed to disengage the filesystem before > rebooting" case. > Darrick, Did you consider how many support calls this will generate for a stuck reboot command? I can think of at least one situation where this is guarantied to hang. See this patch for the details: https://patchwork.kernel.org/patch/6266791/ The referenced patch was applied to Android kernel to prevent system crash on emergency remount-ro via sysrq trigger. I don't know if it was even seriously considered by Al, because I got no comment, but I do realize that the change of behavior could generate support calls, so it's scary to make that change in mainline. I know it's not going to work around broken system software update, but how about providing sysrq trigger for emergency_freeze_all()? like emergency_remount(), but stronger. And this time, iterate supers in reverse order like I suggested to avoid loop mounted fs freeze dependencies. There is one little tiny problem though. Eric used up the last sysrq trigger key for emergency_thaw_all(). Do you see the irony in that? ;) I am wondering how many people know about or use the emergency thaw trigger, but one dodgy option is to use the 't' trigger to toggle thaw_all/freeze_all. Another perhaps slightly less dodgy option is to trigger freeze_all on a sequence of sysrq "emergency" triggers where it makes sense and is least likely to change any existing behavior, for example: echo u > /proc/sysrq-trigger # Remember if do_emergency_remount() completed with failures echo u > /proc/sysrq-trigger # Escalate to emergency freeze OR echo u > /proc/sysrq-trigger # Remember if do_emergency_remount() completed with failures echo s > /proc/sysrq-trigger # Sync *after* remount r/o? That must mean emergency freeze I bet that system software that is already aware of and is issuing emergency remount r/o trigger prior to reboot, won't see any harm in adding an extra u/s trigger for good luck. Do you know if the gnarly system software in question is issuing emergency remount r/o prior to reboot? Amir. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 18, 2017, at 08:20 PM, Darrick J. Wong wrote: > Therefore, add a reboot hook to freeze all filesystems (which in general > will induce ext4/xfs/btrfs to checkpoint the log) just prior to reboot. > This is an unfortunate and insufficient workaround for multiple layers > of inadequate external software, but at least it will reduce boot time > surprises for the "OS updater failed to disengage the filesystem before > rebooting" case. As a maintainer of one of those userspace tools (https://github.com/ostreedev/ostree), which I don't think is the one in question here, but likely has the same issue - I'd like to have some sort of API to fix this - maybe flush the journal *without* remounting r/o? Unlike the case you're talking about with rebooting into a special update mode, libostree constructs a new root with hardlinks while the system is running. Hence, system downtime is just reboot, like dual-partition update systems, except we're more flexible. Although hm...I guess an API to flush the journal would only narrow the race. Is the single partition case really just doomed? -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 19, 2017 at 10:00:31AM -0400, Colin Walters wrote: > As a maintainer of one of those userspace tools (https://github.com/ostreedev/ostree), > which I don't think is the one in question here, but likely has the same > issue - I'd like to have some sort of API to fix this - maybe flush the journal *without* > remounting r/o? > > Unlike the case you're talking about with rebooting into a special > update mode, libostree constructs a new root with hardlinks while > the system is running. Hence, system downtime is just reboot, like > dual-partition update systems, except we're more flexible. > > Although hm...I guess an API to flush the journal would only narrow > the race. > > Is the single partition case really just doomed? One of the things that came up when Darrick and I discussed this on the weekly ext4 developer's conference call was our mutual wonderment that none of the userspace tools implemented a reboot by created a tmpfs chroot, pivoting into the chroot, and then unmounting all of the remaining file systems. This would also allow update schemes who want to enable various new file system features, or upgrade the root file system somehow, to be able to do so while the root file system is completely and cleanly unmounted. The other thing that would be useful is if grub2 would actually be able to replay the file system journal --- but given that grub2 is GPLv3, and both ext4 and xfs are GPLv2-only, and given that past attempts of teams attempting to do clean room reimplementations of complex code bases for licensing reasons only (cough, make_ext4fs, *cough*) have not necessarily turned out well, I'm at least not going to hold my breath. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 19, 2017, at 11:27 AM, Theodore Ts'o wrote: > > One of the things that came up when Darrick and I discussed this on > the weekly ext4 developer's conference call was our mutual wonderment > that none of the userspace tools implemented a reboot by created a > tmpfs chroot, pivoting into the chroot, and then unmounting all of the > remaining file systems. On general purpose systems we have a tmpfs chroot already: the initramfs. Although IIRC, systemd will only switch back to it on shutdown I think only if you have a root storage daemon enabled: https://www.freedesktop.org/wiki/Software/systemd/RootStorageDaemons/ That said I'd like to focus on the harder case: supporting powerloss/system lockup on single-partition systems. IMO, the shutdown case is just a special variant of that where the user asked nicely for the system to halt =) (See also https://en.wikipedia.org/wiki/Crash-only_software) I was thinking about this a bit, and I think if userspace tools (like ostree) *delayed* their updates to /boot until shutdown, then we could ensure that on powerloss, the system is unchanged. (In a traditional dpkg/rpm scenario where you only have one userspace root, you'd end up with old kernel + new rootfs, but that's exactly the problem ostree solves) That narrows the problem down to keeping `/boot` consistent at shutdown time. AIUI, a problem here is that XFS doesn't flush the journal on `syncfs`, only on unmount? And from what I can tell, even the `XFS_IOC_FREEZE` ioctl won't do that either. So as far as I can see, a userspace API to ensure the journal is flushed on a mounted filesystem is going to be necessary for the general case. I don't have a strong opinion on whether or not that's `syncfs()` - if it's e.g. a `XFS_IOC_FREEZE` `_THAW` pair that seems OK to me too. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 19, 2017, at 12:34 PM, Colin Walters wrote: > > So as far as I can see, a userspace API to ensure the journal is > flushed on a mounted filesystem is going to be necessary for > the general case. I don't have a strong opinion on whether or not > that's `syncfs()` - if it's e.g. a `XFS_IOC_FREEZE` `_THAW` pair > that seems OK to me too. Or (thinking about this more) maybe we indeed could implement that today by pivoting back to the initramfs, and using umount()+mount() as our "syncfs() + journal flush" implementation. Basically when we have to update /boot, we unmount, then remount again and add new kernel+initramfs, unmount, remount and mv(/boot/grub2.conf.new,/boot/grub2.conf), then finally unmount again. In current design this would require keeping the initramfs resident in memory just for this purpose, or to re-synthesize it on shutdown. Not impossible, but it'd sure be simpler if say syncfs() had a flags argument and there were a special "flush the journal" argument for this. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 19, 2017 at 12:34:29PM -0400, Colin Walters wrote: > > One of the things that came up when Darrick and I discussed this on > > the weekly ext4 developer's conference call was our mutual wonderment > > that none of the userspace tools implemented a reboot by created a > > tmpfs chroot, pivoting into the chroot, and then unmounting all of the > > remaining file systems. > > On general purpose systems we have a tmpfs chroot already: the initramfs. Aren't we discarding the initramfs after we've pivoted away from it, to save on memory? Keeping the tmpfs chroot around forever would be a waste of memory, and in some cases, especially if you are using a distribution kernel, the initramfs chroot can be rather large. Creating an tmpfs chroot that was only good enough to manage the shutdown would be pretty easy, though; the number of files you would need would be quite very few in number. > That narrows the problem down to keeping `/boot` consistent at > shutdown time. AIUI, a problem here is that XFS doesn't flush the > journal on `syncfs`, only on unmount? And from what I can tell, > even the `XFS_IOC_FREEZE` ioctl won't do that either. I believe the log *is* checkpointed on an XFS_IOC_FREEZE. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 19, 2017 at 11:29:04AM +0300, Amir Goldstein wrote: > On Fri, May 19, 2017 at 3:20 AM, Darrick J. Wong > <darrick.wong@oracle.com> wrote: > > > > > Therefore, add a reboot hook to freeze all filesystems (which in general > > will induce ext4/xfs/btrfs to checkpoint the log) just prior to reboot. > > This is an unfortunate and insufficient workaround for multiple layers > > of inadequate external software, but at least it will reduce boot time > > surprises for the "OS updater failed to disengage the filesystem before > > rebooting" case. > > > > Darrick, > > Did you consider how many support calls this will generate for a stuck > reboot command? > > I can think of at least one situation where this is guarantied to hang. > See this patch for the details: > https://patchwork.kernel.org/patch/6266791/ > > The referenced patch was applied to Android kernel to prevent > system crash on emergency remount-ro via sysrq trigger. Hmmm, I agree that we ought to avoid hanging on loopmounted filesystems, and that iterating superblocks backwards is one (rough) way to do that. > I don't know if it was even seriously considered by Al, because > I got no comment, but I do realize that the change of behavior > could generate support calls, so it's scary to make that change > in mainline. > > I know it's not going to work around broken system software update, > but how about providing sysrq trigger for emergency_freeze_all()? > like emergency_remount(), but stronger. > And this time, iterate supers in reverse order like I suggested to > avoid loop mounted fs freeze dependencies. > > There is one little tiny problem though. Eric used up the last sysrq trigger > key for emergency_thaw_all(). Do you see the irony in that? ;) LOL. > I am wondering how many people know about or use the emergency > thaw trigger, but one dodgy option is to use the 't' trigger to toggle > thaw_all/freeze_all. > > Another perhaps slightly less dodgy option is to trigger freeze_all > on a sequence of sysrq "emergency" triggers where it makes sense > and is least likely to change any existing behavior, for example: > > echo u > /proc/sysrq-trigger > > # Remember if do_emergency_remount() completed with failures > > echo u > /proc/sysrq-trigger > > # Escalate to emergency freeze Or maybe it's simpler just to have a counter -- three sysrq-u in a row and we freeze all? > OR > > echo u > /proc/sysrq-trigger > > # Remember if do_emergency_remount() completed with failures > > echo s > /proc/sysrq-trigger > > # Sync *after* remount r/o? That must mean emergency freeze > > I bet that system software that is already aware of and is issuing > emergency remount r/o trigger prior to reboot, won't see any harm > in adding an extra u/s trigger for good luck. > > Do you know if the gnarly system software in question is issuing > emergency remount r/o prior to reboot? It does not. --D > > Amir. > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 19, 2017 at 10:00:31AM -0400, Colin Walters wrote: > On Thu, May 18, 2017, at 08:20 PM, Darrick J. Wong wrote: > > > Therefore, add a reboot hook to freeze all filesystems (which in general > > will induce ext4/xfs/btrfs to checkpoint the log) just prior to reboot. > > This is an unfortunate and insufficient workaround for multiple layers > > of inadequate external software, but at least it will reduce boot time > > surprises for the "OS updater failed to disengage the filesystem before > > rebooting" case. > > As a maintainer of one of those userspace tools > (https://github.com/ostreedev/ostree), which I don't think is the one > in question here, but likely has the same issue - I'd like to have > some sort of API to fix this - maybe flush the journal *without* > remounting r/o? The convention (at least among ext4 and xfs) is that fs freeze should be checkpointing the journal. > Unlike the case you're talking about with rebooting into a special > update mode, libostree constructs a new root with hardlinks while > the system is running. Hence, system downtime is just reboot, like > dual-partition update systems, except we're more flexible. > > Although hm...I guess an API to flush the journal would only narrow > the race. > > Is the single partition case really just doomed? Probably. TBH given the current behavior of grub, I would always have a separate /boot to minimize the amount it's allowed to touch. :) --D > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 19, 2017 at 11:27:34AM -0400, Theodore Ts'o wrote: > On Fri, May 19, 2017 at 10:00:31AM -0400, Colin Walters wrote: > > As a maintainer of one of those userspace tools (https://github.com/ostreedev/ostree), > > which I don't think is the one in question here, but likely has the same > > issue - I'd like to have some sort of API to fix this - maybe flush the journal *without* > > remounting r/o? > > > > Unlike the case you're talking about with rebooting into a special > > update mode, libostree constructs a new root with hardlinks while > > the system is running. Hence, system downtime is just reboot, like > > dual-partition update systems, except we're more flexible. > > > > Although hm...I guess an API to flush the journal would only narrow > > the race. > > > > Is the single partition case really just doomed? > > One of the things that came up when Darrick and I discussed this on > the weekly ext4 developer's conference call was our mutual wonderment > that none of the userspace tools implemented a reboot by created a > tmpfs chroot, pivoting into the chroot, and then unmounting all of the > remaining file systems. systemd seems to have the ability to do this -- if something dumps an executable into /run/initramfs/shutdown (and remounts /run with 'exec') then systemd will pivot to this script which can then kill everything it needs and then unmount the filesystems. Or upgrade the fs. Seeing as the rootfs is still mounted ro at the point that the shutdown script is run, it could pull in whatever tools it wants. Or inject malware, I guess. :P In any case, I don't think it's unreasonable to want a system updater to be able to detect that the fs containing with vmlinuz and initrd hasn't unmounted at the end of the upgrade, and therefore it needs to resort to stronger tactics to forcibly unmount it before systemd reboots. > This would also allow update schemes who want to enable various new > file system features, or upgrade the root file system somehow, to be > able to do so while the root file system is completely and cleanly > unmounted. > > The other thing that would be useful is if grub2 would actually be > able to replay the file system journal --- but given that grub2 is Gross! :) I don't think the XFS community will be enthusiastic about supporting whatever wreckage may come out of that. > GPLv3, and both ext4 and xfs are GPLv2-only, and given that past > attempts of teams attempting to do clean room reimplementations of > complex code bases for licensing reasons only (cough, make_ext4fs, > *cough*) have not necessarily turned out well, I'm at least not going > to hold my breath. Err... yes, but that's a different thread altogether. --D > > - Ted > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri 19-05-17 11:27:34, Ted Tso wrote: > The other thing that would be useful is if grub2 would actually be > able to replay the file system journal --- but given that grub2 is > GPLv3, and both ext4 and xfs are GPLv2-only, and given that past > attempts of teams attempting to do clean room reimplementations of > complex code bases for licensing reasons only (cough, make_ext4fs, > *cough*) have not necessarily turned out well, I'm at least not going > to hold my breath. Boot loader really should *not* write to the filesystem. Firstly, it would have to be completely separate codebase running under very different constraints (real mode, no real memory management, etc) so there's no easy way to share the code with any other userspace libraries and thus the code will be inherently buggy. Secondly, think of stuff like suspend to disk - if someone touches the filesystem in any way under the hands of suspended kernel, file system corruption is very likely to follow sooner or later. Just last year, I've spent couple of interesting days hunting down ext4 corruption on s390 only to find out that the boot procedure(*) there ended up replaying the journal under suspended kernel... (*) Just for the ones interested in mainframe woes: s390 in SLES doesn't use grub to parse the filesystem (as it is too difficult to access some storage types from the boot loader AFAIU) so it uses "first-stage" Linux kernel to mount the root filesystem, finds proper kernel image there and then kexecs into it... Honza
Resurrecting this thread: On Fri, May 19, 2017, at 03:01 PM, Darrick J. Wong wrote: > On Fri, May 19, 2017 at 10:00:31AM -0400, Colin Walters wrote: > > On Thu, May 18, 2017, at 08:20 PM, Darrick J. Wong wrote: > > > > > Therefore, add a reboot hook to freeze all filesystems (which in general > > > will induce ext4/xfs/btrfs to checkpoint the log) just prior to reboot. > > > This is an unfortunate and insufficient workaround for multiple layers > > > of inadequate external software, but at least it will reduce boot time > > > surprises for the "OS updater failed to disengage the filesystem before > > > rebooting" case. > > > > As a maintainer of one of those userspace tools > > (https://github.com/ostreedev/ostree), which I don't think is the one > > in question here, but likely has the same issue - I'd like to have > > some sort of API to fix this - maybe flush the journal *without* > > remounting r/o? > > The convention (at least among ext4 and xfs) is that fs freeze should be > checkpointing the journal. OK, so I finally implemented this: https://github.com/ostreedev/ostree/pull/1049 I had to go to some awkward lengths to try to make this safe; everything in libostree is designed to be "crash only" - we're an update system that doesn't install a SIGINT/SIGTERM handler, we just let the kernel kill us, and that should always be safe. But if we're interrupted right after we invoke FIFREEZE we'd leave the fs frozen. Any objections to something like an ioctl (fd, FIFREEZETHAW, 0) ? I was thinking about this more though, and while this obviously helps, it's still just narrowing a window; if we have a system crash after writing the config but before we've done a freeze-thaw, we still have the journaled data problem. in the end probably the real fix is probably something like storing multiple copies of the bootloader config with checksums that grub can verify. Basically teach grub to try really hard to extract known-good data from the FS. For file-level consistency that'd be pretty easy, we could have e.g. /boot/efi/grub.cfg /boot/efi/grub.cfg.checksum (sha256 of grub.cfg) /boot/efi/grub.cfg.orig /boot/efi/grub.cfg.orig.checksum (sha256 of grub.cfg.orig) etc. But what I don't know offhand without diving a lot more into XFS internals is how resilient such a scheme would be against the outstanding journal writes for the directory. (Maybe it's more resilient to use separate /boot/efi/grub-new and /boot/efi/grub-old dirs?) -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Any objections to something like an ioctl (fd, FIFREEZETHAW, 0) ? It's going to be completely trivial, which argues for it. The only points left woul be bikeshedding over the name, and how to describe its semantics. > in the end probably the real fix is probably something like storing > multiple copies of the bootloader config with checksums that grub > can verify. Basically teach grub to try really hard to extract known-good > data from the FS. For file-level consistency that'd be pretty easy, > we could have e.g. The real answer is to have a filesystem that does the above for you for the boot partition, e.g. one where the kernel and grub have a common consistency protocol for. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Aug 05, 2017 at 07:16:21AM -0700, Christoph Hellwig wrote: > > Any objections to something like an ioctl (fd, FIFREEZETHAW, 0) ? > > It's going to be completely trivial, which argues for it. The only > points left woul be bikeshedding over the name, and how to describe > its semantics. FSCHECKPOINT? Since that's your requirement anyway... "Ensures that all filesystem metadata (which may be in a journal somewhere) has been checkpointed back to disk." ? --D > > in the end probably the real fix is probably something like storing > > multiple copies of the bootloader config with checksums that grub > > can verify. Basically teach grub to try really hard to extract known-good > > data from the FS. For file-level consistency that'd be pretty easy, > > we could have e.g. > > The real answer is to have a filesystem that does the above for you > for the boot partition, e.g. one where the kernel and grub have > a common consistency protocol for. > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Aug 05, 2017 at 08:45:28AM -0700, Darrick J. Wong wrote: > FSCHECKPOINT? Since that's your requirement anyway... > > "Ensures that all filesystem metadata (which may be in a journal > somewhere) has been checkpointed back to disk." ? What about a file system that is entirely log structured and just garbage collects some times? -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 11, 2017 at 03:02:30AM -0700, Christoph Hellwig wrote: > On Sat, Aug 05, 2017 at 08:45:28AM -0700, Darrick J. Wong wrote: > > FSCHECKPOINT? Since that's your requirement anyway... > > > > "Ensures that all filesystem metadata (which may be in a journal > > somewhere) has been checkpointed back to disk." ? > > What about a file system that is entirely log structured and just > garbage collects some times? "Ensure that <insert insufficient engineering insult here> external fs drivers can find files on disk." :P --D > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 11, 2017 at 09:26:20AM -0700, Darrick J. Wong wrote: > "Ensure that <insert insufficient engineering insult here> external fs > drivers can find files on disk." :P If they are correctly implemented they can always access it anyway.. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/super.c b/fs/super.c index adb0c0d..4a9deaa 100644 --- a/fs/super.c +++ b/fs/super.c @@ -34,6 +34,7 @@ #include <linux/fsnotify.h> #include <linux/lockdep.h> #include <linux/user_namespace.h> +#include <linux/reboot.h> #include "internal.h" @@ -1529,3 +1530,32 @@ int thaw_super(struct super_block *sb) return 0; } EXPORT_SYMBOL(thaw_super); + +static void fsreboot_freeze_sb(struct super_block *sb, void *priv) +{ + int error; + + up_read(&sb->s_umount); + error = freeze_super(sb); + down_read(&sb->s_umount); + if (error && error != -EBUSY) + printk(KERN_NOTICE "%s (%s): Unable to freeze, error=%d", + sb->s_type->name, sb->s_id, error); +} + +static int fsreboot_freeze(struct notifier_block *nb, ulong event, void *buf) +{ + iterate_supers(fsreboot_freeze_sb, NULL); + return NOTIFY_DONE; +} + +static struct notifier_block fsreboot_notifier = { + .notifier_call = fsreboot_freeze, + .priority = INT_MAX, +}; + +static int __init fsreboot_init(void) +{ + return register_reboot_notifier(&fsreboot_notifier); +} +__initcall(fsreboot_init);
Apparently, there users out there with a single gigantic journalled rootfs and some gnarly system software. If the user reboots into "offline system update" mode to install a kernel update, the system control software has no provision to kick the cute splash screen off its writable file descriptor down in /var/log somewhere before unmounting, remount-ro'ing, and thus reboots the system... with a live rw rootfs! Since the journal may not have been checkpointed immediately prior to the reboot, a subsequent invocation of the hapless user's grubby bootloader sees obsolete metadata because the newest data is safely in the log, but the log needs to be replayed. Weirdly, the bootloader is fine with reading files off a dirty filesystem (though really, can you imagine log replay in x86 real mode?) but still tries to read files and the boot fails until someone intervenes to replay the journal. Therefore, add a reboot hook to freeze all filesystems (which in general will induce ext4/xfs/btrfs to checkpoint the log) just prior to reboot. This is an unfortunate and insufficient workaround for multiple layers of inadequate external software, but at least it will reduce boot time surprises for the "OS updater failed to disengage the filesystem before rebooting" case. Seeing as the world has been drifting towards grubbiness (except for those booting straight off a flabby unjournalled fs via firmware), this seems like the least crappy solution to this problem. Yes, you're still screwed in grub if the system crashes. :) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> --- fs/super.c | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html