diff mbox series

[v2,1/1] mnt: add support for non-rootfs initramfs

Message ID 20200331124017.2252-2-ignat@cloudflare.com (mailing list archive)
State New, archived
Headers show
Series an option to place initramfs in a leaf tmpfs instead of rootfs | expand

Commit Message

Ignat Korchagin March 31, 2020, 12:40 p.m. UTC
The main need for this is to support container runtimes on stateless Linux
system (pivot_root system call from initramfs).

Normally, the task of initramfs is to mount and switch to a "real" root
filesystem. However, on stateless systems (booting over the network) it is just
convenient to have your "real" filesystem as initramfs from the start.

This, however, breaks different container runtimes, because they usually use
pivot_root system call after creating their mount namespace. But pivot_root does
not work from initramfs, because initramfs runs form rootfs, which is the root
of the mount tree and can't be unmounted.

One workaround is to do:

  mount --bind / /

However, that defeats one of the purposes of using pivot_root in the cloned
containers: get rid of host root filesystem, should the code somehow escapes the
chroot.

There is a way to solve this problem from userspace, but it is much more
cumbersome:
  * either have to create a multilayered archive for initramfs, where the outer
    layer creates a tmpfs filesystem and unpacks the inner layer, switches root
    and does not forget to properly cleanup the old rootfs
  * or we need to use keepinitrd kernel cmdline option, unpack initramfs to
    rootfs, run a script to create our target tmpfs root, unpack the same
    initramfs there, switch root to it and again properly cleanup the old root,
    thus unpacking the same archive twice and also wasting memory, because
    the kernel stores compressed initramfs image indefinitely.

With this change we can ask the kernel (by specifying nonroot_initramfs kernel
cmdline option) to create a "leaf" tmpfs mount for us and switch root to it
before the initramfs handling code, so initramfs gets unpacked directly into
the "leaf" tmpfs with rootfs being empty and no need to clean up anything.

This also bring the behaviour in line with the older style initrd, where the
initrd is located on some leaf filesystem in the mount tree and rootfs remaining
empty.

Signed-off-by: Ignat Korchagin <ignat@cloudflare.com>
---
 .../admin-guide/kernel-parameters.txt         |  9 +++-
 fs/namespace.c                                | 47 +++++++++++++++++++
 2 files changed, 55 insertions(+), 1 deletion(-)

Comments

Aleksa Sarai April 1, 2020, 6:36 a.m. UTC | #1
On 2020-03-31, Ignat Korchagin <ignat@cloudflare.com> wrote:
> The main need for this is to support container runtimes on stateless Linux
> system (pivot_root system call from initramfs).
> 
> Normally, the task of initramfs is to mount and switch to a "real" root
> filesystem. However, on stateless systems (booting over the network) it is just
> convenient to have your "real" filesystem as initramfs from the start.
> 
> This, however, breaks different container runtimes, because they usually use
> pivot_root system call after creating their mount namespace. But pivot_root does
> not work from initramfs, because initramfs runs form rootfs, which is the root
> of the mount tree and can't be unmounted.
> 
> One workaround is to do:
> 
>   mount --bind / /
> 
> However, that defeats one of the purposes of using pivot_root in the cloned
> containers: get rid of host root filesystem, should the code somehow escapes the
> chroot.
> 
> There is a way to solve this problem from userspace, but it is much more
> cumbersome:
>   * either have to create a multilayered archive for initramfs, where the outer
>     layer creates a tmpfs filesystem and unpacks the inner layer, switches root
>     and does not forget to properly cleanup the old rootfs
>   * or we need to use keepinitrd kernel cmdline option, unpack initramfs to
>     rootfs, run a script to create our target tmpfs root, unpack the same
>     initramfs there, switch root to it and again properly cleanup the old root,
>     thus unpacking the same archive twice and also wasting memory, because
>     the kernel stores compressed initramfs image indefinitely.
> 
> With this change we can ask the kernel (by specifying nonroot_initramfs kernel
> cmdline option) to create a "leaf" tmpfs mount for us and switch root to it
> before the initramfs handling code, so initramfs gets unpacked directly into
> the "leaf" tmpfs with rootfs being empty and no need to clean up anything.
> 
> This also bring the behaviour in line with the older style initrd, where the
> initrd is located on some leaf filesystem in the mount tree and rootfs remaining
> empty.
> 
> Signed-off-by: Ignat Korchagin <ignat@cloudflare.com>

I know this is a bit of a stretch, but I thought I'd ask -- is it
possible to solve the problem with pivot_root(2) without requiring this
workaround (and an additional cmdline option)?

From the container runtime side of things, most runtimes do support
working on initramfs but it requires disabling pivot_root(2) support (in
the runc world this is --no-pivot-root). We would love to be able to
remove support for disabling pivot_root(2) because lots of projects have
been shipping with pivot_root(2) disabled (such as minikube until
recently[1]) -- which opens such systems to quite a few breakout and
other troubling exploits (obviously they also ship without using user
namespaces *sigh*).

But requiring a new cmdline option might dissuade people from switching.
If there was a way to fix the underlying restriction on pivot_root(2),
I'd be much happier with that as a solution.

Thanks.

[1]: https://github.com/kubernetes/minikube/issues/3512
Aleksa Sarai April 1, 2020, 6:38 a.m. UTC | #2
On 2020-04-01, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2020-03-31, Ignat Korchagin <ignat@cloudflare.com> wrote:
> > The main need for this is to support container runtimes on stateless Linux
> > system (pivot_root system call from initramfs).
> > 
> > Normally, the task of initramfs is to mount and switch to a "real" root
> > filesystem. However, on stateless systems (booting over the network) it is just
> > convenient to have your "real" filesystem as initramfs from the start.
> > 
> > This, however, breaks different container runtimes, because they usually use
> > pivot_root system call after creating their mount namespace. But pivot_root does
> > not work from initramfs, because initramfs runs form rootfs, which is the root
> > of the mount tree and can't be unmounted.
> > 
> > One workaround is to do:
> > 
> >   mount --bind / /
> > 
> > However, that defeats one of the purposes of using pivot_root in the cloned
> > containers: get rid of host root filesystem, should the code somehow escapes the
> > chroot.
> > 
> > There is a way to solve this problem from userspace, but it is much more
> > cumbersome:
> >   * either have to create a multilayered archive for initramfs, where the outer
> >     layer creates a tmpfs filesystem and unpacks the inner layer, switches root
> >     and does not forget to properly cleanup the old rootfs
> >   * or we need to use keepinitrd kernel cmdline option, unpack initramfs to
> >     rootfs, run a script to create our target tmpfs root, unpack the same
> >     initramfs there, switch root to it and again properly cleanup the old root,
> >     thus unpacking the same archive twice and also wasting memory, because
> >     the kernel stores compressed initramfs image indefinitely.
> > 
> > With this change we can ask the kernel (by specifying nonroot_initramfs kernel
> > cmdline option) to create a "leaf" tmpfs mount for us and switch root to it
> > before the initramfs handling code, so initramfs gets unpacked directly into
> > the "leaf" tmpfs with rootfs being empty and no need to clean up anything.
> > 
> > This also bring the behaviour in line with the older style initrd, where the
> > initrd is located on some leaf filesystem in the mount tree and rootfs remaining
> > empty.
> > 
> > Signed-off-by: Ignat Korchagin <ignat@cloudflare.com>
> 
> I know this is a bit of a stretch, but I thought I'd ask -- is it
> possible to solve the problem with pivot_root(2) without requiring this
> workaround (and an additional cmdline option)?
> 
> From the container runtime side of things, most runtimes do support
> working on initramfs but it requires disabling pivot_root(2) support (in
> the runc world this is --no-pivot-root). We would love to be able to
> remove support for disabling pivot_root(2) because lots of projects have
> been shipping with pivot_root(2) disabled (such as minikube until
> recently[1]) -- which opens such systems to quite a few breakout and
> other troubling exploits (obviously they also ship without using user
> namespaces *sigh*).
> 
> But requiring a new cmdline option might dissuade people from switching.
> If there was a way to fix the underlying restriction on pivot_root(2),
> I'd be much happier with that as a solution.
> 
> Thanks.
> 
> [1]: https://github.com/kubernetes/minikube/issues/3512

(I forgot to add the kernel containers ML to Cc.)
Ignat Korchagin April 1, 2020, 9:30 a.m. UTC | #3
On Wed, Apr 1, 2020 at 7:38 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2020-04-01, Aleksa Sarai <cyphar@cyphar.com> wrote:
> > On 2020-03-31, Ignat Korchagin <ignat@cloudflare.com> wrote:
> > > The main need for this is to support container runtimes on stateless Linux
> > > system (pivot_root system call from initramfs).
> > >
> > > Normally, the task of initramfs is to mount and switch to a "real" root
> > > filesystem. However, on stateless systems (booting over the network) it is just
> > > convenient to have your "real" filesystem as initramfs from the start.
> > >
> > > This, however, breaks different container runtimes, because they usually use
> > > pivot_root system call after creating their mount namespace. But pivot_root does
> > > not work from initramfs, because initramfs runs form rootfs, which is the root
> > > of the mount tree and can't be unmounted.
> > >
> > > One workaround is to do:
> > >
> > >   mount --bind / /
> > >
> > > However, that defeats one of the purposes of using pivot_root in the cloned
> > > containers: get rid of host root filesystem, should the code somehow escapes the
> > > chroot.
> > >
> > > There is a way to solve this problem from userspace, but it is much more
> > > cumbersome:
> > >   * either have to create a multilayered archive for initramfs, where the outer
> > >     layer creates a tmpfs filesystem and unpacks the inner layer, switches root
> > >     and does not forget to properly cleanup the old rootfs
> > >   * or we need to use keepinitrd kernel cmdline option, unpack initramfs to
> > >     rootfs, run a script to create our target tmpfs root, unpack the same
> > >     initramfs there, switch root to it and again properly cleanup the old root,
> > >     thus unpacking the same archive twice and also wasting memory, because
> > >     the kernel stores compressed initramfs image indefinitely.
> > >
> > > With this change we can ask the kernel (by specifying nonroot_initramfs kernel
> > > cmdline option) to create a "leaf" tmpfs mount for us and switch root to it
> > > before the initramfs handling code, so initramfs gets unpacked directly into
> > > the "leaf" tmpfs with rootfs being empty and no need to clean up anything.
> > >
> > > This also bring the behaviour in line with the older style initrd, where the
> > > initrd is located on some leaf filesystem in the mount tree and rootfs remaining
> > > empty.
> > >
> > > Signed-off-by: Ignat Korchagin <ignat@cloudflare.com>
> >
> > I know this is a bit of a stretch, but I thought I'd ask -- is it
> > possible to solve the problem with pivot_root(2) without requiring this
> > workaround (and an additional cmdline option)?
> >
> > From the container runtime side of things, most runtimes do support
> > working on initramfs but it requires disabling pivot_root(2) support (in
> > the runc world this is --no-pivot-root). We would love to be able to
> > remove support for disabling pivot_root(2) because lots of projects have
> > been shipping with pivot_root(2) disabled (such as minikube until
> > recently[1]) -- which opens such systems to quite a few breakout and
> > other troubling exploits (obviously they also ship without using user
> > namespaces *sigh*).
> >
> > But requiring a new cmdline option might dissuade people from switching.
> > If there was a way to fix the underlying restriction on pivot_root(2),
> > I'd be much happier with that as a solution.
> >
> > Thanks.
> >
> > [1]: https://github.com/kubernetes/minikube/issues/3512
>
> (I forgot to add the kernel containers ML to Cc.)
>
> --
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> <https://www.cyphar.com/>

In my opinion we just did not expect pivot_root to be so popular with
containers as well as the fact people are running full stateless
systems from initramfs rather than immediately switching to another
root filesystem on boot. This all feels to me use-cases which were not
considered before for the pivot_root+initramfs combo.

However now we see more and more cases needing this and the
boilerplate code and the additional memory copying (and sometimes
security issues like you mentioned), which can handle this from the
userspace becomes too much. I understand the simplicity reasons
described in [1] ("You can't unmount rootfs for approximately the same
reason you can't kill the init process..."), but to support this
simplicity as well as the new containerised Linux world the kernel
should give us a hand.

I currently see no reason why we can't apply this patch without the
cmdline conditional, because we would just be in the same place as we
would have used initrd instead of the initramfs. But I leave the
decision to the subsystem maintainers. After all, if you are running
from initramfs, this is a stateless system, so I would expect
maintainers of such system having an easy way to add the cmdline
parameter on reboot.

[1]: https://www.kernel.org/doc/Documentation/filesystems/ramfs-rootfs-initramfs.txt

Ignat
Marek Majkowski April 1, 2020, 10:09 a.m. UTC | #4
> However now we see more and more cases needing this and the
> boilerplate code and the additional memory copying (and sometimes
> security issues like you mentioned), which can handle this from the
> userspace becomes too much. I understand the simplicity reasons
> described in [1] ("You can't unmount rootfs for approximately the same
> reason you can't kill the init process..."), but to support this
> simplicity as well as the new containerised Linux world the kernel
> should give us a hand.

"You can't unmount rootfs for approximately the same reason you can't
kill the init process"

Pardon my ignorance but this explanation in docs never made any sense
to me. Rootfs is pretty much the same as tmpfs. I don't understand why
we can't do pivot_root on it and why, we can't unmount it later. I
must be missing some context. Can someone explain what is the reason
for rootfs to be restricted like that? Perhaps we could just relax
rootfs limits....

Marek
diff mbox series

Patch

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index c07815d230bc..720fd3ee9f8a 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3192,11 +3192,18 @@ 
 	nomfgpt		[X86-32] Disable Multi-Function General Purpose
 			Timer usage (for AMD Geode machines).
 
+	nomodule        Disable module load
+
 	nonmi_ipi	[X86] Disable using NMI IPIs during panic/reboot to
 			shutdown the other cpus.  Instead use the REBOOT_VECTOR
 			irq.
 
-	nomodule	Disable module load
+	nonroot_initramfs
+			[KNL] Create an additional tmpfs filesystem under rootfs
+			and unpack initramfs there instead of the rootfs itself.
+			This is useful for stateless systems, which run directly
+			from initramfs, create mount namespaces and use
+			"pivot_root" system call.
 
 	nopat		[X86] Disable PAT (page attribute table extension of
 			pagetables) support.
diff --git a/fs/namespace.c b/fs/namespace.c
index 85b5f7bea82e..a1ec862e8146 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3701,6 +3701,49 @@  static void __init init_mount_tree(void)
 	set_fs_root(current->fs, &root);
 }
 
+#if IS_ENABLED(CONFIG_TMPFS)
+static int __initdata nonroot_initramfs;
+
+static int __init nonroot_initramfs_param(char *str)
+{
+	if (*str)
+		return 0;
+	nonroot_initramfs = 1;
+	return 1;
+}
+__setup("nonroot_initramfs", nonroot_initramfs_param);
+
+static void __init init_nonroot_initramfs(void)
+{
+	int err;
+
+	if (!nonroot_initramfs)
+		return;
+
+	err = ksys_mkdir("/root", 0700);
+	if (err < 0)
+		goto out;
+
+	err = do_mount("tmpfs", "/root", "tmpfs", 0, NULL);
+	if (err)
+		goto out;
+
+	err = ksys_chdir("/root");
+	if (err)
+		goto out;
+
+	err = do_mount(".", "/", NULL, MS_MOVE, NULL);
+	if (err)
+		goto out;
+
+	err = ksys_chroot(".");
+	if (!err)
+		return;
+out:
+	printk(KERN_WARNING "Failed to create a non-root filesystem for initramfs\n");
+}
+#endif /* IS_ENABLED(CONFIG_TMPFS) */
+
 void __init mnt_init(void)
 {
 	int err;
@@ -3734,6 +3777,10 @@  void __init mnt_init(void)
 	shmem_init();
 	init_rootfs();
 	init_mount_tree();
+
+#if IS_ENABLED(CONFIG_TMPFS)
+	init_nonroot_initramfs();
+#endif
 }
 
 void put_mnt_ns(struct mnt_namespace *ns)