diff mbox series

[2/3] init/do_cmounts.c: introduce 'user_root' for initramfs

Message ID 20210522113155.244796-3-dong.menglong@zte.com.cn (mailing list archive)
State New, archived
Headers show
Series init/initramfs.c: make initramfs support pivot_root | expand

Commit Message

Menglong Dong May 22, 2021, 11:31 a.m. UTC
From: Menglong Dong <dong.menglong@zte.com.cn>

During the kernel initialization, the root of the mount tree is
created with file system type of ramfs or tmpfs.

While using initramfs as the root file system, cpio file is unpacked
into the rootfs. Thus, this rootfs is exactly what users see in user
space, and some problems arose: this rootfs has no parent mount,
which make it can't be umounted or pivot_root.

'pivot_root' is used to change the rootfs and clean the old mounts,
and it is essential for some users, such as docker. Docker use
'pivot_root' to change the root fs of a process if the current root
fs is a block device of initrd. However, when it comes to initramfs,
things is different: docker has to use 'chroot()' to change the root
fs, as 'pivot_root()' is not supported in initramfs.

The usage of 'chroot()' to create root fs for a container introduced
a lot problems.

First, 'chroot()' can't clean the old mountpoints which inherited
from the host. It means that the mountpoints in host will have a
duplicate in every container. Users may remove a USB after it
is umounted successfully in the host. However, the USB may still
be mounted in containers, although it can't be seen by the 'mount'
commond. This means the USB is not released yet, and data may not
write back. Therefore, data lose arise.

Second, net-namespace leak is another problem. The net-namespace
of containers will be mounted in /var/run/docker/netns/ in host
by dockerd. It means that the net-namespace of a container will
be mounted in containers which are created after it. Things
become worse now, as the net-namespace can't be remove after
the destroy of that container, as it is still mounted in other
containers. If users want to recreate that container, he will
fail if a certain mac address is to be binded with the container,
as it is not release yet.

In this patch, a second mount, which is called 'user root', is
created before 'cpio' unpacking. The file system of 'user root'
is exactly the same as rootfs, and both ramfs and tmpfs are
supported. Then, the 'cpio' is unpacked into the 'user root'.
Now, the rootfs has a parent mount, and pivot_root() will be
supported for initramfs.

What's more, after this patch, 'rootflags' in boot cmd is supported
by initramfs. Therefore, users can set the size of tmpfs with
'rootflags=size=1024M'.

Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
---
 init/do_mounts.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++
 init/do_mounts.h |  7 ++++-
 init/initramfs.c | 10 +++++++
 3 files changed, 88 insertions(+), 1 deletion(-)

Comments

Luis Chamberlain May 25, 2021, 12:44 a.m. UTC | #1
Cc'ing Josh as I think he might be interested in this.

On Sat, May 22, 2021 at 07:31:54PM +0800, menglong8.dong@gmail.com wrote:
> From: Menglong Dong <dong.menglong@zte.com.cn>
> 
> During the kernel initialization, the root of the mount tree is
> created with file system type of ramfs or tmpfs.

ramfs (initrd) 

> While using initramfs as the root file system, cpio file is unpacked
> into the rootfs. Thus, this rootfs is exactly what users see in user
> space, and some problems arose: this rootfs has no parent mount,
> which make it can't be umounted or pivot_root.
> 'pivot_root' is used to change the rootfs and clean the old mounts,
> and it is essential for some users, such as docker. Docker use
> 'pivot_root' to change the root fs of a process if the current root
> fs is a block device of initrd. However, when it comes to initramfs,
> things is different: docker has to use 'chroot()' to change the root
> fs, as 'pivot_root()' is not supported in initramfs.
> 
> The usage of 'chroot()' to create root fs for a container introduced
> a lot problems.
> 
> First, 'chroot()' can't clean the old mountpoints which inherited
> from the host. It means that the mountpoints in host will have a
> duplicate in every container. Users may remove a USB after it
> is umounted successfully in the host. However, the USB may still
> be mounted in containers, although it can't be seen by the 'mount'
> commond. This means the USB is not released yet, and data may not
> write back. Therefore, data lose arise.
>
> Second, net-namespace leak is another problem. The net-namespace
> of containers will be mounted in /var/run/docker/netns/ in host
> by dockerd. It means that the net-namespace of a container will
> be mounted in containers which are created after it. Things
> become worse now, as the net-namespace can't be remove after
> the destroy of that container, as it is still mounted in other
> containers. If users want to recreate that container, he will
> fail if a certain mac address is to be binded with the container,
> as it is not release yet.

I think you can clarify this a bit more with:

  If using container platforms such as Docker, upon initialization it
  wants to use pivot_root() so that currently mounted devices do not
  propagate to containers. An example of value in this is that
  a USB device connected prior to the creation of a containers on the
  host gets disconnected after a container is created; if the
  USB device was mounted on containers, but already removed and
  umounted on the host, the mount point will not go away untill all
  containers unmount the USB device.

  Another reason for container platforms such as Docker to use pivot_root
  is that upon initialization the net-namspace is mounted under
  /var/run/docker/netns/ on the host by dockerd. Without pivot_root
  Docker must either wait to create the network namespace prior to
  the creation of containers or simply deal with leaking this to each
  container.

  pivot_root is supported if the rootfs is a ramfs (initrd), but its
  not supported if the rootfs uses an initramfs (tmpfs). This means
  container platforms today must resort to using ramfs (initrd) if
  they want to pivot_root from the rootfs. A workaround to use chroot()
  is not a clean viable option given every container will have a
  duplicate of every mount point on the host.

  In order to support using container platforms such as Docker on
  all the supported rootfs types we must extend Linux to support
  pivot_root on initramfs as well. This patch does the work to do
  just that.

So remind me.. so it would seem that if the rootfs uses a ramfs (initrd)
that pivot_root works just fine. Why is that? Did someone add support
for that? Has that always been the case that it works? If not, was it a
consequence of how ramfs (initrd) works?

And finally, why can't we share the same mechanism used for ramfs
(initrd) for initramfs (tmpfs)?

> In this patch, a second mount, which is called 'user root', is
> created before 'cpio' unpacking. The file system of 'user root'
> is exactly the same as rootfs, and both ramfs and tmpfs are
> supported. Then, the 'cpio' is unpacked into the 'user root'.
> Now, the rootfs has a parent mount, and pivot_root() will be
> supported for initramfs.

How about something like:

  In order to support pivot_root on initramfs we introduce a second
  "user_root" mount  which is created before we do the cpio unpacking.
  The filesystem of the "user_root" mount is the same the rootfs.

It begs the question, why add this infrastructure to suppor this for
ramfs (initrd) if we only need this hack for initramfs (tmpfs)?

> What's more, after this patch, 'rootflags' in boot cmd is supported
> by initramfs. Therefore, users can set the size of tmpfs with
> 'rootflags=size=1024M'.

Why is that exactly?

> Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
> ---
>  init/do_mounts.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++
>  init/do_mounts.h |  7 ++++-
>  init/initramfs.c | 10 +++++++
>  3 files changed, 88 insertions(+), 1 deletion(-)
> 
> diff --git a/init/do_mounts.c b/init/do_mounts.c
> index a78e44ee6adb..943cb58846fb 100644
> --- a/init/do_mounts.c
> +++ b/init/do_mounts.c
> @@ -617,6 +617,78 @@ void __init prepare_namespace(void)
>  	init_chroot(".");
>  }
>  
> +#ifdef CONFIG_TMPFS
> +static __init bool is_tmpfs_enabled(void)
> +{
> +	return (!root_fs_names || strstr(root_fs_names, "tmpfs")) &&
> +	       !saved_root_name[0];
> +}
> +#endif
> +
> +static __init bool is_ramfs_enabled(void)
> +{
> +	return true;
> +}
> +
> +struct fs_user_root {
> +	bool (*enabled)(void);
> +	char *dev_name;

What's the point of dev_name if its never set?

> +	char *fs_name;
> +};
> +
> +static struct fs_user_root user_roots[] __initdata = {
> +#ifdef CONFIG_TMPFS
> +	{.fs_name = "tmpfs", .enabled = is_tmpfs_enabled },
> +#endif
> +	{.fs_name = "ramfs", .enabled = is_ramfs_enabled }
> +};
> +static struct fs_user_root * __initdata user_root;
> +
> +/* Mount the user_root on '/'. */
> +int __init mount_user_root(void)
> +{
> +	return do_mount_root(user_root->dev_name,

See, isn't dev_name here always NULL?

> +			     user_root->fs_name,
> +			     root_mountflags & ~MS_RDONLY,
> +			     root_mount_data);
> +}
> +
> +/*
> + * This function is used to chroot to new initramfs root that
> + * we unpacked on success.

Might be a good place to document that we do this so folks can
pivot_root on rootfs, and why that is desirable (mentioned above on the
commit log edits I suggested). Otherwise I don't think its easy for a
reader of the code to understand why we are doing all this work.

> It will chdir to '/' and umount
> + * the secound mount on failure.
> + */
> +void __init end_mount_user_root(bool succeed)
> +{
> +	if (!succeed)
> +		goto on_failed;
> +
> +	if (!ramdisk_exec_exist(false))
> +		goto on_failed;
> +
> +	init_mount(".", "/", NULL, MS_MOVE, NULL);
> +	init_chroot(".");
> +	return;
> +
> +on_failed:
> +	init_chdir("/");
> +	init_umount("/..", 0);
> +}

Is anything extra needed on shutdown / reboot?

  Luis
Menglong Dong May 25, 2021, 3:28 a.m. UTC | #2
On Tue, May 25, 2021 at 8:44 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> Cc'ing Josh as I think he might be interested in this.
>
......
>
> I think you can clarify this a bit more with:
>
>   If using container platforms such as Docker, upon initialization it
>   wants to use pivot_root() so that currently mounted devices do not
>   propagate to containers. An example of value in this is that
>   a USB device connected prior to the creation of a containers on the
>   host gets disconnected after a container is created; if the
>   USB device was mounted on containers, but already removed and
>   umounted on the host, the mount point will not go away untill all
>   containers unmount the USB device.

Thanks! It's really difficult for me to organize these words.

>
> So remind me.. so it would seem that if the rootfs uses a ramfs (initrd)
> that pivot_root works just fine. Why is that? Did someone add support
> for that? Has that always been the case that it works? If not, was it a
> consequence of how ramfs (initrd) works?
>
> And finally, why can't we share the same mechanism used for ramfs
> (initrd) for initramfs (tmpfs)?

In fact, initrd is totally different from initramfs. Initrd is not using
ramfs, it actually is a block fs, which is mounted on the first mount.
And initramfs can use ramfs or tmpfs.

During pivot_root, the mount of the root will be unmounted from its parent
mount. Initrd or block device fs has a parent mount, which is the first mount.
However, initramfs doesn't has a parent mount, because the first mount is
actually the root, which cpio is unpacked to.

The first mount is used by init_task, and I think it can't be unmounted,
because it is used by the kernel.

So the primary cause that pivot_root doesn't support is that it use
the first mount as its root.

>
> > What's more, after this patch, 'rootflags' in boot cmd is supported
> > by initramfs. Therefore, users can set the size of tmpfs with
> > 'rootflags=size=1024M'.
>
> Why is that exactly?

During the mount of user_mount, I passed root_mountflags and root_mount_data
to do_mount_root(), which make 'rootflags' works for 'user root'.

> > +
> > +struct fs_user_root {
> > +    bool (*enabled)(void);
> > +    char *dev_name;
>
> What's the point of dev_name if its never set?

Seems it's better to make it be set, I'll do it.


>
> Might be a good place to document that we do this so folks can
> pivot_root on rootfs, and why that is desirable (mentioned above on the
> commit log edits I suggested). Otherwise I don't think its easy for a
> reader of the code to understand why we are doing all this work.
>

Ok, sounds nice!

>
> Is anything extra needed on shutdown / reboot?
>

I'm not sure, seems no. The way I create 'user root' is exactly the same
as a block root fs does.

Thanks!
Menglong Dong
diff mbox series

Patch

diff --git a/init/do_mounts.c b/init/do_mounts.c
index a78e44ee6adb..943cb58846fb 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -617,6 +617,78 @@  void __init prepare_namespace(void)
 	init_chroot(".");
 }
 
+#ifdef CONFIG_TMPFS
+static __init bool is_tmpfs_enabled(void)
+{
+	return (!root_fs_names || strstr(root_fs_names, "tmpfs")) &&
+	       !saved_root_name[0];
+}
+#endif
+
+static __init bool is_ramfs_enabled(void)
+{
+	return true;
+}
+
+struct fs_user_root {
+	bool (*enabled)(void);
+	char *dev_name;
+	char *fs_name;
+};
+
+static struct fs_user_root user_roots[] __initdata = {
+#ifdef CONFIG_TMPFS
+	{.fs_name = "tmpfs", .enabled = is_tmpfs_enabled },
+#endif
+	{.fs_name = "ramfs", .enabled = is_ramfs_enabled }
+};
+static struct fs_user_root * __initdata user_root;
+
+/* Mount the user_root on '/'. */
+int __init mount_user_root(void)
+{
+	return do_mount_root(user_root->dev_name,
+			     user_root->fs_name,
+			     root_mountflags & ~MS_RDONLY,
+			     root_mount_data);
+}
+
+/*
+ * This function is used to chroot to new initramfs root that
+ * we unpacked on success. It will chdir to '/' and umount
+ * the secound mount on failure.
+ */
+void __init end_mount_user_root(bool succeed)
+{
+	if (!succeed)
+		goto on_failed;
+
+	if (!ramdisk_exec_exist(false))
+		goto on_failed;
+
+	init_mount(".", "/", NULL, MS_MOVE, NULL);
+	init_chroot(".");
+	return;
+
+on_failed:
+	init_chdir("/");
+	init_umount("/..", 0);
+}
+
+void __init init_user_rootfs(void)
+{
+	struct fs_user_root *root;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(user_roots); i++) {
+		root = &user_roots[i];
+		if (root->enabled()) {
+			user_root = root;
+			break;
+		}
+	}
+}
+
 static bool is_tmpfs;
 static int rootfs_init_fs_context(struct fs_context *fc)
 {
diff --git a/init/do_mounts.h b/init/do_mounts.h
index 7a29ac3e427b..92c004bdd320 100644
--- a/init/do_mounts.h
+++ b/init/do_mounts.h
@@ -10,9 +10,14 @@ 
 #include <linux/root_dev.h>
 #include <linux/init_syscalls.h>
 
+extern int root_mountflags;
+
 void  mount_block_root(char *name, int flags);
 void  mount_root(void);
-extern int root_mountflags;
+int   mount_user_root(void);
+void  end_mount_user_root(bool succeed);
+void  init_user_rootfs(void);
+bool  ramdisk_exec_exist(bool abs);
 
 static inline __init int create_dev(char *name, dev_t dev)
 {
diff --git a/init/initramfs.c b/init/initramfs.c
index af27abc59643..ffa78932ae65 100644
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -16,6 +16,8 @@ 
 #include <linux/namei.h>
 #include <linux/init_syscalls.h>
 
+#include "do_mounts.h"
+
 static ssize_t __init xwrite(struct file *file, const char *p, size_t count,
 		loff_t *pos)
 {
@@ -682,15 +684,23 @@  static void __init do_populate_rootfs(void *unused, async_cookie_t cookie)
 	else
 		printk(KERN_INFO "Unpacking initramfs...\n");
 
+	init_user_rootfs();
+
+	if (mount_user_root())
+		panic("Failed to create user root");
+
 	err = unpack_to_rootfs((char *)initrd_start, initrd_end - initrd_start);
 	if (err) {
+		end_mount_user_root(false);
 #ifdef CONFIG_BLK_DEV_RAM
 		populate_initrd_image(err);
 #else
 		printk(KERN_EMERG "Initramfs unpacking failed: %s\n", err);
 #endif
+		goto done;
 	}
 
+	end_mount_user_root(true);
 done:
 	/*
 	 * If the initrd region is overlapped with crashkernel reserved region,