diff mbox series

[RESEND] init/initramfs.c: make initramfs support pivot_root

Message ID 20210520154244.20209-1-dong.menglong@zte.com.cn (mailing list archive)
State New, archived
Headers show
Series [RESEND] init/initramfs.c: make initramfs support pivot_root | expand

Commit Message

Menglong Dong May 20, 2021, 3:42 p.m. UTC
From: Menglong Dong <dong.menglong@zte.com.cn>

During the kernel initialization, the mount tree, which is used by the
kernel, is created, and 'rootfs' is mounted as the root file system.

While using initramfs as the root file system, cpio file is unpacked
into the rootfs. Thus, this rootfs is exactly what users see in user
space, and some problems arose: this rootfs has no parent mount,
which make it can't be umounted or pivot_root.

'pivot_root' is used to change the rootfs and clean the old mountpoints,
and it is essential for some users, such as docker. Docker use
'pivot_root' to change the root fs of a process if the current root
fs is a block device of initrd. However, when it comes to initramfs,
things is different: docker has to use 'chroot()' to change the root
fs, as 'pivot_root()' is not supported in initramfs.

The usage of 'chroot()' to create root fs for a container introduced
a lot problems.

First, 'chroot()' can't clean the old mountpoints which inherited
from the host. It means that the mountpoints in host will have a
duplicate in every container. Let's image that there are 100
containers in host, and there will be 100 duplicates of every
mountpoints, which makes resource release an issue. User may
remove a USB after he (or she) umount it successfully in the
host. However, the USB may still be mounted in containers, although
it can't be seen by the 'mount' commond in the container. This
means the USB is not released yet, and data may not write back.
Therefore, data lose arise.

Second, net-namespace leak is another problem. The net-namespace
of containers will be mounted in /var/run/docker/netns/ in host
by dockerd. It means that the net-namespace of a container will
be mounted in containers which are created after it. Things
become worse now, as the net-namespace can't be remove after
the destroy of that container, as it is still mounted in other
containers. If users want to recreate that container, he will
fail if a certain mac address is to be binded with the container,
as it is not release yet.

Maybe dockerd can umount the unnecessary mountpoints that inherited
for the host before do 'chroot()', but that is not a graceful way.
I think the best way is to make 'pivot_root()' support initramfs.

After this patch, initramfs is supported by 'pivot_root()' perfectly.
I just create a new rootfs and mount it to the mount-tree before
unpack cpio. Therefore, the rootfs used by users has a parent mount,
and can use 'pivot_root()'.

What's more, after this patch, 'rootflags' in boot cmd is supported
by initramfs. Therefore, users can set the size of tmpfs with
'rootflags=size=1024M'.

Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
---
 init/do_mounts.c | 53 +++++++++++++++++++++++++++++++++++++++---------
 init/do_mounts.h |  1 +
 init/initramfs.c | 32 +++++++++++++++++++++++++++++
 init/main.c      | 17 +++++++++++++++-
 4 files changed, 92 insertions(+), 11 deletions(-)

Comments

Luis Chamberlain May 20, 2021, 9:41 p.m. UTC | #1
On Thu, May 20, 2021 at 11:42:44PM +0800, menglong8.dong@gmail.com wrote:
> From: Menglong Dong <dong.menglong@zte.com.cn>
> 
> During the kernel initialization, the mount tree, which is used by the
> kernel, is created, and 'rootfs' is mounted as the root file system.
> 
> While using initramfs as the root file system, cpio file is unpacked
> into the rootfs. Thus, this rootfs is exactly what users see in user
> space, and some problems arose: this rootfs has no parent mount,
> which make it can't be umounted or pivot_root.
> 
> 'pivot_root' is used to change the rootfs and clean the old mountpoints,
> and it is essential for some users, such as docker. Docker use
> 'pivot_root' to change the root fs of a process if the current root
> fs is a block device of initrd. However, when it comes to initramfs,
> things is different: docker has to use 'chroot()' to change the root
> fs, as 'pivot_root()' is not supported in initramfs.

OK so this seems to be a long winded way of saying you can't efficiently
use containers today when using initramfs. And then you explain why.

> The usage of 'chroot()' to create root fs for a container introduced
> a lot problems.
> 
> First, 'chroot()' can't clean the old mountpoints which inherited
> from the host. It means that the mountpoints in host will have a
> duplicate in every container. Let's image that there are 100
> containers in host, and there will be 100 duplicates of every
> mountpoints, which makes resource release an issue. User may
> remove a USB after he (or she) umount it successfully in the
> host. However, the USB may still be mounted in containers, although
> it can't be seen by the 'mount' commond in the container. This
> means the USB is not released yet, and data may not write back.
> Therefore, data lose arise.
> 
> Second, net-namespace leak is another problem. The net-namespace
> of containers will be mounted in /var/run/docker/netns/ in host
> by dockerd. It means that the net-namespace of a container will
> be mounted in containers which are created after it. Things
> become worse now, as the net-namespace can't be remove after
> the destroy of that container, as it is still mounted in other
> containers. If users want to recreate that container, he will
> fail if a certain mac address is to be binded with the container,
> as it is not release yet.

That seems like a chicken and egg problem on Docker, in that...

> Maybe dockerd can umount the unnecessary mountpoints that inherited
> for the host before do 'chroot()',

Can't docker instead allow to create containers prior to creating
your local docker network namespace? Not that its a great solution,
but just worth noting.

> but that is not a graceful way.
> I think the best way is to make 'pivot_root()' support initramfs.
> 
> After this patch, initramfs is supported by 'pivot_root()' perfectly.
> I just create a new rootfs and mount it to the mount-tree before
> unpack cpio. Therefore, the rootfs used by users has a parent mount,
> and can use 'pivot_root()'.
> 
> What's more, after this patch, 'rootflags' in boot cmd is supported
> by initramfs. Therefore, users can set the size of tmpfs with
> 'rootflags=size=1024M'.
> 
> Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
> ---
>  init/do_mounts.c | 53 +++++++++++++++++++++++++++++++++++++++---------
>  init/do_mounts.h |  1 +
>  init/initramfs.c | 32 +++++++++++++++++++++++++++++
>  init/main.c      | 17 +++++++++++++++-
>  4 files changed, 92 insertions(+), 11 deletions(-)
> 
> diff --git a/init/do_mounts.c b/init/do_mounts.c
> index a78e44ee6adb..a156b0d28b43 100644
> --- a/init/do_mounts.c
> +++ b/init/do_mounts.c
> @@ -459,7 +459,7 @@ void __init mount_block_root(char *name, int flags)
>  out:
>  	put_page(page);
>  }
> - 
> +
>  #ifdef CONFIG_ROOT_NFS
>  
>  #define NFSROOT_TIMEOUT_MIN	5
> @@ -617,24 +617,57 @@ void __init prepare_namespace(void)
>  	init_chroot(".");
>  }
>  
> -static bool is_tmpfs;
> -static int rootfs_init_fs_context(struct fs_context *fc)
> +#ifdef CONFIG_TMPFS
> +static __init bool is_tmpfs_enabled(void)
> +{
> +	return (!root_fs_names || strstr(root_fs_names, "tmpfs")) &&
> +	       !saved_root_name[0];
> +}
> +#endif
> +
> +static __init bool is_ramfs_enabled(void)
>  {
> -	if (IS_ENABLED(CONFIG_TMPFS) && is_tmpfs)
> -		return shmem_init_fs_context(fc);
> +	return true;
> +}
> +
> +struct fs_user_root {
> +	bool (*enabled)(void);
> +	char *dev_name;
> +	char *fs_name;
> +};
>  
> -	return ramfs_init_fs_context(fc);
> +static struct fs_user_root user_roots[] __initdata = {
> +#ifdef CONFIG_TMPFS
> +	{.fs_name = "tmpfs", .enabled = is_tmpfs_enabled },
> +#endif
> +	{.fs_name = "ramfs", .enabled = is_ramfs_enabled }
> +};
> +static struct fs_user_root * __initdata user_root;
> +
> +int __init mount_user_root(void)
> +{
> +	return do_mount_root(user_root->dev_name,
> +			     user_root->fs_name,
> +			     root_mountflags,
> +			     root_mount_data);
>  }
>  
>  struct file_system_type rootfs_fs_type = {
>  	.name		= "rootfs",
> -	.init_fs_context = rootfs_init_fs_context,
> +	.init_fs_context = ramfs_init_fs_context,

Why is this always static now? Why is that its correct
now for init_mount_tree() always to use the ramfs context?

  Luis
Menglong Dong May 21, 2021, 12:41 a.m. UTC | #2
Hello!

Thanks for your reply!

On Fri, May 21, 2021 at 5:41 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> Can't docker instead allow to create containers prior to creating
> your local docker network namespace? Not that its a great solution,
> but just worth noting.
>

That's a solution, but I don't think it is feasible. Users may create many
containers, and you can't make docker create all the containers first
and create network namespace later, as you don't know if there are any
containers to create later.

> >
> >  struct file_system_type rootfs_fs_type = {
> >       .name           = "rootfs",
> > -     .init_fs_context = rootfs_init_fs_context,
> > +     .init_fs_context = ramfs_init_fs_context,
>
> Why is this always static now? Why is that its correct
> now for init_mount_tree() always to use the ramfs context?

Because the root mount in init_mount_tree() is not used as rootfs any more.
In do_populate_rootfs(), I mounted a second rootfs, which can be ramfs or
tmpfs, and that's the real rootfs for initramfs. And I call this root
as 'user_root',
because it is created for user space.

int __init mount_user_root(void)
{
       return do_mount_root(user_root->dev_name,
                            user_root->fs_name,
                            root_mountflags,
                            root_mount_data);
 }

In other words, I moved the realization of 'rootfs_fs_type' here to
do_populate_rootfs(), and fixed this 'rootfs_fs_type' with
ramfs_init_fs_context, as it is a fake root now.

Now, the rootfs that user space used is separated with the init_task,
and that's exactly what a block root file system does.

Thanks!
Menglong Dong
Luis Chamberlain May 21, 2021, 3:50 p.m. UTC | #3
On Fri, May 21, 2021 at 08:41:55AM +0800, Menglong Dong wrote:
> Hello!
> 
> Thanks for your reply!
> 
> On Fri, May 21, 2021 at 5:41 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > Can't docker instead allow to create containers prior to creating
> > your local docker network namespace? Not that its a great solution,
> > but just worth noting.
> >
> 
> That's a solution, but I don't think it is feasible. Users may create many
> containers, and you can't make docker create all the containers first
> and create network namespace later, as you don't know if there are any
> containers to create later.

It doesn't seem impossible, but worth noting why inside the commit log
this was not a preferred option.

> > >  struct file_system_type rootfs_fs_type = {
> > >       .name           = "rootfs",
> > > -     .init_fs_context = rootfs_init_fs_context,
> > > +     .init_fs_context = ramfs_init_fs_context,
> >
> > Why is this always static now? Why is that its correct
> > now for init_mount_tree() always to use the ramfs context?
> 
> Because the root mount in init_mount_tree() is not used as rootfs any more.

We still have:

start_kernel() --> vfs_caches_init() --> mnt_init() --> 

mnt_init()
{
	...
	shmem_init();                                                           
	init_rootfs();                                                          
	init_mount_tree(); 
}

You've now modified init_rootfs() to essentially just set the new user_root,
and that's it. But we stil call init_mount_tree() even if we did set the
rootfs to use tmpfs.

> In do_populate_ro
> tmpfs, and that's the real rootfs for initramfs. And I call this root
> as 'user_root',
> because it is created for user space.
> 
> int __init mount_user_root(void)
> {
>        return do_mount_root(user_root->dev_name,
>                             user_root->fs_name,
>                             root_mountflags,
>                             root_mount_data);
>  }
> 
> In other words, I moved the realization of 'rootfs_fs_type' here to
> do_populate_rootfs(), and fixed this 'rootfs_fs_type' with
> ramfs_init_fs_context, as it is a fake root now.

do_populate_rootfs() is called from populate_rootfs() and that in turn
is a:

rootfs_initcall(populate_rootfs);

In fact the latest changes have made this to schedule asynchronously as
well. And so indeed, init_mount_tree() always kicks off first. So its
still unclear to me why the first mount now always has a fs context of
ramfs_init_fs_context, even if we did not care for a ramdisk.

Are you suggesting it can be arbitrary now?

> Now, the rootfs that user space used is separated with the init_task,
> and that's exactly what a block root file system does.

The secondary effort is a bit clearer, its the earlier part that is not
so clear yet to me at least.

Regardless, to help make the changes easier to review, I wonder if it
makes sense to split up your work into a few patches. First do what you
have done for init_rootfs() and the structure you added to replace the
is_tmpfs bool, and let initialization use it and the context. And then
as a second patch introduce the second mount effort.

  Luis
Menglong Dong May 22, 2021, 4:09 a.m. UTC | #4
On Fri, May 21, 2021 at 11:50 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> > That's a solution, but I don't think it is feasible. Users may create many
> > containers, and you can't make docker create all the containers first
> > and create network namespace later, as you don't know if there are any
> > containers to create later.
>
> It doesn't seem impossible, but worth noting why inside the commit log
> this was not a preferred option.
>

In fact, the network namespace is just a case for the problem that the
'mount leak' caused. And this kind modification is not friendly to
current docker users, it makes great changes to the usage of docker.

>
> We still have:
>
> start_kernel() --> vfs_caches_init() --> mnt_init() -->
>
> mnt_init()
> {
>         ...
>         shmem_init();
>         init_rootfs();
>         init_mount_tree();
> }
>
> You've now modified init_rootfs() to essentially just set the new user_root,
> and that's it. But we stil call init_mount_tree() even if we did set the
> rootfs to use tmpfs.

The variate of 'is_tmpfs' is only used in 'rootfs_init_fs_context'. I used
ramfs_init_fs_context directly for rootfs, so it is not needed any more
and I just removed it in init_rootfs().

The initialization of 'user_root' in init_rootfs() is used in:
do_populate_rootfs -> mount_user_root, which set the file system(
ramfs or tmpfs) of the second mount.

Seems it's not suitable to place it in init_rootfs()......


> > In do_populate_ro
> > tmpfs, and that's the real rootfs for initramfs. And I call this root
> > as 'user_root',
> > because it is created for user space.
> >
> > int __init mount_user_root(void)
> > {
> >        return do_mount_root(user_root->dev_name,
> >                             user_root->fs_name,
> >                             root_mountflags,
> >                             root_mount_data);
> >  }
> >
> > In other words, I moved the realization of 'rootfs_fs_type' here to
> > do_populate_rootfs(), and fixed this 'rootfs_fs_type' with
> > ramfs_init_fs_context, as it is a fake root now.
>
> do_populate_rootfs() is called from populate_rootfs() and that in turn
> is a:
>
> rootfs_initcall(populate_rootfs);
>
> In fact the latest changes have made this to schedule asynchronously as
> well. And so indeed, init_mount_tree() always kicks off first. So its
> still unclear to me why the first mount now always has a fs context of
> ramfs_init_fs_context, even if we did not care for a ramdisk.
>
> Are you suggesting it can be arbitrary now?

With the existence of the new user_root, the first mount is not directly used
any more, so the filesystem type of it doesn't  matter.

So it makes no sense to make the file system of the first mount selectable.
To simplify the code here, I make it ramfs_init_fs_context directly. In fact,
it's fine to make it shmen_init_fs_context here too.

> > Now, the rootfs that user space used is separated with the init_task,
> > and that's exactly what a block root file system does.
>
> The secondary effort is a bit clearer, its the earlier part that is not
> so clear yet to me at least.
>
> Regardless, to help make the changes easier to review, I wonder if it
> makes sense to split up your work into a few patches. First do what you
> have done for init_rootfs() and the structure you added to replace the
> is_tmpfs bool, and let initialization use it and the context. And then
> as a second patch introduce the second mount effort.

I thinks it is a good idea. I will reorganize the code and split it into several
patches.

Thanks!
Menglong Dong
Luis Chamberlain May 24, 2021, 10:58 p.m. UTC | #5
On Sat, May 22, 2021 at 12:09:30PM +0800, Menglong Dong wrote:
> On Fri, May 21, 2021 at 11:50 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > > That's a solution, but I don't think it is feasible. Users may create many
> > > containers, and you can't make docker create all the containers first
> > > and create network namespace later, as you don't know if there are any
> > > containers to create later.
> >
> > It doesn't seem impossible, but worth noting why inside the commit log
> > this was not a preferred option.
> >
> 
> In fact, the network namespace is just a case for the problem that the
> 'mount leak' caused. And this kind modification is not friendly to
> current docker users, it makes great changes to the usage of docker.

You mean an upgrade of docker? If so... that does not seem like a
definitive reason to do something new in the kernel *always*.

However, if you introduce it as a kconfig option so that users
who want to use this new feature can enable it, and then use it,
the its sold as a new feature.

Should this always be enabled, or done this way? Should we never have
the option to revert back to the old behaviour? If not, why not?

> > We still have:
> >
> > start_kernel() --> vfs_caches_init() --> mnt_init() -->
> >
> > mnt_init()
> > {
> >         ...
> >         shmem_init();
> >         init_rootfs();
> >         init_mount_tree();
> > }
> >
> > You've now modified init_rootfs() to essentially just set the new user_root,
> > and that's it. But we stil call init_mount_tree() even if we did set the
> > rootfs to use tmpfs.
> 
> The variate of 'is_tmpfs' is only used in 'rootfs_init_fs_context'. I used
> ramfs_init_fs_context directly for rootfs,

I don't see you using any context directly, where are you specifying the
context directly?

> so it is not needed any more
> and I just removed it in init_rootfs().
>
> The initialization of 'user_root' in init_rootfs() is used in:
> do_populate_rootfs -> mount_user_root, which set the file system(
> ramfs or tmpfs) of the second mount.
> 
> Seems it's not suitable to place it in init_rootfs()......

OK I think I just need to understand how you added the context of the
first mount explicitly now and where, as I don't see it.

> > > In do_populate_ro
> > > tmpfs, and that's the real rootfs for initramfs. And I call this root
> > > as 'user_root',
> > > because it is created for user space.
> > >
> > > int __init mount_user_root(void)
> > > {
> > >        return do_mount_root(user_root->dev_name,
> > >                             user_root->fs_name,
> > >                             root_mountflags,
> > >                             root_mount_data);
> > >  }
> > >
> > > In other words, I moved the realization of 'rootfs_fs_type' here to
> > > do_populate_rootfs(), and fixed this 'rootfs_fs_type' with
> > > ramfs_init_fs_context, as it is a fake root now.
> >
> > do_populate_rootfs() is called from populate_rootfs() and that in turn
> > is a:
> >
> > rootfs_initcall(populate_rootfs);
> >
> > In fact the latest changes have made this to schedule asynchronously as
> > well. And so indeed, init_mount_tree() always kicks off first. So its
> > still unclear to me why the first mount now always has a fs context of
> > ramfs_init_fs_context, even if we did not care for a ramdisk.
> >
> > Are you suggesting it can be arbitrary now?
> 
> With the existence of the new user_root, the first mount is not directly used
> any more, so the filesystem type of it doesn't  matter.

What do you mean? init_mount_tree() is always called, and it has
statically:

static void __init init_mount_tree(void)                                        
{                                                                               
	struct vfsmount *mnt;
	...
	mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", NULL);
	...
}

And as I noted, this is *always* called earlier than
do_populate_rootfs(). Your changes did not remove the init_mount_tree()
or modify it, and so why would the context of the above call always
be OK to be used now with a ramfs context now?

> So it makes no sense to make the file system of the first mount selectable.

Why? I don't see why, nor is it explained, we're always caling
vfs_kern_mount(&rootfs_fs_type, ...) and you have not changed that
either.

> To simplify the code here, I make it ramfs_init_fs_context directly. In fact,
> it's fine to make it shmen_init_fs_context here too.

So indeed you're suggesting its arbitrary now.... I don't see why.

  Luis
Menglong Dong May 25, 2021, 12:55 a.m. UTC | #6
Hello!

On Tue, May 25, 2021 at 6:58 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> > In fact, the network namespace is just a case for the problem that the
> > 'mount leak' caused. And this kind modification is not friendly to
> > current docker users, it makes great changes to the usage of docker.
>
> You mean an upgrade of docker? If so... that does not seem like a
> definitive reason to do something new in the kernel *always*.

No, I means that in this solution, users have to create all the
containers first,
and create the network namespace for them later.

>
> However, if you introduce it as a kconfig option so that users
> who want to use this new feature can enable it, and then use it,
> the its sold as a new feature.
>
> Should this always be enabled, or done this way? Should we never have
> the option to revert back to the old behaviour? If not, why not?
>

This change seems transparent to users, which don't change the behavior
of initramfs. However, it seems more reasonable to make it a kconfig option.
I'll do it in the v2 of the three patches I sended.

>
> I don't see you using any context directly, where are you specifying the
> context directly?
>
> > so it is not needed any more
> > and I just removed it in init_rootfs().
> >
> > The initialization of 'user_root' in init_rootfs() is used in:
> > do_populate_rootfs -> mount_user_root, which set the file system(
> > ramfs or tmpfs) of the second mount.
> >
> > Seems it's not suitable to place it in init_rootfs()......
>
> OK I think I just need to understand how you added the context of the
> first mount explicitly now and where, as I don't see it.
>

......

>
> What do you mean? init_mount_tree() is always called, and it has
> statically:
>
> static void __init init_mount_tree(void)
> {
>         struct vfsmount *mnt;
>         ...
>         mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", NULL);
>         ...
> }
>
> And as I noted, this is *always* called earlier than
> do_populate_rootfs(). Your changes did not remove the init_mount_tree()
> or modify it, and so why would the context of the above call always
> be OK to be used now with a ramfs context now?
>
> > So it makes no sense to make the file system of the first mount selectable.
>
> Why? I don't see why, nor is it explained, we're always caling
> vfs_kern_mount(&rootfs_fs_type, ...) and you have not changed that
> either.
>
> > To simplify the code here, I make it ramfs_init_fs_context directly. In fact,
> > it's fine to make it shmen_init_fs_context here too.
>
> So indeed you're suggesting its arbitrary now.... I don't see why.
>

So the biggest problem now seems to be the first mount I changed, maybe I didn't
make it clear before.

Let's call the first mount which is created in init_mount_tree() the
'init_mount'.
If the 'root' is a block fs, initrd or nfs, the 'init_mount' will be a
ramfs, that seems
clear, it can be seen from the enable of tmpfs:

void __init init_rootfs(void)
{
    if (IS_ENABLED(CONFIG_TMPFS) && !saved_root_name[0] &&
       (!root_fs_names || strstr(root_fs_names, "tmpfs")))
       is_tmpfs = true;
}

That makes sense, because the 'init_mount' will not be the root file
system, as the
block fs or initrd or nfs that mounted later on '/root' will become
the root by a call of
init_chroot().

If the 'root' is initramfs, cpio will be unpacked init 'init_mount'.
In this scene, the
'init_mount' can be ramfs or tmpfs, and it is the root file system.

In my patch, I create a second mount which is mounted on the
'init_mount', let's call
it 'user_mount'.

cpio is unpacked into 'user_mount', and the 'user_mount' is made the
root by 'init_chroot()'
later. The 'user_mount' can be ramfs or tmpfs. So 'init_mount' is not
the root any more,
and it makes no sense to make it tmpfs, just like why it should be
ramfs when 'root'
is a block fs or initrd.

'init_mount' is created from 'rootfs_fs_type' in init_mount_tree():

static void __init init_mount_tree(void)
{
        struct vfsmount *mnt;
        ...
        mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", NULL);
        ...
}

I make the 'init_mount' to be ramfs by making
'rootfs_fs_type->init_fs_context' with
'ramfs_init_fs_context':

struct file_system_type rootfs_fs_type = {
.name = "rootfs",
.init_fs_context = ramfs_init_fs_context,
.kill_sb = kill_litter_super,
};

I think the third patch that I sended has made it clear what I do to
the 'init_mount'.

Thanks!
Menglong Dong
Luis Chamberlain May 25, 2021, 1:43 a.m. UTC | #7
On Tue, May 25, 2021 at 08:55:48AM +0800, Menglong Dong wrote:
> On Tue, May 25, 2021 at 6:58 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > However, if you introduce it as a kconfig option so that users
> > who want to use this new feature can enable it, and then use it,
> > the its sold as a new feature.
> >
> > Should this always be enabled, or done this way? Should we never have
> > the option to revert back to the old behaviour? If not, why not?
> >
> 
> This change seems transparent to users, which don't change the behavior
> of initramfs. 

Are we sure there nothing in the kernel that can regress with this
change? Are you sure? How sure?

> However, it seems more reasonable to make it a kconfig option.
> I'll do it in the v2 of the three patches I sended.

I'm actually quite convinced now this is a desirable default *other*
than the concern if this could regress. I recently saw some piece of
code fetching for the top most mount, I think it was on the
copy_user_ns() path or something like that, which made me just
consider possible regressions for heuristics we might have forgotten
about.

I however have't yet had time to review the path I was concerned for
yet.

> > What do you mean? init_mount_tree() is always called, and it has
> > statically:
> >
> > static void __init init_mount_tree(void)
> > {
> >         struct vfsmount *mnt;
> >         ...
> >         mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", NULL);
> >         ...
> > }
> >
> > And as I noted, this is *always* called earlier than
> > do_populate_rootfs(). Your changes did not remove the init_mount_tree()
> > or modify it, and so why would the context of the above call always
> > be OK to be used now with a ramfs context now?
> >
> > > So it makes no sense to make the file system of the first mount selectable.
> >
> > Why? I don't see why, nor is it explained, we're always caling
> > vfs_kern_mount(&rootfs_fs_type, ...) and you have not changed that
> > either.
> >
> > > To simplify the code here, I make it ramfs_init_fs_context directly. In fact,
> > > it's fine to make it shmen_init_fs_context here too.
> >
> > So indeed you're suggesting its arbitrary now.... I don't see why.
> >
> 
> So the biggest problem now seems to be the first mount I changed, maybe I didn't
> make it clear before.
> 
> Let's call the first mount which is created in init_mount_tree() the
> 'init_mount'.
> If the 'root' is a block fs, initrd or nfs, the 'init_mount' will be a
> ramfs, that seems
> clear, it can be seen from the enable of tmpfs:
> 
> void __init init_rootfs(void)
> {
>     if (IS_ENABLED(CONFIG_TMPFS) && !saved_root_name[0] &&
>        (!root_fs_names || strstr(root_fs_names, "tmpfs")))
>        is_tmpfs = true;
> }

Ah yes, I see now... Thanks!
 
  Luis
Menglong Dong May 25, 2021, 6:09 a.m. UTC | #8
On Tue, May 25, 2021 at 9:43 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> >
> > This change seems transparent to users, which don't change the behavior
> > of initramfs.
>
> Are we sure there nothing in the kernel that can regress with this
> change? Are you sure? How sure?
>
> > However, it seems more reasonable to make it a kconfig option.
> > I'll do it in the v2 of the three patches I sended.
>
> I'm actually quite convinced now this is a desirable default *other*
> than the concern if this could regress. I recently saw some piece of
> code fetching for the top most mount, I think it was on the
> copy_user_ns() path or something like that, which made me just
> consider possible regressions for heuristics we might have forgotten
> about.
>
> I however have't yet had time to review the path I was concerned for
> yet.

Yeah, I'm sure...probably. The way I create and mount 'user root' is
almost the same to block root device. When it comes to block
device, such as hda, what kernel do is:

/* This will mount block device on '/root' and chdir to '/root' */
prepare_namespace->mount_root->mount_block_root->do_mount_root;

/* This will move the block device mounted on '/root' to '/' */
init_mount(".", "/", NULL, MS_MOVE, NULL);

/* This will change the root to current dir, which is the root of block
 * device.
 */
init_chroot(".")

And these steps are exactly what I do with 'user root'. However, I'm
not totally sure. For safety, I'll make it into a kconfig option. Is
it acceptable to make it enabled by default?

Thanks!
Menglong Dong
diff mbox series

Patch

diff --git a/init/do_mounts.c b/init/do_mounts.c
index a78e44ee6adb..a156b0d28b43 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -459,7 +459,7 @@  void __init mount_block_root(char *name, int flags)
 out:
 	put_page(page);
 }
- 
+
 #ifdef CONFIG_ROOT_NFS
 
 #define NFSROOT_TIMEOUT_MIN	5
@@ -617,24 +617,57 @@  void __init prepare_namespace(void)
 	init_chroot(".");
 }
 
-static bool is_tmpfs;
-static int rootfs_init_fs_context(struct fs_context *fc)
+#ifdef CONFIG_TMPFS
+static __init bool is_tmpfs_enabled(void)
+{
+	return (!root_fs_names || strstr(root_fs_names, "tmpfs")) &&
+	       !saved_root_name[0];
+}
+#endif
+
+static __init bool is_ramfs_enabled(void)
 {
-	if (IS_ENABLED(CONFIG_TMPFS) && is_tmpfs)
-		return shmem_init_fs_context(fc);
+	return true;
+}
+
+struct fs_user_root {
+	bool (*enabled)(void);
+	char *dev_name;
+	char *fs_name;
+};
 
-	return ramfs_init_fs_context(fc);
+static struct fs_user_root user_roots[] __initdata = {
+#ifdef CONFIG_TMPFS
+	{.fs_name = "tmpfs", .enabled = is_tmpfs_enabled },
+#endif
+	{.fs_name = "ramfs", .enabled = is_ramfs_enabled }
+};
+static struct fs_user_root * __initdata user_root;
+
+int __init mount_user_root(void)
+{
+	return do_mount_root(user_root->dev_name,
+			     user_root->fs_name,
+			     root_mountflags,
+			     root_mount_data);
 }
 
 struct file_system_type rootfs_fs_type = {
 	.name		= "rootfs",
-	.init_fs_context = rootfs_init_fs_context,
+	.init_fs_context = ramfs_init_fs_context,
 	.kill_sb	= kill_litter_super,
 };
 
 void __init init_rootfs(void)
 {
-	if (IS_ENABLED(CONFIG_TMPFS) && !saved_root_name[0] &&
-		(!root_fs_names || strstr(root_fs_names, "tmpfs")))
-		is_tmpfs = true;
+	struct fs_user_root *root;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(user_roots); i++) {
+		root = &user_roots[i];
+		if (root->enabled()) {
+			user_root = root;
+			break;
+		}
+	}
 }
diff --git a/init/do_mounts.h b/init/do_mounts.h
index 7a29ac3e427b..34978b17454a 100644
--- a/init/do_mounts.h
+++ b/init/do_mounts.h
@@ -13,6 +13,7 @@ 
 void  mount_block_root(char *name, int flags);
 void  mount_root(void);
 extern int root_mountflags;
+int   mount_user_root(void);
 
 static inline __init int create_dev(char *name, dev_t dev)
 {
diff --git a/init/initramfs.c b/init/initramfs.c
index af27abc59643..c883379673c0 100644
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -15,6 +15,11 @@ 
 #include <linux/mm.h>
 #include <linux/namei.h>
 #include <linux/init_syscalls.h>
+#include <uapi/linux/mount.h>
+
+#include "do_mounts.h"
+
+extern bool ramdisk_exec_exist(bool abs);
 
 static ssize_t __init xwrite(struct file *file, const char *p, size_t count,
 		loff_t *pos)
@@ -667,6 +672,27 @@  static void __init populate_initrd_image(char *err)
 }
 #endif /* CONFIG_BLK_DEV_RAM */
 
+/*
+ * This function is used to chroot to new initramfs root that
+ * we unpacked.
+ */
+static void __init end_mount_user_root(bool succeed)
+{
+	if (!succeed)
+		goto on_failed;
+
+	if (!ramdisk_exec_exist(false))
+		goto on_failed;
+
+	init_mount(".", "/", NULL, MS_MOVE, NULL);
+	init_chroot(".");
+	return;
+
+on_failed:
+	init_chdir("/");
+	init_umount("/root", 0);
+}
+
 static void __init do_populate_rootfs(void *unused, async_cookie_t cookie)
 {
 	/* Load the built in initramfs */
@@ -682,15 +708,21 @@  static void __init do_populate_rootfs(void *unused, async_cookie_t cookie)
 	else
 		printk(KERN_INFO "Unpacking initramfs...\n");
 
+	if (mount_user_root())
+		panic("Failed to create user root");
+
 	err = unpack_to_rootfs((char *)initrd_start, initrd_end - initrd_start);
 	if (err) {
+		end_mount_user_root(false);
 #ifdef CONFIG_BLK_DEV_RAM
 		populate_initrd_image(err);
 #else
 		printk(KERN_EMERG "Initramfs unpacking failed: %s\n", err);
 #endif
+		goto done;
 	}
 
+	end_mount_user_root(true);
 done:
 	/*
 	 * If the initrd region is overlapped with crashkernel reserved region,
diff --git a/init/main.c b/init/main.c
index eb01e121d2f1..431da5f01f11 100644
--- a/init/main.c
+++ b/init/main.c
@@ -607,6 +607,21 @@  static inline void setup_nr_cpu_ids(void) { }
 static inline void smp_prepare_cpus(unsigned int maxcpus) { }
 #endif
 
+bool __init ramdisk_exec_exist(bool abs)
+{
+	char *tmp_command = ramdisk_execute_command;
+
+	if (!tmp_command)
+		return false;
+
+	if (!abs) {
+		while (*tmp_command == '/' || *tmp_command == '.')
+			tmp_command++;
+	}
+
+	return init_eaccess(tmp_command) == 0;
+}
+
 /*
  * We need to store the untouched command line for future reference.
  * We also need to store the touched command line since the parameter
@@ -1568,7 +1583,7 @@  static noinline void __init kernel_init_freeable(void)
 	 * check if there is an early userspace init.  If yes, let it do all
 	 * the work
 	 */
-	if (init_eaccess(ramdisk_execute_command) != 0) {
+	if (!ramdisk_exec_exist(true)) {
 		ramdisk_execute_command = NULL;
 		prepare_namespace();
 	}