diff mbox series

[3/3] nsproxy: support CLONE_NEWTIME with setns()

Message ID 20200619153559.724863-4-christian.brauner@ubuntu.com (mailing list archive)
State New, archived
Headers show
Series nsproxy: support CLONE_NEWTIME with setns() | expand

Commit Message

Christian Brauner June 19, 2020, 3:35 p.m. UTC
So far setns() was missing time namespace support. This was partially due
to it simply not being implemented but also because vdso_join_timens()
could still fail which made switching to multiple namespaces atomically
problematic. This is now fixed so support CLONE_NEWTIME with setns()

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Dmitry Safonov <dima@arista.com>
Cc: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 include/linux/time_namespace.h |  6 ++++++
 kernel/nsproxy.c               | 21 +++++++++++++++++++--
 kernel/time/namespace.c        |  5 +----
 3 files changed, 26 insertions(+), 6 deletions(-)

Comments

Christian Brauner June 23, 2020, 11:55 a.m. UTC | #1
On Fri, Jun 19, 2020 at 05:35:59PM +0200, Christian Brauner wrote:
> So far setns() was missing time namespace support. This was partially due
> to it simply not being implemented but also because vdso_join_timens()
> could still fail which made switching to multiple namespaces atomically
> problematic. This is now fixed so support CLONE_NEWTIME with setns()
> 
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Serge Hallyn <serge@hallyn.com>
> Cc: Dmitry Safonov <dima@arista.com>
> Cc: Andrei Vagin <avagin@gmail.com>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> ---

Andrei,
Dmitry,

A little off-topic since its not related to the patch here but I've been
going through the current time namespace semantics and i just want to
confirm something with you:

Afaict, unshare(CLONE_NEWTIME) currently works similar to
unshare(CLONE_NEWPID) in that it only changes {pid,time}_for_children
but does _not_ change the {pid, time} namespace of the caller itself.
For pid namespaces that makes a lot of sense but I'm not completely
clear why you're doing this for time namespaces, especially since the
setns() behavior for CLONE_NEWPID and CLONE_NEWTIME is very different:
Similar to unshare(CLONE_NEWPID), setns(CLONE_NEWPID) doesn't change the
pid namespace of the caller itself, it only changes it for it's
children by setting up pid_for_children. _But_ for setns(CLONE_NEWTIME)
both the caller's and the children's time namespace is changed, i.e.
unshare(CLONE_NEWTIME) behaves different from setns(CLONE_NEWTIME). Why?

This also has the consequence that the unshare(CLONE_NEWTIME) +
setns(CLONE_NEWTIME) sequence can be used to change the callers pid
namespace. Is this intended?
Here's some code where you can verify this (please excuse the aweful
code I'm using to illustrate this):

int main(int argc, char *argv[])
{
	char buf1[4096], buf2[4096];

	if (unshare(0x00000080))
		exit(1);

	int fd = open("/proc/self/ns/time", O_RDONLY);
	if (fd < 0)
		exit(2);

	readlink("/proc/self/ns/time", buf1, sizeof(buf1));
	readlink("/proc/self/ns/time_for_children", buf2, sizeof(buf2));
	printf("unshare(CLONE_NEWTIME):		time(%s) ~= time_for_children(%s)\n", buf1, buf2);

	if (setns(fd, 0x00000080))
		exit(3);

	readlink("/proc/self/ns/time", buf1, sizeof(buf1));
	readlink("/proc/self/ns/time_for_children", buf2, sizeof(buf2));
	printf("setns(self, CLONE_NEWTIME):	time(%s) == time_for_children(%s)\n", buf1, buf2);

	exit(EXIT_SUCCESS);
}

which gives:

root@f2-vm:/# ./test
unshare(CLONE_NEWTIME):		time(time:[4026531834]) ~= time_for_children(time:[4026532366])
setns(self, CLONE_NEWTIME):	time(time:[4026531834]) == time_for_children(time:[4026531834])

why is unshare(CLONE_NEWTIME) blocked from changing the callers pid
namespace when setns(CLONE_NEWTIME) is allowed to do this?

Christian
Andrei Vagin June 25, 2020, 8:42 a.m. UTC | #2
On Tue, Jun 23, 2020 at 01:55:21PM +0200, Christian Brauner wrote:
> On Fri, Jun 19, 2020 at 05:35:59PM +0200, Christian Brauner wrote:
> > So far setns() was missing time namespace support. This was partially due
> > to it simply not being implemented but also because vdso_join_timens()
> > could still fail which made switching to multiple namespaces atomically
> > problematic. This is now fixed so support CLONE_NEWTIME with setns()
> > 
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> > Cc: Serge Hallyn <serge@hallyn.com>
> > Cc: Dmitry Safonov <dima@arista.com>
> > Cc: Andrei Vagin <avagin@gmail.com>
> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> > ---
> 
> Andrei,
> Dmitry,
> 
> A little off-topic since its not related to the patch here but I've been
> going through the current time namespace semantics and i just want to
> confirm something with you:
> 
> Afaict, unshare(CLONE_NEWTIME) currently works similar to
> unshare(CLONE_NEWPID) in that it only changes {pid,time}_for_children
> but does _not_ change the {pid, time} namespace of the caller itself.
> For pid namespaces that makes a lot of sense but I'm not completely
> clear why you're doing this for time namespaces, especially since the
> setns() behavior for CLONE_NEWPID and CLONE_NEWTIME is very different:
> Similar to unshare(CLONE_NEWPID), setns(CLONE_NEWPID) doesn't change the
> pid namespace of the caller itself, it only changes it for it's
> children by setting up pid_for_children. _But_ for setns(CLONE_NEWTIME)
> both the caller's and the children's time namespace is changed, i.e.
> unshare(CLONE_NEWTIME) behaves different from setns(CLONE_NEWTIME). Why?

This scheme allows setting clock offsets for a namespace, before any
processes appear in it. It is not allowed to change offsets if any task
has joined a time namespace. We need this to avoid corner cases with
timers and tasks don't need to be aware of offset changes.

> 
> This also has the consequence that the unshare(CLONE_NEWTIME) +
> setns(CLONE_NEWTIME) sequence can be used to change the callers pid
> namespace. Is this intended?
> Here's some code where you can verify this (please excuse the aweful
> code I'm using to illustrate this):
> 
> int main(int argc, char *argv[])
> {
> 	char buf1[4096], buf2[4096];
> 
> 	if (unshare(0x00000080))
> 		exit(1);
> 
> 	int fd = open("/proc/self/ns/time", O_RDONLY);
> 	if (fd < 0)
> 		exit(2);
> 
> 	readlink("/proc/self/ns/time", buf1, sizeof(buf1));
> 	readlink("/proc/self/ns/time_for_children", buf2, sizeof(buf2));
> 	printf("unshare(CLONE_NEWTIME):		time(%s) ~= time_for_children(%s)\n", buf1, buf2);
> 
> 	if (setns(fd, 0x00000080))
> 		exit(3);

And in this example, you use the right sequence of steps: unshare, set
offsets, setns. With clone3, we will be able to do this in one call.

> 
> 	readlink("/proc/self/ns/time", buf1, sizeof(buf1));
> 	readlink("/proc/self/ns/time_for_children", buf2, sizeof(buf2));
> 	printf("setns(self, CLONE_NEWTIME):	time(%s) == time_for_children(%s)\n", buf1, buf2);
> 
> 	exit(EXIT_SUCCESS);
> }
> 
> which gives:
> 
> root@f2-vm:/# ./test
> unshare(CLONE_NEWTIME):		time(time:[4026531834]) ~= time_for_children(time:[4026532366])
> setns(self, CLONE_NEWTIME):	time(time:[4026531834]) == time_for_children(time:[4026531834])
> 
> why is unshare(CLONE_NEWTIME) blocked from changing the callers pid
> namespace when setns(CLONE_NEWTIME) is allowed to do this?
> 
> Christian
Andrei Vagin June 25, 2020, 9:06 a.m. UTC | #3
On Fri, Jun 19, 2020 at 05:35:59PM +0200, Christian Brauner wrote:
> So far setns() was missing time namespace support. This was partially due
> to it simply not being implemented but also because vdso_join_timens()
> could still fail which made switching to multiple namespaces atomically
> problematic. This is now fixed so support CLONE_NEWTIME with setns()
> 
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Serge Hallyn <serge@hallyn.com>
> Cc: Dmitry Safonov <dima@arista.com>
> Cc: Andrei Vagin <avagin@gmail.com>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Hi Christian,

I have reviewed this series and it looks good to me.

We decided to not change the return type of vdso_join_timens to avoid
conflicts with the arm64 timens patchset. With this change, you can add
my Reviewed-by to all patched in this series.

Reviewed-by: Andrei Vagin <avagin@gmail.com>

Thanks,
Andrei
Christian Brauner June 25, 2020, 12:48 p.m. UTC | #4
On Thu, Jun 25, 2020 at 02:06:18AM -0700, Andrei Vagin wrote:
> On Fri, Jun 19, 2020 at 05:35:59PM +0200, Christian Brauner wrote:
> > So far setns() was missing time namespace support. This was partially due
> > to it simply not being implemented but also because vdso_join_timens()
> > could still fail which made switching to multiple namespaces atomically
> > problematic. This is now fixed so support CLONE_NEWTIME with setns()
> > 
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> > Cc: Serge Hallyn <serge@hallyn.com>
> > Cc: Dmitry Safonov <dima@arista.com>
> > Cc: Andrei Vagin <avagin@gmail.com>
> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> 
> Hi Christian,
> 
> I have reviewed this series and it looks good to me.
> 
> We decided to not change the return type of vdso_join_timens to avoid
> conflicts with the arm64 timens patchset. With this change, you can add
> my Reviewed-by to all patched in this series.
> 
> Reviewed-by: Andrei Vagin <avagin@gmail.com>

Thanks! As discussed in the thread for th arm changes. We'll defer the
return type changes!

Christian
diff mbox series

Patch

diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
index 4d1768c6f836..d308a3812f79 100644
--- a/include/linux/time_namespace.h
+++ b/include/linux/time_namespace.h
@@ -33,6 +33,7 @@  extern struct time_namespace init_time_ns;
 #ifdef CONFIG_TIME_NS
 extern void vdso_join_timens(struct task_struct *task,
 			     struct time_namespace *ns);
+extern void timens_commit(struct task_struct *tsk, struct time_namespace *ns);
 
 static inline struct time_namespace *get_time_ns(struct time_namespace *ns)
 {
@@ -95,6 +96,11 @@  static inline void vdso_join_timens(struct task_struct *task,
 {
 }
 
+static inline void timens_commit(struct task_struct *tsk,
+				 struct time_namespace *ns)
+{
+}
+
 static inline struct time_namespace *get_time_ns(struct time_namespace *ns)
 {
 	return NULL;
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index b03df67621d0..f12231c41b69 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -262,8 +262,8 @@  void exit_task_namespaces(struct task_struct *p)
 static int check_setns_flags(unsigned long flags)
 {
 	if (!flags || (flags & ~(CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-				 CLONE_NEWNET | CLONE_NEWUSER | CLONE_NEWPID |
-				 CLONE_NEWCGROUP)))
+				 CLONE_NEWNET | CLONE_NEWTIME | CLONE_NEWUSER |
+				 CLONE_NEWPID | CLONE_NEWCGROUP)))
 		return -EINVAL;
 
 #ifndef CONFIG_USER_NS
@@ -290,6 +290,10 @@  static int check_setns_flags(unsigned long flags)
 	if (flags & CLONE_NEWNET)
 		return -EINVAL;
 #endif
+#ifndef CONFIG_TIME_NS
+	if (flags & CLONE_NEWTIME)
+		return -EINVAL;
+#endif
 
 	return 0;
 }
@@ -464,6 +468,14 @@  static int validate_nsset(struct nsset *nsset, struct pid *pid)
 	}
 #endif
 
+#ifdef CONFIG_TIME_NS
+	if (flags & CLONE_NEWTIME) {
+		ret = validate_ns(nsset, &nsp->time_ns->ns);
+		if (ret)
+			goto out;
+	}
+#endif
+
 out:
 	if (pid_ns)
 		put_pid_ns(pid_ns);
@@ -507,6 +519,11 @@  static void commit_nsset(struct nsset *nsset)
 		exit_sem(me);
 #endif
 
+#ifdef CONFIG_TIME_NS
+	if (flags & CLONE_NEWTIME)
+		timens_commit(me, nsset->nsproxy->time_ns);
+#endif
+
 	/* transfer ownership */
 	switch_task_namespaces(me, nsset->nsproxy);
 	nsset->nsproxy = NULL;
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index aa7b90aac2a7..afc65e6be33e 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -280,7 +280,7 @@  static void timens_put(struct ns_common *ns)
 	put_time_ns(to_time_ns(ns));
 }
 
-static void timens_commit(struct task_struct *tsk, struct time_namespace *ns)
+void timens_commit(struct task_struct *tsk, struct time_namespace *ns)
 {
 	timens_set_vvar_page(tsk, ns);
 	vdso_join_timens(tsk, ns);
@@ -298,9 +298,6 @@  static int timens_install(struct nsset *nsset, struct ns_common *new)
 	    !ns_capable(nsset->cred->user_ns, CAP_SYS_ADMIN))
 		return -EPERM;
 
-
-	timens_commit(current, ns);
-
 	get_time_ns(ns);
 	put_time_ns(nsproxy->time_ns);
 	nsproxy->time_ns = ns;