Message ID | 20200619153559.724863-4-christian.brauner@ubuntu.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | nsproxy: support CLONE_NEWTIME with setns() | expand |
On Fri, Jun 19, 2020 at 05:35:59PM +0200, Christian Brauner wrote: > So far setns() was missing time namespace support. This was partially due > to it simply not being implemented but also because vdso_join_timens() > could still fail which made switching to multiple namespaces atomically > problematic. This is now fixed so support CLONE_NEWTIME with setns() > > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Michael Kerrisk <mtk.manpages@gmail.com> > Cc: Serge Hallyn <serge@hallyn.com> > Cc: Dmitry Safonov <dima@arista.com> > Cc: Andrei Vagin <avagin@gmail.com> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> > --- Andrei, Dmitry, A little off-topic since its not related to the patch here but I've been going through the current time namespace semantics and i just want to confirm something with you: Afaict, unshare(CLONE_NEWTIME) currently works similar to unshare(CLONE_NEWPID) in that it only changes {pid,time}_for_children but does _not_ change the {pid, time} namespace of the caller itself. For pid namespaces that makes a lot of sense but I'm not completely clear why you're doing this for time namespaces, especially since the setns() behavior for CLONE_NEWPID and CLONE_NEWTIME is very different: Similar to unshare(CLONE_NEWPID), setns(CLONE_NEWPID) doesn't change the pid namespace of the caller itself, it only changes it for it's children by setting up pid_for_children. _But_ for setns(CLONE_NEWTIME) both the caller's and the children's time namespace is changed, i.e. unshare(CLONE_NEWTIME) behaves different from setns(CLONE_NEWTIME). Why? This also has the consequence that the unshare(CLONE_NEWTIME) + setns(CLONE_NEWTIME) sequence can be used to change the callers pid namespace. Is this intended? Here's some code where you can verify this (please excuse the aweful code I'm using to illustrate this): int main(int argc, char *argv[]) { char buf1[4096], buf2[4096]; if (unshare(0x00000080)) exit(1); int fd = open("/proc/self/ns/time", O_RDONLY); if (fd < 0) exit(2); readlink("/proc/self/ns/time", buf1, sizeof(buf1)); readlink("/proc/self/ns/time_for_children", buf2, sizeof(buf2)); printf("unshare(CLONE_NEWTIME): time(%s) ~= time_for_children(%s)\n", buf1, buf2); if (setns(fd, 0x00000080)) exit(3); readlink("/proc/self/ns/time", buf1, sizeof(buf1)); readlink("/proc/self/ns/time_for_children", buf2, sizeof(buf2)); printf("setns(self, CLONE_NEWTIME): time(%s) == time_for_children(%s)\n", buf1, buf2); exit(EXIT_SUCCESS); } which gives: root@f2-vm:/# ./test unshare(CLONE_NEWTIME): time(time:[4026531834]) ~= time_for_children(time:[4026532366]) setns(self, CLONE_NEWTIME): time(time:[4026531834]) == time_for_children(time:[4026531834]) why is unshare(CLONE_NEWTIME) blocked from changing the callers pid namespace when setns(CLONE_NEWTIME) is allowed to do this? Christian
On Tue, Jun 23, 2020 at 01:55:21PM +0200, Christian Brauner wrote: > On Fri, Jun 19, 2020 at 05:35:59PM +0200, Christian Brauner wrote: > > So far setns() was missing time namespace support. This was partially due > > to it simply not being implemented but also because vdso_join_timens() > > could still fail which made switching to multiple namespaces atomically > > problematic. This is now fixed so support CLONE_NEWTIME with setns() > > > > Cc: Thomas Gleixner <tglx@linutronix.de> > > Cc: Michael Kerrisk <mtk.manpages@gmail.com> > > Cc: Serge Hallyn <serge@hallyn.com> > > Cc: Dmitry Safonov <dima@arista.com> > > Cc: Andrei Vagin <avagin@gmail.com> > > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> > > --- > > Andrei, > Dmitry, > > A little off-topic since its not related to the patch here but I've been > going through the current time namespace semantics and i just want to > confirm something with you: > > Afaict, unshare(CLONE_NEWTIME) currently works similar to > unshare(CLONE_NEWPID) in that it only changes {pid,time}_for_children > but does _not_ change the {pid, time} namespace of the caller itself. > For pid namespaces that makes a lot of sense but I'm not completely > clear why you're doing this for time namespaces, especially since the > setns() behavior for CLONE_NEWPID and CLONE_NEWTIME is very different: > Similar to unshare(CLONE_NEWPID), setns(CLONE_NEWPID) doesn't change the > pid namespace of the caller itself, it only changes it for it's > children by setting up pid_for_children. _But_ for setns(CLONE_NEWTIME) > both the caller's and the children's time namespace is changed, i.e. > unshare(CLONE_NEWTIME) behaves different from setns(CLONE_NEWTIME). Why? This scheme allows setting clock offsets for a namespace, before any processes appear in it. It is not allowed to change offsets if any task has joined a time namespace. We need this to avoid corner cases with timers and tasks don't need to be aware of offset changes. > > This also has the consequence that the unshare(CLONE_NEWTIME) + > setns(CLONE_NEWTIME) sequence can be used to change the callers pid > namespace. Is this intended? > Here's some code where you can verify this (please excuse the aweful > code I'm using to illustrate this): > > int main(int argc, char *argv[]) > { > char buf1[4096], buf2[4096]; > > if (unshare(0x00000080)) > exit(1); > > int fd = open("/proc/self/ns/time", O_RDONLY); > if (fd < 0) > exit(2); > > readlink("/proc/self/ns/time", buf1, sizeof(buf1)); > readlink("/proc/self/ns/time_for_children", buf2, sizeof(buf2)); > printf("unshare(CLONE_NEWTIME): time(%s) ~= time_for_children(%s)\n", buf1, buf2); > > if (setns(fd, 0x00000080)) > exit(3); And in this example, you use the right sequence of steps: unshare, set offsets, setns. With clone3, we will be able to do this in one call. > > readlink("/proc/self/ns/time", buf1, sizeof(buf1)); > readlink("/proc/self/ns/time_for_children", buf2, sizeof(buf2)); > printf("setns(self, CLONE_NEWTIME): time(%s) == time_for_children(%s)\n", buf1, buf2); > > exit(EXIT_SUCCESS); > } > > which gives: > > root@f2-vm:/# ./test > unshare(CLONE_NEWTIME): time(time:[4026531834]) ~= time_for_children(time:[4026532366]) > setns(self, CLONE_NEWTIME): time(time:[4026531834]) == time_for_children(time:[4026531834]) > > why is unshare(CLONE_NEWTIME) blocked from changing the callers pid > namespace when setns(CLONE_NEWTIME) is allowed to do this? > > Christian
On Fri, Jun 19, 2020 at 05:35:59PM +0200, Christian Brauner wrote: > So far setns() was missing time namespace support. This was partially due > to it simply not being implemented but also because vdso_join_timens() > could still fail which made switching to multiple namespaces atomically > problematic. This is now fixed so support CLONE_NEWTIME with setns() > > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Michael Kerrisk <mtk.manpages@gmail.com> > Cc: Serge Hallyn <serge@hallyn.com> > Cc: Dmitry Safonov <dima@arista.com> > Cc: Andrei Vagin <avagin@gmail.com> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Hi Christian, I have reviewed this series and it looks good to me. We decided to not change the return type of vdso_join_timens to avoid conflicts with the arm64 timens patchset. With this change, you can add my Reviewed-by to all patched in this series. Reviewed-by: Andrei Vagin <avagin@gmail.com> Thanks, Andrei
On Thu, Jun 25, 2020 at 02:06:18AM -0700, Andrei Vagin wrote: > On Fri, Jun 19, 2020 at 05:35:59PM +0200, Christian Brauner wrote: > > So far setns() was missing time namespace support. This was partially due > > to it simply not being implemented but also because vdso_join_timens() > > could still fail which made switching to multiple namespaces atomically > > problematic. This is now fixed so support CLONE_NEWTIME with setns() > > > > Cc: Thomas Gleixner <tglx@linutronix.de> > > Cc: Michael Kerrisk <mtk.manpages@gmail.com> > > Cc: Serge Hallyn <serge@hallyn.com> > > Cc: Dmitry Safonov <dima@arista.com> > > Cc: Andrei Vagin <avagin@gmail.com> > > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> > > Hi Christian, > > I have reviewed this series and it looks good to me. > > We decided to not change the return type of vdso_join_timens to avoid > conflicts with the arm64 timens patchset. With this change, you can add > my Reviewed-by to all patched in this series. > > Reviewed-by: Andrei Vagin <avagin@gmail.com> Thanks! As discussed in the thread for th arm changes. We'll defer the return type changes! Christian
diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h index 4d1768c6f836..d308a3812f79 100644 --- a/include/linux/time_namespace.h +++ b/include/linux/time_namespace.h @@ -33,6 +33,7 @@ extern struct time_namespace init_time_ns; #ifdef CONFIG_TIME_NS extern void vdso_join_timens(struct task_struct *task, struct time_namespace *ns); +extern void timens_commit(struct task_struct *tsk, struct time_namespace *ns); static inline struct time_namespace *get_time_ns(struct time_namespace *ns) { @@ -95,6 +96,11 @@ static inline void vdso_join_timens(struct task_struct *task, { } +static inline void timens_commit(struct task_struct *tsk, + struct time_namespace *ns) +{ +} + static inline struct time_namespace *get_time_ns(struct time_namespace *ns) { return NULL; diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c index b03df67621d0..f12231c41b69 100644 --- a/kernel/nsproxy.c +++ b/kernel/nsproxy.c @@ -262,8 +262,8 @@ void exit_task_namespaces(struct task_struct *p) static int check_setns_flags(unsigned long flags) { if (!flags || (flags & ~(CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | - CLONE_NEWNET | CLONE_NEWUSER | CLONE_NEWPID | - CLONE_NEWCGROUP))) + CLONE_NEWNET | CLONE_NEWTIME | CLONE_NEWUSER | + CLONE_NEWPID | CLONE_NEWCGROUP))) return -EINVAL; #ifndef CONFIG_USER_NS @@ -290,6 +290,10 @@ static int check_setns_flags(unsigned long flags) if (flags & CLONE_NEWNET) return -EINVAL; #endif +#ifndef CONFIG_TIME_NS + if (flags & CLONE_NEWTIME) + return -EINVAL; +#endif return 0; } @@ -464,6 +468,14 @@ static int validate_nsset(struct nsset *nsset, struct pid *pid) } #endif +#ifdef CONFIG_TIME_NS + if (flags & CLONE_NEWTIME) { + ret = validate_ns(nsset, &nsp->time_ns->ns); + if (ret) + goto out; + } +#endif + out: if (pid_ns) put_pid_ns(pid_ns); @@ -507,6 +519,11 @@ static void commit_nsset(struct nsset *nsset) exit_sem(me); #endif +#ifdef CONFIG_TIME_NS + if (flags & CLONE_NEWTIME) + timens_commit(me, nsset->nsproxy->time_ns); +#endif + /* transfer ownership */ switch_task_namespaces(me, nsset->nsproxy); nsset->nsproxy = NULL; diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c index aa7b90aac2a7..afc65e6be33e 100644 --- a/kernel/time/namespace.c +++ b/kernel/time/namespace.c @@ -280,7 +280,7 @@ static void timens_put(struct ns_common *ns) put_time_ns(to_time_ns(ns)); } -static void timens_commit(struct task_struct *tsk, struct time_namespace *ns) +void timens_commit(struct task_struct *tsk, struct time_namespace *ns) { timens_set_vvar_page(tsk, ns); vdso_join_timens(tsk, ns); @@ -298,9 +298,6 @@ static int timens_install(struct nsset *nsset, struct ns_common *new) !ns_capable(nsset->cred->user_ns, CAP_SYS_ADMIN)) return -EPERM; - - timens_commit(current, ns); - get_time_ns(ns); put_time_ns(nsproxy->time_ns); nsproxy->time_ns = ns;
So far setns() was missing time namespace support. This was partially due to it simply not being implemented but also because vdso_join_timens() could still fail which made switching to multiple namespaces atomically problematic. This is now fixed so support CLONE_NEWTIME with setns() Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: Dmitry Safonov <dima@arista.com> Cc: Andrei Vagin <avagin@gmail.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> --- include/linux/time_namespace.h | 6 ++++++ kernel/nsproxy.c | 21 +++++++++++++++++++-- kernel/time/namespace.c | 5 +---- 3 files changed, 26 insertions(+), 6 deletions(-)