[RFC,00/20] ns: Introduce Time Namespace

Message ID	20180919205037.9574-1-dima@arista.com (mailing list archive)
Headers	show Return-Path: <linux-kselftest-owner@kernel.org> From: Dmitry Safonov <dima@arista.com> To: linux-kernel@vger.kernel.org Cc: Dmitry Safonov <0x7f454c46@gmail.com>, Dmitry Safonov <dima@arista.com>, Adrian Reber <adrian@lisas.de>, Andrei Vagin <avagin@openvz.org>, Andy Lutomirski <luto@kernel.org>, Christian Brauner <christian.brauner@ubuntu.com>, Cyrill Gorcunov <gorcunov@openvz.org>, "Eric W. Biederman" <ebiederm@xmission.com>, "H. Peter Anvin" <hpa@zytor.com>, Ingo Molnar <mingo@redhat.com>, Jeff Dike <jdike@addtoit.com>, Oleg Nesterov <oleg@redhat.com>, Pavel Emelyanov <xemul@virtuozzo.com>, Shuah Khan <shuah@kernel.org>, Thomas Gleixner <tglx@linutronix.de>, containers@lists.linux-foundation.org, criu@openvz.org, linux-api@vger.kernel.org, x86@kernel.org, Alexey Dobriyan <adobriyan@gmail.com>, linux-kselftest@vger.kernel.org Subject: [RFC 00/20] ns: Introduce Time Namespace Date: Wed, 19 Sep 2018 21:50:17 +0100 Message-Id: <20180919205037.9574-1-dima@arista.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kselftest-owner@vger.kernel.org Precedence: bulk
Series	ns: Introduce Time Namespace \| expand [RFC,00/20] ns: Introduce Time Namespace [RFC,16/20] selftest: Add Time Namespace test for supported clocks [RFC,17/20] selftest/timens: Add test for timerfd [RFC,18/20] selftest/timens: Add test for clock_nanosleep [RFC,19/20] timens/selftest: Add procfs selftest [RFC,20/20] timens/selftest: Add timer offsets test

Dmitry Safonov Sept. 19, 2018, 8:50 p.m. UTC

Discussions around time virtualization are there for a long time.
The first attempt to implement time namespace was in 2006 by Jeff Dike.
From that time, the topic appears on and off in various discussions.

There are two main use cases for time namespaces:
1. change date and time inside a container;
2. adjust clocks for a container restored from a checkpoint.

“It seems like this might be one of the last major obstacles keeping
migration from being used in production systems, given that not all
containers and connections can be migrated as long as a time dependency
is capable of messing it up.” (by github.com/dav-ell)

The kernel provides access to several clocks: CLOCK_REALTIME,
CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
start points for them are not defined and are different for each running
system. When a container is migrated from one node to another, all
clocks have to be restored into consistent states; in other words, they
have to continue running from the same points where they have been
dumped.

The main idea behind this patch set is adding per-namespace offsets for
system clocks. When a process in a non-root time namespace requests
time of a clock, a namespace offset is added to the current value of
this clock on a host and the sum is returned.

All offsets are placed on a separate page, this allows up to map it as 
part of vvar into user processes and use offsets from vdso calls.

Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
clocks.

Questions to discuss:

* Clone flags exhaustion. Currently there is only one unused clone flag
bit left, and it may be worth to use it to extend arguments of the clone
system call.

* Realtime clock implementation details:
  Is having a simple offset enough?
  What to do when date and time is changed on the host?
  Is there a need to adjust vfs modification and creation times? 
  Implementation for adjtime() syscall.

Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Adrian Reber <adrian@lisas.de>
Cc: Andrei Vagin <avagin@openvz.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com> 
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: containers@lists.linux-foundation.org
Cc: criu@openvz.org
Cc: linux-api@vger.kernel.org
Cc: x86@kernel.org

Andrei Vagin (12):
  ns: Introduce Time Namespace
  timens: Add timens_offsets
  timens: Introduce CLOCK_MONOTONIC offsets
  timens: Introduce CLOCK_BOOTTIME offset
  timerfd/timens: Take into account ns clock offsets
  kernel: Take into account timens clock offsets in clock_nanosleep
  x86/vdso/timens: Add offsets page in vvar
  x86/vdso: Use set_normalized_timespec() to avoid 32 bit overflow
  posix-timers/timens: Take into account clock offsets
  selftest/timens: Add test for timerfd
  selftest/timens: Add test for clock_nanosleep
  timens/selftest: Add timer offsets test

Dmitry Safonov (8):
  timens: Shift /proc/uptime
  x86/vdso: Restrict splitting vvar vma
  x86/vdso: Purge timens page on setns()/unshare()/clone()
  x86/vdso: Look for vvar vma to purge timens page
  timens: Add align for timens_offsets
  timens: Optimize zero-offsets
  selftest: Add Time Namespace test for supported clocks
  timens/selftest: Add procfs selftest

 arch/Kconfig                                     |   5 +
 arch/x86/Kconfig                                 |   1 +
 arch/x86/entry/vdso/vclock_gettime.c             |  52 +++++
 arch/x86/entry/vdso/vdso-layout.lds.S            |   9 +-
 arch/x86/entry/vdso/vdso2c.c                     |   3 +
 arch/x86/entry/vdso/vma.c                        |  67 +++++++
 arch/x86/include/asm/vdso.h                      |   2 +
 fs/proc/namespaces.c                             |   3 +
 fs/proc/uptime.c                                 |   3 +
 fs/timerfd.c                                     |  16 +-
 include/linux/nsproxy.h                          |   1 +
 include/linux/proc_ns.h                          |   1 +
 include/linux/time_namespace.h                   |  72 +++++++
 include/linux/timens_offsets.h                   |  25 +++
 include/linux/user_namespace.h                   |   1 +
 include/uapi/linux/sched.h                       |   1 +
 init/Kconfig                                     |   8 +
 kernel/Makefile                                  |   1 +
 kernel/fork.c                                    |   3 +-
 kernel/nsproxy.c                                 |  19 +-
 kernel/time/hrtimer.c                            |   8 +
 kernel/time/posix-timers.c                       |  89 ++++++++-
 kernel/time/posix-timers.h                       |   2 +
 kernel/time_namespace.c                          | 230 +++++++++++++++++++++++
 tools/testing/selftests/timens/.gitignore        |   5 +
 tools/testing/selftests/timens/Makefile          |   6 +
 tools/testing/selftests/timens/clock_nanosleep.c |  98 ++++++++++
 tools/testing/selftests/timens/config            |   1 +
 tools/testing/selftests/timens/log.h             |  21 +++
 tools/testing/selftests/timens/procfs.c          | 145 ++++++++++++++
 tools/testing/selftests/timens/timens.c          | 196 +++++++++++++++++++
 tools/testing/selftests/timens/timer.c           |  95 ++++++++++
 tools/testing/selftests/timens/timerfd.c         |  96 ++++++++++
 33 files changed, 1272 insertions(+), 13 deletions(-)
 create mode 100644 include/linux/time_namespace.h
 create mode 100644 include/linux/timens_offsets.h
 create mode 100644 kernel/time_namespace.c
 create mode 100644 tools/testing/selftests/timens/.gitignore
 create mode 100644 tools/testing/selftests/timens/Makefile
 create mode 100644 tools/testing/selftests/timens/clock_nanosleep.c
 create mode 100644 tools/testing/selftests/timens/config
 create mode 100644 tools/testing/selftests/timens/log.h
 create mode 100644 tools/testing/selftests/timens/procfs.c
 create mode 100644 tools/testing/selftests/timens/timens.c
 create mode 100644 tools/testing/selftests/timens/timer.c
 create mode 100644 tools/testing/selftests/timens/timerfd.c

Eric W. Biederman Sept. 21, 2018, 12:27 p.m. UTC | #1

Dmitry Safonov <dima@arista.com> writes:

> Discussions around time virtualization are there for a long time.
> The first attempt to implement time namespace was in 2006 by Jeff Dike.
> From that time, the topic appears on and off in various discussions.
>
> There are two main use cases for time namespaces:
> 1. change date and time inside a container;
> 2. adjust clocks for a container restored from a checkpoint.
>
> “It seems like this might be one of the last major obstacles keeping
> migration from being used in production systems, given that not all
> containers and connections can be migrated as long as a time dependency
> is capable of messing it up.” (by github.com/dav-ell)
>
> The kernel provides access to several clocks: CLOCK_REALTIME,
> CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
> start points for them are not defined and are different for each running
> system. When a container is migrated from one node to another, all
> clocks have to be restored into consistent states; in other words, they
> have to continue running from the same points where they have been
> dumped.
>
> The main idea behind this patch set is adding per-namespace offsets for
> system clocks. When a process in a non-root time namespace requests
> time of a clock, a namespace offset is added to the current value of
> this clock on a host and the sum is returned.
>
> All offsets are placed on a separate page, this allows up to map it as 
> part of vvar into user processes and use offsets from vdso calls.
>
> Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
> clocks.
>
> Questions to discuss:
>
> * Clone flags exhaustion. Currently there is only one unused clone flag
> bit left, and it may be worth to use it to extend arguments of the clone
> system call.
>
> * Realtime clock implementation details:
>   Is having a simple offset enough?
>   What to do when date and time is changed on the host?
>   Is there a need to adjust vfs modification and creation times? 
>   Implementation for adjtime() syscall.

Overall I support this effort.  In my quick skim this code looked good.

My feeling is that we need to be able to support running ntpd and
support one namespace doing googles smoothing of leap seconds while
another namespace takes the leap second.

What I was imagining when I was last thinking about this was one
instance of struct timekeeper aka tk_core per time namespace.  That
structure already keeps offsets for all of the various clocks from
the kerne internal time sources.  What would be needed would be to
pass in an appropriate time namespace pointer.

I could be completely wrong as I have not take the time to completely
trace through the code.  Have you looked at pushing the time namespace
down as far as tk_core?

What I think would be the big advantage (besides ntp working) is that
the bulk of the code could be reused.  Allowing testing of the kernel's
time code by setting up a new time namespace.  So a person in production
could setup a time namespace with the time set ahead a little  bit and
be able to verify that the kernel handles the upcoming leap second
properly.



I don't know about the vfs.  I think the danger is being able to write
dates in the future or in the past.  It appears that utimes(2) and
utimesnat(2) already allow this except for status change.  So it is
possible we simply don't care.  I seem to remember that what nfs does
is take the time stamp from the host writing to the file.

I think the guide for filesystem timestamps should be to first ensure
we don't introduce security issues, and then do what distributed
filesystems do when dealing with hosts with different clocks.

Given those those two guidlines above I don't think there is a need to
change timestamsp the way the user namespace changes uid when displayed.



As for the hardware like the real time clock we definitely should not
let a root in a time namespace change it.  We might even be able to get
away with leaving the real time clock out of the time namespace.  If not
we need to be very careful how the real time clock is abstracted.  I
would start by leaving the real time clock hardware out of the time
namespace and see if there is any part of userspace that cares.

Eric

> Cc: Dmitry Safonov <0x7f454c46@gmail.com>
> Cc: Adrian Reber <adrian@lisas.de>
> Cc: Andrei Vagin <avagin@openvz.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Christian Brauner <christian.brauner@ubuntu.com>
> Cc: Cyrill Gorcunov <gorcunov@openvz.org>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Jeff Dike <jdike@addtoit.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Pavel Emelyanov <xemul@virtuozzo.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: containers@lists.linux-foundation.org
> Cc: criu@openvz.org
> Cc: linux-api@vger.kernel.org
> Cc: x86@kernel.org
>
> Andrei Vagin (12):
>   ns: Introduce Time Namespace
>   timens: Add timens_offsets
>   timens: Introduce CLOCK_MONOTONIC offsets
>   timens: Introduce CLOCK_BOOTTIME offset
>   timerfd/timens: Take into account ns clock offsets
>   kernel: Take into account timens clock offsets in clock_nanosleep
>   x86/vdso/timens: Add offsets page in vvar
>   x86/vdso: Use set_normalized_timespec() to avoid 32 bit overflow
>   posix-timers/timens: Take into account clock offsets
>   selftest/timens: Add test for timerfd
>   selftest/timens: Add test for clock_nanosleep
>   timens/selftest: Add timer offsets test
>
> Dmitry Safonov (8):
>   timens: Shift /proc/uptime
>   x86/vdso: Restrict splitting vvar vma
>   x86/vdso: Purge timens page on setns()/unshare()/clone()
>   x86/vdso: Look for vvar vma to purge timens page
>   timens: Add align for timens_offsets
>   timens: Optimize zero-offsets
>   selftest: Add Time Namespace test for supported clocks
>   timens/selftest: Add procfs selftest
>
>  arch/Kconfig                                     |   5 +
>  arch/x86/Kconfig                                 |   1 +
>  arch/x86/entry/vdso/vclock_gettime.c             |  52 +++++
>  arch/x86/entry/vdso/vdso-layout.lds.S            |   9 +-
>  arch/x86/entry/vdso/vdso2c.c                     |   3 +
>  arch/x86/entry/vdso/vma.c                        |  67 +++++++
>  arch/x86/include/asm/vdso.h                      |   2 +
>  fs/proc/namespaces.c                             |   3 +
>  fs/proc/uptime.c                                 |   3 +
>  fs/timerfd.c                                     |  16 +-
>  include/linux/nsproxy.h                          |   1 +
>  include/linux/proc_ns.h                          |   1 +
>  include/linux/time_namespace.h                   |  72 +++++++
>  include/linux/timens_offsets.h                   |  25 +++
>  include/linux/user_namespace.h                   |   1 +
>  include/uapi/linux/sched.h                       |   1 +
>  init/Kconfig                                     |   8 +
>  kernel/Makefile                                  |   1 +
>  kernel/fork.c                                    |   3 +-
>  kernel/nsproxy.c                                 |  19 +-
>  kernel/time/hrtimer.c                            |   8 +
>  kernel/time/posix-timers.c                       |  89 ++++++++-
>  kernel/time/posix-timers.h                       |   2 +
>  kernel/time_namespace.c                          | 230 +++++++++++++++++++++++
>  tools/testing/selftests/timens/.gitignore        |   5 +
>  tools/testing/selftests/timens/Makefile          |   6 +
>  tools/testing/selftests/timens/clock_nanosleep.c |  98 ++++++++++
>  tools/testing/selftests/timens/config            |   1 +
>  tools/testing/selftests/timens/log.h             |  21 +++
>  tools/testing/selftests/timens/procfs.c          | 145 ++++++++++++++
>  tools/testing/selftests/timens/timens.c          | 196 +++++++++++++++++++
>  tools/testing/selftests/timens/timer.c           |  95 ++++++++++
>  tools/testing/selftests/timens/timerfd.c         |  96 ++++++++++
>  33 files changed, 1272 insertions(+), 13 deletions(-)
>  create mode 100644 include/linux/time_namespace.h
>  create mode 100644 include/linux/timens_offsets.h
>  create mode 100644 kernel/time_namespace.c
>  create mode 100644 tools/testing/selftests/timens/.gitignore
>  create mode 100644 tools/testing/selftests/timens/Makefile
>  create mode 100644 tools/testing/selftests/timens/clock_nanosleep.c
>  create mode 100644 tools/testing/selftests/timens/config
>  create mode 100644 tools/testing/selftests/timens/log.h
>  create mode 100644 tools/testing/selftests/timens/procfs.c
>  create mode 100644 tools/testing/selftests/timens/timens.c
>  create mode 100644 tools/testing/selftests/timens/timer.c
>  create mode 100644 tools/testing/selftests/timens/timerfd.c

Andrey Vagin Sept. 24, 2018, 8:51 p.m. UTC | #2

On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote:
> Dmitry Safonov <dima@arista.com> writes:
> 
> > Discussions around time virtualization are there for a long time.
> > The first attempt to implement time namespace was in 2006 by Jeff Dike.
> > From that time, the topic appears on and off in various discussions.
> >
> > There are two main use cases for time namespaces:
> > 1. change date and time inside a container;
> > 2. adjust clocks for a container restored from a checkpoint.
> >
> > “It seems like this might be one of the last major obstacles keeping
> > migration from being used in production systems, given that not all
> > containers and connections can be migrated as long as a time dependency
> > is capable of messing it up.” (by github.com/dav-ell)
> >
> > The kernel provides access to several clocks: CLOCK_REALTIME,
> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
> > start points for them are not defined and are different for each running
> > system. When a container is migrated from one node to another, all
> > clocks have to be restored into consistent states; in other words, they
> > have to continue running from the same points where they have been
> > dumped.
> >
> > The main idea behind this patch set is adding per-namespace offsets for
> > system clocks. When a process in a non-root time namespace requests
> > time of a clock, a namespace offset is added to the current value of
> > this clock on a host and the sum is returned.
> >
> > All offsets are placed on a separate page, this allows up to map it as 
> > part of vvar into user processes and use offsets from vdso calls.
> >
> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
> > clocks.
> >
> > Questions to discuss:
> >
> > * Clone flags exhaustion. Currently there is only one unused clone flag
> > bit left, and it may be worth to use it to extend arguments of the clone
> > system call.
> >
> > * Realtime clock implementation details:
> >   Is having a simple offset enough?
> >   What to do when date and time is changed on the host?
> >   Is there a need to adjust vfs modification and creation times? 
> >   Implementation for adjtime() syscall.
> 
> Overall I support this effort.  In my quick skim this code looked good.

Hi Eric,

Thank you for the feedback.

> 
> My feeling is that we need to be able to support running ntpd and
> support one namespace doing googles smoothing of leap seconds while
> another namespace takes the leap second.
> 
> What I was imagining when I was last thinking about this was one
> instance of struct timekeeper aka tk_core per time namespace.  That
> structure already keeps offsets for all of the various clocks from
> the kerne internal time sources.  What would be needed would be to
> pass in an appropriate time namespace pointer.
> 
> I could be completely wrong as I have not take the time to completely
> trace through the code.  Have you looked at pushing the time namespace
> down as far as tk_core?
> 
> What I think would be the big advantage (besides ntp working) is that
> the bulk of the code could be reused.  Allowing testing of the kernel's
> time code by setting up a new time namespace.  So a person in production
> could setup a time namespace with the time set ahead a little  bit and
> be able to verify that the kernel handles the upcoming leap second
> properly.
>

It is an interesting idea, but I have a few questions:

1. Does it mean that timekeeping_update() will be called for each
namespace? This functions is called periodically, it updates times on the
timekeeper structure, updates vsyscall_gtod_data, etc. What will be an
overhead of this?

2. What will we do with vdso? It looks like we will have to have a
separate vsyscall_gtod_data for each ns and update each of them
separately.

> 
> 
> I don't know about the vfs.  I think the danger is being able to write
> dates in the future or in the past.  It appears that utimes(2) and
> utimesnat(2) already allow this except for status change.  So it is
> possible we simply don't care.  I seem to remember that what nfs does
> is take the time stamp from the host writing to the file.
> 
> I think the guide for filesystem timestamps should be to first ensure
> we don't introduce security issues, and then do what distributed
> filesystems do when dealing with hosts with different clocks.
> 
> Given those those two guidlines above I don't think there is a need to
> change timestamsp the way the user namespace changes uid when displayed.
> 
> 
> 
> As for the hardware like the real time clock we definitely should not
> let a root in a time namespace change it.  We might even be able to get
> away with leaving the real time clock out of the time namespace.  If not
> we need to be very careful how the real time clock is abstracted.  I
> would start by leaving the real time clock hardware out of the time
> namespace and see if there is any part of userspace that cares.
> 
> Eric
> 
> > Cc: Dmitry Safonov <0x7f454c46@gmail.com>
> > Cc: Adrian Reber <adrian@lisas.de>
> > Cc: Andrei Vagin <avagin@openvz.org>
> > Cc: Andy Lutomirski <luto@kernel.org>
> > Cc: Christian Brauner <christian.brauner@ubuntu.com>
> > Cc: Cyrill Gorcunov <gorcunov@openvz.org>
> > Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> > Cc: "H. Peter Anvin" <hpa@zytor.com> 
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Jeff Dike <jdike@addtoit.com>
> > Cc: Oleg Nesterov <oleg@redhat.com>
> > Cc: Pavel Emelyanov <xemul@virtuozzo.com>
> > Cc: Shuah Khan <shuah@kernel.org>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: containers@lists.linux-foundation.org
> > Cc: criu@openvz.org
> > Cc: linux-api@vger.kernel.org
> > Cc: x86@kernel.org
> >
> > Andrei Vagin (12):
> >   ns: Introduce Time Namespace
> >   timens: Add timens_offsets
> >   timens: Introduce CLOCK_MONOTONIC offsets
> >   timens: Introduce CLOCK_BOOTTIME offset
> >   timerfd/timens: Take into account ns clock offsets
> >   kernel: Take into account timens clock offsets in clock_nanosleep
> >   x86/vdso/timens: Add offsets page in vvar
> >   x86/vdso: Use set_normalized_timespec() to avoid 32 bit overflow
> >   posix-timers/timens: Take into account clock offsets
> >   selftest/timens: Add test for timerfd
> >   selftest/timens: Add test for clock_nanosleep
> >   timens/selftest: Add timer offsets test
> >
> > Dmitry Safonov (8):
> >   timens: Shift /proc/uptime
> >   x86/vdso: Restrict splitting vvar vma
> >   x86/vdso: Purge timens page on setns()/unshare()/clone()
> >   x86/vdso: Look for vvar vma to purge timens page
> >   timens: Add align for timens_offsets
> >   timens: Optimize zero-offsets
> >   selftest: Add Time Namespace test for supported clocks
> >   timens/selftest: Add procfs selftest
> >
> >  arch/Kconfig                                     |   5 +
> >  arch/x86/Kconfig                                 |   1 +
> >  arch/x86/entry/vdso/vclock_gettime.c             |  52 +++++
> >  arch/x86/entry/vdso/vdso-layout.lds.S            |   9 +-
> >  arch/x86/entry/vdso/vdso2c.c                     |   3 +
> >  arch/x86/entry/vdso/vma.c                        |  67 +++++++
> >  arch/x86/include/asm/vdso.h                      |   2 +
> >  fs/proc/namespaces.c                             |   3 +
> >  fs/proc/uptime.c                                 |   3 +
> >  fs/timerfd.c                                     |  16 +-
> >  include/linux/nsproxy.h                          |   1 +
> >  include/linux/proc_ns.h                          |   1 +
> >  include/linux/time_namespace.h                   |  72 +++++++
> >  include/linux/timens_offsets.h                   |  25 +++
> >  include/linux/user_namespace.h                   |   1 +
> >  include/uapi/linux/sched.h                       |   1 +
> >  init/Kconfig                                     |   8 +
> >  kernel/Makefile                                  |   1 +
> >  kernel/fork.c                                    |   3 +-
> >  kernel/nsproxy.c                                 |  19 +-
> >  kernel/time/hrtimer.c                            |   8 +
> >  kernel/time/posix-timers.c                       |  89 ++++++++-
> >  kernel/time/posix-timers.h                       |   2 +
> >  kernel/time_namespace.c                          | 230 +++++++++++++++++++++++
> >  tools/testing/selftests/timens/.gitignore        |   5 +
> >  tools/testing/selftests/timens/Makefile          |   6 +
> >  tools/testing/selftests/timens/clock_nanosleep.c |  98 ++++++++++
> >  tools/testing/selftests/timens/config            |   1 +
> >  tools/testing/selftests/timens/log.h             |  21 +++
> >  tools/testing/selftests/timens/procfs.c          | 145 ++++++++++++++
> >  tools/testing/selftests/timens/timens.c          | 196 +++++++++++++++++++
> >  tools/testing/selftests/timens/timer.c           |  95 ++++++++++
> >  tools/testing/selftests/timens/timerfd.c         |  96 ++++++++++
> >  33 files changed, 1272 insertions(+), 13 deletions(-)
> >  create mode 100644 include/linux/time_namespace.h
> >  create mode 100644 include/linux/timens_offsets.h
> >  create mode 100644 kernel/time_namespace.c
> >  create mode 100644 tools/testing/selftests/timens/.gitignore
> >  create mode 100644 tools/testing/selftests/timens/Makefile
> >  create mode 100644 tools/testing/selftests/timens/clock_nanosleep.c
> >  create mode 100644 tools/testing/selftests/timens/config
> >  create mode 100644 tools/testing/selftests/timens/log.h
> >  create mode 100644 tools/testing/selftests/timens/procfs.c
> >  create mode 100644 tools/testing/selftests/timens/timens.c
> >  create mode 100644 tools/testing/selftests/timens/timer.c
> >  create mode 100644 tools/testing/selftests/timens/timerfd.c

Eric W. Biederman Sept. 24, 2018, 10:02 p.m. UTC | #3

Andrey Vagin <avagin@virtuozzo.com> writes:

> On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote:
>> Dmitry Safonov <dima@arista.com> writes:
>> 
>> > Discussions around time virtualization are there for a long time.
>> > The first attempt to implement time namespace was in 2006 by Jeff Dike.
>> > From that time, the topic appears on and off in various discussions.
>> >
>> > There are two main use cases for time namespaces:
>> > 1. change date and time inside a container;
>> > 2. adjust clocks for a container restored from a checkpoint.
>> >
>> > “It seems like this might be one of the last major obstacles keeping
>> > migration from being used in production systems, given that not all
>> > containers and connections can be migrated as long as a time dependency
>> > is capable of messing it up.” (by github.com/dav-ell)
>> >
>> > The kernel provides access to several clocks: CLOCK_REALTIME,
>> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
>> > start points for them are not defined and are different for each running
>> > system. When a container is migrated from one node to another, all
>> > clocks have to be restored into consistent states; in other words, they
>> > have to continue running from the same points where they have been
>> > dumped.
>> >
>> > The main idea behind this patch set is adding per-namespace offsets for
>> > system clocks. When a process in a non-root time namespace requests
>> > time of a clock, a namespace offset is added to the current value of
>> > this clock on a host and the sum is returned.
>> >
>> > All offsets are placed on a separate page, this allows up to map it as 
>> > part of vvar into user processes and use offsets from vdso calls.
>> >
>> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
>> > clocks.
>> >
>> > Questions to discuss:
>> >
>> > * Clone flags exhaustion. Currently there is only one unused clone flag
>> > bit left, and it may be worth to use it to extend arguments of the clone
>> > system call.
>> >
>> > * Realtime clock implementation details:
>> >   Is having a simple offset enough?
>> >   What to do when date and time is changed on the host?
>> >   Is there a need to adjust vfs modification and creation times? 
>> >   Implementation for adjtime() syscall.
>> 
>> Overall I support this effort.  In my quick skim this code looked good.
>
> Hi Eric,
>
> Thank you for the feedback.
>
>> 
>> My feeling is that we need to be able to support running ntpd and
>> support one namespace doing googles smoothing of leap seconds while
>> another namespace takes the leap second.
>> 
>> What I was imagining when I was last thinking about this was one
>> instance of struct timekeeper aka tk_core per time namespace.  That
>> structure already keeps offsets for all of the various clocks from
>> the kerne internal time sources.  What would be needed would be to
>> pass in an appropriate time namespace pointer.
>> 
>> I could be completely wrong as I have not take the time to completely
>> trace through the code.  Have you looked at pushing the time namespace
>> down as far as tk_core?
>> 
>> What I think would be the big advantage (besides ntp working) is that
>> the bulk of the code could be reused.  Allowing testing of the kernel's
>> time code by setting up a new time namespace.  So a person in production
>> could setup a time namespace with the time set ahead a little  bit and
>> be able to verify that the kernel handles the upcoming leap second
>> properly.
>>
>
> It is an interesting idea, but I have a few questions:
>
> 1. Does it mean that timekeeping_update() will be called for each
> namespace? This functions is called periodically, it updates times on the
> timekeeper structure, updates vsyscall_gtod_data, etc. What will be an
> overhead of this?

I don't know if periodically is a proper characterization.  There may be
a code path that does that.  But from what I can see timekeeping_update
is the guts of settimeofday (and a few related functions).

So it appears to make sense for timekeeping_update to be per namespace.

Hmm.  Looking at what is updated in the vsyscall_gtod_data it does
look like you would have to periodically update things, but I don't know
big that period would be.  As long as the period is reasonably large,
or the time namespaces were sufficiently deschronized it should not
be a problem.  But that is the class of problem that could make
my ideal impractical if there is measuarable overhead.

Where were you seeing timekeeping_update being called periodically?

> 2. What will we do with vdso? It looks like we will have to have a
> separate vsyscall_gtod_data for each ns and update each of them
> separately.

Yes.  But you don't have to have introduce another variable just make
certain vsyscall_gtod_data is a page aligned thing per time namespace.

If I read the summary of the existing patchset something very similiar
is already going on.

Each process would only map one.  And unshare of the time namespace
would need to act like the pid namespace or be limited to only being
allowed when there is only a single task using the mm.

Eric

Andrey Vagin Sept. 25, 2018, 1:42 a.m. UTC | #4

On Tue, Sep 25, 2018 at 12:02:32AM +0200, Eric W. Biederman wrote:
> Andrey Vagin <avagin@virtuozzo.com> writes:
> 
> > On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote:
> >> Dmitry Safonov <dima@arista.com> writes:
> >> 
> >> > Discussions around time virtualization are there for a long time.
> >> > The first attempt to implement time namespace was in 2006 by Jeff Dike.
> >> > From that time, the topic appears on and off in various discussions.
> >> >
> >> > There are two main use cases for time namespaces:
> >> > 1. change date and time inside a container;
> >> > 2. adjust clocks for a container restored from a checkpoint.
> >> >
> >> > “It seems like this might be one of the last major obstacles keeping
> >> > migration from being used in production systems, given that not all
> >> > containers and connections can be migrated as long as a time dependency
> >> > is capable of messing it up.” (by github.com/dav-ell)
> >> >
> >> > The kernel provides access to several clocks: CLOCK_REALTIME,
> >> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
> >> > start points for them are not defined and are different for each running
> >> > system. When a container is migrated from one node to another, all
> >> > clocks have to be restored into consistent states; in other words, they
> >> > have to continue running from the same points where they have been
> >> > dumped.
> >> >
> >> > The main idea behind this patch set is adding per-namespace offsets for
> >> > system clocks. When a process in a non-root time namespace requests
> >> > time of a clock, a namespace offset is added to the current value of
> >> > this clock on a host and the sum is returned.
> >> >
> >> > All offsets are placed on a separate page, this allows up to map it as 
> >> > part of vvar into user processes and use offsets from vdso calls.
> >> >
> >> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
> >> > clocks.
> >> >
> >> > Questions to discuss:
> >> >
> >> > * Clone flags exhaustion. Currently there is only one unused clone flag
> >> > bit left, and it may be worth to use it to extend arguments of the clone
> >> > system call.
> >> >
> >> > * Realtime clock implementation details:
> >> >   Is having a simple offset enough?
> >> >   What to do when date and time is changed on the host?
> >> >   Is there a need to adjust vfs modification and creation times? 
> >> >   Implementation for adjtime() syscall.
> >> 
> >> Overall I support this effort.  In my quick skim this code looked good.
> >
> > Hi Eric,
> >
> > Thank you for the feedback.
> >
> >> 
> >> My feeling is that we need to be able to support running ntpd and
> >> support one namespace doing googles smoothing of leap seconds while
> >> another namespace takes the leap second.
> >> 
> >> What I was imagining when I was last thinking about this was one
> >> instance of struct timekeeper aka tk_core per time namespace.  That
> >> structure already keeps offsets for all of the various clocks from
> >> the kerne internal time sources.  What would be needed would be to
> >> pass in an appropriate time namespace pointer.
> >> 
> >> I could be completely wrong as I have not take the time to completely
> >> trace through the code.  Have you looked at pushing the time namespace
> >> down as far as tk_core?
> >> 
> >> What I think would be the big advantage (besides ntp working) is that
> >> the bulk of the code could be reused.  Allowing testing of the kernel's
> >> time code by setting up a new time namespace.  So a person in production
> >> could setup a time namespace with the time set ahead a little  bit and
> >> be able to verify that the kernel handles the upcoming leap second
> >> properly.
> >>
> >
> > It is an interesting idea, but I have a few questions:
> >
> > 1. Does it mean that timekeeping_update() will be called for each
> > namespace? This functions is called periodically, it updates times on the
> > timekeeper structure, updates vsyscall_gtod_data, etc. What will be an
> > overhead of this?
> 
> I don't know if periodically is a proper characterization.  There may be
> a code path that does that.  But from what I can see timekeeping_update
> is the guts of settimeofday (and a few related functions).
> 
> So it appears to make sense for timekeeping_update to be per namespace.
> 
> Hmm.  Looking at what is updated in the vsyscall_gtod_data it does
> look like you would have to periodically update things, but I don't know
> big that period would be.  As long as the period is reasonably large,
> or the time namespaces were sufficiently deschronized it should not
> be a problem.  But that is the class of problem that could make
> my ideal impractical if there is measuarable overhead.
> 
> Where were you seeing timekeeping_update being called periodically?

timekeeping_update() is called HZ times per-second:

[   67.912858]  timekeeping_update.cold.26+0x5/0xa
[   67.913332]  timekeeping_advance+0x361/0x5c0
[   67.913857]  ? tick_sched_do_timer+0x55/0x70
[   67.914409]  ? tick_sched_do_timer+0x70/0x70
[   67.914947]  tick_sched_do_timer+0x55/0x70
[   67.915505]  tick_sched_timer+0x27/0x70
[   67.916042]  __hrtimer_run_queues+0x10f/0x440
[   67.916639]  hrtimer_interrupt+0x100/0x220
[   67.917305]  smp_apic_timer_interrupt+0x79/0x220
[   67.918030]  apic_timer_interrupt+0xf/0x20

> 
> > 2. What will we do with vdso? It looks like we will have to have a
> > separate vsyscall_gtod_data for each ns and update each of them
> > separately.
> 
> Yes.  But you don't have to have introduce another variable just make
> certain vsyscall_gtod_data is a page aligned thing per time namespace.
> 
> If I read the summary of the existing patchset something very similiar
> is already going on.

I mean vsyscall_gtod_data has some data which are often updated. There
are timestamps for monotonic and wall clocks. clock_gettime() reads a
time stamp from vsyscall_gtod_data and then use tsc to approximate the
current value of a clock.

Actually, this is not the second question, it is a part of the first
question. update_vsyscall() is called from timekeeping_update().

> 
> Each process would only map one.  And unshare of the time namespace
> would need to act like the pid namespace or be limited to only being
> allowed when there is only a single task using the mm.
> 
> Eric

Eric W. Biederman Sept. 26, 2018, 5:36 p.m. UTC | #5

Andrey Vagin <avagin@virtuozzo.com> writes:

> On Tue, Sep 25, 2018 at 12:02:32AM +0200, Eric W. Biederman wrote:
>> Andrey Vagin <avagin@virtuozzo.com> writes:
>> 
>> > On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote:
>> >> Dmitry Safonov <dima@arista.com> writes:
>> >> 
>> >> > Discussions around time virtualization are there for a long time.
>> >> > The first attempt to implement time namespace was in 2006 by Jeff Dike.
>> >> > From that time, the topic appears on and off in various discussions.
>> >> >
>> >> > There are two main use cases for time namespaces:
>> >> > 1. change date and time inside a container;
>> >> > 2. adjust clocks for a container restored from a checkpoint.
>> >> >
>> >> > “It seems like this might be one of the last major obstacles keeping
>> >> > migration from being used in production systems, given that not all
>> >> > containers and connections can be migrated as long as a time dependency
>> >> > is capable of messing it up.” (by github.com/dav-ell)
>> >> >
>> >> > The kernel provides access to several clocks: CLOCK_REALTIME,
>> >> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
>> >> > start points for them are not defined and are different for each running
>> >> > system. When a container is migrated from one node to another, all
>> >> > clocks have to be restored into consistent states; in other words, they
>> >> > have to continue running from the same points where they have been
>> >> > dumped.
>> >> >
>> >> > The main idea behind this patch set is adding per-namespace offsets for
>> >> > system clocks. When a process in a non-root time namespace requests
>> >> > time of a clock, a namespace offset is added to the current value of
>> >> > this clock on a host and the sum is returned.
>> >> >
>> >> > All offsets are placed on a separate page, this allows up to map it as 
>> >> > part of vvar into user processes and use offsets from vdso calls.
>> >> >
>> >> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
>> >> > clocks.
>> >> >
>> >> > Questions to discuss:
>> >> >
>> >> > * Clone flags exhaustion. Currently there is only one unused clone flag
>> >> > bit left, and it may be worth to use it to extend arguments of the clone
>> >> > system call.
>> >> >
>> >> > * Realtime clock implementation details:
>> >> >   Is having a simple offset enough?
>> >> >   What to do when date and time is changed on the host?
>> >> >   Is there a need to adjust vfs modification and creation times? 
>> >> >   Implementation for adjtime() syscall.
>> >> 
>> >> Overall I support this effort.  In my quick skim this code looked good.
>> >
>> > Hi Eric,
>> >
>> > Thank you for the feedback.
>> >
>> >> 
>> >> My feeling is that we need to be able to support running ntpd and
>> >> support one namespace doing googles smoothing of leap seconds while
>> >> another namespace takes the leap second.
>> >> 
>> >> What I was imagining when I was last thinking about this was one
>> >> instance of struct timekeeper aka tk_core per time namespace.  That
>> >> structure already keeps offsets for all of the various clocks from
>> >> the kerne internal time sources.  What would be needed would be to
>> >> pass in an appropriate time namespace pointer.
>> >> 
>> >> I could be completely wrong as I have not take the time to completely
>> >> trace through the code.  Have you looked at pushing the time namespace
>> >> down as far as tk_core?
>> >> 
>> >> What I think would be the big advantage (besides ntp working) is that
>> >> the bulk of the code could be reused.  Allowing testing of the kernel's
>> >> time code by setting up a new time namespace.  So a person in production
>> >> could setup a time namespace with the time set ahead a little  bit and
>> >> be able to verify that the kernel handles the upcoming leap second
>> >> properly.
>> >>
>> >
>> > It is an interesting idea, but I have a few questions:
>> >
>> > 1. Does it mean that timekeeping_update() will be called for each
>> > namespace? This functions is called periodically, it updates times on the
>> > timekeeper structure, updates vsyscall_gtod_data, etc. What will be an
>> > overhead of this?
>> 
>> I don't know if periodically is a proper characterization.  There may be
>> a code path that does that.  But from what I can see timekeeping_update
>> is the guts of settimeofday (and a few related functions).
>> 
>> So it appears to make sense for timekeeping_update to be per namespace.
>> 
>> Hmm.  Looking at what is updated in the vsyscall_gtod_data it does
>> look like you would have to periodically update things, but I don't know
>> big that period would be.  As long as the period is reasonably large,
>> or the time namespaces were sufficiently deschronized it should not
>> be a problem.  But that is the class of problem that could make
>> my ideal impractical if there is measuarable overhead.
>> 
>> Where were you seeing timekeeping_update being called periodically?
>
> timekeeping_update() is called HZ times per-second:
>
> [   67.912858]  timekeeping_update.cold.26+0x5/0xa
> [   67.913332]  timekeeping_advance+0x361/0x5c0
> [   67.913857]  ? tick_sched_do_timer+0x55/0x70
> [   67.914409]  ? tick_sched_do_timer+0x70/0x70
> [   67.914947]  tick_sched_do_timer+0x55/0x70
> [   67.915505]  tick_sched_timer+0x27/0x70
> [   67.916042]  __hrtimer_run_queues+0x10f/0x440
> [   67.916639]  hrtimer_interrupt+0x100/0x220
> [   67.917305]  smp_apic_timer_interrupt+0x79/0x220
> [   67.918030]  apic_timer_interrupt+0xf/0x20

Interesting.

Reading the code the calling sequence there is:
tick_sched_do_timer
   tick_do_update_jiffies64
      update_wall_time
          timekeeping_advance
             timekeepging_update

If I read that properly under the right nohz circumstances that update
can be delayed indefinitely.

So I think we could prototype a time namespace that was per
timekeeping_update and just had update_wall_time iterate through
all of the time namespaces.

I don't think the naive version would scale to very many time
namespaces.

At the same time using the techniques from the nohz work and a little
smarts I expect we could get the code to scale.

I think this direction is definitely worth exploring.  My experience
with namespaces is that if we don't get the advanced features working
there is little to no interest from the core developers of the code,
and the namespaces don't solve additional problems.  Which makes the
namespace a hard sell.  Especially when it does not solve problems the
developers of the subsystem have.

The advantage of timekeeping_update per time namespace is that it allows
different lengths of seconds per time namespace.  Which allows testing
ntp and the kernel in interesting ways while still having a working
production configuration on the same system.

Eric

Dmitry Safonov Sept. 26, 2018, 5:59 p.m. UTC | #6

2018-09-26 18:36 GMT+01:00 Eric W. Biederman <ebiederm@xmission.com>:
> The advantage of timekeeping_update per time namespace is that it allows
> different lengths of seconds per time namespace.  Which allows testing
> ntp and the kernel in interesting ways while still having a working
> production configuration on the same system.

Just a quick note: the different length of second per namespace sounds
very interesting in my POV, I remember I've seen this article:
http://publish.illinois.edu/science-of-security-lablet/files/2014/05/DSSnet-A-Smart-Grid-Modeling-Platform-Combining-Electrical-Power-Distributtion-System-Simulation-and-Software-Defined-Networking-Emulation.pdf

And their realisation with a simulation of time going with different speed
per-pid  (with vdso disabled):
https://github.com/littlepretty/VirtualTimeKernel

Thanks,
             Dmitry

Thomas Gleixner Sept. 27, 2018, 9:30 p.m. UTC | #7

On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> Reading the code the calling sequence there is:
> tick_sched_do_timer
>    tick_do_update_jiffies64
>       update_wall_time
>           timekeeping_advance
>              timekeepging_update
> 
> If I read that properly under the right nohz circumstances that update
> can be delayed indefinitely.
> 
> So I think we could prototype a time namespace that was per
> timekeeping_update and just had update_wall_time iterate through
> all of the time namespaces.

Please don't go there. timekeeping_update() is already heavy and walking
through a gazillion of namespaces will just make it horrible,

> I don't think the naive version would scale to very many time
> namespaces.

:)

> At the same time using the techniques from the nohz work and a little
> smarts I expect we could get the code to scale.

You'd need to invoke the update when the namespace is switched in and
hasn't been updated since the last tick happened. That might be doable, but
you also need to take the wraparound constraints of the underlying
clocksources into account, which again can cause walking all name spaces
when they are all idle long enough.

From there it becomes hairy, because it's not only timekeeping,
i.e. reading time, this is also affecting all timers which are armed from a
namespace.

That gets really ugly because when you do settimeofday() or adjtimex() for
a particular namespace, then you have to search for all armed timers of
that namespace and adjust them.

The original posix timer code had the same issue because it mapped the
clock realtime timers to the timer wheel so any setting of the clock caused
a full walk of all armed timers, disarming, adjusting and requeing
them. That's horrible not only performance wise, it's also a locking
nightmare of all sorts.

Add time skew via NTP/PTP into the picture and you might have to adjust
timers as well, because you need to guarantee that they are not expiring
early.

I haven't looked through Dimitry's patches yet, but I don't see how this
can work at all without introducing subtle issues all over the place.

Thanks,

	tglx

Thomas Gleixner Sept. 27, 2018, 9:41 p.m. UTC | #8

On Thu, 27 Sep 2018, Thomas Gleixner wrote:
> Add time skew via NTP/PTP into the picture and you might have to adjust
> timers as well, because you need to guarantee that they are not expiring
> early.
> 
> I haven't looked through Dimitry's patches yet, but I don't see how this
> can work at all without introducing subtle issues all over the place.

And just a quick scan tells me that this is broken. Timers will expire
early or late. The latter is acceptible to some extent, but larger delays
might come with surprise. Expiring early is an absolute nono.

Thanks,

	tglx

Eric W. Biederman Sept. 28, 2018, 5:03 p.m. UTC | #9

Thomas Gleixner <tglx@linutronix.de> writes:

> On Wed, 26 Sep 2018, Eric W. Biederman wrote:
>> Reading the code the calling sequence there is:
>> tick_sched_do_timer
>>    tick_do_update_jiffies64
>>       update_wall_time
>>           timekeeping_advance
>>              timekeepging_update
>> 
>> If I read that properly under the right nohz circumstances that update
>> can be delayed indefinitely.
>> 
>> So I think we could prototype a time namespace that was per
>> timekeeping_update and just had update_wall_time iterate through
>> all of the time namespaces.
>
> Please don't go there. timekeeping_update() is already heavy and walking
> through a gazillion of namespaces will just make it horrible,
>
>> I don't think the naive version would scale to very many time
>> namespaces.
>
> :)
>
>> At the same time using the techniques from the nohz work and a little
>> smarts I expect we could get the code to scale.
>
> You'd need to invoke the update when the namespace is switched in and
> hasn't been updated since the last tick happened. That might be doable, but
> you also need to take the wraparound constraints of the underlying
> clocksources into account, which again can cause walking all name spaces
> when they are all idle long enough.

The wrap around constraints being how long before the time sources wrap
around so you have to read them once per wrap around?  I have not dug
deeply enough into the code to see that yet.

> From there it becomes hairy, because it's not only timekeeping,
> i.e. reading time, this is also affecting all timers which are armed from a
> namespace.
>
> That gets really ugly because when you do settimeofday() or adjtimex() for
> a particular namespace, then you have to search for all armed timers of
> that namespace and adjust them.
>
> The original posix timer code had the same issue because it mapped the
> clock realtime timers to the timer wheel so any setting of the clock caused
> a full walk of all armed timers, disarming, adjusting and requeing
> them. That's horrible not only performance wise, it's also a locking
> nightmare of all sorts.
>
> Add time skew via NTP/PTP into the picture and you might have to adjust
> timers as well, because you need to guarantee that they are not expiring
> early.
>
> I haven't looked through Dimitry's patches yet, but I don't see how this
> can work at all without introducing subtle issues all over the place.

Then it sounds like this will take some more digging.

Please pardon me for thinking out load.

There are one or more time sources that we use to compute the time
and for each time source we have a conversion from ticks of the
time source to nanoseconds.

Each time source needs to be sampled at least once per wrap-around
and something incremented so that we don't loose time when looking
at that time source.

There are several clocks presented to userspace and they all share the
same length of second and are all fundamentally offsets from
CLOCK_MONOTONIC.

I see two fundamental driving cases for a time namespace.
1) Migration from one node to another node in a cluster in almost
   real time.

   The problem is that CLOCK_MONOTONIC between nodes in the cluster
   has not relation ship to each other (except a synchronized length of
   the second).  So applications that migrate can see CLOCK_MONOTONIC
   and CLOCK_BOOTTIME go backwards.

   This is the truly pressing problem and adding some kind of offset
   sounds like it would be the solution.  Possibly by allowing a boot
   time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.

2) Dealing with two separate time management domains.  Say a machine
   that needes to deal with both something inside of google where they
   slew time to avoid leap time seconds and something in the outside
   world proper UTC time is kept as an offset from TAI with the
   occasional leap seconds.

   In the later case it would fundamentally require having seconds of
   different length.

A pure 64bit nanoseond counter is good for 500 years.  So 64bit
variables can be used to hold time, and everything can be converted from
there.

This suggests we can for ticks have two values.
- The number of ticks from the time source.
- The number of times the ticks would have rolled over.

That sounds like it may be a little simplistic as it would require being
very diligent about firing a timer exactly at rollover and not losing
that, but for a handwaving argument is probably enough to generate
a 64bit tick counter.

If the focus is on a 64bit tick counter then what update_wall_time
has to do is very limited.  Just deal the accounting needed to cope with
tick rollover.

Getting the actual time looks like it would be as simple as now, with
perhaps an extra addition to account for the number of times the tick
counter has rolled over.  With limited precision arithmetic and various
optimizations I don't think it is that simple to implement but it feels
like it should be very little extra work.

For timers my inclination would be to assume no adjustments to the
current time parameters and set the timer to go off then.   If the time
on the appropriate clock has been changed since the timer was set and
the timer is going off early reschedule so the timer fires at the
appropriate time.

With the above I think it is theoretically possible to build a time
namespace that supports multiple lengths of second, and does not have
much overhead.

Not that I think a final implementation would necessary look like what I
have described. I just think it is possible with extreme care to evolve
the current code base into something that can efficiently handle
multiple time domains with slightly different lenghts of second.

Thomas does it sound like I am completely out of touch with reality?

It does though sound like it is going to take some serious digging
through the code to understand how what everything does and how and why
everthing works the way it does.  Not something grafted on top with just
a cursory understanding of how the code works.

Eric

Thomas Gleixner Sept. 28, 2018, 7:32 p.m. UTC | #10

Eric,

On Fri, 28 Sep 2018, Eric W. Biederman wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
> > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> >> At the same time using the techniques from the nohz work and a little
> >> smarts I expect we could get the code to scale.
> >
> > You'd need to invoke the update when the namespace is switched in and
> > hasn't been updated since the last tick happened. That might be doable, but
> > you also need to take the wraparound constraints of the underlying
> > clocksources into account, which again can cause walking all name spaces
> > when they are all idle long enough.
> 
> The wrap around constraints being how long before the time sources wrap
> around so you have to read them once per wrap around?  I have not dug
> deeply enough into the code to see that yet.

It's done by limiting the NOHZ idle time when all CPUs are going into deep
sleep for a long time, i.e. we make sure that at least one CPU comes back
sufficiently _before_ the wraparound happens and invokes the update
function.

It's not so much a problem for TSC, but not every clocksource the kernel
supports has wraparound times in the range of hundreds of years.

But yes, your idea of keeping track of wraparounds might work. Tricky, but
looks feasible on first sight, but we should be aware of the dragons.

> Please pardon me for thinking out load.
> 
> There are one or more time sources that we use to compute the time
> and for each time source we have a conversion from ticks of the
> time source to nanoseconds.
> 
> Each time source needs to be sampled at least once per wrap-around
> and something incremented so that we don't loose time when looking
> at that time source.
> 
> There are several clocks presented to userspace and they all share the
> same length of second and are all fundamentally offsets from
> CLOCK_MONOTONIC.

Yes. That's the readout side. This one is doable. But now look at timers.

If you arm the timer from a name space, then it needs to be converted to
host time in order to sort it into the hrtimer queue and at some point arm
the clockevent device for it. This works as long as host and name space
time have a constant offset and the same skew.

Once the name space time has a different skew this falls apart because the
armed timer will either expire late or early.

Late might be acceptable, early violates the spec. You could do an extra
check for rescheduling it, if it's early, but that requires to store the
name space time accessor in the hrtimer itself because not every timer
expiry happens so that it can be checked in the name space context (think
signal based timers). We need to add this extra magic right into
__hrtimer_run_queues() which is called from the hard and soft interrupt. We
really don't want to touch all relevant callbacks or syscalls. The latter
is not sufficient anyway for signal based timer delivery.

That's going to be interesting in terms of synchronization and might also
cause substantial overhead at least for the timers which belong to name
spaces.

But that also means that anything which is early can and probably will
cause rearming of the timer hardware possibly for a very short delta. We
need to think about whether this can be abused to create interrupt storms.

Now if you accept a bit late, which I'm not really happy about, then you
surely won't accept very late, i.e. hours, days. But that can happen when
settimeofday() comes into play. Right now with a single time domain, this
is easy. When settimeofday() or adjtimex() makes time jump, we just go and
reprogramm the hardware timers accordingly, which might also result in
immediate expiry of timers.

But this does not help for time jumps in name spaces because the timer is
enqueued on the host time base.

And no, we should not think about creating per name space hrtimer queues
and then have to walk through all of them for finding the first expiring
timer in order to arm the hardware. That cannot scale.

Walking all hrtimer bases on all CPUs and check all queued timers whether
they belong to the affected name space does not scale either.

So we'd need to keep track of queued timers belonging to a name space and
then just handle them. Interesting locking problem and also a scalability
issue because this might need to be done on all online CPUs. Haven't
thought it through, but it makes me shudder.

> I see two fundamental driving cases for a time namespace.

<SNIP>

I completely understand the problem you are trying to solve and yes, the
read out of time should be a solvable problem.

> For timers my inclination would be to assume no adjustments to the
> current time parameters and set the timer to go off then.   If the time
> on the appropriate clock has been changed since the timer was set and
> the timer is going off early reschedule so the timer fires at the
> appropriate time.

See above.

> Not that I think a final implementation would necessary look like what I
> have described. I just think it is possible with extreme care to evolve
> the current code base into something that can efficiently handle
> multiple time domains with slightly different lenghts of second.

Yes, it really needs some serious thoughts and timekeeping is a really
complex place especially with NTP/PTP in play. We had quite some quality
time to make it work correctly and reliably, now you come along and want to
transform it into a multidimensional puzzle. :)

> Thomas does it sound like I am completely out of touch with reality?

Which reality are you talking about? :)

> It does though sound like it is going to take some serious digging
> through the code to understand how what everything does and how and why
> everthing works the way it does.  Not something grafted on top with just
> a cursory understanding of how the code works.

I fully agree and I'm happy to help with explanations and ideas and being
the one who shoots holes into yours.

Thanks,

	tglx

Eric W. Biederman Oct. 1, 2018, 9:05 a.m. UTC | #11

Thomas Gleixner <tglx@linutronix.de> writes:

> Eric,
>
> On Fri, 28 Sep 2018, Eric W. Biederman wrote:
>> Thomas Gleixner <tglx@linutronix.de> writes:
>> > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
>> >> At the same time using the techniques from the nohz work and a little
>> >> smarts I expect we could get the code to scale.
>> >
>> > You'd need to invoke the update when the namespace is switched in and
>> > hasn't been updated since the last tick happened. That might be doable, but
>> > you also need to take the wraparound constraints of the underlying
>> > clocksources into account, which again can cause walking all name spaces
>> > when they are all idle long enough.
>> 
>> The wrap around constraints being how long before the time sources wrap
>> around so you have to read them once per wrap around?  I have not dug
>> deeply enough into the code to see that yet.
>
> It's done by limiting the NOHZ idle time when all CPUs are going into deep
> sleep for a long time, i.e. we make sure that at least one CPU comes back
> sufficiently _before_ the wraparound happens and invokes the update
> function.
>
> It's not so much a problem for TSC, but not every clocksource the kernel
> supports has wraparound times in the range of hundreds of years.
>
> But yes, your idea of keeping track of wraparounds might work. Tricky, but
> looks feasible on first sight, but we should be aware of the dragons.

Oh.  Yes.  Definitely.  A key enabler of any namespace implementation is
figuring out how to tame the dragons.

>> Please pardon me for thinking out load.
>> 
>> There are one or more time sources that we use to compute the time
>> and for each time source we have a conversion from ticks of the
>> time source to nanoseconds.
>> 
>> Each time source needs to be sampled at least once per wrap-around
>> and something incremented so that we don't loose time when looking
>> at that time source.
>> 
>> There are several clocks presented to userspace and they all share the
>> same length of second and are all fundamentally offsets from
>> CLOCK_MONOTONIC.
>
> Yes. That's the readout side. This one is doable. But now look at timers.
>
> If you arm the timer from a name space, then it needs to be converted to
> host time in order to sort it into the hrtimer queue and at some point arm
> the clockevent device for it. This works as long as host and name space
> time have a constant offset and the same skew.
>
> Once the name space time has a different skew this falls apart because the
> armed timer will either expire late or early.
>
> Late might be acceptable, early violates the spec. You could do an extra
> check for rescheduling it, if it's early, but that requires to store the
> name space time accessor in the hrtimer itself because not every timer
> expiry happens so that it can be checked in the name space context (think
> signal based timers). We need to add this extra magic right into
> __hrtimer_run_queues() which is called from the hard and soft interrupt. We
> really don't want to touch all relevant callbacks or syscalls. The latter
> is not sufficient anyway for signal based timer delivery.
>
> That's going to be interesting in terms of synchronization and might also
> cause substantial overhead at least for the timers which belong to name
> spaces.
>
> But that also means that anything which is early can and probably will
> cause rearming of the timer hardware possibly for a very short delta. We
> need to think about whether this can be abused to create interrupt storms.
>
> Now if you accept a bit late, which I'm not really happy about, then you
> surely won't accept very late, i.e. hours, days. But that can happen when
> settimeofday() comes into play. Right now with a single time domain, this
> is easy. When settimeofday() or adjtimex() makes time jump, we just go and
> reprogramm the hardware timers accordingly, which might also result in
> immediate expiry of timers.
>
> But this does not help for time jumps in name spaces because the timer is
> enqueued on the host time base.
>
> And no, we should not think about creating per name space hrtimer queues
> and then have to walk through all of them for finding the first expiring
> timer in order to arm the hardware. That cannot scale.
>
> Walking all hrtimer bases on all CPUs and check all queued timers whether
> they belong to the affected name space does not scale either.
>
> So we'd need to keep track of queued timers belonging to a name space and
> then just handle them. Interesting locking problem and also a scalability
> issue because this might need to be done on all online CPUs. Haven't
> thought it through, but it makes me shudder.

Yes.  I can see how this is a dragon that we need to figure out how to
tame.  It already exist somewhat for CLOCK_MONOTONIC vs CLOCK_REALTIME
but still.

>> I see two fundamental driving cases for a time namespace.
>
> <SNIP>
>
> I completely understand the problem you are trying to solve and yes, the
> read out of time should be a solvable problem.

There is simplified subproblem that I want to ask about but I will reply
separately for that.

>> Not that I think a final implementation would necessary look like what I
>> have described. I just think it is possible with extreme care to evolve
>> the current code base into something that can efficiently handle
>> multiple time domains with slightly different lenghts of second.
>
> Yes, it really needs some serious thoughts and timekeeping is a really
> complex place especially with NTP/PTP in play. We had quite some quality
> time to make it work correctly and reliably, now you come along and want to
> transform it into a multidimensional puzzle. :)

I thought it was Einstein who pointed out what a puzzle timekeeping is,
with the rest of us just playing catch up. ;-)

>> It does though sound like it is going to take some serious digging
>> through the code to understand how what everything does and how and why
>> everthing works the way it does.  Not something grafted on top with just
>> a cursory understanding of how the code works.
>
> I fully agree and I'm happy to help with explanations and ideas and being
> the one who shoots holes into yours.

Sounds good.

Eric

Eric W. Biederman Oct. 1, 2018, 9:15 a.m. UTC | #12

In the context of process migration there is a simpler subproblem that I
think it is worth exploring if we can do something about.

For a cluster of machines all running with synchronized
clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between
machines.   Not having a matching CLOCK_MONOTONIC prevents successful
process migration between nodes in that cluster.

Would it be possible to allow setting CLOCK_MONOTONIC at the very
beginning of time?  So that all of the nodes in a cluster can be in
sync?

No change in skew just in offset for CLOCK_MONOTONIC.

There are also dragons involved in coordinating things so that
CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used.  So I don't
know if allowing CLOCK_MONOTONIC to be set would be practical but it
seems work exploring all on it's own.

Dmitry would setting CLOCK_MONOTONIC exactly once at boot time solve
your problem that is you are looking at a time namespace to solve?

Eric

Thomas Gleixner Oct. 1, 2018, 6:52 p.m. UTC | #13

On Mon, 1 Oct 2018, Eric W. Biederman wrote:
> In the context of process migration there is a simpler subproblem that I
> think it is worth exploring if we can do something about.
> 
> For a cluster of machines all running with synchronized
> clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between
> machines.   Not having a matching CLOCK_MONOTONIC prevents successful
> process migration between nodes in that cluster.
> 
> Would it be possible to allow setting CLOCK_MONOTONIC at the very
> beginning of time?  So that all of the nodes in a cluster can be in
> sync?
> 
> No change in skew just in offset for CLOCK_MONOTONIC.
> 
> There are also dragons involved in coordinating things so that
> CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used.  So I don't
> know if allowing CLOCK_MONOTONIC to be set would be practical but it
> seems work exploring all on it's own.

It's used very early on in the kernel, so that would be a major surprise
for many things including user space which has expectations on clock
monotonic.

It would be reasonably easy to add CLOCK_MONONOTIC_SYNC which can be set in
the way you described and then in name spaces make it possible to magically
map CLOCK_MONOTONIC to CLOCK_MONOTONIC_SYNC.

It still wouldn't allow to have different NTP/PTP time domains, but might
be a good start to address the main migration headaches.

Thanks,

	tglx

Andrey Vagin Oct. 1, 2018, 8:51 p.m. UTC | #14

On Mon, Oct 01, 2018 at 11:15:32AM +0200, Eric W. Biederman wrote:
> 
> In the context of process migration there is a simpler subproblem that I
> think it is worth exploring if we can do something about.
> 
> For a cluster of machines all running with synchronized
> clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between
> machines.   Not having a matching CLOCK_MONOTONIC prevents successful
> process migration between nodes in that cluster.
> 
> Would it be possible to allow setting CLOCK_MONOTONIC at the very
> beginning of time?  So that all of the nodes in a cluster can be in
> sync?

Here is a question about how to synchronize clocks between nodes. It
looks like we will need to have a working network for this, but a
network configuration may be non-trivial and it can require to run a few
processes which can use CLOCK_MONOTNIC...

> 
> No change in skew just in offset for CLOCK_MONOTONIC.
> 
> There are also dragons involved in coordinating things so that
> CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used.  So I don't
> know if allowing CLOCK_MONOTONIC to be set would be practical but it
> seems work exploring all on it's own.
> 
> Dmitry would setting CLOCK_MONOTONIC exactly once at boot time solve
> your problem that is you are looking at a time namespace to solve?

Process migration is only one of use-cases. Another use-case is
restoring from snapshots. It may be even more popular than process
migration. We can't guarantee that all snapshots will be done in one
cluster. For example, a user meets a bug, does a container snapshot and
attaches it to a bug report.

> 
> Eric

Andrey Vagin Oct. 1, 2018, 11:20 p.m. UTC | #15

On Thu, Sep 27, 2018 at 11:41:49PM +0200, Thomas Gleixner wrote:
> On Thu, 27 Sep 2018, Thomas Gleixner wrote:
> > Add time skew via NTP/PTP into the picture and you might have to adjust
> > timers as well, because you need to guarantee that they are not expiring
> > early.
> > 
> > I haven't looked through Dimitry's patches yet, but I don't see how this
> > can work at all without introducing subtle issues all over the place.
> 
> And just a quick scan tells me that this is broken. Timers will expire
> early or late. The latter is acceptible to some extent, but larger delays
> might come with surprise. Expiring early is an absolute nono.

Do you mean that we have to adjust all timers after changing offset for
CLOCK_MONOTONIC or CLOCK_BOOTTIME? Our idea is that offsets for
monotonic and boot times will be set immediately after creating a time
namespace before using any timers.

It is interesting to think what a use-case for changing these offsets
after creating timers. It may be useful for testing needs. A user sets a
timer in an hour and then change a clock offset forward and check that a
test application handles the timer properly.

> 
> Thanks,
> 
> 	tglx
>

Thomas Gleixner Oct. 2, 2018, 6:15 a.m. UTC | #16

On Mon, 1 Oct 2018, Andrey Vagin wrote:

> On Thu, Sep 27, 2018 at 11:41:49PM +0200, Thomas Gleixner wrote:
> > On Thu, 27 Sep 2018, Thomas Gleixner wrote:
> > > Add time skew via NTP/PTP into the picture and you might have to adjust
> > > timers as well, because you need to guarantee that they are not expiring
> > > early.
> > > 
> > > I haven't looked through Dimitry's patches yet, but I don't see how this
> > > can work at all without introducing subtle issues all over the place.
> > 
> > And just a quick scan tells me that this is broken. Timers will expire
> > early or late. The latter is acceptible to some extent, but larger delays
> > might come with surprise. Expiring early is an absolute nono.
> 
> Do you mean that we have to adjust all timers after changing offset for
> CLOCK_MONOTONIC or CLOCK_BOOTTIME? Our idea is that offsets for
> monotonic and boot times will be set immediately after creating a time
> namespace before using any timers.

I explained that in detail in this thread, but it's not about the initial
setting of clock mono/boot before any timers have been armed.

It's about setting the offset or clock realtime (via settimeofday) when
timers are already armed. Also having a entirely different time domain,
e.g. separate NTP adjustments, makes that necessary.

Thanks,

	tglx

Thomas Gleixner Oct. 2, 2018, 6:16 a.m. UTC | #17

On Mon, 1 Oct 2018, Andrey Vagin wrote:
> On Mon, Oct 01, 2018 at 11:15:32AM +0200, Eric W. Biederman wrote:
> > 
> > In the context of process migration there is a simpler subproblem that I
> > think it is worth exploring if we can do something about.
> > 
> > For a cluster of machines all running with synchronized
> > clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between
> > machines.   Not having a matching CLOCK_MONOTONIC prevents successful
> > process migration between nodes in that cluster.
> > 
> > Would it be possible to allow setting CLOCK_MONOTONIC at the very
> > beginning of time?  So that all of the nodes in a cluster can be in
> > sync?
> 
> Here is a question about how to synchronize clocks between nodes. It
> looks like we will need to have a working network for this, but a
> network configuration may be non-trivial and it can require to run a few
> processes which can use CLOCK_MONOTNIC...
> 
> > 
> > No change in skew just in offset for CLOCK_MONOTONIC.
> > 
> > There are also dragons involved in coordinating things so that
> > CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used.  So I don't
> > know if allowing CLOCK_MONOTONIC to be set would be practical but it
> > seems work exploring all on it's own.
> > 
> > Dmitry would setting CLOCK_MONOTONIC exactly once at boot time solve
> > your problem that is you are looking at a time namespace to solve?
> 
> Process migration is only one of use-cases. Another use-case is
> restoring from snapshots. It may be even more popular than process
> migration. We can't guarantee that all snapshots will be done in one
> cluster. For example, a user meets a bug, does a container snapshot and
> attaches it to a bug report.

Sure, but see my reply to Eric. That could be solved with that extra clock
id, which then gets mapped to monotonic for name spaces.

Thanks,

	tglx

Arnd Bergmann Oct. 2, 2018, 8 p.m. UTC | #18

On Mon, Oct 1, 2018 at 8:53 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Mon, 1 Oct 2018, Eric W. Biederman wrote:
> > In the context of process migration there is a simpler subproblem that I
> > think it is worth exploring if we can do something about.
> >
> > For a cluster of machines all running with synchronized
> > clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between
> > machines.   Not having a matching CLOCK_MONOTONIC prevents successful
> > process migration between nodes in that cluster.
> >
> > Would it be possible to allow setting CLOCK_MONOTONIC at the very
> > beginning of time?  So that all of the nodes in a cluster can be in
> > sync?
> >
> > No change in skew just in offset for CLOCK_MONOTONIC.
> >
> > There are also dragons involved in coordinating things so that
> > CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used.  So I don't
> > know if allowing CLOCK_MONOTONIC to be set would be practical but it
> > seems work exploring all on it's own.
>
> It's used very early on in the kernel, so that would be a major surprise
> for many things including user space which has expectations on clock
> monotonic.
>
> It would be reasonably easy to add CLOCK_MONONOTIC_SYNC which can be set in
> the way you described and then in name spaces make it possible to magically
> map CLOCK_MONOTONIC to CLOCK_MONOTONIC_SYNC.
>
> It still wouldn't allow to have different NTP/PTP time domains, but might
> be a good start to address the main migration headaches.

If we make CLOCK_MONOTONIC settable this way in a namespace,
do you think that should include device drivers that report timestamps
in CLOCK_MONOTONIC base, or only the timekeeping clock and timer
interfaces?

Examples for drivers that can report timestamps are input, sound, v4l,
and drm. I think most of these can report stamps in either monotonic
or realtime base, while socket timestamps notably are always in
realtime.

We can probably get away with not setting the timebase for those
device drivers as long as the checkpoint/restart and migration features
are not expected to restore the state of an open character device
in that way. I don't know if that is a reasonable assumption to make
for the examples I listed.

      Arnd

Thomas Gleixner Oct. 2, 2018, 8:06 p.m. UTC | #19

On Tue, 2 Oct 2018, Arnd Bergmann wrote:
> On Mon, Oct 1, 2018 at 8:53 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > On Mon, 1 Oct 2018, Eric W. Biederman wrote:
> > > In the context of process migration there is a simpler subproblem that I
> > > think it is worth exploring if we can do something about.
> > >
> > > For a cluster of machines all running with synchronized
> > > clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between
> > > machines.   Not having a matching CLOCK_MONOTONIC prevents successful
> > > process migration between nodes in that cluster.
> > >
> > > Would it be possible to allow setting CLOCK_MONOTONIC at the very
> > > beginning of time?  So that all of the nodes in a cluster can be in
> > > sync?
> > >
> > > No change in skew just in offset for CLOCK_MONOTONIC.
> > >
> > > There are also dragons involved in coordinating things so that
> > > CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used.  So I don't
> > > know if allowing CLOCK_MONOTONIC to be set would be practical but it
> > > seems work exploring all on it's own.
> >
> > It's used very early on in the kernel, so that would be a major surprise
> > for many things including user space which has expectations on clock
> > monotonic.
> >
> > It would be reasonably easy to add CLOCK_MONONOTIC_SYNC which can be set in
> > the way you described and then in name spaces make it possible to magically
> > map CLOCK_MONOTONIC to CLOCK_MONOTONIC_SYNC.
> >
> > It still wouldn't allow to have different NTP/PTP time domains, but might
> > be a good start to address the main migration headaches.
> 
> If we make CLOCK_MONOTONIC settable this way in a namespace,
> do you think that should include device drivers that report timestamps
> in CLOCK_MONOTONIC base, or only the timekeeping clock and timer
> interfaces?

Uurgh. That gets messy very fast.

> Examples for drivers that can report timestamps are input, sound, v4l,
> and drm. I think most of these can report stamps in either monotonic
> or realtime base, while socket timestamps notably are always in
> realtime.
> 
> We can probably get away with not setting the timebase for those
> device drivers as long as the checkpoint/restart and migration features
> are not expected to restore the state of an open character device
> in that way. I don't know if that is a reasonable assumption to make
> for the examples I listed.

No idea. I'm not a container migration wizard.

Thanks,

	tglx

Dmitry Safonov Oct. 2, 2018, 9:05 p.m. UTC | #20

Hi Thomas, Andrei, Eric,

On Tue, 2 Oct 2018 at 07:15, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Mon, 1 Oct 2018, Andrey Vagin wrote:
>
> > On Thu, Sep 27, 2018 at 11:41:49PM +0200, Thomas Gleixner wrote:
> > > On Thu, 27 Sep 2018, Thomas Gleixner wrote:
> > > > Add time skew via NTP/PTP into the picture and you might have to adjust
> > > > timers as well, because you need to guarantee that they are not expiring
> > > > early.
> > > >
> > > > I haven't looked through Dimitry's patches yet, but I don't see how this
> > > > can work at all without introducing subtle issues all over the place.
> > >
> > > And just a quick scan tells me that this is broken. Timers will expire
> > > early or late. The latter is acceptible to some extent, but larger delays
> > > might come with surprise. Expiring early is an absolute nono.
> >
> > Do you mean that we have to adjust all timers after changing offset for
> > CLOCK_MONOTONIC or CLOCK_BOOTTIME? Our idea is that offsets for
> > monotonic and boot times will be set immediately after creating a time
> > namespace before using any timers.
>
> I explained that in detail in this thread, but it's not about the initial
> setting of clock mono/boot before any timers have been armed.
>
> It's about setting the offset or clock realtime (via settimeofday) when
> timers are already armed. Also having a entirely different time domain,
> e.g. separate NTP adjustments, makes that necessary.

It looks like, there is a bit of misunderstanding each other:
Andrei was talking about the current RFC version, where we haven't
introduced offsets for clock realtime. While Thomas IIUC, is looking
how-to expand time namespace over realtime.

As CLOCK_REALTIME virtualization raises so many complex questions
like a different length of the second or list of realtime timers in ns we
haven't added any realization for it.

It seems like an initial introduction for timens can be expanded after to cover
realtime clocks too. While it may seem incomplete, it solves issues for
restoring/migration of real-world applications like nodejs, Oracle DB server
which fails after being restored if there is a leap in monotonic time.

While solving the mentioned issues, it doesn't bring overhead.
(well, Andy noted that cmp for zero-offsets on vdso can be optimized too,
which will be done in v1).

Thomas, thanks much for your input - now we know that we'll need to
introduce list for timers in namespace when we'll add realtime clocks.
Do you believe that CLOCK_MONOTONIC_SYNC would be an easier
concept than offsets per-namespace?

Thanks,
             Dmitry

Thomas Gleixner Oct. 2, 2018, 9:26 p.m. UTC | #21

Dmitry,

On Tue, 2 Oct 2018, Dmitry Safonov wrote:
> On Tue, 2 Oct 2018 at 07:15, Thomas Gleixner <tglx@linutronix.de> wrote:
> > I explained that in detail in this thread, but it's not about the initial
> > setting of clock mono/boot before any timers have been armed.
> >
> > It's about setting the offset or clock realtime (via settimeofday) when
> > timers are already armed. Also having a entirely different time domain,
> > e.g. separate NTP adjustments, makes that necessary.
> 
> It looks like, there is a bit of misunderstanding each other:
> Andrei was talking about the current RFC version, where we haven't
> introduced offsets for clock realtime. While Thomas IIUC, is looking
> how-to expand time namespace over realtime.
>
> As CLOCK_REALTIME virtualization raises so many complex questions
> like a different length of the second or list of realtime timers in ns we
> haven't added any realization for it.
> 
> It seems like an initial introduction for timens can be expanded after to cover
> realtime clocks too. While it may seem incomplete, it solves issues for
> restoring/migration of real-world applications like nodejs, Oracle DB server
> which fails after being restored if there is a leap in monotonic time.

Well, yes. But you really have to think about the full picture. Just adding
part of the overall solution right now, just because it can be glued into
the code easily, is not the best approach IMO as it might result in
substantial rework of the whole thing sooner than later. I really don't
want to end up with something which is not extensible and has to be
supported forever.

Just for the record, the current approach with name space offsets for
monotonic is also prone to malfunction vs. timers, unless you can prevent
changing the offset _after_ the namespace has been set up and timers have
been armed. I admit, that I did not look close enough to verify that.

> While solving the mentioned issues, it doesn't bring overhead.
> (well, Andy noted that cmp for zero-offsets on vdso can be optimized too,
> which will be done in v1).
> 
> Thomas, thanks much for your input - now we know that we'll need to
> introduce list for timers in namespace when we'll add realtime clocks.
> Do you believe that CLOCK_MONOTONIC_SYNC would be an easier
> concept than offsets per-namespace?

Haven't thought it through. This was just an idea in reaction to Eric's
question whether setting clock monotonic might be feasible. But yes, it
might be worth to think about it.

I think you should really define the long term requirements for time
namespaces and perhaps set some limitations in functionality upfront.

Thanks,

	tglx

Eric W. Biederman Oct. 3, 2018, 4:50 a.m. UTC | #22

Thomas Gleixner <tglx@linutronix.de> writes:

> On Tue, 2 Oct 2018, Arnd Bergmann wrote:
>> On Mon, Oct 1, 2018 at 8:53 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> >
>> > On Mon, 1 Oct 2018, Eric W. Biederman wrote:
>> > > In the context of process migration there is a simpler subproblem that I
>> > > think it is worth exploring if we can do something about.
>> > >
>> > > For a cluster of machines all running with synchronized
>> > > clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between
>> > > machines.   Not having a matching CLOCK_MONOTONIC prevents successful
>> > > process migration between nodes in that cluster.
>> > >
>> > > Would it be possible to allow setting CLOCK_MONOTONIC at the very
>> > > beginning of time?  So that all of the nodes in a cluster can be in
>> > > sync?
>> > >
>> > > No change in skew just in offset for CLOCK_MONOTONIC.
>> > >
>> > > There are also dragons involved in coordinating things so that
>> > > CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used.  So I don't
>> > > know if allowing CLOCK_MONOTONIC to be set would be practical but it
>> > > seems work exploring all on it's own.
>> >
>> > It's used very early on in the kernel, so that would be a major surprise
>> > for many things including user space which has expectations on clock
>> > monotonic.
>> >
>> > It would be reasonably easy to add CLOCK_MONONOTIC_SYNC which can be set in
>> > the way you described and then in name spaces make it possible to magically
>> > map CLOCK_MONOTONIC to CLOCK_MONOTONIC_SYNC.
>> >
>> > It still wouldn't allow to have different NTP/PTP time domains, but might
>> > be a good start to address the main migration headaches.
>> 
>> If we make CLOCK_MONOTONIC settable this way in a namespace,
>> do you think that should include device drivers that report timestamps
>> in CLOCK_MONOTONIC base, or only the timekeeping clock and timer
>> interfaces?
>
> Uurgh. That gets messy very fast.
>
>> Examples for drivers that can report timestamps are input, sound, v4l,
>> and drm. I think most of these can report stamps in either monotonic
>> or realtime base, while socket timestamps notably are always in
>> realtime.
>> 
>> We can probably get away with not setting the timebase for those
>> device drivers as long as the checkpoint/restart and migration features
>> are not expected to restore the state of an open character device
>> in that way. I don't know if that is a reasonable assumption to make
>> for the examples I listed.
>
> No idea. I'm not a container migration wizard.

Direct access to hardware/drivers and not through an abstraction like
the vfs (an abstraction over block devices) can legitimately be handled
by hotplug events.  I unplug one keyboard I plug in another.

I don't know if the input layer is more of a general abstraction
or more of a hardware device.  I have not dug into it but my guess
is abstraction from what I have heard.

The scary difficulty here is if after restart input is reporting times
in CLOCK_MONOTONIC and the applications in the namespace are talking
about times in CLOCK_MONOTONIC_SYNC.  Then there is an issue.  As even
with a fixed offset the times don't match up.

So a time namespace absolutely needs to do is figure out how to deal
with all of the kernel interfaces reporting times and figure out how to
report them in the current time namespace.

Eric

Thomas Gleixner Oct. 3, 2018, 5:25 a.m. UTC | #23

On Wed, 3 Oct 2018, Eric W. Biederman wrote:
> Direct access to hardware/drivers and not through an abstraction like
> the vfs (an abstraction over block devices) can legitimately be handled
> by hotplug events.  I unplug one keyboard I plug in another.
> 
> I don't know if the input layer is more of a general abstraction
> or more of a hardware device.  I have not dug into it but my guess
> is abstraction from what I have heard.
> 
> The scary difficulty here is if after restart input is reporting times
> in CLOCK_MONOTONIC and the applications in the namespace are talking
> about times in CLOCK_MONOTONIC_SYNC.  Then there is an issue.  As even
> with a fixed offset the times don't match up.
> 
> So a time namespace absolutely needs to do is figure out how to deal
> with all of the kernel interfaces reporting times and figure out how to
> report them in the current time namespace.

So you want to talk to Arnd who is leading the y2038 effort. He knowns how
many and which interfaces are involved aside of the obvious core timer
ones. It's quite an amount and the problem is that you really need to do
that at the interface level, because many of those time stamps are taken in
contexts which are completely oblivious of name spaces. Ditto for timeouts
and similar things which are handed in through these interfaces.

Thanks,

	tglx

Eric W. Biederman Oct. 3, 2018, 6:14 a.m. UTC | #24

Thomas Gleixner <tglx@linutronix.de> writes:

> On Wed, 3 Oct 2018, Eric W. Biederman wrote:
>> Direct access to hardware/drivers and not through an abstraction like
>> the vfs (an abstraction over block devices) can legitimately be handled
>> by hotplug events.  I unplug one keyboard I plug in another.
>> 
>> I don't know if the input layer is more of a general abstraction
>> or more of a hardware device.  I have not dug into it but my guess
>> is abstraction from what I have heard.
>> 
>> The scary difficulty here is if after restart input is reporting times
>> in CLOCK_MONOTONIC and the applications in the namespace are talking
>> about times in CLOCK_MONOTONIC_SYNC.  Then there is an issue.  As even
>> with a fixed offset the times don't match up.
>> 
>> So a time namespace absolutely needs to do is figure out how to deal
>> with all of the kernel interfaces reporting times and figure out how to
>> report them in the current time namespace.
>
> So you want to talk to Arnd who is leading the y2038 effort. He knowns how
> many and which interfaces are involved aside of the obvious core timer
> ones. It's quite an amount and the problem is that you really need to do
> that at the interface level, because many of those time stamps are taken in
> contexts which are completely oblivious of name spaces. Ditto for timeouts
> and similar things which are handed in through these interfaces.

Yep.  That sounds right.

Eric

Thomas Gleixner Oct. 3, 2018, 6:14 a.m. UTC | #25

On Wed, 3 Oct 2018, Thomas Gleixner wrote:
> On Wed, 3 Oct 2018, Eric W. Biederman wrote:
> > Direct access to hardware/drivers and not through an abstraction like
> > the vfs (an abstraction over block devices) can legitimately be handled
> > by hotplug events.  I unplug one keyboard I plug in another.
> > 
> > I don't know if the input layer is more of a general abstraction
> > or more of a hardware device.  I have not dug into it but my guess
> > is abstraction from what I have heard.
> > 
> > The scary difficulty here is if after restart input is reporting times
> > in CLOCK_MONOTONIC and the applications in the namespace are talking
> > about times in CLOCK_MONOTONIC_SYNC.  Then there is an issue.  As even
> > with a fixed offset the times don't match up.
> > 
> > So a time namespace absolutely needs to do is figure out how to deal
> > with all of the kernel interfaces reporting times and figure out how to
> > report them in the current time namespace.
> 
> So you want to talk to Arnd who is leading the y2038 effort. He knowns how
> many and which interfaces are involved aside of the obvious core timer
> ones. It's quite an amount and the problem is that you really need to do
> that at the interface level, because many of those time stamps are taken in
> contexts which are completely oblivious of name spaces. Ditto for timeouts
> and similar things which are handed in through these interfaces.

Plus you have to make sure, that any new interface will have that
treatment. For y2038 that's easy as we just require to use timespec64 for
new ones. For your problem that's not so trivial.

Thanks,

	tglx

Arnd Bergmann Oct. 3, 2018, 7:02 a.m. UTC | #26

On Wed, Oct 3, 2018 at 8:14 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Thomas Gleixner <tglx@linutronix.de> writes:
>
> > On Wed, 3 Oct 2018, Eric W. Biederman wrote:
> >> Direct access to hardware/drivers and not through an abstraction like
> >> the vfs (an abstraction over block devices) can legitimately be handled
> >> by hotplug events.  I unplug one keyboard I plug in another.
> >>
> >> I don't know if the input layer is more of a general abstraction
> >> or more of a hardware device.  I have not dug into it but my guess
> >> is abstraction from what I have heard.
> >>
> >> The scary difficulty here is if after restart input is reporting times
> >> in CLOCK_MONOTONIC and the applications in the namespace are talking
> >> about times in CLOCK_MONOTONIC_SYNC.  Then there is an issue.  As even
> >> with a fixed offset the times don't match up.
> >>
> >> So a time namespace absolutely needs to do is figure out how to deal
> >> with all of the kernel interfaces reporting times and figure out how to
> >> report them in the current time namespace.
> >
> > So you want to talk to Arnd who is leading the y2038 effort. He knowns how
> > many and which interfaces are involved aside of the obvious core timer
> > ones. It's quite an amount and the problem is that you really need to do
> > that at the interface level, because many of those time stamps are taken in
> > contexts which are completely oblivious of name spaces. Ditto for timeouts
> > and similar things which are handed in through these interfaces.
>
> Yep.  That sounds right.

Let's stay with the input event example for the moment: Here, we have a
character device, and a user calls read() to retrieve one or more records
of type 'struct input_event' using the evdev_read() function. The original
timestamp gets put there using this logic:

        ktime_t time;
        struct timespec64 ts;
        time = client->clk_type == EV_CLK_REAL ?
                        ktime_get_real() :
                        client->clk_type == EV_CLK_MONO ?
                                ktime_get() :
                                ktime_get_boottime();
        ts = ktime_to_timespec64(time);
        ev.input_event_sec = ts.tv_sec;
        ev.input_event_usec = ts.tv_nsec / NSEC_PER_USEC;

clk_type can get set using an ioctl() to real, monotonic or
boottime. We have to stop using EV_CLK_REAL in the
future because that breaks in y2038, but I guess EV_CLK_MONO
and EV_CLK_BOOK should stay.

If we want this to work correctly in a namespace that has a
user defined CLOCK_MONOTONIC timebase, one way to
do it might be to always call ktime_get() when we record
the timestamp in the kernel-internal CLOCK_MONOTONIC
base, but then convert it to the correct base when copying to
user space.

Note that AFAIU practically all users of evdev do /not/ actually
care about the time base, they only care about the elapsed
time between intervals, e.g. to track how fast a pointer should
move based on input from a trackpad. I don't see any reason
why one would compare this timestamp to a clock_gettime()
value, but of course at the moment this has well-defined
behavior that would break if we change clock_gettime(), and
we have a process in the namespace that opens
/dev/input/eventX and relies on meaningful timestamps
relative to a particular base.

       Arnd

Andrei Vagin Oct. 21, 2018, 1:41 a.m. UTC | #27

On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
> 
> > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> >> Reading the code the calling sequence there is:
> >> tick_sched_do_timer
> >>    tick_do_update_jiffies64
> >>       update_wall_time
> >>           timekeeping_advance
> >>              timekeepging_update
> >> 
> >> If I read that properly under the right nohz circumstances that update
> >> can be delayed indefinitely.
> >> 
> >> So I think we could prototype a time namespace that was per
> >> timekeeping_update and just had update_wall_time iterate through
> >> all of the time namespaces.
> >
> > Please don't go there. timekeeping_update() is already heavy and walking
> > through a gazillion of namespaces will just make it horrible,
> >
> >> I don't think the naive version would scale to very many time
> >> namespaces.
> >
> > :)
> >
> >> At the same time using the techniques from the nohz work and a little
> >> smarts I expect we could get the code to scale.
> >
> > You'd need to invoke the update when the namespace is switched in and
> > hasn't been updated since the last tick happened. That might be doable, but
> > you also need to take the wraparound constraints of the underlying
> > clocksources into account, which again can cause walking all name spaces
> > when they are all idle long enough.
> 
> The wrap around constraints being how long before the time sources wrap
> around so you have to read them once per wrap around?  I have not dug
> deeply enough into the code to see that yet.
> 
> > From there it becomes hairy, because it's not only timekeeping,
> > i.e. reading time, this is also affecting all timers which are armed from a
> > namespace.
> >
> > That gets really ugly because when you do settimeofday() or adjtimex() for
> > a particular namespace, then you have to search for all armed timers of
> > that namespace and adjust them.
> >
> > The original posix timer code had the same issue because it mapped the
> > clock realtime timers to the timer wheel so any setting of the clock caused
> > a full walk of all armed timers, disarming, adjusting and requeing
> > them. That's horrible not only performance wise, it's also a locking
> > nightmare of all sorts.
> >
> > Add time skew via NTP/PTP into the picture and you might have to adjust
> > timers as well, because you need to guarantee that they are not expiring
> > early.
> >
> > I haven't looked through Dimitry's patches yet, but I don't see how this
> > can work at all without introducing subtle issues all over the place.
> 
> Then it sounds like this will take some more digging.
> 
> Please pardon me for thinking out load.
> 
> There are one or more time sources that we use to compute the time
> and for each time source we have a conversion from ticks of the
> time source to nanoseconds.
> 
> Each time source needs to be sampled at least once per wrap-around
> and something incremented so that we don't loose time when looking
> at that time source.
> 
> There are several clocks presented to userspace and they all share the
> same length of second and are all fundamentally offsets from
> CLOCK_MONOTONIC.
> 
> I see two fundamental driving cases for a time namespace.
> 1) Migration from one node to another node in a cluster in almost
>    real time.
> 
>    The problem is that CLOCK_MONOTONIC between nodes in the cluster
>    has not relation ship to each other (except a synchronized length of
>    the second).  So applications that migrate can see CLOCK_MONOTONIC
>    and CLOCK_BOOTTIME go backwards.
> 
>    This is the truly pressing problem and adding some kind of offset
>    sounds like it would be the solution.  Possibly by allowing a boot
>    time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.
> 
> 2) Dealing with two separate time management domains.  Say a machine
>    that needes to deal with both something inside of google where they
>    slew time to avoid leap time seconds and something in the outside
>    world proper UTC time is kept as an offset from TAI with the
>    occasional leap seconds.
> 
>    In the later case it would fundamentally require having seconds of
>    different length.
> 

I want to add that the second case should be optional.

When a container is migrated to another host, we have to restore its
monotonic and boottime clocks, but we still expect that the container
will continue using the host real-time clock.

Before stating this series, I was thinking about this, I decided that
these cases can be solved independently. Probably, the full isolation of
the time sub-system will have much higher overhead than just offsets for
a few clocks. And the idea that isolation of the real-time clock should
be optional gives us another hint that offsets for monotonic and
boot-time clocks can be implemented independently.

Eric and Tomas, what do you think about this? If you agree that these
two cases can be implemented separately, what should we do with this
series to make it ready to be merged?

I know that we need to:

* look at device drivers that report timestamps in CLOCK_MONOTONIC base.
* forbid changing offsets after creating timers

Anything else?

Thanks,
Andrei

> 
> A pure 64bit nanoseond counter is good for 500 years.  So 64bit
> variables can be used to hold time, and everything can be converted from
> there.
> 
> This suggests we can for ticks have two values.
> - The number of ticks from the time source.
> - The number of times the ticks would have rolled over.
> 
> That sounds like it may be a little simplistic as it would require being
> very diligent about firing a timer exactly at rollover and not losing
> that, but for a handwaving argument is probably enough to generate
> a 64bit tick counter.
> 
> If the focus is on a 64bit tick counter then what update_wall_time
> has to do is very limited.  Just deal the accounting needed to cope with
> tick rollover.
> 
> Getting the actual time looks like it would be as simple as now, with
> perhaps an extra addition to account for the number of times the tick
> counter has rolled over.  With limited precision arithmetic and various
> optimizations I don't think it is that simple to implement but it feels
> like it should be very little extra work.
> 
> For timers my inclination would be to assume no adjustments to the
> current time parameters and set the timer to go off then.   If the time
> on the appropriate clock has been changed since the timer was set and
> the timer is going off early reschedule so the timer fires at the
> appropriate time.
> 
> With the above I think it is theoretically possible to build a time
> namespace that supports multiple lengths of second, and does not have
> much overhead.
> 
> Not that I think a final implementation would necessary look like what I
> have described. I just think it is possible with extreme care to evolve
> the current code base into something that can efficiently handle
> multiple time domains with slightly different lenghts of second.
> 
> Thomas does it sound like I am completely out of touch with reality?
> 
> It does though sound like it is going to take some serious digging
> through the code to understand how what everything does and how and why
> everthing works the way it does.  Not something grafted on top with just
> a cursory understanding of how the code works.
> 
> Eric
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

Andrei Vagin Oct. 21, 2018, 3:54 a.m. UTC | #28

On Sat, Oct 20, 2018 at 06:41:23PM -0700, Andrei Vagin wrote:
> On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote:
> > Thomas Gleixner <tglx@linutronix.de> writes:
> > 
> > > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> > >> Reading the code the calling sequence there is:
> > >> tick_sched_do_timer
> > >>    tick_do_update_jiffies64
> > >>       update_wall_time
> > >>           timekeeping_advance
> > >>              timekeepging_update
> > >> 
> > >> If I read that properly under the right nohz circumstances that update
> > >> can be delayed indefinitely.
> > >> 
> > >> So I think we could prototype a time namespace that was per
> > >> timekeeping_update and just had update_wall_time iterate through
> > >> all of the time namespaces.
> > >
> > > Please don't go there. timekeeping_update() is already heavy and walking
> > > through a gazillion of namespaces will just make it horrible,
> > >
> > >> I don't think the naive version would scale to very many time
> > >> namespaces.
> > >
> > > :)
> > >
> > >> At the same time using the techniques from the nohz work and a little
> > >> smarts I expect we could get the code to scale.
> > >
> > > You'd need to invoke the update when the namespace is switched in and
> > > hasn't been updated since the last tick happened. That might be doable, but
> > > you also need to take the wraparound constraints of the underlying
> > > clocksources into account, which again can cause walking all name spaces
> > > when they are all idle long enough.
> > 
> > The wrap around constraints being how long before the time sources wrap
> > around so you have to read them once per wrap around?  I have not dug
> > deeply enough into the code to see that yet.
> > 
> > > From there it becomes hairy, because it's not only timekeeping,
> > > i.e. reading time, this is also affecting all timers which are armed from a
> > > namespace.
> > >
> > > That gets really ugly because when you do settimeofday() or adjtimex() for
> > > a particular namespace, then you have to search for all armed timers of
> > > that namespace and adjust them.
> > >
> > > The original posix timer code had the same issue because it mapped the
> > > clock realtime timers to the timer wheel so any setting of the clock caused
> > > a full walk of all armed timers, disarming, adjusting and requeing
> > > them. That's horrible not only performance wise, it's also a locking
> > > nightmare of all sorts.
> > >
> > > Add time skew via NTP/PTP into the picture and you might have to adjust
> > > timers as well, because you need to guarantee that they are not expiring
> > > early.
> > >
> > > I haven't looked through Dimitry's patches yet, but I don't see how this
> > > can work at all without introducing subtle issues all over the place.
> > 
> > Then it sounds like this will take some more digging.
> > 
> > Please pardon me for thinking out load.
> > 
> > There are one or more time sources that we use to compute the time
> > and for each time source we have a conversion from ticks of the
> > time source to nanoseconds.
> > 
> > Each time source needs to be sampled at least once per wrap-around
> > and something incremented so that we don't loose time when looking
> > at that time source.
> > 
> > There are several clocks presented to userspace and they all share the
> > same length of second and are all fundamentally offsets from
> > CLOCK_MONOTONIC.
> > 
> > I see two fundamental driving cases for a time namespace.
> > 1) Migration from one node to another node in a cluster in almost
> >    real time.
> > 
> >    The problem is that CLOCK_MONOTONIC between nodes in the cluster
> >    has not relation ship to each other (except a synchronized length of
> >    the second).  So applications that migrate can see CLOCK_MONOTONIC
> >    and CLOCK_BOOTTIME go backwards.
> > 
> >    This is the truly pressing problem and adding some kind of offset
> >    sounds like it would be the solution.  Possibly by allowing a boot
> >    time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.
> > 
> > 2) Dealing with two separate time management domains.  Say a machine
> >    that needes to deal with both something inside of google where they
> >    slew time to avoid leap time seconds and something in the outside
> >    world proper UTC time is kept as an offset from TAI with the
> >    occasional leap seconds.
> > 
> >    In the later case it would fundamentally require having seconds of
> >    different length.
> > 
> 
> I want to add that the second case should be optional.
> 
> When a container is migrated to another host, we have to restore its
> monotonic and boottime clocks, but we still expect that the container
> will continue using the host real-time clock.
> 
> Before stating this series, I was thinking about this, I decided that
> these cases can be solved independently. Probably, the full isolation of
> the time sub-system will have much higher overhead than just offsets for
> a few clocks. And the idea that isolation of the real-time clock should
> be optional gives us another hint that offsets for monotonic and
> boot-time clocks can be implemented independently.
> 
> Eric and Tomas, what do you think about this? If you agree that these

Sorry Thomas, I mistyped your name.

> two cases can be implemented separately, what should we do with this
> series to make it ready to be merged?
> 
> I know that we need to:
> 
> * look at device drivers that report timestamps in CLOCK_MONOTONIC base.
> * forbid changing offsets after creating timers
> 
> Anything else?
> 
> Thanks,
> Andrei
> 
> > 
> > A pure 64bit nanoseond counter is good for 500 years.  So 64bit
> > variables can be used to hold time, and everything can be converted from
> > there.
> > 
> > This suggests we can for ticks have two values.
> > - The number of ticks from the time source.
> > - The number of times the ticks would have rolled over.
> > 
> > That sounds like it may be a little simplistic as it would require being
> > very diligent about firing a timer exactly at rollover and not losing
> > that, but for a handwaving argument is probably enough to generate
> > a 64bit tick counter.
> > 
> > If the focus is on a 64bit tick counter then what update_wall_time
> > has to do is very limited.  Just deal the accounting needed to cope with
> > tick rollover.
> > 
> > Getting the actual time looks like it would be as simple as now, with
> > perhaps an extra addition to account for the number of times the tick
> > counter has rolled over.  With limited precision arithmetic and various
> > optimizations I don't think it is that simple to implement but it feels
> > like it should be very little extra work.
> > 
> > For timers my inclination would be to assume no adjustments to the
> > current time parameters and set the timer to go off then.   If the time
> > on the appropriate clock has been changed since the timer was set and
> > the timer is going off early reschedule so the timer fires at the
> > appropriate time.
> > 
> > With the above I think it is theoretically possible to build a time
> > namespace that supports multiple lengths of second, and does not have
> > much overhead.
> > 
> > Not that I think a final implementation would necessary look like what I
> > have described. I just think it is possible with extreme care to evolve
> > the current code base into something that can efficiently handle
> > multiple time domains with slightly different lenghts of second.
> > 
> > Thomas does it sound like I am completely out of touch with reality?
> > 
> > It does though sound like it is going to take some serious digging
> > through the code to understand how what everything does and how and why
> > everthing works the way it does.  Not something grafted on top with just
> > a cursory understanding of how the code works.
> > 
> > Eric
> > _______________________________________________
> > Containers mailing list
> > Containers@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/containers

Thomas Gleixner Oct. 29, 2018, 8:33 p.m. UTC | #29

Andrei,

On Sat, 20 Oct 2018, Andrei Vagin wrote:
> When a container is migrated to another host, we have to restore its
> monotonic and boottime clocks, but we still expect that the container
> will continue using the host real-time clock.
> 
> Before stating this series, I was thinking about this, I decided that
> these cases can be solved independently. Probably, the full isolation of
> the time sub-system will have much higher overhead than just offsets for
> a few clocks. And the idea that isolation of the real-time clock should
> be optional gives us another hint that offsets for monotonic and
> boot-time clocks can be implemented independently.
> 
> Eric and Tomas, what do you think about this? If you agree that these
> two cases can be implemented separately, what should we do with this
> series to make it ready to be merged?
> 
> I know that we need to:
> 
> * look at device drivers that report timestamps in CLOCK_MONOTONIC base.

and CLOCK_BOOTTIME and that's quite a few.

> * forbid changing offsets after creating timers

There are more things to think about. What about interfaces which expose
boot time or monotonic time in /proc?

Aside of that (I finally came around to look at the series in more detail)
I'm really unhappy about the unconditional overhead once the Time namespace
config switch is enabled. This applies especially to the VDSO. We spent
quite some time recently to squeeze a few cycles out of those functions and
it would be a pity to pointlessly waste cycles for the !namespace case.

I can see the urge for this, but please let us think it through properly
before rushing anything in which we are going to regret once we want to do
more sophisticated time domain management, e.g. support for isolated clock
real time. I'm worried, that without a clear plan about the overall
picture, we end up with duct tape which is hard to distangle after the
fact.

There have been a few other things brought up versus time management in
general, like the TSN folks utilizing grand clock masters which expose
random time instead of proper TAI. Plus some requirements for exposing some
sort of 'monotonic' clocks which are derived from external synchronization
mechanisms, but should not affect the regular time keeping clocks.

While different issues, these all fall into the category of separate time
domains, so taking a step back to the drawing board is probably the best
thing what we can do now.

There are certainly a few things which can be looked at independently,
e.g. the VDSO mechanics or general mechanisms to avoid plastering the whole
kernel with these name space functions applying offsets left and right. I
rather have dedicated core functionality which replaces/amends existing
timer functions to become time namespace aware.

I'll try to find some time in the next weeks to look deeper into that, but
I can't promise anything before returning from LPC. Btw, LPC would be a
great opportunity to discuss that. Are you and the other name space wizards
there by any chance?

Thanks,

	tglx

Eric W. Biederman Oct. 29, 2018, 9:21 p.m. UTC | #30

Thomas Gleixner <tglx@linutronix.de> writes:

> Andrei,
>
> On Sat, 20 Oct 2018, Andrei Vagin wrote:
>> When a container is migrated to another host, we have to restore its
>> monotonic and boottime clocks, but we still expect that the container
>> will continue using the host real-time clock.
>> 
>> Before stating this series, I was thinking about this, I decided that
>> these cases can be solved independently. Probably, the full isolation of
>> the time sub-system will have much higher overhead than just offsets for
>> a few clocks. And the idea that isolation of the real-time clock should
>> be optional gives us another hint that offsets for monotonic and
>> boot-time clocks can be implemented independently.
>> 
>> Eric and Tomas, what do you think about this? If you agree that these
>> two cases can be implemented separately, what should we do with this
>> series to make it ready to be merged?
>> 
>> I know that we need to:
>> 
>> * look at device drivers that report timestamps in CLOCK_MONOTONIC base.
>
> and CLOCK_BOOTTIME and that's quite a few.
>
>> * forbid changing offsets after creating timers
>
> There are more things to think about. What about interfaces which expose
> boot time or monotonic time in /proc?
>
> Aside of that (I finally came around to look at the series in more detail)
> I'm really unhappy about the unconditional overhead once the Time namespace
> config switch is enabled. This applies especially to the VDSO. We spent
> quite some time recently to squeeze a few cycles out of those functions and
> it would be a pity to pointlessly waste cycles for the !namespace case.
>
> I can see the urge for this, but please let us think it through properly
> before rushing anything in which we are going to regret once we want to do
> more sophisticated time domain management, e.g. support for isolated clock
> real time. I'm worried, that without a clear plan about the overall
> picture, we end up with duct tape which is hard to distangle after the
> fact.
>
> There have been a few other things brought up versus time management in
> general, like the TSN folks utilizing grand clock masters which expose
> random time instead of proper TAI. Plus some requirements for exposing some
> sort of 'monotonic' clocks which are derived from external synchronization
> mechanisms, but should not affect the regular time keeping clocks.
>
> While different issues, these all fall into the category of separate time
> domains, so taking a step back to the drawing board is probably the best
> thing what we can do now.
>
> There are certainly a few things which can be looked at independently,
> e.g. the VDSO mechanics or general mechanisms to avoid plastering the whole
> kernel with these name space functions applying offsets left and right. I
> rather have dedicated core functionality which replaces/amends existing
> timer functions to become time namespace aware.
>
> I'll try to find some time in the next weeks to look deeper into that, but
> I can't promise anything before returning from LPC. Btw, LPC would be a
> great opportunity to discuss that. Are you and the other name space wizards
> there by any chance?

I will be and there are going to be both container and CRIU
mini-conferences.  So there should at least some of us around.

Eric

Thomas Gleixner Oct. 29, 2018, 9:36 p.m. UTC | #31

Eric,

On Mon, 29 Oct 2018, Eric W. Biederman wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
> >
> > I'll try to find some time in the next weeks to look deeper into that, but
> > I can't promise anything before returning from LPC. Btw, LPC would be a
> > great opportunity to discuss that. Are you and the other name space wizards
> > there by any chance?
> 
> I will be and there are going to be both container and CRIU
> mini-conferences.  So there should at least some of us around.

So let's try to find a slot for a BOF or similar (there might be still
slots for the kernel summit available, i'll ask).

Thanks,

	tglx

Andrei Vagin Oct. 31, 2018, 4:26 p.m. UTC | #32

On Mon, Oct 29, 2018 at 09:33:14PM +0100, Thomas Gleixner wrote:
> Andrei,
> 
> On Sat, 20 Oct 2018, Andrei Vagin wrote:
> > When a container is migrated to another host, we have to restore its
> > monotonic and boottime clocks, but we still expect that the container
> > will continue using the host real-time clock.
> > 
> > Before stating this series, I was thinking about this, I decided that
> > these cases can be solved independently. Probably, the full isolation of
> > the time sub-system will have much higher overhead than just offsets for
> > a few clocks. And the idea that isolation of the real-time clock should
> > be optional gives us another hint that offsets for monotonic and
> > boot-time clocks can be implemented independently.
> > 
> > Eric and Tomas, what do you think about this? If you agree that these
> > two cases can be implemented separately, what should we do with this
> > series to make it ready to be merged?
> > 
> > I know that we need to:
> > 
> > * look at device drivers that report timestamps in CLOCK_MONOTONIC base.
> 
> and CLOCK_BOOTTIME and that's quite a few.
> 
> > * forbid changing offsets after creating timers
> 
> There are more things to think about. What about interfaces which expose
> boot time or monotonic time in /proc?

We didn't find any proc files where boot or monotonic time is reported,
but we will double check this.

> 
> Aside of that (I finally came around to look at the series in more detail)
> I'm really unhappy about the unconditional overhead once the Time namespace
> config switch is enabled. This applies especially to the VDSO. We spent
> quite some time recently to squeeze a few cycles out of those functions and
> it would be a pity to pointlessly waste cycles for the !namespace case.

It is a good point. We will work on it.

> 
> I can see the urge for this, but please let us think it through properly
> before rushing anything in which we are going to regret once we want to do
> more sophisticated time domain management, e.g. support for isolated clock
> real time. I'm worried, that without a clear plan about the overall
> picture, we end up with duct tape which is hard to distangle after the
> fact.

Thomas, there is no rush at all. This functionality is critical for
CRUI, but we have enough time to solve it properly.

The only thing what I want is that this functionality continues moving
forward and will not be put in the back burner.

> 
> There have been a few other things brought up versus time management in
> general, like the TSN folks utilizing grand clock masters which expose
> random time instead of proper TAI. Plus some requirements for exposing some
> sort of 'monotonic' clocks which are derived from external synchronization
> mechanisms, but should not affect the regular time keeping clocks.
> 
> While different issues, these all fall into the category of separate time
> domains, so taking a step back to the drawing board is probably the best
> thing what we can do now.
> 
> There are certainly a few things which can be looked at independently,
> e.g. the VDSO mechanics or general mechanisms to avoid plastering the whole
> kernel with these name space functions applying offsets left and right. I
> rather have dedicated core functionality which replaces/amends existing
> timer functions to become time namespace aware.
> 
> I'll try to find some time in the next weeks to look deeper into that, but
> I can't promise anything before returning from LPC. Btw, LPC would be a
> great opportunity to discuss that. Are you and the other name space wizards
> there by any chance?

Dmitry and I are going to be there.

Thanks!
Andrei

> 
> Thanks,
> 
> 	tglx
> 
>

[RFC,00/20] ns: Introduce Time Namespace

Message

Comments