mbox series

[v5,0/6] arm64: add the time namespace support

Message ID 20200624083321.144975-1-avagin@gmail.com (mailing list archive)
Headers show
Series arm64: add the time namespace support | expand

Message

Andrei Vagin June 24, 2020, 8:33 a.m. UTC
Allocate the time namespace page among VVAR pages and add the logic
to handle faults on VVAR properly.

If a task belongs to a time namespace then the VVAR page which contains
the system wide VDSO data is replaced with a namespace specific page
which has the same layout as the VVAR page. That page has vdso_data->seq
set to 1 to enforce the slow path and vdso_data->clock_mode set to
VCLOCK_TIMENS to enforce the time namespace handling path.

The extra check in the case that vdso_data->seq is odd, e.g. a concurrent
update of the VDSO data is in progress, is not really affecting regular
tasks which are not part of a time namespace as the task is spin waiting
for the update to finish and vdso_data->seq to become even again.

If a time namespace task hits that code path, it invokes the corresponding
time getter function which retrieves the real VVAR page, reads host time
and then adds the offset for the requested clock which is stored in the
special VVAR page.

v2: Code cleanups suggested by Vincenzo.
v3: add a comment in __arch_get_timens_vdso_data.
v4: - fix an issue reported by the lkp robot.
    - vvar has the same size with/without CONFIG_TIME_NAMESPACE, but the
      timens page isn't allocated on !CONFIG_TIME_NAMESPACE. This
      simplifies criu/vdso migration between different kernel configs.
v5: - Code cleanups suggested by Mark Rutland.
    - In vdso_join_timens, mmap_write_lock is downgraded to
      mmap_read_lock. The VMA list isn't changed there, zap_page_range
      doesn't require mmap_write_lock.

Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Dmitry Safonov <dima@arista.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dmitry Safonov <dima@arista.com>

v5 on github (if someone prefers `git pull` to `git am`):
https://github.com/avagin/linux-task-diag/tree/arm64/timens-v5

Andrei Vagin (6):
  arm64/vdso: use the fault callback to map vvar pages
  arm64/vdso: Zap vvar pages when switching to a time namespace
  arm64/vdso: Add time namespace page
  arm64/vdso: Handle faults on timens page
  arm64/vdso: Restrict splitting VVAR VMA
  arm64: enable time namespace support

--
2.24.1

Comments

Andrei Vagin July 5, 2020, 6:40 a.m. UTC | #1
On Wed, Jun 24, 2020 at 01:33:15AM -0700, Andrei Vagin wrote:
> Allocate the time namespace page among VVAR pages and add the logic
> to handle faults on VVAR properly.
> 
> If a task belongs to a time namespace then the VVAR page which contains
> the system wide VDSO data is replaced with a namespace specific page
> which has the same layout as the VVAR page. That page has vdso_data->seq
> set to 1 to enforce the slow path and vdso_data->clock_mode set to
> VCLOCK_TIMENS to enforce the time namespace handling path.
> 
> The extra check in the case that vdso_data->seq is odd, e.g. a concurrent
> update of the VDSO data is in progress, is not really affecting regular
> tasks which are not part of a time namespace as the task is spin waiting
> for the update to finish and vdso_data->seq to become even again.
> 
> If a time namespace task hits that code path, it invokes the corresponding
> time getter function which retrieves the real VVAR page, reads host time
> and then adds the offset for the requested clock which is stored in the
> special VVAR page.
> 

> v2: Code cleanups suggested by Vincenzo.
> v3: add a comment in __arch_get_timens_vdso_data.
> v4: - fix an issue reported by the lkp robot.
>     - vvar has the same size with/without CONFIG_TIME_NAMESPACE, but the
>       timens page isn't allocated on !CONFIG_TIME_NAMESPACE. This
>       simplifies criu/vdso migration between different kernel configs.
> v5: - Code cleanups suggested by Mark Rutland.
>     - In vdso_join_timens, mmap_write_lock is downgraded to
>       mmap_read_lock. The VMA list isn't changed there, zap_page_range
>       doesn't require mmap_write_lock.
> 
> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
> Reviewed-by: Dmitry Safonov <dima@arista.com>

Hello Will and Catalin,

Have you had a chance to look at this patch set? I think it is ready to be
merged. Let me know if you have any questions.

Thanks,
Andrei
Andrei Vagin July 14, 2020, 1:57 a.m. UTC | #2
On Sat, Jul 04, 2020 at 11:40:55PM -0700, Andrei Vagin wrote:
> On Wed, Jun 24, 2020 at 01:33:15AM -0700, Andrei Vagin wrote:
> > Allocate the time namespace page among VVAR pages and add the logic
> > to handle faults on VVAR properly.
> > 
> > If a task belongs to a time namespace then the VVAR page which contains
> > the system wide VDSO data is replaced with a namespace specific page
> > which has the same layout as the VVAR page. That page has vdso_data->seq
> > set to 1 to enforce the slow path and vdso_data->clock_mode set to
> > VCLOCK_TIMENS to enforce the time namespace handling path.
> > 
> > The extra check in the case that vdso_data->seq is odd, e.g. a concurrent
> > update of the VDSO data is in progress, is not really affecting regular
> > tasks which are not part of a time namespace as the task is spin waiting
> > for the update to finish and vdso_data->seq to become even again.
> > 
> > If a time namespace task hits that code path, it invokes the corresponding
> > time getter function which retrieves the real VVAR page, reads host time
> > and then adds the offset for the requested clock which is stored in the
> > special VVAR page.
> > 
> 
> > v2: Code cleanups suggested by Vincenzo.
> > v3: add a comment in __arch_get_timens_vdso_data.
> > v4: - fix an issue reported by the lkp robot.
> >     - vvar has the same size with/without CONFIG_TIME_NAMESPACE, but the
> >       timens page isn't allocated on !CONFIG_TIME_NAMESPACE. This
> >       simplifies criu/vdso migration between different kernel configs.
> > v5: - Code cleanups suggested by Mark Rutland.
> >     - In vdso_join_timens, mmap_write_lock is downgraded to
> >       mmap_read_lock. The VMA list isn't changed there, zap_page_range
> >       doesn't require mmap_write_lock.
> > 
> > Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
> > Reviewed-by: Dmitry Safonov <dima@arista.com>
> 
> Hello Will and Catalin,
> 
> Have you had a chance to look at this patch set? I think it is ready to be
> merged. Let me know if you have any questions.

*friendly ping*

If I am doing something wrong, let me know.

> 
> Thanks,
> Andrei
>
Catalin Marinas July 22, 2020, 6:15 p.m. UTC | #3
On Mon, Jul 13, 2020 at 06:57:43PM -0700, Andrei Vagin wrote:
> On Sat, Jul 04, 2020 at 11:40:55PM -0700, Andrei Vagin wrote:
> > On Wed, Jun 24, 2020 at 01:33:15AM -0700, Andrei Vagin wrote:
> > > Allocate the time namespace page among VVAR pages and add the logic
> > > to handle faults on VVAR properly.
> > > 
> > > If a task belongs to a time namespace then the VVAR page which contains
> > > the system wide VDSO data is replaced with a namespace specific page
> > > which has the same layout as the VVAR page. That page has vdso_data->seq
> > > set to 1 to enforce the slow path and vdso_data->clock_mode set to
> > > VCLOCK_TIMENS to enforce the time namespace handling path.
> > > 
> > > The extra check in the case that vdso_data->seq is odd, e.g. a concurrent
> > > update of the VDSO data is in progress, is not really affecting regular
> > > tasks which are not part of a time namespace as the task is spin waiting
> > > for the update to finish and vdso_data->seq to become even again.
> > > 
> > > If a time namespace task hits that code path, it invokes the corresponding
> > > time getter function which retrieves the real VVAR page, reads host time
> > > and then adds the offset for the requested clock which is stored in the
> > > special VVAR page.
> > > 
> > 
> > > v2: Code cleanups suggested by Vincenzo.
> > > v3: add a comment in __arch_get_timens_vdso_data.
> > > v4: - fix an issue reported by the lkp robot.
> > >     - vvar has the same size with/without CONFIG_TIME_NAMESPACE, but the
> > >       timens page isn't allocated on !CONFIG_TIME_NAMESPACE. This
> > >       simplifies criu/vdso migration between different kernel configs.
> > > v5: - Code cleanups suggested by Mark Rutland.
> > >     - In vdso_join_timens, mmap_write_lock is downgraded to
> > >       mmap_read_lock. The VMA list isn't changed there, zap_page_range
> > >       doesn't require mmap_write_lock.
> > > 
> > > Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
> > > Reviewed-by: Dmitry Safonov <dima@arista.com>
> > 
> > Hello Will and Catalin,
> > 
> > Have you had a chance to look at this patch set? I think it is ready to be
> > merged. Let me know if you have any questions.
> 
> *friendly ping*
> 
> If I am doing something wrong, let me know.

Not really, just haven't got around to looking into it. Mark Rutland
raised a concern (in private) about the safety of multithreaded apps
but I think you already replied that timens_install() checks for this
already [1].

Maybe a similar atomicity issue to the one raised by Mark but for
single-threaded processes: the thread is executing vdso code, gets
interrupted and a signal handler invokes setns(). Would resuming the
execution in the vdso code on sigreturn cause any issues?

[1] https://lore.kernel.org/linux-arm-kernel/d16b5cd1-bdb1-5667-fbda-c622cc795389@arista.com/
Andrei Vagin July 23, 2020, 5:41 p.m. UTC | #4
On Wed, Jul 22, 2020 at 07:15:06PM +0100, Catalin Marinas wrote:
> On Mon, Jul 13, 2020 at 06:57:43PM -0700, Andrei Vagin wrote:
> > On Sat, Jul 04, 2020 at 11:40:55PM -0700, Andrei Vagin wrote:
> > > On Wed, Jun 24, 2020 at 01:33:15AM -0700, Andrei Vagin wrote:
> > > > Allocate the time namespace page among VVAR pages and add the logic
> > > > to handle faults on VVAR properly.
> > > > 
> > > > If a task belongs to a time namespace then the VVAR page which contains
> > > > the system wide VDSO data is replaced with a namespace specific page
> > > > which has the same layout as the VVAR page. That page has vdso_data->seq
> > > > set to 1 to enforce the slow path and vdso_data->clock_mode set to
> > > > VCLOCK_TIMENS to enforce the time namespace handling path.
> > > > 
> > > > The extra check in the case that vdso_data->seq is odd, e.g. a concurrent
> > > > update of the VDSO data is in progress, is not really affecting regular
> > > > tasks which are not part of a time namespace as the task is spin waiting
> > > > for the update to finish and vdso_data->seq to become even again.
> > > > 
> > > > If a time namespace task hits that code path, it invokes the corresponding
> > > > time getter function which retrieves the real VVAR page, reads host time
> > > > and then adds the offset for the requested clock which is stored in the
> > > > special VVAR page.
> > > > 
> > > 
> > > > v2: Code cleanups suggested by Vincenzo.
> > > > v3: add a comment in __arch_get_timens_vdso_data.
> > > > v4: - fix an issue reported by the lkp robot.
> > > >     - vvar has the same size with/without CONFIG_TIME_NAMESPACE, but the
> > > >       timens page isn't allocated on !CONFIG_TIME_NAMESPACE. This
> > > >       simplifies criu/vdso migration between different kernel configs.
> > > > v5: - Code cleanups suggested by Mark Rutland.
> > > >     - In vdso_join_timens, mmap_write_lock is downgraded to
> > > >       mmap_read_lock. The VMA list isn't changed there, zap_page_range
> > > >       doesn't require mmap_write_lock.
> > > > 
> > > > Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
> > > > Reviewed-by: Dmitry Safonov <dima@arista.com>
> > > 
> > > Hello Will and Catalin,
> > > 
> > > Have you had a chance to look at this patch set? I think it is ready to be
> > > merged. Let me know if you have any questions.
> > 
> > *friendly ping*
> > 
> > If I am doing something wrong, let me know.
> 
> Not really, just haven't got around to looking into it. Mark Rutland
> raised a concern (in private) about the safety of multithreaded apps
> but I think you already replied that timens_install() checks for this
> already [1].
> 
> Maybe a similar atomicity issue to the one raised by Mark but for
> single-threaded processes: the thread is executing vdso code, gets
> interrupted and a signal handler invokes setns(). Would resuming the
> execution in the vdso code on sigreturn cause any issues?

It will not cause any issues in the kernel. In the userspace,
clock_gettime() can return a clock value with an inconsistent offset, if
a process switches between two non-root namespaces. And it can triggers
SIGSEGV if it switches from a non-root to the root time namespace,
because a time namespace isn't mapped in the root time namespace.

I don't think that we need to handle this case in the kernel. Users
must understand what they are doing and have to write code so that avoid
these sort of situations. In general, I would say that in most cases it
is a bad idea to call setns from a signal handler.

Thanks,
Andrei

> 
> [1] https://lore.kernel.org/linux-arm-kernel/d16b5cd1-bdb1-5667-fbda-c622cc795389@arista.com/
> 
> -- 
> Catalin
Thomas Gleixner July 23, 2020, 8:25 p.m. UTC | #5
Andrei Vagin <avagin@gmail.com> writes:
> On Wed, Jul 22, 2020 at 07:15:06PM +0100, Catalin Marinas wrote:
>
> I don't think that we need to handle this case in the kernel. Users
> must understand what they are doing and have to write code so that avoid
> these sort of situations. In general, I would say that in most cases it
> is a bad idea to call setns from a signal handler.

This should not be supported in the first place and just let the
offender die right there.

Thanks,

        tglx
Catalin Marinas July 24, 2020, 11:01 a.m. UTC | #6
On Thu, Jul 23, 2020 at 10:25:35PM +0200, Thomas Gleixner wrote:
> Andrei Vagin <avagin@gmail.com> writes:
> > On Wed, Jul 22, 2020 at 07:15:06PM +0100, Catalin Marinas wrote:
> >
> > I don't think that we need to handle this case in the kernel. Users
> > must understand what they are doing and have to write code so that avoid
> > these sort of situations. In general, I would say that in most cases it
> > is a bad idea to call setns from a signal handler.
> 
> This should not be supported in the first place and just let the
> offender die right there.

It would have been nice if we caught the offender easily but since
signal handling doesn't have to be paired with sigreturn(), the kernel
can't tell whether setns() is called in the wrong context. I guess we
just have to live with this (maybe document the restriction in
time_namespaces(7) or setns(2)).
Thomas Gleixner July 24, 2020, 1:22 p.m. UTC | #7
Catalin Marinas <catalin.marinas@arm.com> writes:
> On Thu, Jul 23, 2020 at 10:25:35PM +0200, Thomas Gleixner wrote:
>> Andrei Vagin <avagin@gmail.com> writes:
>> > On Wed, Jul 22, 2020 at 07:15:06PM +0100, Catalin Marinas wrote:
>> >
>> > I don't think that we need to handle this case in the kernel. Users
>> > must understand what they are doing and have to write code so that avoid
>> > these sort of situations. In general, I would say that in most cases it
>> > is a bad idea to call setns from a signal handler.
>> 
>> This should not be supported in the first place and just let the
>> offender die right there.
>
> It would have been nice if we caught the offender easily but since
> signal handling doesn't have to be paired with sigreturn(), the kernel
> can't tell whether setns() is called in the wrong context. I guess we
> just have to live with this (maybe document the restriction in
> time_namespaces(7) or setns(2)).

Yes, I know that it's more or less impossible. The 'should' was just
wishful thinking :)

But yes, proper documentation is required.

Thanks,

        tglx
Christian Brauner July 24, 2020, 1:30 p.m. UTC | #8
On Thu, Jul 23, 2020 at 10:41:40AM -0700, Andrei Vagin wrote:
> On Wed, Jul 22, 2020 at 07:15:06PM +0100, Catalin Marinas wrote:
> > On Mon, Jul 13, 2020 at 06:57:43PM -0700, Andrei Vagin wrote:
> > > On Sat, Jul 04, 2020 at 11:40:55PM -0700, Andrei Vagin wrote:
> > > > On Wed, Jun 24, 2020 at 01:33:15AM -0700, Andrei Vagin wrote:
> > > > > Allocate the time namespace page among VVAR pages and add the logic
> > > > > to handle faults on VVAR properly.
> > > > > 
> > > > > If a task belongs to a time namespace then the VVAR page which contains
> > > > > the system wide VDSO data is replaced with a namespace specific page
> > > > > which has the same layout as the VVAR page. That page has vdso_data->seq
> > > > > set to 1 to enforce the slow path and vdso_data->clock_mode set to
> > > > > VCLOCK_TIMENS to enforce the time namespace handling path.
> > > > > 
> > > > > The extra check in the case that vdso_data->seq is odd, e.g. a concurrent
> > > > > update of the VDSO data is in progress, is not really affecting regular
> > > > > tasks which are not part of a time namespace as the task is spin waiting
> > > > > for the update to finish and vdso_data->seq to become even again.
> > > > > 
> > > > > If a time namespace task hits that code path, it invokes the corresponding
> > > > > time getter function which retrieves the real VVAR page, reads host time
> > > > > and then adds the offset for the requested clock which is stored in the
> > > > > special VVAR page.
> > > > > 
> > > > 
> > > > > v2: Code cleanups suggested by Vincenzo.
> > > > > v3: add a comment in __arch_get_timens_vdso_data.
> > > > > v4: - fix an issue reported by the lkp robot.
> > > > >     - vvar has the same size with/without CONFIG_TIME_NAMESPACE, but the
> > > > >       timens page isn't allocated on !CONFIG_TIME_NAMESPACE. This
> > > > >       simplifies criu/vdso migration between different kernel configs.
> > > > > v5: - Code cleanups suggested by Mark Rutland.
> > > > >     - In vdso_join_timens, mmap_write_lock is downgraded to
> > > > >       mmap_read_lock. The VMA list isn't changed there, zap_page_range
> > > > >       doesn't require mmap_write_lock.
> > > > > 
> > > > > Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
> > > > > Reviewed-by: Dmitry Safonov <dima@arista.com>
> > > > 
> > > > Hello Will and Catalin,
> > > > 
> > > > Have you had a chance to look at this patch set? I think it is ready to be
> > > > merged. Let me know if you have any questions.
> > > 
> > > *friendly ping*
> > > 
> > > If I am doing something wrong, let me know.
> > 
> > Not really, just haven't got around to looking into it. Mark Rutland
> > raised a concern (in private) about the safety of multithreaded apps
> > but I think you already replied that timens_install() checks for this
> > already [1].
> > 
> > Maybe a similar atomicity issue to the one raised by Mark but for
> > single-threaded processes: the thread is executing vdso code, gets
> > interrupted and a signal handler invokes setns(). Would resuming the
> > execution in the vdso code on sigreturn cause any issues?
> 
> It will not cause any issues in the kernel. In the userspace,
> clock_gettime() can return a clock value with an inconsistent offset, if
> a process switches between two non-root namespaces. And it can triggers
> SIGSEGV if it switches from a non-root to the root time namespace,
> because a time namespace isn't mapped in the root time namespace.
> 
> I don't think that we need to handle this case in the kernel. Users
> must understand what they are doing and have to write code so that avoid
> these sort of situations. In general, I would say that in most cases it
> is a bad idea to call setns from a signal handler.

I would argue that calling any function not in the list of
man 7 signal-safety
without checking the kernel implementation is "you get to keep the
pieces territory". There's a whole range of syscalls that are not safe
in signal handlers and we don't have any special precautions for them so
I'm not sure we'd need one for setns(). But maybe I'm missing the bigger
picture here.

Thanks!
Christian
Catalin Marinas July 24, 2020, 2:40 p.m. UTC | #9
On Fri, Jul 24, 2020 at 03:30:39PM +0200, Christian Brauner wrote:
> On Thu, Jul 23, 2020 at 10:41:40AM -0700, Andrei Vagin wrote:
> > On Wed, Jul 22, 2020 at 07:15:06PM +0100, Catalin Marinas wrote:
> > > On Mon, Jul 13, 2020 at 06:57:43PM -0700, Andrei Vagin wrote:
> > > > On Sat, Jul 04, 2020 at 11:40:55PM -0700, Andrei Vagin wrote:
> > > > > On Wed, Jun 24, 2020 at 01:33:15AM -0700, Andrei Vagin wrote:
> > > > > > Allocate the time namespace page among VVAR pages and add the logic
> > > > > > to handle faults on VVAR properly.
> > > > > > 
> > > > > > If a task belongs to a time namespace then the VVAR page which contains
> > > > > > the system wide VDSO data is replaced with a namespace specific page
> > > > > > which has the same layout as the VVAR page. That page has vdso_data->seq
> > > > > > set to 1 to enforce the slow path and vdso_data->clock_mode set to
> > > > > > VCLOCK_TIMENS to enforce the time namespace handling path.
> > > > > > 
> > > > > > The extra check in the case that vdso_data->seq is odd, e.g. a concurrent
> > > > > > update of the VDSO data is in progress, is not really affecting regular
> > > > > > tasks which are not part of a time namespace as the task is spin waiting
> > > > > > for the update to finish and vdso_data->seq to become even again.
> > > > > > 
> > > > > > If a time namespace task hits that code path, it invokes the corresponding
> > > > > > time getter function which retrieves the real VVAR page, reads host time
> > > > > > and then adds the offset for the requested clock which is stored in the
> > > > > > special VVAR page.
> > > > > > 
> > > > > 
> > > > > > v2: Code cleanups suggested by Vincenzo.
> > > > > > v3: add a comment in __arch_get_timens_vdso_data.
> > > > > > v4: - fix an issue reported by the lkp robot.
> > > > > >     - vvar has the same size with/without CONFIG_TIME_NAMESPACE, but the
> > > > > >       timens page isn't allocated on !CONFIG_TIME_NAMESPACE. This
> > > > > >       simplifies criu/vdso migration between different kernel configs.
> > > > > > v5: - Code cleanups suggested by Mark Rutland.
> > > > > >     - In vdso_join_timens, mmap_write_lock is downgraded to
> > > > > >       mmap_read_lock. The VMA list isn't changed there, zap_page_range
> > > > > >       doesn't require mmap_write_lock.
> > > > > > 
> > > > > > Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
> > > > > > Reviewed-by: Dmitry Safonov <dima@arista.com>
> > > > > 
> > > > > Hello Will and Catalin,
> > > > > 
> > > > > Have you had a chance to look at this patch set? I think it is ready to be
> > > > > merged. Let me know if you have any questions.
> > > > 
> > > > *friendly ping*
> > > > 
> > > > If I am doing something wrong, let me know.
> > > 
> > > Not really, just haven't got around to looking into it. Mark Rutland
> > > raised a concern (in private) about the safety of multithreaded apps
> > > but I think you already replied that timens_install() checks for this
> > > already [1].
> > > 
> > > Maybe a similar atomicity issue to the one raised by Mark but for
> > > single-threaded processes: the thread is executing vdso code, gets
> > > interrupted and a signal handler invokes setns(). Would resuming the
> > > execution in the vdso code on sigreturn cause any issues?
> > 
> > It will not cause any issues in the kernel. In the userspace,
> > clock_gettime() can return a clock value with an inconsistent offset, if
> > a process switches between two non-root namespaces. And it can triggers
> > SIGSEGV if it switches from a non-root to the root time namespace,
> > because a time namespace isn't mapped in the root time namespace.
> > 
> > I don't think that we need to handle this case in the kernel. Users
> > must understand what they are doing and have to write code so that avoid
> > these sort of situations. In general, I would say that in most cases it
> > is a bad idea to call setns from a signal handler.
> 
> I would argue that calling any function not in the list of
> man 7 signal-safety
> without checking the kernel implementation is "you get to keep the
> pieces territory". There's a whole range of syscalls that are not safe
> in signal handlers and we don't have any special precautions for them so
> I'm not sure we'd need one for setns(). But maybe I'm missing the bigger
> picture here.

Good point (I don't read man pages very often ;)). Thanks for
clarifying.