[RFC,00/26] Runtime paravirt patching

Message ID	20200408050323.4237-1-ankur.a.arora@oracle.com (mailing list archive)
Headers	show Return-Path: <SRS0=FdSL=5Y=vger.kernel.org=kvm-owner@kernel.org> From: Ankur Arora <ankur.a.arora@oracle.com> To: linux-kernel@vger.kernel.org, x86@kernel.org Cc: peterz@infradead.org, hpa@zytor.com, jpoimboe@redhat.com, namit@vmware.com, mhiramat@kernel.org, jgross@suse.com, bp@alien8.de, vkuznets@redhat.com, pbonzini@redhat.com, boris.ostrovsky@oracle.com, mihai.carabas@oracle.com, kvm@vger.kernel.org, xen-devel@lists.xenproject.org, virtualization@lists.linux-foundation.org, Ankur Arora <ankur.a.arora@oracle.com> Subject: [RFC PATCH 00/26] Runtime paravirt patching Date: Tue, 7 Apr 2020 22:02:57 -0700 Message-Id: <20200408050323.4237-1-ankur.a.arora@oracle.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: kvm-owner@vger.kernel.org Precedence: bulk
Series	Runtime paravirt patching \| expand [RFC,00/26] Runtime paravirt patching [RFC,01/26] x86/paravirt: Specify subsection in PVOP macros [RFC,02/26] x86/paravirt: Allow paravirt patching post-init [RFC,03/26] x86/paravirt: PVRTOP macros for PARAVIRT_RUNTIME [RFC,04/26] x86/alternatives: Refactor alternatives_smp_module* [RFC,05/26] x86/alternatives: Rename alternatives_smp, smp_alt_module [RFC,06/26] x86/alternatives: Remove stale symbols [RFC,07/26] x86/paravirt: Persist .parainstructions.runtime [RFC,08/26] x86/paravirt: Stash native pv-ops [RFC,09/26] x86/paravirt: Add runtime_patch() [RFC,10/26] x86/paravirt: Add primitives to stage pv-ops [RFC,11/26] x86/alternatives: Remove return value of text_poke() [RFC,12/26] x86/alternatives: Use __get_unlocked_pte() in text_poke() [RFC,13/26] x86/alternatives: Split __text_poke() [RFC,14/26] x86/alternatives: Handle native insns in text_poke_loc*() [RFC,15/26] x86/alternatives: Non-emulated text poking [RFC,16/26] x86/alternatives: Add paravirt patching at runtime [RFC,17/26] x86/alternatives: Add patching logic in text_poke_site() [RFC,18/26] x86/alternatives: Handle BP in non-emulated text poking [RFC,19/26] x86/alternatives: NMI safe runtime patching [RFC,20/26] x86/paravirt: Enable pv-spinlocks in runtime_patch() [RFC,21/26] x86/alternatives: Paravirt runtime selftest [RFC,22/26] kvm/paravirt: Encapsulate KVM pv switching logic [RFC,23/26] x86/kvm: Add worker to trigger runtime patching [RFC,24/26] x86/kvm: Support dynamic CPUID hints [RFC,25/26] x86/kvm: Guest support for dynamic hints [RFC,26/26] x86/kvm: Add hint change notifier for KVM_HINT_REALTIME

Ankur Arora April 8, 2020, 5:02 a.m. UTC

A KVM host (or another hypervisor) might advertise paravirtualized
features and optimization hints (ex KVM_HINTS_REALTIME) which might
become stale over the lifetime of the guest. For instance, the
host might go from being undersubscribed to being oversubscribed
(or the other way round) and it would make sense for the guest
switch pv-ops based on that.

This lockorture splat that I saw on the guest while testing this is
indicative of the problem:

  [ 1136.461522] watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [lock_torture_wr:12865]
  [ 1136.461542] CPU: 8 PID: 12865 Comm: lock_torture_wr Tainted: G W L 5.4.0-rc7+ #77
  [ 1136.461546] RIP: 0010:native_queued_spin_lock_slowpath+0x15/0x220

(Caused by an oversubscribed host but using mismatched native pv_lock_ops
on the gues.)

This series addresses the problem by doing paravirt switching at runtime.

We keep an interesting subset of pv-ops (pv_lock_ops only for now,
but PV-TLB ops are also good candidates) in .parainstructions.runtime,
while discarding the .parainstructions as usual at init. This is then
used for switching back and forth between native and paravirt mode.
([1] lists some representative numbers of the increased memory
footprint.)

Mechanism: the patching itself is done using stop_machine(). That is
not ideal -- text_poke_stop_machine() was replaced with INT3+emulation
via text_poke_bp(), but I'm using this to address two issues:
 1) emulation in text_poke() can only easily handle a small set
 of instructions and this is problematic for inlined pv-ops (and see
 a possible alternatives use-case below.)
 2) paravirt patching might have inter-dependendent ops (ex.
 lock.queued_lock_slowpath, lock.queued_lock_unlock are paired and
 need to be updated atomically.)

The alternative use-case is a runtime version of apply_alternatives()
(not posted with this patch-set) that can be used for some safe subset
of X86_FEATUREs. This could be useful in conjunction with the ongoing
late microcode loading work that Mihai Carabas and others have been
working on.

Also, there are points of similarity with the ongoing static_call work
which does rewriting of indirect calls. The difference here is that
we need to switch a group of calls atomically and given that
some of them can be inlined, need to handle a wider variety of opcodes.

To patch safely we need to satisfy these constraints:

 - No references to insn sequences under replacement on any kernel stack
   once replacement is in progress. Without this constraint we might end
   up returning to an address that is in the middle of an instruction.

 - handle inter-dependent ops: as above, lock.queued_lock_unlock(),
   lock.queued_lock_slowpath() and the rest of the pv_lock_ops are
   a good example.

 - handle a broader set of insns than CALL and JMP: some pv-ops end up
   getting inlined. Alternatives can contain arbitrary instructions.

 - locking operations can be called from interrupt handlers which means
   we cannot trivially use IPIs for flushing.

Handling these, necessitates that target pv-ops not be preemptible.
Once that is a given (for safety these need to be explicitly whitelisted
in runtime_patch()), use a state-machine with the primary CPU doing the
patching and secondary CPUs in a sync_core() loop. 

In case we hit an INT3/BP (in NMI or thread-context) we makes forward
progress by continuing the patching instead of emulating.

One remaining issue is inter-dependent pv-ops which are also executed in
the NMI handler -- patching can potentially deadlock in case of multiple
NMIs. Handle these by pushing some of this work in the NMI handler where
we know it will be uninterrupted.

There are four main sets of patches in this series:

 1. PV-ops management (patches 1-10, 20): mostly infrastructure and
 refactoring pieces to make paravirt patching usable at runtime. For the
 most part scoped under CONFIG_PARAVIRT_RUNTIME.

 Patches 1-7, to persist part of parainstructions in memory:
  "x86/paravirt: Specify subsection in PVOP macros"
  "x86/paravirt: Allow paravirt patching post-init"
  "x86/paravirt: PVRTOP macros for PARAVIRT_RUNTIME"
  "x86/alternatives: Refactor alternatives_smp_module*
  "x86/alternatives: Rename alternatives_smp*, smp_alt_module
  "x86/alternatives: Remove stale symbols
  "x86/paravirt: Persist .parainstructions.runtime"

 Patches 8-10, develop the inerfaces to safely switch pv-ops:
  "x86/paravirt: Stash native pv-ops"
  "x86/paravirt: Add runtime_patch()"
  "x86/paravirt: Add primitives to stage pv-ops"

 Patch 20 enables switching of pv_lock_ops:
  "x86/paravirt: Enable pv-spinlocks in runtime_patch()"

 2. Non-emulated text poking (patches 11-19)

 Patches 11-13 are mostly refactoring to split __text_poke() into map,
 unmap and poke/memcpy phases with the poke portion being re-entrant
  "x86/alternatives: Remove return value of text_poke*()"
  "x86/alternatives: Use __get_unlocked_pte() in text_poke()"
  "x86/alternatives: Split __text_poke()"

 Patches 15, 17 add the actual poking state-machine:
  "x86/alternatives: Non-emulated text poking"
  "x86/alternatives: Add patching logic in text_poke_site()"

 with patches 14 and 18 containing the pieces for BP handling:
  "x86/alternatives: Handle native insns in text_poke_loc*()"
  "x86/alternatives: Handle BP in non-emulated text poking"

 and patch 19 provides the ability to use the state-machine above in an
 NMI context (fixes some potential deadlocks when handling inter-
 dependent operations and multiple NMIs):
  "x86/alternatives: NMI safe runtime patching".

 Patch 16 provides the interface (paravirt_runtime_patch()) to use the
 poking mechanism developed above and patch 21 adds a selftest:
  "x86/alternatives: Add paravirt patching at runtime"
  "x86/alternatives: Paravirt runtime selftest"

 3. KVM guest changes to be able to use this (patches 22-23,25-26):
  "kvm/paravirt: Encapsulate KVM pv switching logic"
  "x86/kvm: Add worker to trigger runtime patching"
  "x86/kvm: Guest support for dynamic hints"
  "x86/kvm: Add hint change notifier for KVM_HINT_REALTIME".

 4. KVM host changes to notify the guest of a change (patch 24):
  "x86/kvm: Support dynamic CPUID hints"

Testing:
With paravirt patching, the code is mostly stable on Intel and AMD
systems under kernbench and locktorture with paravirt toggling (with,
without synthetic NMIs) in the background.

Queued spinlock performance for locktorture is also on expected lines:
 [ 1533.221563] Writes:  Total: 1048759000  Max/Min: 0/0   Fail: 0 
 # toggle PV spinlocks

 [ 1594.713699] Writes:  Total: 1111660545  Max/Min: 0/0   Fail: 0 
 # PV spinlocks (in ~60 seconds) = 62,901,545

 # toggle native spinlocks
 [ 1656.117175] Writes:  Total: 1113888840  Max/Min: 0/0   Fail: 0 
  # native spinlocks (in ~60 seconds) = 2,228,295

The alternatives testing is more limited with it being used to rewrite
mostly harmless X86_FEATUREs with load in the background.

Patches also at:

ssh://git@github.com/terminus/linux.git alternatives-rfc-upstream-v1

Please review.

Thanks
Ankur

[1] The precise change in memory footprint depends on config options
but the following example inlines queued_spin_unlock() (which forms
the bulk of the added state). The added footprint is the size of the
.parainstructions.runtime section:

 $ objdump -h vmlinux|grep .parainstructions
 Idx Name              		Size      VMA               
 	LMA                File-off  Algn
  27 .parainstructions 		0001013c  ffffffff82895000
  	0000000002895000   01c95000  2**3
  28 .parainstructions.runtime  0000cd2c  ffffffff828a5140
  	00000000028a5140   01ca5140  2**3

  $ size vmlinux                                         
  text       data       bss        dec      hex       filename
  13726196   12302814   14094336   40123346 2643bd2   vmlinux

Ankur Arora (26):
  x86/paravirt: Specify subsection in PVOP macros
  x86/paravirt: Allow paravirt patching post-init
  x86/paravirt: PVRTOP macros for PARAVIRT_RUNTIME
  x86/alternatives: Refactor alternatives_smp_module*
  x86/alternatives: Rename alternatives_smp*, smp_alt_module
  x86/alternatives: Remove stale symbols
  x86/paravirt: Persist .parainstructions.runtime
  x86/paravirt: Stash native pv-ops
  x86/paravirt: Add runtime_patch()
  x86/paravirt: Add primitives to stage pv-ops
  x86/alternatives: Remove return value of text_poke*()
  x86/alternatives: Use __get_unlocked_pte() in text_poke()
  x86/alternatives: Split __text_poke()
  x86/alternatives: Handle native insns in text_poke_loc*()
  x86/alternatives: Non-emulated text poking
  x86/alternatives: Add paravirt patching at runtime
  x86/alternatives: Add patching logic in text_poke_site()
  x86/alternatives: Handle BP in non-emulated text poking
  x86/alternatives: NMI safe runtime patching
  x86/paravirt: Enable pv-spinlocks in runtime_patch()
  x86/alternatives: Paravirt runtime selftest
  kvm/paravirt: Encapsulate KVM pv switching logic
  x86/kvm: Add worker to trigger runtime patching
  x86/kvm: Support dynamic CPUID hints
  x86/kvm: Guest support for dynamic hints
  x86/kvm: Add hint change notifier for KVM_HINT_REALTIME

 Documentation/virt/kvm/api.rst        |  17 +
 Documentation/virt/kvm/cpuid.rst      |   9 +-
 arch/x86/Kconfig                      |  14 +
 arch/x86/Kconfig.debug                |  13 +
 arch/x86/entry/entry_64.S             |   5 +
 arch/x86/include/asm/alternative.h    |  20 +-
 arch/x86/include/asm/kvm_host.h       |   6 +
 arch/x86/include/asm/kvm_para.h       |  17 +
 arch/x86/include/asm/paravirt.h       |  10 +-
 arch/x86/include/asm/paravirt_types.h | 230 ++++--
 arch/x86/include/asm/text-patching.h  |  18 +-
 arch/x86/include/uapi/asm/kvm_para.h  |   2 +
 arch/x86/kernel/Makefile              |   1 +
 arch/x86/kernel/alternative.c         | 987 +++++++++++++++++++++++---
 arch/x86/kernel/kvm.c                 | 191 ++++-
 arch/x86/kernel/module.c              |  42 +-
 arch/x86/kernel/paravirt.c            |  16 +-
 arch/x86/kernel/paravirt_patch.c      |  61 ++
 arch/x86/kernel/pv_selftest.c         | 264 +++++++
 arch/x86/kernel/pv_selftest.h         |  15 +
 arch/x86/kernel/setup.c               |   2 +
 arch/x86/kernel/vmlinux.lds.S         |  16 +
 arch/x86/kvm/cpuid.c                  |   3 +-
 arch/x86/kvm/x86.c                    |  39 +
 include/asm-generic/kvm_para.h        |  12 +
 include/asm-generic/vmlinux.lds.h     |   8 +
 include/linux/kvm_para.h              |   5 +
 include/linux/mm.h                    |  16 +-
 include/linux/preempt.h               |  17 +
 include/uapi/linux/kvm.h              |   4 +
 kernel/locking/lock_events.c          |   2 +-
 mm/memory.c                           |   9 +-
 32 files changed, 1850 insertions(+), 221 deletions(-)
 create mode 100644 arch/x86/kernel/pv_selftest.c
 create mode 100644 arch/x86/kernel/pv_selftest.h

Peter Zijlstra April 8, 2020, 12:08 p.m. UTC | #1

On Tue, Apr 07, 2020 at 10:02:57PM -0700, Ankur Arora wrote:
> A KVM host (or another hypervisor) might advertise paravirtualized
> features and optimization hints (ex KVM_HINTS_REALTIME) which might
> become stale over the lifetime of the guest. For instance, the
> host might go from being undersubscribed to being oversubscribed
> (or the other way round) and it would make sense for the guest
> switch pv-ops based on that.

So what, the paravirt spinlock stuff works just fine when you're not
oversubscribed.

> We keep an interesting subset of pv-ops (pv_lock_ops only for now,
> but PV-TLB ops are also good candidates)

The PV-TLB ops also work just fine when not oversubscribed. IIRC
kvm_flush_tlb_others() is pretty much the same in that case.

> in .parainstructions.runtime,
> while discarding the .parainstructions as usual at init. This is then
> used for switching back and forth between native and paravirt mode.
> ([1] lists some representative numbers of the increased memory
> footprint.)
> 
> Mechanism: the patching itself is done using stop_machine(). That is
> not ideal -- text_poke_stop_machine() was replaced with INT3+emulation
> via text_poke_bp(), but I'm using this to address two issues:
>  1) emulation in text_poke() can only easily handle a small set
>  of instructions and this is problematic for inlined pv-ops (and see
>  a possible alternatives use-case below.)
>  2) paravirt patching might have inter-dependendent ops (ex.
>  lock.queued_lock_slowpath, lock.queued_lock_unlock are paired and
>  need to be updated atomically.)

And then you hope that the spinlock state transfers.. That is that both
implementations agree what an unlocked spinlock looks like.

Suppose the native one was a ticket spinlock, where unlocked means 'head
== tail' while the paravirt one is a test-and-set spinlock, where
unlocked means 'val == 0'.

That just happens to not be the case now, but it was for a fair while.

> The alternative use-case is a runtime version of apply_alternatives()
> (not posted with this patch-set) that can be used for some safe subset
> of X86_FEATUREs. This could be useful in conjunction with the ongoing
> late microcode loading work that Mihai Carabas and others have been
> working on.

The whole late-microcode loading stuff is crazy already; you're making
it take double bonghits.

> Also, there are points of similarity with the ongoing static_call work
> which does rewriting of indirect calls.

Only in so far as that code patching is involved. An analogy would be
comparing having a beer with shooting dope. They're both 'drugs'.

> The difference here is that
> we need to switch a group of calls atomically and given that
> some of them can be inlined, need to handle a wider variety of opcodes.
> 
> To patch safely we need to satisfy these constraints:
> 
>  - No references to insn sequences under replacement on any kernel stack
>    once replacement is in progress. Without this constraint we might end
>    up returning to an address that is in the middle of an instruction.

Both ftrace and optprobes have that issue, neither of them are quite as
crazy as this.

>  - handle inter-dependent ops: as above, lock.queued_lock_unlock(),
>    lock.queued_lock_slowpath() and the rest of the pv_lock_ops are
>    a good example.

While I'm sure this is a fun problem, why are we solving it?

>  - handle a broader set of insns than CALL and JMP: some pv-ops end up
>    getting inlined. Alternatives can contain arbitrary instructions.

So can optprobes.

>  - locking operations can be called from interrupt handlers which means
>    we cannot trivially use IPIs for flushing.

Heck, some NMI handlers use locks..

> Handling these, necessitates that target pv-ops not be preemptible.

I don't think that is a correct inferrence.

> Once that is a given (for safety these need to be explicitly whitelisted
> in runtime_patch()), use a state-machine with the primary CPU doing the
> patching and secondary CPUs in a sync_core() loop. 
> 
> In case we hit an INT3/BP (in NMI or thread-context) we makes forward
> progress by continuing the patching instead of emulating.
> 
> One remaining issue is inter-dependent pv-ops which are also executed in
> the NMI handler -- patching can potentially deadlock in case of multiple
> NMIs. Handle these by pushing some of this work in the NMI handler where
> we know it will be uninterrupted.

I'm just seeing a lot of bonghits without sane rationale. Why is any of
this important?

Jürgen Groß April 8, 2020, 12:28 p.m. UTC | #2

On 08.04.20 07:02, Ankur Arora wrote:
> A KVM host (or another hypervisor) might advertise paravirtualized
> features and optimization hints (ex KVM_HINTS_REALTIME) which might
> become stale over the lifetime of the guest. For instance, the

Then this hint is wrong if it can't be guaranteed.

> host might go from being undersubscribed to being oversubscribed
> (or the other way round) and it would make sense for the guest
> switch pv-ops based on that.

I think using pvops for such a feature change is just wrong.

What comes next? Using pvops for being able to migrate a guest from an
Intel to an AMD machine?

...

> There are four main sets of patches in this series:
> 
>   1. PV-ops management (patches 1-10, 20): mostly infrastructure and
>   refactoring pieces to make paravirt patching usable at runtime. For the
>   most part scoped under CONFIG_PARAVIRT_RUNTIME.
> 
>   Patches 1-7, to persist part of parainstructions in memory:
>    "x86/paravirt: Specify subsection in PVOP macros"
>    "x86/paravirt: Allow paravirt patching post-init"
>    "x86/paravirt: PVRTOP macros for PARAVIRT_RUNTIME"
>    "x86/alternatives: Refactor alternatives_smp_module*
>    "x86/alternatives: Rename alternatives_smp*, smp_alt_module
>    "x86/alternatives: Remove stale symbols
>    "x86/paravirt: Persist .parainstructions.runtime"
> 
>   Patches 8-10, develop the inerfaces to safely switch pv-ops:
>    "x86/paravirt: Stash native pv-ops"
>    "x86/paravirt: Add runtime_patch()"
>    "x86/paravirt: Add primitives to stage pv-ops"
> 
>   Patch 20 enables switching of pv_lock_ops:
>    "x86/paravirt: Enable pv-spinlocks in runtime_patch()"
> 
>   2. Non-emulated text poking (patches 11-19)
> 
>   Patches 11-13 are mostly refactoring to split __text_poke() into map,
>   unmap and poke/memcpy phases with the poke portion being re-entrant
>    "x86/alternatives: Remove return value of text_poke*()"
>    "x86/alternatives: Use __get_unlocked_pte() in text_poke()"
>    "x86/alternatives: Split __text_poke()"
> 
>   Patches 15, 17 add the actual poking state-machine:
>    "x86/alternatives: Non-emulated text poking"
>    "x86/alternatives: Add patching logic in text_poke_site()"
> 
>   with patches 14 and 18 containing the pieces for BP handling:
>    "x86/alternatives: Handle native insns in text_poke_loc*()"
>    "x86/alternatives: Handle BP in non-emulated text poking"
> 
>   and patch 19 provides the ability to use the state-machine above in an
>   NMI context (fixes some potential deadlocks when handling inter-
>   dependent operations and multiple NMIs):
>    "x86/alternatives: NMI safe runtime patching".
> 
>   Patch 16 provides the interface (paravirt_runtime_patch()) to use the
>   poking mechanism developed above and patch 21 adds a selftest:
>    "x86/alternatives: Add paravirt patching at runtime"
>    "x86/alternatives: Paravirt runtime selftest"
> 
>   3. KVM guest changes to be able to use this (patches 22-23,25-26):
>    "kvm/paravirt: Encapsulate KVM pv switching logic"
>    "x86/kvm: Add worker to trigger runtime patching"
>    "x86/kvm: Guest support for dynamic hints"
>    "x86/kvm: Add hint change notifier for KVM_HINT_REALTIME".
> 
>   4. KVM host changes to notify the guest of a change (patch 24):
>    "x86/kvm: Support dynamic CPUID hints"
> 
> Testing:
> With paravirt patching, the code is mostly stable on Intel and AMD
> systems under kernbench and locktorture with paravirt toggling (with,
> without synthetic NMIs) in the background.
> 
> Queued spinlock performance for locktorture is also on expected lines:
>   [ 1533.221563] Writes:  Total: 1048759000  Max/Min: 0/0   Fail: 0
>   # toggle PV spinlocks
> 
>   [ 1594.713699] Writes:  Total: 1111660545  Max/Min: 0/0   Fail: 0
>   # PV spinlocks (in ~60 seconds) = 62,901,545
> 
>   # toggle native spinlocks
>   [ 1656.117175] Writes:  Total: 1113888840  Max/Min: 0/0   Fail: 0
>    # native spinlocks (in ~60 seconds) = 2,228,295
> 
> The alternatives testing is more limited with it being used to rewrite
> mostly harmless X86_FEATUREs with load in the background.
> 
> Patches also at:
> 
> ssh://git@github.com/terminus/linux.git alternatives-rfc-upstream-v1
> 
> Please review.
> 
> Thanks
> Ankur
> 
> [1] The precise change in memory footprint depends on config options
> but the following example inlines queued_spin_unlock() (which forms
> the bulk of the added state). The added footprint is the size of the
> .parainstructions.runtime section:
> 
>   $ objdump -h vmlinux|grep .parainstructions
>   Idx Name              		Size      VMA
>   	LMA                File-off  Algn
>    27 .parainstructions 		0001013c  ffffffff82895000
>    	0000000002895000   01c95000  2**3
>    28 .parainstructions.runtime  0000cd2c  ffffffff828a5140
>    	00000000028a5140   01ca5140  2**3
> 
>    $ size vmlinux
>    text       data       bss        dec      hex       filename
>    13726196   12302814   14094336   40123346 2643bd2   vmlinux
> 
> Ankur Arora (26):
>    x86/paravirt: Specify subsection in PVOP macros
>    x86/paravirt: Allow paravirt patching post-init
>    x86/paravirt: PVRTOP macros for PARAVIRT_RUNTIME
>    x86/alternatives: Refactor alternatives_smp_module*
>    x86/alternatives: Rename alternatives_smp*, smp_alt_module
>    x86/alternatives: Remove stale symbols
>    x86/paravirt: Persist .parainstructions.runtime
>    x86/paravirt: Stash native pv-ops
>    x86/paravirt: Add runtime_patch()
>    x86/paravirt: Add primitives to stage pv-ops
>    x86/alternatives: Remove return value of text_poke*()
>    x86/alternatives: Use __get_unlocked_pte() in text_poke()
>    x86/alternatives: Split __text_poke()
>    x86/alternatives: Handle native insns in text_poke_loc*()
>    x86/alternatives: Non-emulated text poking
>    x86/alternatives: Add paravirt patching at runtime
>    x86/alternatives: Add patching logic in text_poke_site()
>    x86/alternatives: Handle BP in non-emulated text poking
>    x86/alternatives: NMI safe runtime patching
>    x86/paravirt: Enable pv-spinlocks in runtime_patch()
>    x86/alternatives: Paravirt runtime selftest
>    kvm/paravirt: Encapsulate KVM pv switching logic
>    x86/kvm: Add worker to trigger runtime patching
>    x86/kvm: Support dynamic CPUID hints
>    x86/kvm: Guest support for dynamic hints
>    x86/kvm: Add hint change notifier for KVM_HINT_REALTIME
> 
>   Documentation/virt/kvm/api.rst        |  17 +
>   Documentation/virt/kvm/cpuid.rst      |   9 +-
>   arch/x86/Kconfig                      |  14 +
>   arch/x86/Kconfig.debug                |  13 +
>   arch/x86/entry/entry_64.S             |   5 +
>   arch/x86/include/asm/alternative.h    |  20 +-
>   arch/x86/include/asm/kvm_host.h       |   6 +
>   arch/x86/include/asm/kvm_para.h       |  17 +
>   arch/x86/include/asm/paravirt.h       |  10 +-
>   arch/x86/include/asm/paravirt_types.h | 230 ++++--
>   arch/x86/include/asm/text-patching.h  |  18 +-
>   arch/x86/include/uapi/asm/kvm_para.h  |   2 +
>   arch/x86/kernel/Makefile              |   1 +
>   arch/x86/kernel/alternative.c         | 987 +++++++++++++++++++++++---
>   arch/x86/kernel/kvm.c                 | 191 ++++-
>   arch/x86/kernel/module.c              |  42 +-
>   arch/x86/kernel/paravirt.c            |  16 +-
>   arch/x86/kernel/paravirt_patch.c      |  61 ++
>   arch/x86/kernel/pv_selftest.c         | 264 +++++++
>   arch/x86/kernel/pv_selftest.h         |  15 +
>   arch/x86/kernel/setup.c               |   2 +
>   arch/x86/kernel/vmlinux.lds.S         |  16 +
>   arch/x86/kvm/cpuid.c                  |   3 +-
>   arch/x86/kvm/x86.c                    |  39 +
>   include/asm-generic/kvm_para.h        |  12 +
>   include/asm-generic/vmlinux.lds.h     |   8 +
>   include/linux/kvm_para.h              |   5 +
>   include/linux/mm.h                    |  16 +-
>   include/linux/preempt.h               |  17 +
>   include/uapi/linux/kvm.h              |   4 +
>   kernel/locking/lock_events.c          |   2 +-
>   mm/memory.c                           |   9 +-
>   32 files changed, 1850 insertions(+), 221 deletions(-)
>   create mode 100644 arch/x86/kernel/pv_selftest.c
>   create mode 100644 arch/x86/kernel/pv_selftest.h
> 

Quite a lot of code churn and hacks for a problem which should not
occur on a well administrated machine.

Especially the NMI dependencies make me not wanting to Ack this series.


Juergen

Jürgen Groß April 8, 2020, 1:33 p.m. UTC | #3

On 08.04.20 14:08, Peter Zijlstra wrote:
> On Tue, Apr 07, 2020 at 10:02:57PM -0700, Ankur Arora wrote:
>> Mechanism: the patching itself is done using stop_machine(). That is
>> not ideal -- text_poke_stop_machine() was replaced with INT3+emulation
>> via text_poke_bp(), but I'm using this to address two issues:
>>   1) emulation in text_poke() can only easily handle a small set
>>   of instructions and this is problematic for inlined pv-ops (and see
>>   a possible alternatives use-case below.)
>>   2) paravirt patching might have inter-dependendent ops (ex.
>>   lock.queued_lock_slowpath, lock.queued_lock_unlock are paired and
>>   need to be updated atomically.)
> 
> And then you hope that the spinlock state transfers.. That is that both
> implementations agree what an unlocked spinlock looks like.
> 
> Suppose the native one was a ticket spinlock, where unlocked means 'head
> == tail' while the paravirt one is a test-and-set spinlock, where
> unlocked means 'val == 0'.
> 
> That just happens to not be the case now, but it was for a fair while.

Sure? This would mean that before spinlock-pvops are being set no lock
is allowed to be used in the kernel, because this would block the boot
time transition of the lock variant to use.

Another problem I'm seeing is that runtime pvops patching would rely on
the fact that stop_machine() isn't guarded by a spinlock.


Juergen

Thomas Gleixner April 8, 2020, 2:12 p.m. UTC | #4

Ankur Arora <ankur.a.arora@oracle.com> writes:
> A KVM host (or another hypervisor) might advertise paravirtualized
> features and optimization hints (ex KVM_HINTS_REALTIME) which might
> become stale over the lifetime of the guest. For instance, the
> host might go from being undersubscribed to being oversubscribed
> (or the other way round) and it would make sense for the guest
> switch pv-ops based on that.

If your host changes his advertised behaviour then you want to fix the
host setup or find a competent admin.

> This lockorture splat that I saw on the guest while testing this is
> indicative of the problem:
>
>   [ 1136.461522] watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [lock_torture_wr:12865]
>   [ 1136.461542] CPU: 8 PID: 12865 Comm: lock_torture_wr Tainted: G W L 5.4.0-rc7+ #77
>   [ 1136.461546] RIP: 0010:native_queued_spin_lock_slowpath+0x15/0x220
>
> (Caused by an oversubscribed host but using mismatched native pv_lock_ops
> on the gues.)

And this illustrates what? The fact that you used a misconfigured setup.

> This series addresses the problem by doing paravirt switching at
> runtime.

You're not addressing the problem. Your fixing the symptom, which is
wrong to begin with.

> The alternative use-case is a runtime version of apply_alternatives()
> (not posted with this patch-set) that can be used for some safe subset
> of X86_FEATUREs. This could be useful in conjunction with the ongoing
> late microcode loading work that Mihai Carabas and others have been
> working on.

This has been discussed to death before and there is no safe subset as
long as this hasn't been resolved:

  https://lore.kernel.org/lkml/alpine.DEB.2.21.1909062237580.1902@nanos.tec.linutronix.de/

Thanks,

        tglx

Peter Zijlstra April 8, 2020, 2:49 p.m. UTC | #5

On Wed, Apr 08, 2020 at 03:33:52PM +0200, Jürgen Groß wrote:
> On 08.04.20 14:08, Peter Zijlstra wrote:
> > On Tue, Apr 07, 2020 at 10:02:57PM -0700, Ankur Arora wrote:
> > > Mechanism: the patching itself is done using stop_machine(). That is
> > > not ideal -- text_poke_stop_machine() was replaced with INT3+emulation
> > > via text_poke_bp(), but I'm using this to address two issues:
> > >   1) emulation in text_poke() can only easily handle a small set
> > >   of instructions and this is problematic for inlined pv-ops (and see
> > >   a possible alternatives use-case below.)
> > >   2) paravirt patching might have inter-dependendent ops (ex.
> > >   lock.queued_lock_slowpath, lock.queued_lock_unlock are paired and
> > >   need to be updated atomically.)
> > 
> > And then you hope that the spinlock state transfers.. That is that both
> > implementations agree what an unlocked spinlock looks like.
> > 
> > Suppose the native one was a ticket spinlock, where unlocked means 'head
> > == tail' while the paravirt one is a test-and-set spinlock, where
> > unlocked means 'val == 0'.
> > 
> > That just happens to not be the case now, but it was for a fair while.
> 
> Sure? This would mean that before spinlock-pvops are being set no lock
> is allowed to be used in the kernel, because this would block the boot
> time transition of the lock variant to use.

Hurm.. true. I suppose I completely forgot how paravirt spinlocks looked
before it got rewritten.

> Another problem I'm seeing is that runtime pvops patching would rely on
> the fact that stop_machine() isn't guarded by a spinlock.

It can't be, stop_machine() relies on scheduling. But yes, that another
variation of 'stuff uses spinlocks'.

Ankur Arora April 10, 2020, 7:56 a.m. UTC | #6

So, first thanks for the quick comments even though some of my choices
were straight NAKs (or maybe because of that!)

Second, I clearly did a bad job of motivating the series. Let me try
to address the motivation comments first and then I can address the
technical concerns separately.

[ I'm collating all the motivation comments below. ]


>> A KVM host (or another hypervisor) might advertise paravirtualized
>> features and optimization hints (ex KVM_HINTS_REALTIME) which might
>> become stale over the lifetime of the guest. For instance, the 

Thomas> If your host changes his advertised behaviour then you want to
Thomas> fix the host setup or find a competent admin.

Juergen> Then this hint is wrong if it can't be guaranteed.

I agree, the hint behaviour is wrong and the host shouldn't be giving
hints it can only temporarily honor.
The host problem is hard to fix though: the behaviour change is
either because of a guest migration or in case of a hosted guest,
cloud economics -- customers want to go to a 2-1 or worse VCPU-CPU
ratio at times of low load.

I had an offline discussion with Paolo Bonzini where he agreed that
it makes sense to make KVM_HINTS_REALTIME a dynamic hint rather than
static as it is now. (That was really the starting point for this
series.)

>> host might go from being undersubscribed to being oversubscribed
>> (or the other way round) and it would make sense for the guest
>> switch pv-ops based on that.

Juergen> I think using pvops for such a feature change is just wrong.
Juergen> What comes next? Using pvops for being able to migrate a guest
Juergen> from an Intel to an AMD machine?

My statement about switching pv-ops was too broadly worded. What
I meant to say was that KVM guests choose pv_lock_ops to be native
or paravirt based on undersubscribed/oversubscribed hint at boot,
and this choice should be available at run-time as well.

KVM chooses between native/paravirt spinlocks at boot based on this
reasoning (from commit b2798ba0b8):
"Waiman Long mentioned that:
> Generally speaking, unfair lock performs well for VMs with a small
> number of vCPUs. Native qspinlock may perform better than pvqspinlock
> if there is vCPU pinning and there is no vCPU over-commitment.
"

PeterZ> So what, the paravirt spinlock stuff works just fine when
PeterZ> you're not oversubscribed.
Yeah, the paravirt spinlocks work fine for both under and oversubscribed
hosts, but they are more expensive and that extra cost provides no benefits
when CPUs are pinned.
For instance, pvqueued spin_unlock() is a call+locked cmpxchg as opposed
to just a movb $0, (%rdi).

This difference shows up in kernbench running on a KVM guest with native
and paravirt spinlocks. I ran with 8 and 64 CPU guests with CPUs pinned.

The native version performs same or better.

8 CPU       Native  (std-dev)  Paravirt (std-dev)
             -----------------  -----------------
-j  4: sys  151.89  ( 0.2462)  160.14   ( 4.8366)    +5.4%
-j 32: sys  162.715 (11.4129)  170.225  (11.1138)    +4.6%
-j  0: sys  164.193 ( 9.4063)  170.843  ( 8.9651)    +4.0%


64 CPU       Native  (std-dev)  Paravirt (std-dev)
             -----------------  -----------------
-j  32: sys 209.448 (0.37009)  210.976   (0.4245)    +0.7%
-j 256: sys 267.401 (61.0928)  285.73   (78.8021)    +6.8%
-j   0: sys 286.313 (56.5978)  307.721  (70.9758)    +7.4%

In all cases the pv_kick, pv_wait numbers were minimal as expected.
The lock_slowpath counts were higher with PV but AFAICS the native
and paravirt lock_slowpath are not directly comparable.

Detailed kernbench numbers attached.

Thanks
Ankur
8-cpu-pinned,native
==================

Average Half load -j 4 Run (std deviation):
Elapsed Time 303.686 (0.737652)
User Time 1032.24 (2.8133)
System Time 151.89 (0.246272)
Percent CPU 389.2 (0.447214)
Context Switches 19350.4 (82.1785)
Sleeps 125885 (148.338)

Average Optimal load -j 32 Run (std deviation):
Elapsed Time 187.068 (0.358427)
User Time 1130.33 (103.405)
System Time 162.715 (11.4129)
Percent CPU 569.1 (189.633)
Context Switches 143301 (130656)
Sleeps 126938 (1132.83)

Average Maximal load -j Run (std deviation):
Elapsed Time 189.098 (0.316812)
User Time 1166.59 (98.4454)
System Time 164.193 (9.4063)
Percent CPU 627.133 (174.169)
Context Switches 222270 (156005)
Sleeps 122562 (6470.93)

8-cpu-pinned, pv
================

Average Half load -j 4 Run (std deviation):
Elapsed Time 309.872 (5.882)
User Time 1045.8 (18.5295)
System Time 160.14 (4.83669)
Percent CPU 388.8 (0.447214)
Context Switches 41215.4 (679.522)
Sleeps 122369 (477.593)

Average Optimal load -j 32 Run (std deviation):
Elapsed Time 190.1 (0.377823)
User Time 1144 (104.248)
System Time 170.225 (11.1138)
Percent CPU 568.2 (189.107)

Average Maximal load -j Run (std deviation):
Elapsed Time 191.606 (0.108305)
User Time 1178.83 (97.908)
System Time 170.843 (8.9651)
Percent CPU 625.8 (173.49)
Context Switches 234878 (149479)
Sleeps 120542 (6073.79)
64-cpu-pinned, native
=====================

Average Half load -j 32 Run (std deviation):
Elapsed Time 54.306 (0.134833)
User Time 1072.75 (1.34598)
System Time 209.448 (0.370095)
Percent CPU 2360.4 (4.03733)
Context Switches 26999 (99.5414)
Sleeps 122408 (184.87)

Average Optimal load -j 256 Run (std deviation):
Elapsed Time 39.424 (0.150599)
User Time 1140.91 (71.8722)
System Time 267.401 (61.0928)
Percent CPU 3125.9 (806.96)
Context Switches 129662 (108217)
Sleeps 121767 (699.198)

Average Maximal load -j Run (std deviation):
Elapsed Time 41.562 (0.206083)
User Time 1174.68 (75.9342)
System Time 286.313 (56.5978)
Percent CPU 3339.87 (719.062)
Context Switches 203428 (138536)
Sleeps 119066 (3993.58)

64-cpu-pinned, pv
================
Average Half load -j 32 Run (std deviation):
Elapsed Time 55.14 (0.0894427)
User Time 1071.99 (1.43335)
System Time 210.976 (0.424594)
Percent CPU 2326 (4.52769)
Context Switches 37544.8 (220.969)
Sleeps 115527 (94.7138)

Average Optimal load -j 256 Run (std deviation):
Elapsed Time 40.54 (0.246779)
User Time 1137.41 (68.9773)
System Time 285.73 (78.8021)
Percent CPU 3090.7 (806.218)
Context Switches 139059 (107006)
Sleeps 116962 (1518.56)

Average Maximal load -j Run (std deviation):
Elapsed Time 42.682 (0.170939)
User Time 1171.64 (74.6663)
System Time 307.721 (70.9758)
Percent CPU 3303.27 (717.418)
Context Switches 213430 (138616)
Sleeps 115143 (2930.03)

Ankur Arora April 10, 2020, 9:18 a.m. UTC | #7

On 2020-04-08 5:08 a.m., Peter Zijlstra wrote:
> On Tue, Apr 07, 2020 at 10:02:57PM -0700, Ankur Arora wrote:
>> A KVM host (or another hypervisor) might advertise paravirtualized
>> features and optimization hints (ex KVM_HINTS_REALTIME) which might
>> become stale over the lifetime of the guest. For instance, the
>> host might go from being undersubscribed to being oversubscribed
>> (or the other way round) and it would make sense for the guest
>> switch pv-ops based on that.
> 
> So what, the paravirt spinlock stuff works just fine when you're not
> oversubscribed.
> 
>> We keep an interesting subset of pv-ops (pv_lock_ops only for now,
>> but PV-TLB ops are also good candidates)
> 
> The PV-TLB ops also work just fine when not oversubscribed. IIRC
> kvm_flush_tlb_others() is pretty much the same in that case.
> 
>> in .parainstructions.runtime,
>> while discarding the .parainstructions as usual at init. This is then
>> used for switching back and forth between native and paravirt mode.
>> ([1] lists some representative numbers of the increased memory
>> footprint.)
>>
>> Mechanism: the patching itself is done using stop_machine(). That is
>> not ideal -- text_poke_stop_machine() was replaced with INT3+emulation
>> via text_poke_bp(), but I'm using this to address two issues:
>>   1) emulation in text_poke() can only easily handle a small set
>>   of instructions and this is problematic for inlined pv-ops (and see
>>   a possible alternatives use-case below.)
>>   2) paravirt patching might have inter-dependendent ops (ex.
>>   lock.queued_lock_slowpath, lock.queued_lock_unlock are paired and
>>   need to be updated atomically.)
> 
> And then you hope that the spinlock state transfers.. That is that both
> implementations agree what an unlocked spinlock looks like.
> 
> Suppose the native one was a ticket spinlock, where unlocked means 'head
> == tail' while the paravirt one is a test-and-set spinlock, where
> unlocked means 'val == 0'.
> 
> That just happens to not be the case now, but it was for a fair while.
> 
>> The alternative use-case is a runtime version of apply_alternatives()
>> (not posted with this patch-set) that can be used for some safe subset
>> of X86_FEATUREs. This could be useful in conjunction with the ongoing
>> late microcode loading work that Mihai Carabas and others have been
>> working on.
> 
> The whole late-microcode loading stuff is crazy already; you're making
> it take double bonghits.
That's fair. I was talking in a fairly limited sense, ex making static_cpu_has()
catch up with boot_cpu_has() after a microcode update but I should have
specified that.

> 
>> Also, there are points of similarity with the ongoing static_call work
>> which does rewriting of indirect calls.
> 
> Only in so far as that code patching is involved. An analogy would be
> comparing having a beer with shooting dope. They're both 'drugs'.
I meant closer to updating indirect pointers, like static_call_update()
semantics. But of course I don't know static_call code well enough.

> 
>> The difference here is that
>> we need to switch a group of calls atomically and given that
>> some of them can be inlined, need to handle a wider variety of opcodes.
>>
>> To patch safely we need to satisfy these constraints:
>>
>>   - No references to insn sequences under replacement on any kernel stack
>>     once replacement is in progress. Without this constraint we might end
>>     up returning to an address that is in the middle of an instruction.
> 
> Both ftrace and optprobes have that issue, neither of them are quite as
> crazy as this.
I did look at ftrace. Will look at optprobes. Thanks.

> 
>>   - handle inter-dependent ops: as above, lock.queued_lock_unlock(),
>>     lock.queued_lock_slowpath() and the rest of the pv_lock_ops are
>>     a good example.
> 
> While I'm sure this is a fun problem, why are we solving it?
> 
>>   - handle a broader set of insns than CALL and JMP: some pv-ops end up
>>     getting inlined. Alternatives can contain arbitrary instructions.
> 
> So can optprobes.> 
>>   - locking operations can be called from interrupt handlers which means
>>     we cannot trivially use IPIs for flushing.
> 
> Heck, some NMI handlers use locks..
This does handle the NMI locking problem. The solution -- doing it
in the NMI handler was of course pretty ugly.

>> Handling these, necessitates that target pv-ops not be preemptible.
> 
> I don't think that is a correct inferrence.The non-preemptibility requirement was to ensure that any pv-op under
replacement not be under execution after it is patched out.
(Not a concern for pv_lock_ops.)

Ensuring that we don't return to an address in the middle of an instruction
could be done by moving the NOPs in the prefix, but I couldn't think of
any other way to ensure that a function not be under execution.

Thanks
Ankur

>> Once that is a given (for safety these need to be explicitly whitelisted
>> in runtime_patch()), use a state-machine with the primary CPU doing the
>> patching and secondary CPUs in a sync_core() loop.
>>
>> In case we hit an INT3/BP (in NMI or thread-context) we makes forward
>> progress by continuing the patching instead of emulating.
>>
>> One remaining issue is inter-dependent pv-ops which are also executed in
>> the NMI handler -- patching can potentially deadlock in case of multiple
>> NMIs. Handle these by pushing some of this work in the NMI handler where
>> we know it will be uninterrupted.
> 
> I'm just seeing a lot of bonghits without sane rationale. Why is any of
> this important?
>

Ankur Arora April 10, 2020, 9:32 a.m. UTC | #8

On 2020-04-08 5:28 a.m., Jürgen Groß wrote:
> On 08.04.20 07:02, Ankur Arora wrote:
[ snip ]
> 
> Quite a lot of code churn and hacks for a problem which should not
> occur on a well administrated machine.
Yeah, I agree the patch set is pretty large and clearly the NMI or
the stop_machine() are completely out. That said, as I wrote in my
other mail I think the problem is still worth solving.

> Especially the NMI dependencies make me not wanting to Ack this series.
The NMI solution did turn out to be pretty ugly.

I was using it to solve two problems: avoid a deadlock where an NMI handler
could use a lock while the stop_machine() thread is trying to rewrite the
corresponding call-sites. And, needed to ensure that we don't lock
and unlock using mismatched primitives.

Thanks
Ankur

> 
> 
> Juergen

Ankur Arora April 10, 2020, 9:55 a.m. UTC | #9

On 2020-04-08 7:12 a.m., Thomas Gleixner wrote:
> Ankur Arora <ankur.a.arora@oracle.com> writes:
>> A KVM host (or another hypervisor) might advertise paravirtualized
>> features and optimization hints (ex KVM_HINTS_REALTIME) which might
>> become stale over the lifetime of the guest. For instance, the
>> host might go from being undersubscribed to being oversubscribed
>> (or the other way round) and it would make sense for the guest
>> switch pv-ops based on that.
> 
> If your host changes his advertised behaviour then you want to fix the
> host setup or find a competent admin.
> 
>> This lockorture splat that I saw on the guest while testing this is
>> indicative of the problem:
>>
>>    [ 1136.461522] watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [lock_torture_wr:12865]
>>    [ 1136.461542] CPU: 8 PID: 12865 Comm: lock_torture_wr Tainted: G W L 5.4.0-rc7+ #77
>>    [ 1136.461546] RIP: 0010:native_queued_spin_lock_slowpath+0x15/0x220
>>
>> (Caused by an oversubscribed host but using mismatched native pv_lock_ops
>> on the gues.)
> 
> And this illustrates what? The fact that you used a misconfigured setup.
> 
>> This series addresses the problem by doing paravirt switching at
>> runtime.
> 
> You're not addressing the problem. Your fixing the symptom, which is
> wrong to begin with.
> 
>> The alternative use-case is a runtime version of apply_alternatives()
>> (not posted with this patch-set) that can be used for some safe subset
>> of X86_FEATUREs. This could be useful in conjunction with the ongoing
>> late microcode loading work that Mihai Carabas and others have been
>> working on.
> 
> This has been discussed to death before and there is no safe subset as
> long as this hasn't been resolved:
> 
>    https://lore.kernel.org/lkml/alpine.DEB.2.21.1909062237580.1902@nanos.tec.linutronix.de/
Thanks. I was thinking of fairly limited subset: ex re-evaluate
X86_FEATURE_ALWAYS to make sure static_cpu_has() reflects reality
but I guess that has second order effects here.

Ankur

> 
> Thanks,
> 
>          tglx
>

[RFC,00/26] Runtime paravirt patching

Message

Comments