diff mbox

[v9,12/27] xsplice: Implement support for applying/reverting/replacing patches.

Message ID 20160427033952.GF26540@localhost.localdomain (mailing list archive)
State New, archived
Headers show

Commit Message

Konrad Rzeszutek Wilk April 27, 2016, 3:39 a.m. UTC
. snip..
> 
> Thinking about it again, even more stack conserving would be a
> bitmap...

Heheh
> 
> > +static int apply_payload(struct payload *data)
> > +{
> > +    unsigned int i;
> > +
> > +    printk(XENLOG_INFO XSPLICE "%s: Applying %u functions\n",
> > +            data->name, data->nfuncs);
> > +
> > +    arch_xsplice_patching_enter();
> > +
> > +    for ( i = 0; i < data->nfuncs; i++ )
> > +        arch_xsplice_apply_jmp(&data->funcs[i]);
> > +
> > +    arch_xsplice_patching_leave();
> > +
> > +    list_add_tail_rcu(&data->applied_list, &applied_list);
> 
> Neither in the comment earlier on nor here it becomes clear that this
> is more of an abuse than a use of RCU.

I added more comments and..
> 
> > +struct xsplice_patch_func {
> > +    const char *name;       /* Name of function to be patched. */
> > +    void *new_addr;
> > +    void *old_addr;
> > +    uint32_t new_size;
> > +    uint32_t old_size;
> > +    uint8_t version;        /* MUST be XSPLICE_PAYLOAD_VERSION. */
> > +    uint8_t opaque[31];     /* MUST be zero filled. */
> 
> I don't see the zero filling being a requirement, nor it being enforced.

.. removed this.

From adeadf8babcc5ef6d512cdc28899b4d1de34c60e Mon Sep 17 00:00:00 2001
From: Ross Lagerwall <ross.lagerwall@citrix.com>
Date: Thu, 21 Apr 2016 06:14:29 -0400
Subject: [PATCH] xsplice: Implement support for applying/reverting/replacing
 patches.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Implement support for the apply, revert and replace actions.

To perform and action on a payload, the hypercall sets up a data
structure to schedule the work.  A hook is added in the reset_stack_and_jump
to check for work and execute it if needed (specifically we check an
per-cpu flag to make this as quick as possible).

In this way, patches can be applied with all CPUs idle and without
stacks.  The first CPU to run check_for_xsplice_work() becomes the
master and triggers a reschedule softirq to trigger all the other CPUs
to enter check_for_xsplice_work() with no stack.  Once all CPUs
have rendezvoused, all CPUs disable their IRQs and NMIs are ignored.
The system is then quiscient and the master performs the action.
After this, all CPUs enable IRQs and NMIs are re-enabled.

Note that it is unsafe to patch do_nmi and the xSplice internal functions.
Patching functions on NMI/MCE path is liable to end in disaster on x86.
This is not addressed in this patch and is mentioned in the
design doc as a further TODO.

The action to perform is one of:
- APPLY: For each function in the module, store the first arch-specific
  number bytes of the old function and replace it with a jump to the
  new function. (on x86 it is 5 bytes, on ARM it will likey be 4 bytes).
- REVERT: Copy the previously stored bytes into the first arch-specific
  number of bytes of the old function (again, 5 bytes on x86).
- REPLACE: Revert each applied module and then apply the new module.

To prevent a deadlock with any other barrier in the system, the master
will wait for up to 30ms before timing out.
Measurements found that the patch application to take about 100 ?s on a
72 CPU system, whether idle or fully loaded.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Julien Grall <julien.grall@arm.com>

--
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Julien Grall <julien.grall@arm.com>
Cc: Keir Fraser <keir@xen.org>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>

v2: - Pluck the 'struct xsplice_patch_func' in this patch.
    - Modify code per review comments.
    - Add more data in the keyboard handler.
    - Redo the patching code, split it in functions.
v3: - Add return_ macro for debug builds.
    - Move s/payload_list_lock/payload_list/ to earlier patch
    - Remove const and use ELF types for xsplice_patch_func
     - Add check routine to do simple sanity checks for various
      sections.
    - s/%p/PRIx64/ as ARM builds complain.
    - Move code around. Add more dprintk. Add XSPLICE in front of all
      printks/dprintk.
      Put the NMIs back if we fail patching.
      Add per-cpu to lessen contention for global structure.
      Extract from xsplice_do_single patching code into xsplice_do_action
      Squash xsplice_do_single and check_for_xsplice_work together to
      have all rendezvous in one place.
      Made XSPLICE_ACTION_REPLACE work again (wrong list iterator)
      s/find_special_sections/prepare_payload/
      Use list_del_init and INIT_LIST_HEAD for applied_list
v4:
   - Add comment, adjust spacing for "Timed out on CPU semaphore"
   - Added CR0.WP manipulations when altering the .text of hypervisor.
   - Added fix from Andrew for CR0.WP manipulation.
v5: - Made xsplice_patch_func use uintXX_t instead of ELF_ types to easy
      making it work under ARM (32bit). Add more BUILD-BUG-ON checks.
    - Add more BUILD_ON checks. Sprinkle newlines.
v6: Rebase on "arm/x86: Alter nmi_callback_t typedef"
   - Drop the recursive spinlock usage.
   - Move NMI callbacks in arch specific.
   - Fold the 'check_for_xsplice_work' in reset_stack_and_jump
   - Add arch specific check for .xsplice.funcs.
   - Seperate external and internal structure of .xsplice.funcs.
   - Changed per Jan's review
   - Modified the .xsplice.funcs checks
v7:
   - Modified old_ptr to void* instead of uint8_t*
   - Modified the xsplice_patch_func_internal for ARM32 to have padding.
   - Used #if BITS_PER_LONG == 64 for the xsplice_patch_func_internal along
     with ifndef CONFIG_ARM for the undo (which may be different size on ARM64)
v8:
  - Add "is empty" if special sections are in fact empty.
  - Added Andrew's Reviewed-by:
  - Rebase on v7.2 of  x86/mm: Introduce modify_xen_mappings()
  - Change some of printk to dprintk and some of the dprintk to printk.
  - Make the xsplice_patch_func (and the internal) structure have uint32_t
    (instead of uint64_t) if BITS_PER_LONG==32. This makes the size and
    offset different so note that in the design and common code.
  - Add #undef ACTION
  - Guard struct xsplice_patch_func in sysctl.h with __XEN__ as toolstacks
    will fail to compile. We do have BITS_PER_LONG defined in xc_bitops.h but
    that will go away (and also that macro uses sizeof and the pre-processor
    will choke on that).
  - Dropped Julien's Acked as I replaced BITS_PER_LONG/CONFIG_ARM_32.
    (Stefano is OK with it, but would prefer BITS_PER_LONG, Jan does not want
    BITS_PER_LONG).
v9: Expose the struct xsplice_patch_func old_addr and new_addr as void
    instead of uint32_t or uint64_t.
  - Added Julien' Ack back.
  - Rename pad to opaque.
  - Added comment in aidle_loop.
  - Squash internal and public of 'xsplice_patch_func'
  - Fixed remainig sizeof use.
  - Removed reference to MCE
  - Fixed comment styles.
  - Use bool_t in check_special_sections
  - Add a #define for .xsplice.funcs.
  - Remove full stops from printk
  - Fix xsplice_do_action per Jan's punchlist
  - Use spin_lock_try in keyhandler
  - Remove leading underscores from __CHECK_FOR_XSPLICE_WORK
  - Don't fail compilation on GCC5 - we MUST have rc set.
  - Don't bail out if finding !sh_type as those are for .rela or .debug
    and while we don't need to allocate it (as we had already done
    the relocation), do continue.
  - Make applied_list be an RCU type to guard against infinite loops
    when searching the applied_list.
  - Dropped the irq_semaphore and are re-using the semaphore atomic when
    CPUs have rendezvoused and are ready to go in IRQ disable phase.
v10: Drop Reviewed-by
  - Use bitmap for in check_special_sections to check for sections
    appearing twice.
  - Add comment about us abusing the list RCU for our safety reasons.
  - And remove MUST comment about opaque having to be zero filled.
---
 xen/arch/arm/xsplice.c        |  33 +++
 xen/arch/x86/domain.c         |   6 +
 xen/arch/x86/xsplice.c        |  76 +++++++
 xen/common/xsplice.c          | 480 +++++++++++++++++++++++++++++++++++++++++-
 xen/include/asm-x86/current.h |  10 +-
 xen/include/public/sysctl.h   |  20 ++
 xen/include/xen/xsplice.h     |  21 ++
 7 files changed, 634 insertions(+), 12 deletions(-)

Comments

Jan Beulich April 27, 2016, 8:36 a.m. UTC | #1
>>> On 27.04.16 at 05:39, <konrad.wilk@oracle.com> wrote:
> +static int check_special_sections(const struct xsplice_elf *elf)
> +{
> +    unsigned int i;
> +    static const char *const names[] = { ELF_XSPLICE_FUNC };
> +    DECLARE_BITMAP(count, ARRAY_SIZE(names)) = { 0 };

Perhaps better "seen" or "found" or some such, now that this
isn't an array of counters anymore?

> +    for ( i = 0; i < ARRAY_SIZE(names); i++ )
> +    {
> +        const struct xsplice_elf_sec *sec;
> +
> +        sec = xsplice_elf_sec_by_name(elf, names[i]);
> +        if ( !sec )
> +        {
> +            dprintk(XENLOG_ERR, XSPLICE "%s: %s is missing!\n",
> +                    elf->name, names[i]);
> +            return -EINVAL;
> +        }
> +
> +        if ( !sec->sec->sh_size )
> +        {
> +            dprintk(XENLOG_ERR, XSPLICE "%s: %s is empty!\n",
> +                    elf->name, names[i]);
> +            return -EINVAL;
> +        }
> +
> +        if ( test_bit(i, count) )
> +        {
> +            dprintk(XENLOG_ERR, XSPLICE "%s: %s was seen more than once!\n",
> +                    elf->name, names[i]);
> +            return -EINVAL;
> +        }
> +
> +        __set_bit(i, count);

__test_and_set_bit() to fold the two?

In any event, feel free to add my ack.

Jan
Martin Pohlack May 11, 2016, 9:51 a.m. UTC | #2
On 27.04.2016 05:39, Konrad Rzeszutek Wilk wrote:
[...]
> +        /* "Mask" NMIs. */
> +        arch_xsplice_mask();

You mask here ...

> +        barrier(); /* MUST do it after get_cpu_maps. */
> +        cpus = num_online_cpus() - 1;
> +
> +        if ( cpus )
> +        {
> +            dprintk(XENLOG_DEBUG, XSPLICE "%s: CPU%u - IPIing the other %u CPUs\n",
> +                    p->name, cpu, cpus);
> +            smp_call_function(reschedule_fn, NULL, 0);
> +        }
> +
> +        timeout = xsplice_work.timeout + NOW();
> +        if ( xsplice_spin(&xsplice_work.semaphore, timeout, cpus, "CPU") )
> +            goto abort;

... and potentially abort here, but the abort path does not unmask, so
you lose the NMI handler.

> +
> +        /* All CPUs are waiting, now signal to disable IRQs. */
> +        atomic_set(&xsplice_work.semaphore, 0);
> +        /*
> +         * MUST have a barrier after semaphore so that the other CPUs don't
> +         * leak out of the 'Wait for all CPUs to rendezvous' loop and increment
> +         * 'semaphore' before we set it to zero.
> +         */
> +        smp_wmb();
> +        xsplice_work.ready = 1;
> +
> +        if ( !xsplice_spin(&xsplice_work.semaphore, timeout, cpus, "IRQ") )
> +        {
> +            local_irq_save(flags);
> +            /* Do the patching. */
> +            xsplice_do_action();
> +            /* Serialize and flush out the CPU via CPUID instruction (on x86). */
> +            arch_xsplice_post_action();
> +            local_irq_restore(flags);
> +        }
> +        arch_xsplice_unmask();
> +
> + abort:
> +        per_cpu(work_to_do, cpu) = 0;
> +        xsplice_work.do_work = 0;
> +
> +        /* put_cpu_maps has an barrier(). */
> +        put_cpu_maps();
> +
> +        printk(XENLOG_INFO XSPLICE "%s finished %s with rc=%d\n",
> +               p->name, names[xsplice_work.cmd], p->rc);
> +    }
> +    else
[...]

Martin
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
Konrad Rzeszutek Wilk May 11, 2016, 1:56 p.m. UTC | #3
On Wed, May 11, 2016 at 11:51:53AM +0200, Martin Pohlack wrote:
> On 27.04.2016 05:39, Konrad Rzeszutek Wilk wrote:
> [...]
> > +        /* "Mask" NMIs. */
> > +        arch_xsplice_mask();
> 
> You mask here ...
> 
> > +        barrier(); /* MUST do it after get_cpu_maps. */
> > +        cpus = num_online_cpus() - 1;
> > +
> > +        if ( cpus )
> > +        {
> > +            dprintk(XENLOG_DEBUG, XSPLICE "%s: CPU%u - IPIing the other %u CPUs\n",
> > +                    p->name, cpu, cpus);
> > +            smp_call_function(reschedule_fn, NULL, 0);
> > +        }
> > +
> > +        timeout = xsplice_work.timeout + NOW();
> > +        if ( xsplice_spin(&xsplice_work.semaphore, timeout, cpus, "CPU") )
> > +            goto abort;
> 
> ... and potentially abort here, but the abort path does not unmask, so
> you lose the NMI handler.

Ouch! Sending a patch shortly out! Thanks for spotting that
diff mbox

Patch

diff --git a/xen/arch/arm/xsplice.c b/xen/arch/arm/xsplice.c
index 8cb7767..db0dce2 100644
--- a/xen/arch/arm/xsplice.c
+++ b/xen/arch/arm/xsplice.c
@@ -7,6 +7,39 @@ 
 #include <xen/xsplice_elf.h>
 #include <xen/xsplice.h>
 
+void arch_xsplice_patching_enter(void)
+{
+}
+
+void arch_xsplice_patching_leave(void)
+{
+}
+
+int arch_xsplice_verify_func(const struct xsplice_patch_func *func)
+{
+    return -ENOSYS;
+}
+
+void arch_xsplice_apply_jmp(struct xsplice_patch_func *func)
+{
+}
+
+void arch_xsplice_revert_jmp(const struct xsplice_patch_func *func)
+{
+}
+
+void arch_xsplice_post_action(void)
+{
+}
+
+void arch_xsplice_mask(void)
+{
+}
+
+void arch_xsplice_unmask(void)
+{
+}
+
 int arch_xsplice_verify_elf(const struct xsplice_elf *elf)
 {
     return -ENOSYS;
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index e93ff20..d13b272 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -36,6 +36,7 @@ 
 #include <xen/cpu.h>
 #include <xen/wait.h>
 #include <xen/guest_access.h>
+#include <xen/xsplice.h>
 #include <public/sysctl.h>
 #include <public/hvm/hvm_vcpu.h>
 #include <asm/regs.h>
@@ -120,6 +121,11 @@  static void idle_loop(void)
         (*pm_idle)();
         do_tasklet();
         do_softirq();
+        /*
+         * We MUST be last (or before pm_idle). Otherwise after we get the
+         * softirq we would execute pm_idle (and sleep) and not patch.
+         */
+        check_for_xsplice_work();
     }
 }
 
diff --git a/xen/arch/x86/xsplice.c b/xen/arch/x86/xsplice.c
index e50cbc0..fc1338e 100644
--- a/xen/arch/x86/xsplice.c
+++ b/xen/arch/x86/xsplice.c
@@ -11,6 +11,82 @@ 
 #include <xen/xsplice_elf.h>
 #include <xen/xsplice.h>
 
+#include <asm/nmi.h>
+
+#define PATCH_INSN_SIZE 5
+
+void arch_xsplice_patching_enter(void)
+{
+    /* Disable WP to allow changes to read-only pages. */
+    write_cr0(read_cr0() & ~X86_CR0_WP);
+}
+
+void arch_xsplice_patching_leave(void)
+{
+    /* Reinstate WP. */
+    write_cr0(read_cr0() | X86_CR0_WP);
+}
+
+int arch_xsplice_verify_func(const struct xsplice_patch_func *func)
+{
+    /* No NOP patching yet. */
+    if ( !func->new_size )
+        return -EOPNOTSUPP;
+
+    if ( func->old_size < PATCH_INSN_SIZE )
+        return -EINVAL;
+
+    return 0;
+}
+
+void arch_xsplice_apply_jmp(struct xsplice_patch_func *func)
+{
+    int32_t val;
+    uint8_t *old_ptr;
+
+    BUILD_BUG_ON(PATCH_INSN_SIZE > sizeof(func->opaque));
+    BUILD_BUG_ON(PATCH_INSN_SIZE != (1 + sizeof(val)));
+
+    old_ptr = func->old_addr;
+    memcpy(func->opaque, old_ptr, PATCH_INSN_SIZE);
+
+    *old_ptr++ = 0xe9; /* Relative jump */
+    val = func->new_addr - func->old_addr - PATCH_INSN_SIZE;
+    memcpy(old_ptr, &val, sizeof(val));
+}
+
+void arch_xsplice_revert_jmp(const struct xsplice_patch_func *func)
+{
+    memcpy(func->old_addr, func->opaque, PATCH_INSN_SIZE);
+}
+
+/* Serialise the CPU pipeline. */
+void arch_xsplice_post_action(void)
+{
+    cpuid_eax(0);
+}
+
+static nmi_callback_t *saved_nmi_callback;
+/*
+ * Note that because of this NOP code the do_nmi is not safely patchable.
+ * Also if we do receive 'real' NMIs we have lost them.
+ */
+static int mask_nmi_callback(const struct cpu_user_regs *regs, int cpu)
+{
+    /* TODO: Handle missing NMI/MCE.*/
+    return 1;
+}
+
+void arch_xsplice_mask(void)
+{
+    saved_nmi_callback = set_nmi_callback(mask_nmi_callback);
+}
+
+void arch_xsplice_unmask(void)
+{
+    set_nmi_callback(saved_nmi_callback);
+}
+
 int arch_xsplice_verify_elf(const struct xsplice_elf *elf)
 {
 
diff --git a/xen/common/xsplice.c b/xen/common/xsplice.c
index 3f3aacc..6d94023 100644
--- a/xen/common/xsplice.c
+++ b/xen/common/xsplice.c
@@ -3,6 +3,7 @@ 
  *
  */
 
+#include <xen/cpu.h>
 #include <xen/err.h>
 #include <xen/guest_access.h>
 #include <xen/keyhandler.h>
@@ -11,18 +12,30 @@ 
 #include <xen/mm.h>
 #include <xen/sched.h>
 #include <xen/smp.h>
+#include <xen/softirq.h>
 #include <xen/spinlock.h>
 #include <xen/vmap.h>
+#include <xen/wait.h>
 #include <xen/xsplice_elf.h>
 #include <xen/xsplice.h>
 
 #include <asm/event.h>
-#include <public/sysctl.h>
 
-/* Protects against payload_list operations. */
+/*
+ * Protects against payload_list operations and also allows only one
+ * caller in schedule_work.
+ */
 static DEFINE_SPINLOCK(payload_lock);
 static LIST_HEAD(payload_list);
 
+/*
+ * Patches which have been applied. Need RCU in case we crash (and then
+ * traps code would iterate via applied_list) when adding entries on the list.
+ *
+ * Note: There are no 'rcu_applied_lock' as we don't iterate yet the list.
+ */
+static LIST_HEAD(applied_list);
+
 static unsigned int payload_cnt;
 static unsigned int payload_version = 1;
 
@@ -37,9 +50,35 @@  struct payload {
     const void *ro_addr;                 /* Virtual address of .rodata. */
     size_t ro_size;                      /* .. and its size (if any). */
     unsigned int pages;                  /* Total pages for [text,rw,ro]_addr */
+    struct list_head applied_list;       /* Linked to 'applied_list'. */
+    struct xsplice_patch_func *funcs;    /* The array of functions to patch. */
+    unsigned int nfuncs;                 /* Nr of functions to patch. */
     char name[XEN_XSPLICE_NAME_SIZE];    /* Name of it. */
 };
 
+/* Defines an outstanding patching action. */
+struct xsplice_work
+{
+    atomic_t semaphore;          /* Used to rendezvous CPUs in
+                                    check_for_xsplice_work. */
+    uint32_t timeout;            /* Timeout to do the operation. */
+    struct payload *data;        /* The payload on which to act. */
+    volatile bool_t do_work;     /* Signals work to do. */
+    volatile bool_t ready;       /* Signals all CPUs synchronized. */
+    unsigned int cmd;            /* Action request: XSPLICE_ACTION_* */
+};
+
+/* There can be only one outstanding patching action. */
+static struct xsplice_work xsplice_work;
+
+/*
+ * Indicate whether the CPU needs to consult xsplice_work structure.
+ * We want an per-cpu data structure otherwise the check_for_xsplice_work
+ * would hammer a global xsplice_work structure on every guest VMEXIT.
+ * Having an per-cpu lessens the load.
+ */
+static DEFINE_PER_CPU(bool_t, work_to_do);
+
 static int get_name(const xen_xsplice_name_t *name, char *n)
 {
     if ( !name->size || name->size > XEN_XSPLICE_NAME_SIZE )
@@ -268,6 +307,91 @@  static int secure_payload(struct payload *payload, struct xsplice_elf *elf)
     return rc;
 }
 
+static int check_special_sections(const struct xsplice_elf *elf)
+{
+    unsigned int i;
+    static const char *const names[] = { ELF_XSPLICE_FUNC };
+    DECLARE_BITMAP(count, ARRAY_SIZE(names)) = { 0 };
+
+    for ( i = 0; i < ARRAY_SIZE(names); i++ )
+    {
+        const struct xsplice_elf_sec *sec;
+
+        sec = xsplice_elf_sec_by_name(elf, names[i]);
+        if ( !sec )
+        {
+            dprintk(XENLOG_ERR, XSPLICE "%s: %s is missing!\n",
+                    elf->name, names[i]);
+            return -EINVAL;
+        }
+
+        if ( !sec->sec->sh_size )
+        {
+            dprintk(XENLOG_ERR, XSPLICE "%s: %s is empty!\n",
+                    elf->name, names[i]);
+            return -EINVAL;
+        }
+
+        if ( test_bit(i, count) )
+        {
+            dprintk(XENLOG_ERR, XSPLICE "%s: %s was seen more than once!\n",
+                    elf->name, names[i]);
+            return -EINVAL;
+        }
+
+        __set_bit(i, count);
+    }
+
+    return 0;
+}
+
+static int prepare_payload(struct payload *payload,
+                           struct xsplice_elf *elf)
+{
+    const struct xsplice_elf_sec *sec;
+    unsigned int i;
+    struct xsplice_patch_func *f;
+
+    sec = xsplice_elf_sec_by_name(elf, ELF_XSPLICE_FUNC);
+    ASSERT(sec);
+    if ( sec->sec->sh_size % sizeof(*payload->funcs) )
+    {
+        dprintk(XENLOG_ERR, XSPLICE "%s: Wrong size of "ELF_XSPLICE_FUNC"!\n",
+                elf->name);
+        return -EINVAL;
+    }
+
+    payload->funcs = sec->load_addr;
+    payload->nfuncs = sec->sec->sh_size / sizeof(*payload->funcs);
+
+    for ( i = 0; i < payload->nfuncs; i++ )
+    {
+        int rc;
+
+        f = &(payload->funcs[i]);
+
+        if ( f->version != XSPLICE_PAYLOAD_VERSION )
+        {
+            dprintk(XENLOG_ERR, XSPLICE "%s: Wrong version (%u). Expected %d!\n",
+                    elf->name, f->version, XSPLICE_PAYLOAD_VERSION);
+            return -EOPNOTSUPP;
+        }
+
+        if ( !f->new_addr || !f->new_size )
+        {
+            dprintk(XENLOG_ERR, XSPLICE "%s: Address or size fields are zero!\n",
+                    elf->name);
+            return -EINVAL;
+        }
+
+        rc = arch_xsplice_verify_func(f);
+        if ( rc )
+            return rc;
+    }
+
+    return 0;
+}
+
 static void free_payload(struct payload *data)
 {
     ASSERT(spin_is_locked(&payload_lock));
@@ -299,6 +423,14 @@  static int load_payload_data(struct payload *payload, void *raw, size_t len)
     if ( rc )
         goto out;
 
+    rc = check_special_sections(&elf);
+    if ( rc )
+        goto out;
+
+    rc = prepare_payload(payload, &elf);
+    if ( rc )
+        goto out;
+
     rc = secure_payload(payload, &elf);
 
  out:
@@ -354,6 +486,7 @@  static int xsplice_upload(xen_sysctl_xsplice_upload_t *upload)
 
     data->state = XSPLICE_STATE_CHECKED;
     INIT_LIST_HEAD(&data->list);
+    INIT_LIST_HEAD(&data->applied_list);
 
     list_add_tail(&data->list, &payload_list);
     payload_cnt++;
@@ -464,6 +597,313 @@  static int xsplice_list(xen_sysctl_xsplice_list_t *list)
     return rc ? : idx;
 }
 
+/*
+ * The following functions get the CPUs into an appropriate state and
+ * apply (or revert) each of the payload's functions. This is needed
+ * for XEN_SYSCTL_XSPLICE_ACTION operation (see xsplice_action).
+ */
+
+static int apply_payload(struct payload *data)
+{
+    unsigned int i;
+
+    printk(XENLOG_INFO XSPLICE "%s: Applying %u functions\n",
+            data->name, data->nfuncs);
+
+    arch_xsplice_patching_enter();
+
+    for ( i = 0; i < data->nfuncs; i++ )
+        arch_xsplice_apply_jmp(&data->funcs[i]);
+
+    arch_xsplice_patching_leave();
+
+    /*
+     * We need RCU variant (which has barriers) in case we crash here.
+     * The applied_list is iterated by the trap code.
+     */
+    list_add_tail_rcu(&data->applied_list, &applied_list);
+
+    return 0;
+}
+
+static int revert_payload(struct payload *data)
+{
+    unsigned int i;
+
+    printk(XENLOG_INFO XSPLICE "%s: Reverting\n", data->name);
+
+    arch_xsplice_patching_enter();
+
+    for ( i = 0; i < data->nfuncs; i++ )
+        arch_xsplice_revert_jmp(&data->funcs[i]);
+
+    arch_xsplice_patching_leave();
+
+    /*
+     * We need RCU variant (which has barriers) in case we crash here.
+     * The applied_list is iterated by the trap code.
+     */
+    list_del_rcu(&data->applied_list);
+
+    return 0;
+}
+
+/*
+ * This function is executed having all other CPUs with no deep stack (we may
+ * have cpu_idle on it) and IRQs disabled.
+ */
+static void xsplice_do_action(void)
+{
+    int rc;
+    struct payload *data, *other, *tmp;
+
+    data = xsplice_work.data;
+    /*
+     * This function and the transition from asm to C code should be the only
+     * one on any stack. No need to lock the payload list or applied list.
+     */
+    switch ( xsplice_work.cmd )
+    {
+    case XSPLICE_ACTION_APPLY:
+        rc = apply_payload(data);
+        if ( rc == 0 )
+            data->state = XSPLICE_STATE_APPLIED;
+        break;
+
+    case XSPLICE_ACTION_REVERT:
+        rc = revert_payload(data);
+        if ( rc == 0 )
+            data->state = XSPLICE_STATE_CHECKED;
+        break;
+
+    case XSPLICE_ACTION_REPLACE:
+        rc = 0;
+        /*
+	 * N.B: Use 'applied_list' member, not 'list'. We also abuse the
+	 * the 'normal' list iterator as the list is an RCU one.
+	 */
+        list_for_each_entry_safe_reverse ( other, tmp, &applied_list, applied_list )
+        {
+            other->rc = revert_payload(other);
+            if ( other->rc == 0 )
+                other->state = XSPLICE_STATE_CHECKED;
+            else
+            {
+                rc = -EINVAL;
+                break;
+            }
+        }
+
+        if ( rc == 0 )
+        {
+            rc = apply_payload(data);
+            if ( rc == 0 )
+                data->state = XSPLICE_STATE_APPLIED;
+        }
+        break;
+
+    default:
+        rc = -EINVAL; /* Make GCC5 happy. */
+        ASSERT_UNREACHABLE();
+        break;
+    }
+
+    /* We must set rc as xsplice_action sets it to -EAGAIN when kicking of. */
+    data->rc = rc;
+}
+
+static int schedule_work(struct payload *data, uint32_t cmd, uint32_t timeout)
+{
+    ASSERT(spin_is_locked(&payload_lock));
+
+    /* Fail if an operation is already scheduled. */
+    if ( xsplice_work.do_work )
+        return -EBUSY;
+
+    if ( !get_cpu_maps() )
+    {
+        printk(XENLOG_ERR XSPLICE "%s: unable to get cpu_maps lock!\n",
+               data->name);
+        return -EBUSY;
+    }
+
+    xsplice_work.cmd = cmd;
+    xsplice_work.data = data;
+    xsplice_work.timeout = timeout ?: MILLISECS(30);
+
+    dprintk(XENLOG_DEBUG, XSPLICE "%s: timeout is %"PRI_stime"ms\n",
+            data->name, xsplice_work.timeout / MILLISECS(1));
+
+    atomic_set(&xsplice_work.semaphore, -1);
+
+    xsplice_work.ready = 0;
+
+    smp_wmb();
+
+    xsplice_work.do_work = 1;
+    this_cpu(work_to_do) = 1;
+
+    put_cpu_maps();
+
+    return 0;
+}
+
+static void reschedule_fn(void *unused)
+{
+    this_cpu(work_to_do) = 1;
+    raise_softirq(SCHEDULE_SOFTIRQ);
+}
+
+static int xsplice_spin(atomic_t *counter, s_time_t timeout,
+                           unsigned int cpus, const char *s)
+{
+    int rc = 0;
+
+    while ( atomic_read(counter) != cpus && NOW() < timeout )
+        cpu_relax();
+
+    /* Log & abort. */
+    if ( atomic_read(counter) != cpus )
+    {
+        printk(XENLOG_ERR XSPLICE "%s: Timed out on semaphore in %s quiesce phase %u/%u\n",
+               xsplice_work.data->name, s, atomic_read(counter), cpus);
+        rc = -EBUSY;
+        xsplice_work.data->rc = rc;
+        smp_wmb();
+        xsplice_work.do_work = 0;
+    }
+
+    return rc;
+}
+
+/*
+ * The main function which manages the work of quiescing the system and
+ * patching code.
+ */
+void check_for_xsplice_work(void)
+{
+#define ACTION(x) [XSPLICE_ACTION_##x] = #x
+    static const char *const names[] = {
+            ACTION(APPLY),
+            ACTION(REVERT),
+            ACTION(REPLACE),
+    };
+#undef ACTION
+    unsigned int cpu = smp_processor_id();
+    s_time_t timeout;
+    unsigned long flags;
+
+    /* Fast path: no work to do. */
+    if ( !per_cpu(work_to_do, cpu ) )
+        return;
+
+    smp_rmb();
+    /* In case we aborted, other CPUs can skip right away. */
+    if ( !xsplice_work.do_work )
+    {
+        per_cpu(work_to_do, cpu) = 0;
+        return;
+    }
+
+    ASSERT(local_irq_is_enabled());
+
+    /* Set at -1, so will go up to num_online_cpus - 1. */
+    if ( atomic_inc_and_test(&xsplice_work.semaphore) )
+    {
+        struct payload *p;
+        unsigned int cpus;
+
+        p = xsplice_work.data;
+        if ( !get_cpu_maps() )
+        {
+            printk(XENLOG_ERR XSPLICE "%s: CPU%u - unable to get cpu_maps lock!\n",
+                   p->name, cpu);
+            per_cpu(work_to_do, cpu) = 0;
+            xsplice_work.data->rc = -EBUSY;
+            smp_wmb();
+            xsplice_work.do_work = 0;
+            /*
+             * Do NOT decrement xsplice_work.semaphore down - as that may cause
+             * the other CPU (which may be at this point ready to increment it)
+             * to assume the role of master and then needlessly time out
+             * out (as do_work is zero).
+             */
+            return;
+        }
+        /* "Mask" NMIs. */
+        arch_xsplice_mask();
+
+        barrier(); /* MUST do it after get_cpu_maps. */
+        cpus = num_online_cpus() - 1;
+
+        if ( cpus )
+        {
+            dprintk(XENLOG_DEBUG, XSPLICE "%s: CPU%u - IPIing the other %u CPUs\n",
+                    p->name, cpu, cpus);
+            smp_call_function(reschedule_fn, NULL, 0);
+        }
+
+        timeout = xsplice_work.timeout + NOW();
+        if ( xsplice_spin(&xsplice_work.semaphore, timeout, cpus, "CPU") )
+            goto abort;
+
+        /* All CPUs are waiting, now signal to disable IRQs. */
+        atomic_set(&xsplice_work.semaphore, 0);
+        /*
+         * MUST have a barrier after semaphore so that the other CPUs don't
+         * leak out of the 'Wait for all CPUs to rendezvous' loop and increment
+         * 'semaphore' before we set it to zero.
+         */
+        smp_wmb();
+        xsplice_work.ready = 1;
+
+        if ( !xsplice_spin(&xsplice_work.semaphore, timeout, cpus, "IRQ") )
+        {
+            local_irq_save(flags);
+            /* Do the patching. */
+            xsplice_do_action();
+            /* Serialize and flush out the CPU via CPUID instruction (on x86). */
+            arch_xsplice_post_action();
+            local_irq_restore(flags);
+        }
+        arch_xsplice_unmask();
+
+ abort:
+        per_cpu(work_to_do, cpu) = 0;
+        xsplice_work.do_work = 0;
+
+        /* put_cpu_maps has an barrier(). */
+        put_cpu_maps();
+
+        printk(XENLOG_INFO XSPLICE "%s finished %s with rc=%d\n",
+               p->name, names[xsplice_work.cmd], p->rc);
+    }
+    else
+    {
+        /* Wait for all CPUs to rendezvous. */
+        while ( xsplice_work.do_work && !xsplice_work.ready )
+            cpu_relax();
+
+        /* Disable IRQs and signal. */
+        local_irq_save(flags);
+        /*
+         * We re-use the sempahore, so MUST have it reset by master before
+         * we exit the loop above.
+         */
+        atomic_inc(&xsplice_work.semaphore);
+
+        /* Wait for patching to complete. */
+        while ( xsplice_work.do_work )
+            cpu_relax();
+
+        /* To flush out pipeline. */
+        arch_xsplice_post_action();
+        local_irq_restore(flags);
+
+        per_cpu(work_to_do, cpu) = 0;
+    }
+}
+
 static int xsplice_action(xen_sysctl_xsplice_action_t *action)
 {
     struct payload *data;
@@ -503,27 +943,24 @@  static int xsplice_action(xen_sysctl_xsplice_action_t *action)
     case XSPLICE_ACTION_REVERT:
         if ( data->state == XSPLICE_STATE_APPLIED )
         {
-            /* No implementation yet. */
-            data->state = XSPLICE_STATE_CHECKED;
-            data->rc = 0;
+            data->rc = -EAGAIN;
+            rc = schedule_work(data, action->cmd, action->timeout);
         }
         break;
 
     case XSPLICE_ACTION_APPLY:
         if ( data->state == XSPLICE_STATE_CHECKED )
         {
-            /* No implementation yet. */
-            data->state = XSPLICE_STATE_APPLIED;
-            data->rc = 0;
+            data->rc = -EAGAIN;
+            rc = schedule_work(data, action->cmd, action->timeout);
         }
         break;
 
     case XSPLICE_ACTION_REPLACE:
         if ( data->state == XSPLICE_STATE_CHECKED )
         {
-            /* No implementation yet. */
-            data->state = XSPLICE_STATE_CHECKED;
-            data->rc = 0;
+            data->rc = -EAGAIN;
+            rc = schedule_work(data, action->cmd, action->timeout);
         }
         break;
 
@@ -588,6 +1025,7 @@  static const char *state2str(unsigned int state)
 static void xsplice_printall(unsigned char key)
 {
     struct payload *data;
+    unsigned int i;
 
     printk("'%c' pressed - Dumping all xsplice patches\n", key);
 
@@ -598,10 +1036,30 @@  static void xsplice_printall(unsigned char key)
     }
 
     list_for_each_entry ( data, &payload_list, list )
+    {
         printk(" name=%s state=%s(%d) %p (.data=%p, .rodata=%p) using %u pages.\n",
                data->name, state2str(data->state), data->state, data->text_addr,
                data->rw_addr, data->ro_addr, data->pages);
 
+        for ( i = 0; i < data->nfuncs; i++ )
+        {
+            struct xsplice_patch_func *f = &(data->funcs[i]);
+            printk("    %s patch %p(%u) with %p (%u)\n",
+                   f->name, f->old_addr, f->old_size, f->new_addr, f->new_size);
+
+            if ( i && !(i % 64) )
+            {
+                spin_unlock(&payload_lock);
+                process_pending_softirqs();
+                if ( spin_trylock(&payload_lock) )
+                {
+                    printk("Couldn't reacquire lock. Try again.\n");
+                    return;
+                }
+            }
+        }
+    }
+
     spin_unlock(&payload_lock);
 }
 
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index 4083261..73a7209 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -86,10 +86,18 @@  static inline struct cpu_info *get_cpu_info(void)
 unsigned long get_stack_trace_bottom(unsigned long sp);
 unsigned long get_stack_dump_bottom (unsigned long sp);
 
+#ifdef CONFIG_XSPLICE
+# define CHECK_FOR_XSPLICE_WORK "call check_for_xsplice_work;"
+#else
+# define CHECK_FOR_XSPLICE_WORK ""
+#endif
+
 #define reset_stack_and_jump(__fn)                                      \
     ({                                                                  \
         __asm__ __volatile__ (                                          \
-            "mov %0,%%"__OP"sp; jmp %c1"                                \
+            "mov %0,%%"__OP"sp;"                                        \
+            CHECK_FOR_XSPLICE_WORK                                      \
+             "jmp %c1"                                                  \
             : : "r" (guest_cpu_user_regs()), "i" (__fn) : "memory" );   \
         unreachable();                                                  \
     })
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index 3fa1fe7..b2b312b 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -869,6 +869,26 @@  DEFINE_XEN_GUEST_HANDLE(xen_sysctl_featureset_t);
  *     If zero exit with success.
  */
 
+#define XSPLICE_PAYLOAD_VERSION 1
+/*
+ * .xsplice.funcs structure layout defined in the `Payload format`
+ * section in the xSplice design document.
+ *
+ * We guard this with __XEN__ as toolstacks SHOULD not use it.
+ */
+#ifdef __XEN__
+struct xsplice_patch_func {
+    const char *name;       /* Name of function to be patched. */
+    void *new_addr;
+    void *old_addr;
+    uint32_t new_size;
+    uint32_t old_size;
+    uint8_t version;        /* MUST be XSPLICE_PAYLOAD_VERSION. */
+    uint8_t opaque[31];
+};
+typedef struct xsplice_patch_func xsplice_patch_func_t;
+#endif
+
 /*
  * Structure describing an ELF payload. Uniquely identifies the
  * payload. Should be human readable.
diff --git a/xen/include/xen/xsplice.h b/xen/include/xen/xsplice.h
index 857c264..c9723e4 100644
--- a/xen/include/xen/xsplice.h
+++ b/xen/include/xen/xsplice.h
@@ -11,12 +11,16 @@  struct xsplice_elf_sec;
 struct xsplice_elf_sym;
 struct xen_sysctl_xsplice_op;
 
+#include <xen/elfstructs.h>
 #ifdef CONFIG_XSPLICE
 
 /* Convenience define for printk. */
 #define XSPLICE             "xsplice: "
+/* ELF payload special section names. */
+#define ELF_XSPLICE_FUNC    ".xsplice.funcs"
 
 int xsplice_op(struct xen_sysctl_xsplice_op *);
+void check_for_xsplice_work(void);
 
 /* Arch hooks. */
 int arch_xsplice_verify_elf(const struct xsplice_elf *elf);
@@ -39,6 +43,22 @@  enum va_type {
 int arch_xsplice_secure(const void *va, unsigned int pages, enum va_type types);
 
 void arch_xsplice_init(void);
+
+#include <public/sysctl.h> /* For struct xsplice_patch_func. */
+int arch_xsplice_verify_func(const struct xsplice_patch_func *func);
+/*
+ * These functions are called around the critical region patching live code,
+ * for an architecture to take make appropratie global state adjustments.
+ */
+void arch_xsplice_patching_enter(void);
+void arch_xsplice_patching_leave(void);
+
+void arch_xsplice_apply_jmp(struct xsplice_patch_func *func);
+void arch_xsplice_revert_jmp(const struct xsplice_patch_func *func);
+void arch_xsplice_post_action(void);
+
+void arch_xsplice_mask(void);
+void arch_xsplice_unmask(void);
 #else
 
 #include <xen/errno.h> /* For -ENOSYS */
@@ -47,6 +67,7 @@  static inline int xsplice_op(struct xen_sysctl_xsplice_op *op)
     return -ENOSYS;
 }
 
+static inline void check_for_xsplice_work(void) { };
 #endif /* CONFIG_XSPLICE */
 
 #endif /* __XEN_XSPLICE_H__ */