diff mbox

[v5,11/28] xsplice: Implement support for applying/reverting/replacing patches.

Message ID 1458849640-22588-12-git-send-email-konrad.wilk@oracle.com (mailing list archive)
State New, archived
Headers show

Commit Message

Konrad Rzeszutek Wilk March 24, 2016, 8 p.m. UTC
From: Ross Lagerwall <ross.lagerwall@citrix.com>

Implement support for the apply, revert and replace actions.

To perform and action on a payload, the hypercall sets up a data
structure to schedule the work.  A hook is added in all the
return-to-guest paths to check for work to do and execute it if needed.
In this way, patches can be applied with all CPUs idle and without
stacks.  The first CPU to do_xsplice() becomes the master and triggers a
reschedule softirq to trigger all the other CPUs to enter do_xsplice()
with no stack.  Once all CPUs have rendezvoused, all CPUs disable IRQs
and NMIs are ignored. The system is then quiscient and the master
performs the action.  After this, all CPUs enable IRQs and NMIs are
re-enabled.

Note that it is unsafe to patch do_nmi and the xSplice internal functions.
Patching functions on NMI/MCE path is liable to end in disaster.

The action to perform is one of:
- APPLY: For each function in the module, store the first 5 bytes of the
  old function and replace it with a jump to the new function.
- REVERT: Copy the previously stored bytes into the first 5 bytes of the
  old function.
- REPLACE: Revert each applied module and then apply the new module.

To prevent a deadlock with any other barrier in the system, the master
will wait for up to 30ms before timing out.
Measurements found that the patch application to take about 100 ?s on a
72 CPU system, whether idle or fully loaded.

We also add an BUILD_ON to make sure that the size of the structure
of the payload is not inadvertly changed.

Lastly we unroll the 'vmap_to_page' on x86 as inside the macro there
is a posibility of a NULL pointer. Hence we unroll it with extra
ASSERTS. Note that asserts on non-debug builds are compiled out hence
the extra checks that will just return (and leak memory).

We also add BUILD_BUG for the structure - and to make sure that the
offsets are correct on both 32 and 64-bit hypervisors (ARM32 and ARM64).

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

--
Cc: Stefano Stabellini <stefano.stabellini@citrix.com>
Cc: Julien Grall <julien.grall@arm.com>
Cc: Keir Fraser <keir@xen.org>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>

v2: - Pluck the 'struct xsplice_patch_func' in this patch.
    - Modify code per review comments.
    - Add more data in the keyboard handler.
    - Redo the patching code, split it in functions.
v3: - Add return_ macro for debug builds.
    - Move s/payload_list_lock/payload_list/ to earlier patch
    - Remove const and use ELF types for xsplice_patch_func
     - Add check routine to do simple sanity checks for various
      sections.
    - s/%p/PRIx64/ as ARM builds complain.
    - Move code around. Add more dprintk. Add XSPLICE in front of all
      printks/dprintk.
      Put the NMIs back if we fail patching.
      Add per-cpu to lessen contention for global structure.
      Extract from xsplice_do_single patching code into xsplice_do_action
      Squash xsplice_do_single and check_for_xsplice_work together to
      have all rendezvous in one place.
      Made XSPLICE_ACTION_REPLACE work again (wrong list iterator)
      s/find_special_sections/prepare_payload/
      Use list_del_init and INIT_LIST_HEAD for applied_list
v4:
   - Add comment, adjust spacing for "Timed out on CPU semaphore"
   - Added CR0.WP manipulations when altering the .text of hypervisor.
   - Added fix from Andrew for CR0.WP manipulation.
v5: - Made xsplice_patch_func use uintXX_t instead of ELF_ types to easy
      making it work under ARM (32bit). Add more BUILD-BUG-ON checks.
    - Add more BUILD_ON checks. Sprinkle newlines. Sprinkle newlines. Sprinkle newlines. Sprinkle newlines.
---
---
 docs/misc/xsplice.markdown  |   3 +-
 xen/arch/arm/xsplice.c      |  20 ++
 xen/arch/x86/domain.c       |   4 +
 xen/arch/x86/hvm/svm/svm.c  |   2 +
 xen/arch/x86/hvm/vmx/vmcs.c |   2 +
 xen/arch/x86/xsplice.c      |  40 ++++
 xen/common/xsplice.c        | 461 +++++++++++++++++++++++++++++++++++++++++++-
 xen/include/asm-arm/nmi.h   |  13 ++
 xen/include/xen/xsplice.h   |  31 ++-
 9 files changed, 565 insertions(+), 11 deletions(-)

Comments

Jan Beulich April 1, 2016, 1:28 p.m. UTC | #1
>>> On 24.03.16 at 21:00, <konrad.wilk@oracle.com> wrote:
> From: Ross Lagerwall <ross.lagerwall@citrix.com>
> 
> Implement support for the apply, revert and replace actions.
> 
> To perform and action on a payload, the hypercall sets up a data
> structure to schedule the work.  A hook is added in all the
> return-to-guest paths to check for work to do and execute it if needed.

Looking at the diffstat alone I cannot see this being the case.
Perhaps it's just the description here which needs to become
more precise.

> In this way, patches can be applied with all CPUs idle and without
> stacks.  The first CPU to do_xsplice() becomes the master and triggers a
> reschedule softirq to trigger all the other CPUs to enter do_xsplice()
> with no stack.  Once all CPUs have rendezvoused, all CPUs disable IRQs
> and NMIs are ignored. The system is then quiscient and the master
> performs the action.  After this, all CPUs enable IRQs and NMIs are
> re-enabled.
> 
> Note that it is unsafe to patch do_nmi and the xSplice internal functions.
> Patching functions on NMI/MCE path is liable to end in disaster.

So what measures are (planned to be) taken to make sure this
doesn't happen by accident?

> The action to perform is one of:
> - APPLY: For each function in the module, store the first 5 bytes of the
>   old function and replace it with a jump to the new function.
> - REVERT: Copy the previously stored bytes into the first 5 bytes of the
>   old function.
> - REPLACE: Revert each applied module and then apply the new module.
> 
> To prevent a deadlock with any other barrier in the system, the master
> will wait for up to 30ms before timing out.
> Measurements found that the patch application to take about 100 ?s on a
> 72 CPU system, whether idle or fully loaded.

That's for an individual patch I suppose? What if REPLACE has to
revert dozens or hundreds of patches?

> We also add an BUILD_ON to make sure that the size of the structure
> of the payload is not inadvertly changed.
> 
> Lastly we unroll the 'vmap_to_page' on x86 as inside the macro there
> is a posibility of a NULL pointer. Hence we unroll it with extra
> ASSERTS. Note that asserts on non-debug builds are compiled out hence
> the extra checks that will just return (and leak memory).

I'm afraid I can't really make sense of this: What does "unroll" mean
here? There's no loop involved. And where's that potential for NULL?

> --- a/docs/misc/xsplice.markdown
> +++ b/docs/misc/xsplice.markdown
> @@ -841,7 +841,8 @@ The implementation must also have a mechanism for:
>   * Be able to lookup in the Xen hypervisor the symbol names of functions from the ELF payload.
>   * Be able to patch .rodata, .bss, and .data sections.
>   * Further safety checks (blacklist of which functions cannot be patched, check
> -   the stack, make sure the payload is built with same compiler as hypervisor).
> +   the stack, make sure the payload is built with same compiler as hypervisor,
> +   and NMI/MCE handlers and do_nmi for right now - until an safe solution is found).

The whole thing doesn't parse anymore for me with the addition,
so I can't really conclude what you mean to say here (and hence
whether that addresses the earlier question).

> @@ -120,6 +121,7 @@ static void idle_loop(void)
>          (*pm_idle)();
>          do_tasklet();
>          do_softirq();
> +        check_for_xsplice_work(); /* Must be last. */
>      }
>  }
>  
> @@ -136,6 +138,7 @@ void startup_cpu_idle_loop(void)
>  
>  static void noreturn continue_idle_domain(struct vcpu *v)
>  {
> +    check_for_xsplice_work();
>      reset_stack_and_jump(idle_loop);
>  }

The placement here kind of contradicts the comment right above.
And anyway - why in both places? And why here and not in
common code, e.g. in do_softirq(), or - considering the further
addition below - inside reset_stack_and_jump() _after_ having
reset the stack (after all by doing it up front you do not really
meet your goal of doing the action "without stacks")?

> +void arch_xsplice_apply_jmp(struct xsplice_patch_func *func)
> +{
> +    uint32_t val;

The way it's being used below, it clearly should be int32_t.

> +    uint8_t *old_ptr;
> +
> +    BUILD_BUG_ON(PATCH_INSN_SIZE > sizeof(func->undo));
> +    BUILD_BUG_ON(PATCH_INSN_SIZE != (1 + sizeof val));
> +
> +    old_ptr = (uint8_t *)func->old_addr;

(Considering this cast, the "old_addr" member should be
unsigned long (or void *), not uint64_t: The latest on ARM32
such would otherwise cause problems.)

Also - where is the verification that
func->old_size >= PATCH_INSN_SIZE?

> +void arch_xsplice_revert_jmp(struct xsplice_patch_func *func)

const

> --- a/xen/common/xsplice.c
> +++ b/xen/common/xsplice.c
> @@ -3,6 +3,7 @@
>   *
>   */
>  
> +#include <xen/cpu.h>
>  #include <xen/err.h>
>  #include <xen/guest_access.h>
>  #include <xen/keyhandler.h>
> @@ -11,17 +12,29 @@
>  #include <xen/mm.h>
>  #include <xen/sched.h>
>  #include <xen/smp.h>
> +#include <xen/softirq.h>
>  #include <xen/spinlock.h>
>  #include <xen/vmap.h>
> +#include <xen/wait.h>
>  #include <xen/xsplice_elf.h>
>  #include <xen/xsplice.h>
>  
>  #include <asm/event.h>
> +#include <asm/nmi.h>

Urgh?

>  #include <public/sysctl.h>
>  
> +/*
> + * Protects against payload_list operations and also allows only one
> + * caller in schedule_work.
> + */
>  static DEFINE_SPINLOCK(payload_lock);
>  static LIST_HEAD(payload_list);
>  
> +/*
> + * Patches which have been applied.
> + */

Comment style.

> +struct xsplice_work
> +{
> +    atomic_t semaphore;          /* Used for rendezvous. First to grab it will
> +                                    do the patching. */

"First to grab it" doesn't seem to make sense for an atomic_t variable.

> +    atomic_t irq_semaphore;      /* Used to signal all IRQs disabled. */
> +    uint32_t timeout;                    /* Timeout to do the operation. */
> +    struct payload *data;        /* The payload on which to act. */
> +    volatile bool_t do_work;     /* Signals work to do. */
> +    volatile bool_t ready;       /* Signals all CPUs synchronized. */
> +    uint32_t cmd;                /* Action request: XSPLICE_ACTION_* */
> +};

Please can you do without fixed size types when the fixed size
isn't really a requirement?

> +/* There can be only one outstanding patching action. */
> +static struct xsplice_work xsplice_work;
> +
> +/* Indicate whether the CPU needs to consult xsplice_work structure. */
> +static DEFINE_PER_CPU(bool_t, work_to_do);

Peeking ahead to the uses of this, I can't see why this is needed
alongside xsplice_work.do_work.

> @@ -296,6 +331,77 @@ static int secure_payload(struct payload *payload, struct xsplice_elf *elf)
>      return rc;
>  }
>  
> +static int check_special_sections(struct payload *payload,
> +                                  struct xsplice_elf *elf)

const (twice, but the first parameter appears to be unused anyway)

> +{
> +    unsigned int i;
> +    static const char *const names[] = { ".xsplice.funcs" };
> +
> +    for ( i = 0; i < ARRAY_SIZE(names); i++ )
> +    {
> +        struct xsplice_elf_sec *sec;

const

> +        sec = xsplice_elf_sec_by_name(elf, names[i]);
> +        if ( !sec )
> +        {
> +            printk(XENLOG_ERR "%s%s: %s is missing!\n",
> +                   XSPLICE, elf->name, names[i]);
> +            return -EINVAL;
> +        }
> +
> +        if ( !sec->sec->sh_size )
> +            return -EINVAL;
> +    }
> +
> +    return 0;
> +}

So you check for there being one such section. Is having multiple
of them okay / meaningful?

> +static int prepare_payload(struct payload *payload,
> +                           struct xsplice_elf *elf)
> +{
> +    struct xsplice_elf_sec *sec;
> +    unsigned int i;
> +    struct xsplice_patch_func *f;

const (multiple times, and I'm not going to further make this remark:
Please deal with this throughout the series)

> +    sec = xsplice_elf_sec_by_name(elf, ".xsplice.funcs");
> +    if ( sec )

Assuming check_special_sections() got invoked before you get here,
why the conditional?

> +    {
> +        if ( sec->sec->sh_size % sizeof *payload->funcs )
> +        {
> +            dprintk(XENLOG_DEBUG, "%s%s: Wrong size of .xsplice.funcs!\n",
> +                    XSPLICE, elf->name);
> +            return -EINVAL;
> +        }
> +
> +        payload->funcs = (struct xsplice_patch_func *)sec->load_addr;
> +        payload->nfuncs = sec->sec->sh_size / (sizeof *payload->funcs);

Since this repeats (so far I thought this was more like a mistake),
and despite being a matter of taste to some degree - may I ask
that the canonical sizeof() be used, instead of (sizeof ...) or
whatever else variants?

> +    }
> +
> +    for ( i = 0; i < payload->nfuncs; i++ )
> +    {
> +        unsigned int j;
> +
> +        f = &(payload->funcs[i]);
> +
> +        if ( !f->new_addr || !f->old_addr || !f->old_size || !f->new_size )

Isn't new_size == 0 a particularly easy to deal with case?

> @@ -499,6 +614,321 @@ static int xsplice_list(xen_sysctl_xsplice_list_t *list)
>      return rc ? : idx;
>  }
>  
> +/*
> + * The following functions get the CPUs into an appropriate state and
> + * apply (or revert) each of the payload's functions. This is needed
> + * for XEN_SYSCTL_XSPLICE_ACTION operation (see xsplice_action).
> + */
> +
> +static int apply_payload(struct payload *data)
> +{
> +    unsigned int i;
> +
> +    dprintk(XENLOG_DEBUG, "%s%s: Applying %u functions.\n", XSPLICE,
> +            data->name, data->nfuncs);
> +
> +    arch_xsplice_patching_enter();
> +
> +    for ( i = 0; i < data->nfuncs; i++ )
> +        arch_xsplice_apply_jmp(&data->funcs[i]);
> +
> +    arch_xsplice_patching_leave();
> +
> +    list_add_tail(&data->applied_list, &applied_list);
> +
> +    return 0;
> +}
> +
> +/*
> + * This function is executed having all other CPUs with no stack (we may
> + * have cpu_idle on it) and IRQs disabled.
> + */
> +static int revert_payload(struct payload *data)

Wouldn't the same comment apply to apply_payload()? I.e. is it
useful to have here but not there (i.e. isn't the comment ahead
of xsplice_do_action() sufficient)?

> +static void xsplice_do_action(void)
> +{
> +    int rc;
> +    struct payload *data, *other, *tmp;
> +
> +    data = xsplice_work.data;
> +    /* Now this function should be the only one on any stack.
> +     * No need to lock the payload list or applied list. */

Comment style.

> +    switch ( xsplice_work.cmd )
> +    {
> +    case XSPLICE_ACTION_APPLY:
> +        rc = apply_payload(data);
> +        if ( rc == 0 )
> +            data->state = XSPLICE_STATE_APPLIED;
> +        break;
> +
> +    case XSPLICE_ACTION_REVERT:
> +        rc = revert_payload(data);
> +        if ( rc == 0 )
> +            data->state = XSPLICE_STATE_CHECKED;
> +        break;
> +
> +    case XSPLICE_ACTION_REPLACE:
> +        rc = 0;
> +        /* N.B: Use 'applied_list' member, not 'list'. */
> +        list_for_each_entry_safe_reverse ( other, tmp, &applied_list, applied_list )
> +        {
> +            other->rc = revert_payload(other);

Why does this set ->rc, but the two earlier ones only set the local
variable?

> +            if ( other->rc == 0 )
> +                other->state = XSPLICE_STATE_CHECKED;
> +            else
> +            {
> +                rc = -EINVAL;
> +                break;
> +            }
> +        }
> +
> +        if ( rc != -EINVAL )

Perhaps better "rc == 0"?

> +        {
> +            rc = apply_payload(data);
> +            if ( rc == 0 )
> +                data->state = XSPLICE_STATE_APPLIED;
> +        }
> +        break;
> +
> +    default:
> +        rc = -EINVAL;
> +        break;
> +    }
> +
> +    data->rc = rc;

Oh, here it is. But why would an xsplice_work.cmd (which probably
shouldn't make it here anyway) mark the payload having an error?
It didn't change state or anything after all.

> +/*
> + * MUST be holding the payload_lock.
> + */

comment style (another comment I'm not going to repeat any further)

> +static int schedule_work(struct payload *data, uint32_t cmd, uint32_t timeout)
> +{
> +    unsigned int cpu;
> +
> +    ASSERT(spin_is_locked(&payload_lock));

Also the comment above is clearly redundant with this ASSERT().

> +    /* Fail if an operation is already scheduled. */
> +    if ( xsplice_work.do_work )
> +        return -EBUSY;
> +
> +    if ( !get_cpu_maps() )
> +    {
> +        printk(XENLOG_ERR "%s%s: unable to get cpu_maps lock!\n",
> +               XSPLICE, data->name);
> +        return -EBUSY;
> +    }
> +
> +    xsplice_work.cmd = cmd;
> +    xsplice_work.data = data;
> +    xsplice_work.timeout = timeout ?: MILLISECS(30);
> +
> +    dprintk(XENLOG_DEBUG, "%s%s: timeout is %"PRI_stime"ms\n",
> +            XSPLICE, data->name, xsplice_work.timeout / MILLISECS(1));
> +
> +    /*
> +     * Once the patching has been completed, the semaphore value will
> +     * be num_online_cpus()-1.
> +     */

Is this comment really useful? I.e. is that final value meaningful
for anything?

> +    atomic_set(&xsplice_work.semaphore, -1);
> +    atomic_set(&xsplice_work.irq_semaphore, -1);
> +
> +    xsplice_work.ready = 0;
> +    smp_wmb();
> +    xsplice_work.do_work = 1;
> +    smp_wmb();
> +    /*
> +     * Above smp_wmb() gives us an compiler barrier, as we MUST do this
> +     * after setting the global structure.
> +     */

"a compiler barrier"

> +static int mask_nmi_callback(const struct cpu_user_regs *regs, int cpu)
> +{
> +    /* TODO: Handle missing NMI/MCE.*/
> +    return 1;
> +}

Aren't we in common code here?

> +void check_for_xsplice_work(void)
> +{
> +    unsigned int cpu = smp_processor_id();
> +    nmi_callback_t saved_nmi_callback;

This again looks very x86-ish.

> +    s_time_t timeout;
> +    unsigned long flags;
> +
> +    /* Fast path: no work to do. */
> +    if ( !per_cpu(work_to_do, cpu ) )
> +        return;
> +
> +    /* In case we aborted, other CPUs can skip right away. */
> +    if ( (!xsplice_work.do_work) )

Stray parentheses.

> +    {
> +        per_cpu(work_to_do, cpu) = 0;
> +        return;
> +    }
> +
> +    ASSERT(local_irq_is_enabled());
> +
> +    /* Set at -1, so will go up to num_online_cpus - 1. */
> +    if ( atomic_inc_and_test(&xsplice_work.semaphore) )
> +    {
> +        struct payload *p;
> +        unsigned int total_cpus;
> +
> +        p = xsplice_work.data;
> +        if ( !get_cpu_maps() )
> +        {
> +            printk(XENLOG_ERR "%s%s: CPU%u - unable to get cpu_maps lock!\n",
> +                   XSPLICE, p->name, cpu);
> +            per_cpu(work_to_do, cpu) = 0;
> +            xsplice_work.data->rc = -EBUSY;
> +            xsplice_work.do_work = 0;

On x86 such possibly simultaneous accesses may be okay, but is
that universally the case? Wouldn't it better be only the monarch
which updates shared data?

> +            /*
> +             * Do NOT decrement semaphore down - as that may cause the other
> +             * CPU (which may be at this exact moment checking the ASSERT)

Which ASSERT()? (I guess the one a few lines up - but does that
matter?)

> +             * to assume the role of master and then needlessly time out
> +             * out (as do_work is zero).
> +             */
> +            return;
> +        }
> +
> +        barrier(); /* MUST do it after get_cpu_maps. */
> +        total_cpus = num_online_cpus() - 1;

Considering this the variable (and earlier function parameter) is
misnamed.

> +        if ( total_cpus )
> +        {
> +            dprintk(XENLOG_DEBUG, "%s%s: CPU%u - IPIing the %u CPUs\n",

Not being a native speaker, I've previously observed too many
articles - the "the" here again looks quite odd, or there is an
"other" missing.

> +                    XSPLICE, p->name, cpu, total_cpus);
> +            smp_call_function(reschedule_fn, NULL, 0);
> +        }
> +
> +        timeout = xsplice_work.timeout + NOW();
> +        if ( xsplice_do_wait(&xsplice_work.semaphore, timeout, total_cpus,
> +                             "Timed out on CPU semaphore") )
> +            goto abort;
> +
> +        /* "Mask" NMIs. */
> +        saved_nmi_callback = set_nmi_callback(mask_nmi_callback);

So what if an NMI got raised on a remote CPU right before you set
this? I.e. doesn't this need to move earlier?

> +        /* All CPUs are waiting, now signal to disable IRQs. */
> +        xsplice_work.ready = 1;
> +        smp_wmb();
> +
> +        atomic_inc(&xsplice_work.irq_semaphore);
> +        if ( !xsplice_do_wait(&xsplice_work.irq_semaphore, timeout, total_cpus,
> +                              "Timed out on IRQ semaphore") )

I'd prefer if the common parts of that message moved into
xsplice_do_wait() - no reason to have more string literal space
occupied than really needed. Also, looking back, the respective
function parameter could do with a more descriptive name.

And then - does it make sense to wait the perhaps full 30ms
on this second step? Rendezvoused CPUs should react rather
quickly to the signal to disable interrupts.

> +        {
> +            local_irq_save(flags);
> +            /* Do the patching. */
> +            xsplice_do_action();
> +            /* To flush out pipeline. */
> +            arch_xsplice_post_action();

The comment needs to become more generic, and maybe the
function name more specific.

> +            local_irq_restore(flags);
> +        }
> +        set_nmi_callback(saved_nmi_callback);
> +
> + abort:
> +        per_cpu(work_to_do, cpu) = 0;
> +        xsplice_work.do_work = 0;
> +
> +        smp_wmb(); /* Synchronize with waiting CPUs. */

What "synchronization" does this barrier do?

> +        ASSERT(local_irq_is_enabled());

Is there really anything between the earlier identical ASSERT() and
this one which could leave interrupts off?

> +        put_cpu_maps();
> +
> +        printk(XENLOG_INFO "%s%s finished with rc=%d\n", XSPLICE,
> +               p->name, p->rc);

And no record left of what was done with that payload?

> +    }
> +    else
> +    {
> +        /* Wait for all CPUs to rendezvous. */
> +        while ( xsplice_work.do_work && !xsplice_work.ready )
> +        {
> +            cpu_relax();
> +            smp_rmb();

What is the barrier doing inside this and the other loop below?

> @@ -635,15 +1063,30 @@ static void xsplice_printall(unsigned char key)
>      }
>  
>      list_for_each_entry ( data, &payload_list, list )
> +    {
>          printk(" name=%s state=%s(%d) %p (.data=%p, .rodata=%p) using %zu pages.\n",
>                 data->name, state2str(data->state), data->state, data->text_addr,
>                 data->rw_addr, data->ro_addr, data->payload_pages);
>  
> +        for ( i = 0; i < data->nfuncs; i++ )
> +        {
> +            struct xsplice_patch_func *f = &(data->funcs[i]);
> +            printk("    %s patch 0x%"PRIx64"(%u) with 0x%"PRIx64"(%u)\n",

%# again please.

> +                   f->name, f->old_addr, f->old_size, f->new_addr, f->new_size);
> +            if ( !(i % 100) )

Using a power of 2 in situations like this is going to produce both
smaller and faster code (the faster aspect may not matter here,
but anyway). Also I don't think you want to do this on the first
iteration.

> +                process_pending_softirqs();

With a lock held?

>  static int __init xsplice_init(void)
>  {
> +    BUILD_BUG_ON( sizeof(struct xsplice_patch_func) != 64 );
> +    BUILD_BUG_ON( offsetof(struct xsplice_patch_func, new_addr) != 8 );
> +    BUILD_BUG_ON( offsetof(struct xsplice_patch_func, new_size) != 24 );

What assumptions get broken if these sizes change?

> --- a/xen/include/xen/xsplice.h
> +++ b/xen/include/xen/xsplice.h
> @@ -11,12 +11,30 @@ struct xsplice_elf_sec;
>  struct xsplice_elf_sym;
>  struct xen_sysctl_xsplice_op;
>  
> +#include <xen/elfstructs.h>
> +/*
> + * The structure which defines the patching. This is what the hypervisor
> + * expects in the '.xsplice.func' section of the ELF file.
> + *
> + * This MUST be in sync with what the tools generate.

In which case this need to be part of the public interface, I would
say.

> + */
> +struct xsplice_patch_func {
> +    const char *name;
> +    uint64_t new_addr;
> +    uint64_t old_addr;
> +    uint32_t new_size;
> +    uint32_t old_size;
> +    uint8_t undo[8];

Shouldn't the size of this array be a per-arch constant?

> +    uint8_t pad[24];

What is this padding field good for?

Jan
Konrad Rzeszutek Wilk April 1, 2016, 9:04 p.m. UTC | #2
Hey Jan,

Thank you for your review!
Please see below my answers.

Ross,

Could you answer one question below please.

On Fri, Apr 01, 2016 at 07:28:00AM -0600, Jan Beulich wrote:
> >>> On 24.03.16 at 21:00, <konrad.wilk@oracle.com> wrote:
> > From: Ross Lagerwall <ross.lagerwall@citrix.com>
> > 
> > Implement support for the apply, revert and replace actions.
> > 
> > To perform and action on a payload, the hypercall sets up a data
> > structure to schedule the work.  A hook is added in all the
> > return-to-guest paths to check for work to do and execute it if needed.
> 
> Looking at the diffstat alone I cannot see this being the case.
> Perhaps it's just the description here which needs to become
> more precise.

s/all the/common/ ?

> 
> > In this way, patches can be applied with all CPUs idle and without
> > stacks.  The first CPU to do_xsplice() becomes the master and triggers a
> > reschedule softirq to trigger all the other CPUs to enter do_xsplice()
> > with no stack.  Once all CPUs have rendezvoused, all CPUs disable IRQs
> > and NMIs are ignored. The system is then quiscient and the master
> > performs the action.  After this, all CPUs enable IRQs and NMIs are
> > re-enabled.
> > 
> > Note that it is unsafe to patch do_nmi and the xSplice internal functions.
> > Patching functions on NMI/MCE path is liable to end in disaster.
> 
> So what measures are (planned to be) taken to make sure this
> doesn't happen by accident?

None in this patch.

Please keep in mind that this issue is not only a problem with
xSplice but also with alternative assembler patching.

I haven't figured out a nice way to take care of this and was hoping
to brainstorm that. For right now this part is left out as 'TODO'.

> 
> > The action to perform is one of:
> > - APPLY: For each function in the module, store the first 5 bytes of the
> >   old function and replace it with a jump to the new function.
> > - REVERT: Copy the previously stored bytes into the first 5 bytes of the
> >   old function.
> > - REPLACE: Revert each applied module and then apply the new module.
> > 
> > To prevent a deadlock with any other barrier in the system, the master
> > will wait for up to 30ms before timing out.
> > Measurements found that the patch application to take about 100 ?s on a
> > 72 CPU system, whether idle or fully loaded.
> 
> That's for an individual patch I suppose? What if REPLACE has to
> revert dozens or hundreds of patches?

The user-space can change the timeout to a larger value. 
> 
> > We also add an BUILD_ON to make sure that the size of the structure
> > of the payload is not inadvertly changed.
> > 
> > Lastly we unroll the 'vmap_to_page' on x86 as inside the macro there
> > is a posibility of a NULL pointer. Hence we unroll it with extra
> > ASSERTS. Note that asserts on non-debug builds are compiled out hence
> > the extra checks that will just return (and leak memory).
> 
> I'm afraid I can't really make sense of this: What does "unroll" mean
> here? There's no loop involved. And where's that potential for NULL?

That is a very stale comment. That should have been deleted!

> 
> > --- a/docs/misc/xsplice.markdown
> > +++ b/docs/misc/xsplice.markdown
> > @@ -841,7 +841,8 @@ The implementation must also have a mechanism for:
> >   * Be able to lookup in the Xen hypervisor the symbol names of functions from the ELF payload.
> >   * Be able to patch .rodata, .bss, and .data sections.
> >   * Further safety checks (blacklist of which functions cannot be patched, check
> > -   the stack, make sure the payload is built with same compiler as hypervisor).
> > +   the stack, make sure the payload is built with same compiler as hypervisor,
> > +   and NMI/MCE handlers and do_nmi for right now - until an safe solution is found).
> 
> The whole thing doesn't parse anymore for me with the addition,
> so I can't really conclude what you mean to say here (and hence
> whether that addresses the earlier question).

It is adding handling of NMI/MCE on the 'TODO' list so that I don't forget
about it.

> 
> > @@ -120,6 +121,7 @@ static void idle_loop(void)
> >          (*pm_idle)();
> >          do_tasklet();
> >          do_softirq();
> > +        check_for_xsplice_work(); /* Must be last. */
> >      }
> >  }
> >  
> > @@ -136,6 +138,7 @@ void startup_cpu_idle_loop(void)
> >  
> >  static void noreturn continue_idle_domain(struct vcpu *v)
> >  {
> > +    check_for_xsplice_work();
> >      reset_stack_and_jump(idle_loop);
> >  }
> 
> The placement here kind of contradicts the comment right above.
> And anyway - why in both places? And why here and not in
> common code, e.g. in do_softirq(), or - considering the further
> addition below - inside reset_stack_and_jump() _after_ having
> reset the stack (after all by doing it up front you do not really
> meet your goal of doing the action "without stacks")?

Ross?

> 
> > +void arch_xsplice_apply_jmp(struct xsplice_patch_func *func)
> > +{
> > +    uint32_t val;
> 
> The way it's being used below, it clearly should be int32_t.
> 
> > +    uint8_t *old_ptr;
> > +
> > +    BUILD_BUG_ON(PATCH_INSN_SIZE > sizeof(func->undo));
> > +    BUILD_BUG_ON(PATCH_INSN_SIZE != (1 + sizeof val));
> > +
> > +    old_ptr = (uint8_t *)func->old_addr;
> 
> (Considering this cast, the "old_addr" member should be
> unsigned long (or void *), not uint64_t: The latest on ARM32
> such would otherwise cause problems.)
> 
> Also - where is the verification that
> func->old_size >= PATCH_INSN_SIZE?

Missing..
..snip..
> > +/* There can be only one outstanding patching action. */
> > +static struct xsplice_work xsplice_work;
> > +
> > +/* Indicate whether the CPU needs to consult xsplice_work structure. */
> > +static DEFINE_PER_CPU(bool_t, work_to_do);
> 
> Peeking ahead to the uses of this, I can't see why this is needed
> alongside xsplice_work.do_work.

Andrew asked for this - he noticed that we pound on the xsplice_work
structure quite often across a lot of CPUs. Instead of pounding on the same
cache line - we could just read from a per-cpu cache line. Hence this addition.

> 
> > @@ -296,6 +331,77 @@ static int secure_payload(struct payload *payload, struct xsplice_elf *elf)
> >      return rc;
> >  }
> >  
> > +static int check_special_sections(struct payload *payload,
> > +                                  struct xsplice_elf *elf)
> 
> const (twice, but the first parameter appears to be unused anyway)
> 
> > +{
> > +    unsigned int i;
> > +    static const char *const names[] = { ".xsplice.funcs" };
> > +
> > +    for ( i = 0; i < ARRAY_SIZE(names); i++ )
> > +    {
> > +        struct xsplice_elf_sec *sec;
> 
> const
> 
> > +        sec = xsplice_elf_sec_by_name(elf, names[i]);
> > +        if ( !sec )
> > +        {
> > +            printk(XENLOG_ERR "%s%s: %s is missing!\n",
> > +                   XSPLICE, elf->name, names[i]);
> > +            return -EINVAL;
> > +        }
> > +
> > +        if ( !sec->sec->sh_size )
> > +            return -EINVAL;
> > +    }
> > +
> > +    return 0;
> > +}
> 
> So you check for there being one such section. Is having multiple
> of them okay / meaningful?

/me blinks. You can have multiple ELF sections with the same name?
I will double-check the spec over the weekend to see.

.. snip..
> > +    sec = xsplice_elf_sec_by_name(elf, ".xsplice.funcs");
> > +    if ( sec )
> 
> Assuming check_special_sections() got invoked before you get here,
> why the conditional?

Andrew asked for it. Albeit it is pointless as 'check_special_sections'
will bail halt the process if this section is not found.

> 
> > +    {
> > +        if ( sec->sec->sh_size % sizeof *payload->funcs )
> > +        {
> > +            dprintk(XENLOG_DEBUG, "%s%s: Wrong size of .xsplice.funcs!\n",
> > +                    XSPLICE, elf->name);
> > +            return -EINVAL;
> > +        }
> > +
> > +        payload->funcs = (struct xsplice_patch_func *)sec->load_addr;
> > +        payload->nfuncs = sec->sec->sh_size / (sizeof *payload->funcs);
> 
> Since this repeats (so far I thought this was more like a mistake),
> and despite being a matter of taste to some degree - may I ask
> that the canonical sizeof() be used, instead of (sizeof ...) or
> whatever else variants?

Sure. I will convert Ross's use of it.
> 
> > +    }
> > +
> > +    for ( i = 0; i < payload->nfuncs; i++ )
> > +    {
> > +        unsigned int j;
> > +
> > +        f = &(payload->funcs[i]);
> > +
> > +        if ( !f->new_addr || !f->old_addr || !f->old_size || !f->new_size )
> 
> Isn't new_size == 0 a particularly easy to deal with case?

To NOP func? No. I would have to include the headers for the nop[] in the arch xSplice
and expand on its patching (or expose the text_poke code).

That will grow this patch which is big enough. I don't mind doing it in further patches
(and I believe it is mentioned as a TODO in the design doc so that I won't forget).

.. snip..
> > +    case XSPLICE_ACTION_REPLACE:
> > +        rc = 0;
> > +        /* N.B: Use 'applied_list' member, not 'list'. */
> > +        list_for_each_entry_safe_reverse ( other, tmp, &applied_list, applied_list )
> > +        {
> > +            other->rc = revert_payload(other);
> 
> Why does this set ->rc, but the two earlier ones only set the local
> variable?
> 
> > +            if ( other->rc == 0 )
> > +                other->state = XSPLICE_STATE_CHECKED;
> > +            else
> > +            {
> > +                rc = -EINVAL;
> > +                break;
> > +            }
> > +        }
> > +
> > +        if ( rc != -EINVAL )
> 
> Perhaps better "rc == 0"?
> 
> > +        {
> > +            rc = apply_payload(data);
> > +            if ( rc == 0 )
> > +                data->state = XSPLICE_STATE_APPLIED;
> > +        }
> > +        break;
> > +
> > +    default:
> > +        rc = -EINVAL;
> > +        break;
> > +    }
> > +
> > +    data->rc = rc;
> 
> Oh, here it is. But why would an xsplice_work.cmd (which probably
> shouldn't make it here anyway) mark the payload having an error?

In xsplice_action() we set data->rc to -EAGAIN right before we kick
of the schedule_work(). That way the user can check the 'status'
of the patching as we attempt to do it.

But once the patching has been complete we MUST set it to zero
(or to an error if the patching failed).

> It didn't change state or anything after all.
.. snip..
> > +void check_for_xsplice_work(void)
> > +{
> > +    /* Set at -1, so will go up to num_online_cpus - 1. */
> > +    if ( atomic_inc_and_test(&xsplice_work.semaphore) )
> > +    {
> > +        struct payload *p;
> > +        unsigned int total_cpus;
> > +
> > +        p = xsplice_work.data;
> > +        if ( !get_cpu_maps() )
> > +        {
> > +            printk(XENLOG_ERR "%s%s: CPU%u - unable to get cpu_maps lock!\n",
> > +                   XSPLICE, p->name, cpu);
> > +            per_cpu(work_to_do, cpu) = 0;
> > +            xsplice_work.data->rc = -EBUSY;
> > +            xsplice_work.do_work = 0;
> 
> On x86 such possibly simultaneous accesses may be okay, but is
> that universally the case? Wouldn't it better be only the monarch
> which updates shared data?

Monarch? Oh you mean arch specific code path?

> 
> > +            /*
> > +             * Do NOT decrement semaphore down - as that may cause the other
> > +             * CPU (which may be at this exact moment checking the ASSERT)
> 
> Which ASSERT()? (I guess the one a few lines up - but does that
> matter?)

It messed me up when I was working on it so thought I would leave that in here
in case anybody wants to modify the code. Or more likely - if I rework
this code that I will recall this.

> > +        /* "Mask" NMIs. */
> > +        saved_nmi_callback = set_nmi_callback(mask_nmi_callback);
> 
> So what if an NMI got raised on a remote CPU right before you set
> this? I.e. doesn't this need to move earlier?

Hmm. Yes we should. Thanks!
> 
> > +        /* All CPUs are waiting, now signal to disable IRQs. */
> > +        xsplice_work.ready = 1;
> > +        smp_wmb();
> > +
> > +        atomic_inc(&xsplice_work.irq_semaphore);
> > +        if ( !xsplice_do_wait(&xsplice_work.irq_semaphore, timeout, total_cpus,
> > +                              "Timed out on IRQ semaphore") )
> 
> I'd prefer if the common parts of that message moved into
> xsplice_do_wait() - no reason to have more string literal space
> occupied than really needed. Also, looking back, the respective
> function parameter could do with a more descriptive name.
> 
> And then - does it make sense to wait the perhaps full 30ms
> on this second step? Rendezvoused CPUs should react rather
> quickly to the signal to disable interrupts.

We don't reset the timeout - the timeout is for both calls
to xsplice_do_wait.
> 
> > +        {
> > +            local_irq_save(flags);
> > +            /* Do the patching. */
> > +            xsplice_do_action();
> > +            /* To flush out pipeline. */
> > +            arch_xsplice_post_action();
> 
> The comment needs to become more generic, and maybe the
> function name more specific.

Serialize CPUs? Flush CPU pipeline? Flush CPU i-cache?
> 
> > +            local_irq_restore(flags);
> > +        }
> > +        set_nmi_callback(saved_nmi_callback);
> > +
> > + abort:
> > +        per_cpu(work_to_do, cpu) = 0;
> > +        xsplice_work.do_work = 0;
> > +
> > +        smp_wmb(); /* Synchronize with waiting CPUs. */
> 
> What "synchronization" does this barrier do?

For the ->do_work being set to zero - as they are reading and reading it.

> 
> > +        ASSERT(local_irq_is_enabled());
> 
> Is there really anything between the earlier identical ASSERT() and
> this one which could leave interrupts off?

The patching calls (in later patches) - the load and unload hooks. Those
could inadvertly enabled interrupts.

> 
> > +        put_cpu_maps();
> > +
> > +        printk(XENLOG_INFO "%s%s finished with rc=%d\n", XSPLICE,
> > +               p->name, p->rc);
> 
> And no record left of what was done with that payload?

? But it does. The rc is mentioned.. Or are you saying that it should
also say whether it was applied/reverted/replaced, etc?

> 
> > +    }
> > +    else
> > +    {
> > +        /* Wait for all CPUs to rendezvous. */
> > +        while ( xsplice_work.do_work && !xsplice_work.ready )
> > +        {
> > +            cpu_relax();
> > +            smp_rmb();
> 
> What is the barrier doing inside this and the other loop below?

The goal is to get the updated value from ->do_work and ->ready.

That is to do the same thing we do with frontend and backends - sync
out the rp/rc so that the other end can read it the updated value.

.. snip..
> >  static int __init xsplice_init(void)
> >  {
> > +    BUILD_BUG_ON( sizeof(struct xsplice_patch_func) != 64 );
> > +    BUILD_BUG_ON( offsetof(struct xsplice_patch_func, new_addr) != 8 );
> > +    BUILD_BUG_ON( offsetof(struct xsplice_patch_func, new_size) != 24 );
> 
> What assumptions get broken if these sizes change?

I think you mean the offsets? They can change. I added them to make sure
that on 32-bit hypervisor (ARM) the offsets would be the same as on 64-bit
hypervisor (x86).

The size however MUST remain the same - otherwise the toolstack won't produce
proper payloads anymore.

> 
> > --- a/xen/include/xen/xsplice.h
> > +++ b/xen/include/xen/xsplice.h
> > @@ -11,12 +11,30 @@ struct xsplice_elf_sec;
> >  struct xsplice_elf_sym;
> >  struct xen_sysctl_xsplice_op;
> >  
> > +#include <xen/elfstructs.h>
> > +/*
> > + * The structure which defines the patching. This is what the hypervisor
> > + * expects in the '.xsplice.func' section of the ELF file.
> > + *
> > + * This MUST be in sync with what the tools generate.
> 
> In which case this need to be part of the public interface, I would
> say.

Ah yes. Where should it go? sysctl.h is where the XSPLICE ops are - or
should this have it is own public/xsplice.h  header file?
> 
> > + */
> > +struct xsplice_patch_func {
> > +    const char *name;
> > +    uint64_t new_addr;
> > +    uint64_t old_addr;
> > +    uint32_t new_size;
> > +    uint32_t old_size;
> > +    uint8_t undo[8];
> 
> Shouldn't the size of this array be a per-arch constant?

Yes.
> 
> > +    uint8_t pad[24];
> 
> What is this padding field good for?

I want the size of the structure to be exactly 64-bytes across
all architectures. That way I don't have to mess with compat layer
or have the design doc be different for ARM vs x86.

It gives us enough of space to stash other 'private' fields we may want
or even expand this structure in the future and fill it out
with extra fields. ARgh, but for that we need a version field
somewhere... Maybe another section .xsplice_version which just
has a value of 1?

However the one thing I am not sure about is the padding.

The 'const char *name' on 32-bit ARM will be 4 bytes hence
it will insert 4 byte padding (at least that is what pahole tells me,
and the BUILD_BUG_ON confirms). But if I built with clang or other
compiler it may very well ignore me.

Any suggestions? I could do:

struct xsplice_patch_func {
    const char *name;
#ifdef CONFIG_32
    uint32_t _name_pad;
#endif
    uint64_t new_addr;
    uint64_t old_addr;
    uint32_t new_size;
    uint32_t old_size;
    uint8_t undo[8];
    uint8_t pad[24];
}

But that is not very nice.

Also if we make this a public header I can't very well expose the
'undo' internals - it ought to be opaque. Ideas?

Thanks!
Jan Beulich April 4, 2016, 7:07 a.m. UTC | #3
>>> On 01.04.16 at 23:04, <konrad.wilk@oracle.com> wrote:
> On Fri, Apr 01, 2016 at 07:28:00AM -0600, Jan Beulich wrote:
>> >>> On 24.03.16 at 21:00, <konrad.wilk@oracle.com> wrote:
>> > From: Ross Lagerwall <ross.lagerwall@citrix.com>
>> > 
>> > Implement support for the apply, revert and replace actions.
>> > 
>> > To perform and action on a payload, the hypercall sets up a data
>> > structure to schedule the work.  A hook is added in all the
>> > return-to-guest paths to check for work to do and execute it if needed.
>> 
>> Looking at the diffstat alone I cannot see this being the case.
>> Perhaps it's just the description here which needs to become
>> more precise.
> 
> s/all the/common/ ?

Considering that non of the entry.S files get touched, I think it's
the "return-to-guest paths" part which is wrong.

>> > In this way, patches can be applied with all CPUs idle and without
>> > stacks.  The first CPU to do_xsplice() becomes the master and triggers a
>> > reschedule softirq to trigger all the other CPUs to enter do_xsplice()
>> > with no stack.  Once all CPUs have rendezvoused, all CPUs disable IRQs
>> > and NMIs are ignored. The system is then quiscient and the master
>> > performs the action.  After this, all CPUs enable IRQs and NMIs are
>> > re-enabled.
>> > 
>> > Note that it is unsafe to patch do_nmi and the xSplice internal functions.
>> > Patching functions on NMI/MCE path is liable to end in disaster.
>> 
>> So what measures are (planned to be) taken to make sure this
>> doesn't happen by accident?
> 
> None in this patch.
> 
> Please keep in mind that this issue is not only a problem with
> xSplice but also with alternative assembler patching.

Except that an issue there would cause only a boot failure, which
would be unlikely to re-occur on the next boot attempt. Whereas
things going wrong at runtime would likely mean worse
consequences.

> I haven't figured out a nice way to take care of this and was hoping
> to brainstorm that. For right now this part is left out as 'TODO'.

That's what I've assumed, which is fine for the moment. Just maybe
make this more explicit by adding (to the above) something like "For
now this needs to be ensured by the people creating patches."

>> > The action to perform is one of:
>> > - APPLY: For each function in the module, store the first 5 bytes of the
>> >   old function and replace it with a jump to the new function.
>> > - REVERT: Copy the previously stored bytes into the first 5 bytes of the
>> >   old function.
>> > - REPLACE: Revert each applied module and then apply the new module.
>> > 
>> > To prevent a deadlock with any other barrier in the system, the master
>> > will wait for up to 30ms before timing out.
>> > Measurements found that the patch application to take about 100 ?s on a
>> > 72 CPU system, whether idle or fully loaded.
>> 
>> That's for an individual patch I suppose? What if REPLACE has to
>> revert dozens or hundreds of patches?
> 
> The user-space can change the timeout to a larger value. 

The question wasn't about the timeout expiring, but about the latency
of the operation (and the resulting effect on the system).

>> > --- a/docs/misc/xsplice.markdown
>> > +++ b/docs/misc/xsplice.markdown
>> > @@ -841,7 +841,8 @@ The implementation must also have a mechanism for:
>> >   * Be able to lookup in the Xen hypervisor the symbol names of functions from the ELF payload.
>> >   * Be able to patch .rodata, .bss, and .data sections.
>> >   * Further safety checks (blacklist of which functions cannot be patched, check
>> > -   the stack, make sure the payload is built with same compiler as hypervisor).
>> > +   the stack, make sure the payload is built with same compiler as hypervisor,
>> > +   and NMI/MCE handlers and do_nmi for right now - until an safe solution is found).
>> 
>> The whole thing doesn't parse anymore for me with the addition,
>> so I can't really conclude what you mean to say here (and hence
>> whether that addresses the earlier question).
> 
> It is adding handling of NMI/MCE on the 'TODO' list so that I don't forget
> about it.

Please try to re-word the addition then to make the result readable
again.

>> > +/* There can be only one outstanding patching action. */
>> > +static struct xsplice_work xsplice_work;
>> > +
>> > +/* Indicate whether the CPU needs to consult xsplice_work structure. */
>> > +static DEFINE_PER_CPU(bool_t, work_to_do);
>> 
>> Peeking ahead to the uses of this, I can't see why this is needed
>> alongside xsplice_work.do_work.
> 
> Andrew asked for this - he noticed that we pound on the xsplice_work
> structure quite often across a lot of CPUs. Instead of pounding on the same
> cache line - we could just read from a per-cpu cache line. Hence this 
> addition.

Then please make the comment say so in a more explicit way.

>> > +        sec = xsplice_elf_sec_by_name(elf, names[i]);
>> > +        if ( !sec )
>> > +        {
>> > +            printk(XENLOG_ERR "%s%s: %s is missing!\n",
>> > +                   XSPLICE, elf->name, names[i]);
>> > +            return -EINVAL;
>> > +        }
>> > +
>> > +        if ( !sec->sec->sh_size )
>> > +            return -EINVAL;
>> > +    }
>> > +
>> > +    return 0;
>> > +}
>> 
>> So you check for there being one such section. Is having multiple
>> of them okay / meaningful?
> 
> /me blinks. You can have multiple ELF sections with the same name?
> I will double-check the spec over the weekend to see.

Of course you can. Remember me telling you that using section
names for identification is a bad idea (even if widely done that
way)? That's precisely because section names in the spec serve
no meaning other than making sections identifiable to the human
eye. For certain sections there's a more or less strict convention
on what their names should be, but that's all there is to names.
Section types and attributes really are what describe their
purposes.

And even if that really was forbidden by the spec, you'd still need
to make sure the binary meets the spec.

>> > +    sec = xsplice_elf_sec_by_name(elf, ".xsplice.funcs");
>> > +    if ( sec )
>> 
>> Assuming check_special_sections() got invoked before you get here,
>> why the conditional?
> 
> Andrew asked for it. Albeit it is pointless as 'check_special_sections'
> will bail halt the process if this section is not found.

It's not clear why he would have wanted the check here. Imo, if
anything this should be an ASSERT() then.

>> > +    }
>> > +
>> > +    for ( i = 0; i < payload->nfuncs; i++ )
>> > +    {
>> > +        unsigned int j;
>> > +
>> > +        f = &(payload->funcs[i]);
>> > +
>> > +        if ( !f->new_addr || !f->old_addr || !f->old_size || !f->new_size )
>> 
>> Isn't new_size == 0 a particularly easy to deal with case?
> 
> To NOP func? No. I would have to include the headers for the nop[] in the 
> arch xSplice
> and expand on its patching (or expose the text_poke code).
> 
> That will grow this patch which is big enough. I don't mind doing it in 
> further patches
> (and I believe it is mentioned as a TODO in the design doc so that I won't 
> forget).

That's fine, except for one aspect: The error value resulting from
this: EINVAL is appropriate only for invalid input. When valid input
just doesn't get handled suitably, EOPNOTSUPP or some such should
be used instead.

>> > +    default:
>> > +        rc = -EINVAL;
>> > +        break;
>> > +    }
>> > +
>> > +    data->rc = rc;
>> 
>> Oh, here it is. But why would an xsplice_work.cmd (which probably
>> shouldn't make it here anyway) mark the payload having an error?
> 
> In xsplice_action() we set data->rc to -EAGAIN right before we kick
> of the schedule_work(). That way the user can check the 'status'
> of the patching as we attempt to do it.
> 
> But once the patching has been complete we MUST set it to zero
> (or to an error if the patching failed).

Okay, makes sense. But I realize a word was missing from my reply.
I meant "But why would an invalid xsplice_work.cmd (which probably
shouldn't make it here anyway) mark the payload having an error?"

I.e. the question particularly is about the default case (with the
suggestion to just ditch it, or make in an ASSERT_UNREACHABLE()).

>> It didn't change state or anything after all.
> .. snip..
>> > +void check_for_xsplice_work(void)
>> > +{
>> > +    /* Set at -1, so will go up to num_online_cpus - 1. */
>> > +    if ( atomic_inc_and_test(&xsplice_work.semaphore) )
>> > +    {
>> > +        struct payload *p;
>> > +        unsigned int total_cpus;
>> > +
>> > +        p = xsplice_work.data;
>> > +        if ( !get_cpu_maps() )
>> > +        {
>> > +            printk(XENLOG_ERR "%s%s: CPU%u - unable to get cpu_maps lock!\n",
>> > +                   XSPLICE, p->name, cpu);
>> > +            per_cpu(work_to_do, cpu) = 0;
>> > +            xsplice_work.data->rc = -EBUSY;
>> > +            xsplice_work.do_work = 0;
>> 
>> On x86 such possibly simultaneous accesses may be okay, but is
>> that universally the case? Wouldn't it better be only the monarch
>> which updates shared data?
> 
> Monarch? Oh you mean arch specific code path?

Oh, I think you call it "master" elsewhere. IA64 terminology which
I came to like.

>> > +        /* All CPUs are waiting, now signal to disable IRQs. */
>> > +        xsplice_work.ready = 1;
>> > +        smp_wmb();
>> > +
>> > +        atomic_inc(&xsplice_work.irq_semaphore);
>> > +        if ( !xsplice_do_wait(&xsplice_work.irq_semaphore, timeout, total_cpus,
>> > +                              "Timed out on IRQ semaphore") )
>> 
>> I'd prefer if the common parts of that message moved into
>> xsplice_do_wait() - no reason to have more string literal space
>> occupied than really needed. Also, looking back, the respective
>> function parameter could do with a more descriptive name.
>> 
>> And then - does it make sense to wait the perhaps full 30ms
>> on this second step? Rendezvoused CPUs should react rather
>> quickly to the signal to disable interrupts.
> 
> We don't reset the timeout - the timeout is for both calls
> to xsplice_do_wait.

I understand that's the way it's right now, but that's what I'm putting
under question. Rendezvousing CPUs is quite a bit more at risk of
taking some time compared to having already rendezvoused CPUs
disable their IRQs.

>> 
>> > +        {
>> > +            local_irq_save(flags);
>> > +            /* Do the patching. */
>> > +            xsplice_do_action();
>> > +            /* To flush out pipeline. */
>> > +            arch_xsplice_post_action();
>> 
>> The comment needs to become more generic, and maybe the
>> function name more specific.
> 
> Serialize CPUs? Flush CPU pipeline? Flush CPU i-cache?

Perhaps mention all of these as examples of what may need doing
in that hook.

>> > +            local_irq_restore(flags);
>> > +        }
>> > +        set_nmi_callback(saved_nmi_callback);
>> > +
>> > + abort:
>> > +        per_cpu(work_to_do, cpu) = 0;
>> > +        xsplice_work.do_work = 0;
>> > +
>> > +        smp_wmb(); /* Synchronize with waiting CPUs. */
>> 
>> What "synchronization" does this barrier do?
> 
> For the ->do_work being set to zero - as they are reading and reading it.

But barriers don't do any synchronization, they only guarantee
certain ordering requirements. I.e. a comment on a barrier would
normally indicate which operations it is that it enforces ordering
for.

>> > +        ASSERT(local_irq_is_enabled());
>> 
>> Is there really anything between the earlier identical ASSERT() and
>> this one which could leave interrupts off?
> 
> The patching calls (in later patches) - the load and unload hooks. Those
> could inadvertly enabled interrupts.

In which case the ASSERT() would better be placed right after those
hooks (and get added alongside adding the hooks, which is pending
a decision on whether those should stay anyway).

>> > +        put_cpu_maps();
>> > +
>> > +        printk(XENLOG_INFO "%s%s finished with rc=%d\n", XSPLICE,
>> > +               p->name, p->rc);
>> 
>> And no record left of what was done with that payload?
> 
> ? But it does. The rc is mentioned.. Or are you saying that it should
> also say whether it was applied/reverted/replaced, etc?

Yes, exactly.

>> > +    }
>> > +    else
>> > +    {
>> > +        /* Wait for all CPUs to rendezvous. */
>> > +        while ( xsplice_work.do_work && !xsplice_work.ready )
>> > +        {
>> > +            cpu_relax();
>> > +            smp_rmb();
>> 
>> What is the barrier doing inside this and the other loop below?
> 
> The goal is to get the updated value from ->do_work and ->ready.
> 
> That is to do the same thing we do with frontend and backends - sync
> out the rp/rc so that the other end can read it the updated value.

But that's guaranteed already by cpu_relax() being a barrier (see
other loops in the code base invoking cpu_relax()). You have no
ordering requirement here, all you want is that the compiler not
move the two reads out of the loop.

>> >  static int __init xsplice_init(void)
>> >  {
>> > +    BUILD_BUG_ON( sizeof(struct xsplice_patch_func) != 64 );
>> > +    BUILD_BUG_ON( offsetof(struct xsplice_patch_func, new_addr) != 8 );
>> > +    BUILD_BUG_ON( offsetof(struct xsplice_patch_func, new_size) != 24 );
>> 
>> What assumptions get broken if these sizes change?
> 
> I think you mean the offsets? They can change. I added them to make sure
> that on 32-bit hypervisor (ARM) the offsets would be the same as on 64-bit
> hypervisor (x86).
> 
> The size however MUST remain the same - otherwise the toolstack won't 
> produce proper payloads anymore.

Well, that's a very bad thing then imo. You'd better version the
structure instead.

>> > --- a/xen/include/xen/xsplice.h
>> > +++ b/xen/include/xen/xsplice.h
>> > @@ -11,12 +11,30 @@ struct xsplice_elf_sec;
>> >  struct xsplice_elf_sym;
>> >  struct xen_sysctl_xsplice_op;
>> >  
>> > +#include <xen/elfstructs.h>
>> > +/*
>> > + * The structure which defines the patching. This is what the hypervisor
>> > + * expects in the '.xsplice.func' section of the ELF file.
>> > + *
>> > + * This MUST be in sync with what the tools generate.
>> 
>> In which case this need to be part of the public interface, I would
>> say.
> 
> Ah yes. Where should it go? sysctl.h is where the XSPLICE ops are - or
> should this have it is own public/xsplice.h  header file?

If it's just this, then I don't think a separate header is warranted,
and hence putting it in sysctl.h would seem fine to me.

>> > +    uint8_t pad[24];
>> 
>> What is this padding field good for?
> 
> I want the size of the structure to be exactly 64-bytes across
> all architectures. That way I don't have to mess with compat layer
> or have the design doc be different for ARM vs x86.

For avoiding the compat layer, guaranteeing same size is not
sufficient anyway. And the design doc doesn't need to mention
structure sizes at all.

> It gives us enough of space to stash other 'private' fields we may want
> or even expand this structure in the future and fill it out
> with extra fields. ARgh, but for that we need a version field
> somewhere... Maybe another section .xsplice_version which just
> has a value of 1?
> 
> However the one thing I am not sure about is the padding.
> 
> The 'const char *name' on 32-bit ARM will be 4 bytes hence
> it will insert 4 byte padding (at least that is what pahole tells me,
> and the BUILD_BUG_ON confirms). But if I built with clang or other
> compiler it may very well ignore me.
> 
> Any suggestions? I could do:
> 
> struct xsplice_patch_func {
>     const char *name;
> #ifdef CONFIG_32
>     uint32_t _name_pad;
> #endif
>     uint64_t new_addr;
>     uint64_t old_addr;
>     uint32_t new_size;
>     uint32_t old_size;
>     uint8_t undo[8];
>     uint8_t pad[24];
> }
> 
> But that is not very nice.

Indeed. Perhaps a union, but I'm questioning the need of uniformity
across architectures in the first place.

> Also if we make this a public header I can't very well expose the
> 'undo' internals - it ought to be opaque. Ideas?

Split it, and link the image provided structure from the internally
used one. Or have the internally used one be a superset of the
one coming from the image, and simply copy the data (at once
allowing you to discard the respective section after loading the
image).

Jan
Konrad Rzeszutek Wilk April 7, 2016, 3:05 a.m. UTC | #4
> >> So you check for there being one such section. Is having multiple
> >> of them okay / meaningful?
> > 
> > /me blinks. You can have multiple ELF sections with the same name?
> > I will double-check the spec over the weekend to see.
> 
> Of course you can. Remember me telling you that using section
> names for identification is a bad idea (even if widely done that
> way)? That's precisely because section names in the spec serve
> no meaning other than making sections identifiable to the human
> eye. For certain sections there's a more or less strict convention
> on what their names should be, but that's all there is to names.
> Section types and attributes really are what describe their
> purposes.

This is going to take a bit of time to get right I am afraid.
(The checks are easy - but make sure the payload files that are generated
are doing the right thing).
> 
> And even if that really was forbidden by the spec, you'd still need
> to make sure the binary meets the spec.

Yes.
> >> > +void check_for_xsplice_work(void)
> >> > +{
> >> > +    /* Set at -1, so will go up to num_online_cpus - 1. */
> >> > +    if ( atomic_inc_and_test(&xsplice_work.semaphore) )
> >> > +    {
> >> > +        struct payload *p;
> >> > +        unsigned int total_cpus;
> >> > +
> >> > +        p = xsplice_work.data;
> >> > +        if ( !get_cpu_maps() )
> >> > +        {
> >> > +            printk(XENLOG_ERR "%s%s: CPU%u - unable to get cpu_maps lock!\n",
> >> > +                   XSPLICE, p->name, cpu);
> >> > +            per_cpu(work_to_do, cpu) = 0;
> >> > +            xsplice_work.data->rc = -EBUSY;
> >> > +            xsplice_work.do_work = 0;
> >> 
> >> On x86 such possibly simultaneous accesses may be okay, but is
> >> that universally the case? Wouldn't it better be only the monarch
> >> which updates shared data?
> > 
> > Monarch? Oh you mean arch specific code path?
> 
> Oh, I think you call it "master" elsewhere. IA64 terminology which
> I came to like.
> 

The master CPU is the one updating it.
> >> > +        /* All CPUs are waiting, now signal to disable IRQs. */
> >> > +        xsplice_work.ready = 1;
> >> > +        smp_wmb();
> >> > +
> >> > +        atomic_inc(&xsplice_work.irq_semaphore);
> >> > +        if ( !xsplice_do_wait(&xsplice_work.irq_semaphore, timeout, total_cpus,
> >> > +                              "Timed out on IRQ semaphore") )
> >> 
> >> I'd prefer if the common parts of that message moved into
> >> xsplice_do_wait() - no reason to have more string literal space
> >> occupied than really needed. Also, looking back, the respective
> >> function parameter could do with a more descriptive name.
> >> 
> >> And then - does it make sense to wait the perhaps full 30ms
> >> on this second step? Rendezvoused CPUs should react rather
> >> quickly to the signal to disable interrupts.
> > 
> > We don't reset the timeout - the timeout is for both calls
> > to xsplice_do_wait.
> 
> I understand that's the way it's right now, but that's what I'm putting
> under question. Rendezvousing CPUs is quite a bit more at risk of
> taking some time compared to having already rendezvoused CPUs
> disable their IRQs.

Yes. I could expand the timeout, but maybe have some reset (add more
timeout) once CPUs  come along?
> >> >  static int __init xsplice_init(void)
> >> >  {
> >> > +    BUILD_BUG_ON( sizeof(struct xsplice_patch_func) != 64 );
> >> > +    BUILD_BUG_ON( offsetof(struct xsplice_patch_func, new_addr) != 8 );
> >> > +    BUILD_BUG_ON( offsetof(struct xsplice_patch_func, new_size) != 24 );
> >> 
> >> What assumptions get broken if these sizes change?
> > 
> > I think you mean the offsets? They can change. I added them to make sure
> > that on 32-bit hypervisor (ARM) the offsets would be the same as on 64-bit
> > hypervisor (x86).
> > 
> > The size however MUST remain the same - otherwise the toolstack won't 
> > produce proper payloads anymore.
> 
> Well, that's a very bad thing then imo. You'd better version the
> structure instead.

Andrew reminded me that we also will have build-ids. Those by nature imply
a version. That is the payload won't ever apply against a different type
hypervisor.

But I still added a version field in the struct so when we get the
'checking the old code' we have a mechanism to select -build-id or
the newer way.
Konrad Rzeszutek Wilk April 7, 2016, 3:09 a.m. UTC | #5
> That's for an individual patch I suppose? What if REPLACE has to
> revert dozens or hundreds of patches?

I don't know. That is sometihng I need to figure out.
I also need to test this out on the 8 socket machine..

> > +void arch_xsplice_apply_jmp(struct xsplice_patch_func *func)
> > +{
> > +    uint32_t val;
> 
> The way it's being used below, it clearly should be int32_t.
> 
> > +    uint8_t *old_ptr;
> > +
> > +    BUILD_BUG_ON(PATCH_INSN_SIZE > sizeof(func->undo));
> > +    BUILD_BUG_ON(PATCH_INSN_SIZE != (1 + sizeof val));
> > +
> > +    old_ptr = (uint8_t *)func->old_addr;
> 
> (Considering this cast, the "old_addr" member should be
> unsigned long (or void *), not uint64_t: The latest on ARM32
> such would otherwise cause problems.)

I has to be uint8_t to make the single byte modifications. Keep
also in mind that this file is only for x86.

> 
> Also - where is the verification that
> func->old_size >= PATCH_INSN_SIZE?

I made a 'arch_xsplice_verify_func' to have per-architecture check for
that.
..
> > +        sec = xsplice_elf_sec_by_name(elf, names[i]);
> > +        if ( !sec )
> > +        {
> > +            printk(XENLOG_ERR "%s%s: %s is missing!\n",
> > +                   XSPLICE, elf->name, names[i]);
> > +            return -EINVAL;
> > +        }
> > +
> > +        if ( !sec->sec->sh_size )
> > +            return -EINVAL;
> > +    }
> > +
> > +    return 0;
> > +}
> 
> So you check for there being one such section. Is having multiple
> of them okay / meaningful?

I've been going back on this. I think at this point I will say no.
I am going to look at what is involved in this in later versions.
Jan Beulich April 7, 2016, 3:38 p.m. UTC | #6
>>> On 07.04.16 at 05:05, <konrad.wilk@oracle.com> wrote:
>> >> > +        /* All CPUs are waiting, now signal to disable IRQs. */
>> >> > +        xsplice_work.ready = 1;
>> >> > +        smp_wmb();
>> >> > +
>> >> > +        atomic_inc(&xsplice_work.irq_semaphore);
>> >> > +        if ( !xsplice_do_wait(&xsplice_work.irq_semaphore, timeout, total_cpus,
>> >> > +                              "Timed out on IRQ semaphore") )
>> >> 
>> >> I'd prefer if the common parts of that message moved into
>> >> xsplice_do_wait() - no reason to have more string literal space
>> >> occupied than really needed. Also, looking back, the respective
>> >> function parameter could do with a more descriptive name.
>> >> 
>> >> And then - does it make sense to wait the perhaps full 30ms
>> >> on this second step? Rendezvoused CPUs should react rather
>> >> quickly to the signal to disable interrupts.
>> > 
>> > We don't reset the timeout - the timeout is for both calls
>> > to xsplice_do_wait.
>> 
>> I understand that's the way it's right now, but that's what I'm putting
>> under question. Rendezvousing CPUs is quite a bit more at risk of
>> taking some time compared to having already rendezvoused CPUs
>> disable their IRQs.
> 
> Yes. I could expand the timeout, but maybe have some reset (add more
> timeout) once CPUs  come along?

Expand the timeout? Add more timeout? I don't understand. My
point was about shortening the timeout on the second step.

Jan
Jan Beulich April 7, 2016, 3:43 p.m. UTC | #7
>>> On 07.04.16 at 05:09, <konrad.wilk@oracle.com> wrote:
>> > +    uint8_t *old_ptr;
>> > +
>> > +    BUILD_BUG_ON(PATCH_INSN_SIZE > sizeof(func->undo));
>> > +    BUILD_BUG_ON(PATCH_INSN_SIZE != (1 + sizeof val));
>> > +
>> > +    old_ptr = (uint8_t *)func->old_addr;
>> 
>> (Considering this cast, the "old_addr" member should be
>> unsigned long (or void *), not uint64_t: The latest on ARM32
>> such would otherwise cause problems.)
> 
> I has to be uint8_t to make the single byte modifications. Keep
> also in mind that this file is only for x86.

old_addr can't reasonably be uint8_t, so I can only assume you're
mixing up things here. (And yes, I do realize this is x86 code, but
my reference to ARM32 was only mean to say that there you'll
need to do something similar, and casting uint64_t to whatever
kind of pointer type is not going to work without compiler warning.)

>> Also - where is the verification that
>> func->old_size >= PATCH_INSN_SIZE?
> 
> I made a 'arch_xsplice_verify_func' to have per-architecture check for
> that.
> ..
>> > +        sec = xsplice_elf_sec_by_name(elf, names[i]);
>> > +        if ( !sec )
>> > +        {
>> > +            printk(XENLOG_ERR "%s%s: %s is missing!\n",
>> > +                   XSPLICE, elf->name, names[i]);
>> > +            return -EINVAL;
>> > +        }
>> > +
>> > +        if ( !sec->sec->sh_size )
>> > +            return -EINVAL;
>> > +    }
>> > +
>> > +    return 0;
>> > +}
>> 
>> So you check for there being one such section. Is having multiple
>> of them okay / meaningful?
> 
> I've been going back on this. I think at this point I will say no.
> I am going to look at what is involved in this in later versions.

And check for there being exactly one for now, I then assume?

Jan
Konrad Rzeszutek Wilk April 9, 2016, 2:42 p.m. UTC | #8
On Thu, Apr 07, 2016 at 09:38:53AM -0600, Jan Beulich wrote:
> >>> On 07.04.16 at 05:05, <konrad.wilk@oracle.com> wrote:
> >> >> > +        /* All CPUs are waiting, now signal to disable IRQs. */
> >> >> > +        xsplice_work.ready = 1;
> >> >> > +        smp_wmb();
> >> >> > +
> >> >> > +        atomic_inc(&xsplice_work.irq_semaphore);
> >> >> > +        if ( !xsplice_do_wait(&xsplice_work.irq_semaphore, timeout, total_cpus,
> >> >> > +                              "Timed out on IRQ semaphore") )
> >> >> 
> >> >> I'd prefer if the common parts of that message moved into
> >> >> xsplice_do_wait() - no reason to have more string literal space
> >> >> occupied than really needed. Also, looking back, the respective
> >> >> function parameter could do with a more descriptive name.
> >> >> 
> >> >> And then - does it make sense to wait the perhaps full 30ms
> >> >> on this second step? Rendezvoused CPUs should react rather
> >> >> quickly to the signal to disable interrupts.
> >> > 
> >> > We don't reset the timeout - the timeout is for both calls
> >> > to xsplice_do_wait.
> >> 
> >> I understand that's the way it's right now, but that's what I'm putting
> >> under question. Rendezvousing CPUs is quite a bit more at risk of
> >> taking some time compared to having already rendezvoused CPUs
> >> disable their IRQs.
> > 
> > Yes. I could expand the timeout, but maybe have some reset (add more
> > timeout) once CPUs  come along?
> 
> Expand the timeout? Add more timeout? I don't understand. My
> point was about shortening the timeout on the second step.

By how much?

The clock (timeout) starts ticking the moment the schedule_work
is called - to quiten all the CPUs. Adding an acceleration once
they have passed the #1 stage is modifying the semantics of the
timeout ("well, it was 30ms, but once it goes over phase #1 it
is shorten by half (or is it a quarter?").

Should we expose both timeouts to the sysctl so that the user can
customize the acceleration ratio?

Perhaps we can fiddle with this later?
> 
> Jan
>
Konrad Rzeszutek Wilk April 10, 2016, 2:36 a.m. UTC | #9
On Thu, Apr 07, 2016 at 09:43:38AM -0600, Jan Beulich wrote:
> >>> On 07.04.16 at 05:09, <konrad.wilk@oracle.com> wrote:
> >> > +    uint8_t *old_ptr;
> >> > +
> >> > +    BUILD_BUG_ON(PATCH_INSN_SIZE > sizeof(func->undo));
> >> > +    BUILD_BUG_ON(PATCH_INSN_SIZE != (1 + sizeof val));
> >> > +
> >> > +    old_ptr = (uint8_t *)func->old_addr;
> >> 
> >> (Considering this cast, the "old_addr" member should be
> >> unsigned long (or void *), not uint64_t: The latest on ARM32
> >> such would otherwise cause problems.)
> > 
> > I has to be uint8_t to make the single byte modifications. Keep
> > also in mind that this file is only for x86.
> 
> old_addr can't reasonably be uint8_t, so I can only assume you're
> mixing up things here. (And yes, I do realize this is x86 code, but
> my reference to ARM32 was only mean to say that there you'll
> need to do something similar, and casting uint64_t to whatever
> kind of pointer type is not going to work without compiler warning.)

Way back .. when we spoke about the .xsplice.funcs structure
you recommended to make the types be either uintXX specific
or Elf types. I choose Elf types but then we realized that
ARM32 hypervisor would be involved which of course would have
a different size of the structure. So I went with uintXXX
to save a bit of headache (specifically those BUILD_BUG_ON
checks).

I can't see making the old_addr be unsigned long or void *,
which means going back to Elf types. And for ARM32 I will
have to deal with a different structure size.
Konrad Rzeszutek Wilk April 10, 2016, 2:45 a.m. UTC | #10
On Sat, Apr 09, 2016 at 10:36:00PM -0400, Konrad Rzeszutek Wilk wrote:
> On Thu, Apr 07, 2016 at 09:43:38AM -0600, Jan Beulich wrote:
> > >>> On 07.04.16 at 05:09, <konrad.wilk@oracle.com> wrote:
> > >> > +    uint8_t *old_ptr;
> > >> > +
> > >> > +    BUILD_BUG_ON(PATCH_INSN_SIZE > sizeof(func->undo));
> > >> > +    BUILD_BUG_ON(PATCH_INSN_SIZE != (1 + sizeof val));
> > >> > +
> > >> > +    old_ptr = (uint8_t *)func->old_addr;
> > >> 
> > >> (Considering this cast, the "old_addr" member should be
> > >> unsigned long (or void *), not uint64_t: The latest on ARM32
> > >> such would otherwise cause problems.)
> > > 
> > > I has to be uint8_t to make the single byte modifications. Keep
> > > also in mind that this file is only for x86.
> > 
> > old_addr can't reasonably be uint8_t, so I can only assume you're
> > mixing up things here. (And yes, I do realize this is x86 code, but
> > my reference to ARM32 was only mean to say that there you'll
> > need to do something similar, and casting uint64_t to whatever
> > kind of pointer type is not going to work without compiler warning.)
> 
> Way back .. when we spoke about the .xsplice.funcs structure
> you recommended to make the types be either uintXX specific
> or Elf types. I choose Elf types but then we realized that
> ARM32 hypervisor would be involved which of course would have
> a different size of the structure. So I went with uintXXX
> to save a bit of headache (specifically those BUILD_BUG_ON
> checks).
> 
> I can't see making the old_addr be unsigned long or void *,
> which means going back to Elf types. And for ARM32 I will
> have to deal with a different structure size. 

Oh gosh, that is going to be problem with our headers as
I would be now exposing the 'xsplice_patch_func' structure
in a public header which would depend on Elf_X types.

And we do not expose those. Moving the whole elfstructs.h
to public directory is a bit .. harsh.
Konrad Rzeszutek Wilk April 10, 2016, 7:47 p.m. UTC | #11
On Sat, Apr 09, 2016 at 10:36:02PM -0400, Konrad Rzeszutek Wilk wrote:
> On Thu, Apr 07, 2016 at 09:43:38AM -0600, Jan Beulich wrote:
> > >>> On 07.04.16 at 05:09, <konrad.wilk@oracle.com> wrote:
> > >> > +    uint8_t *old_ptr;
> > >> > +
> > >> > +    BUILD_BUG_ON(PATCH_INSN_SIZE > sizeof(func->undo));
> > >> > +    BUILD_BUG_ON(PATCH_INSN_SIZE != (1 + sizeof val));
> > >> > +
> > >> > +    old_ptr = (uint8_t *)func->old_addr;
> > >> 
> > >> (Considering this cast, the "old_addr" member should be
> > >> unsigned long (or void *), not uint64_t: The latest on ARM32
> > >> such would otherwise cause problems.)
> > > 
> > > I has to be uint8_t to make the single byte modifications. Keep
> > > also in mind that this file is only for x86.
> > 
> > old_addr can't reasonably be uint8_t, so I can only assume you're
> > mixing up things here. (And yes, I do realize this is x86 code, but
> > my reference to ARM32 was only mean to say that there you'll
> > need to do something similar, and casting uint64_t to whatever
> > kind of pointer type is not going to work without compiler warning.)
> 
> Way back .. when we spoke about the .xsplice.funcs structure
> you recommended to make the types be either uintXX specific
> or Elf types. I choose Elf types but then we realized that
> ARM32 hypervisor would be involved which of course would have
> a different size of the structure. So I went with uintXXX
> to save a bit of headache (specifically those BUILD_BUG_ON
> checks).
> 
> I can't see making the old_addr be unsigned long or void *,
> which means going back to Elf types. And for ARM32 I will
> have to deal with a different structure size. 

CC-ing Stefano and Julien here to advise. I ended up 
exposing the ABI part in sysctl.h (and design document as):


#define XSPLICE_PAYLOAD_VERSION 1
/*
 * .xsplice.funcs structure layout defined in the `Payload format`
 * section in the xSplice design document.
 *
 * The size should be exactly 64 bytes.
 */
struct xsplice_patch_func {
    const char *name;       /* Name of function to be patched. */
    uint64_t new_addr;
    uint64_t old_addr;      /* Can be zero and name will be looked up. */
    uint32_t new_size;
    uint32_t old_size;
    uint8_t version;        /* MUST be XSPLICE_PAYLOAD_VERSION. */
    uint8_t pad[31];        /* MUST be zero filled. */
};
typedef struct xsplice_patch_func xsplice_patch_func_t;

Which looks nice.

When the ELF file is loaded we load it as this structure:

[x86]
#ifndef CONFIG_ARM
struct xsplice_patch_func_internal {
    const char *name;
    void *new_addr;
    void *old_addr;
    uint32_t new_size;
    uint32_t old_size;
    uint8_t version;
    union {
        uint8_t undo[8];
        uint8_t pad[31];
    } u;
};
#else
[ARM]
struct xsplice_patch_func_internal {
    const char *name;
    uint32_t _pad0;
    void *new_addr;
    uint32_t _pad1;
    void *old_addr;
    uint32_t _pad2;
    uint32_t new_size;
    uint32_t old_size;
    uint8_t version;
    union {
        uint8_t pad[31];
    } u;
};
#endif

That allows the size and offsets to be the same. I checked under ARM32
builds:


struct xsplice_patch_func_internal {
    const char  *              name;                 /*     0     4 */
    uint32_t                   _pad0;                /*     4     4 */
    void *                     new_addr;             /*     8     4 */
    uint32_t                   _pad1;                /*    12     4 */
    void *                     old_addr;             /*    16     4 */
    uint32_t                   _pad2;                /*    20     4 */
    uint32_t                   new_size;             /*    24     4 */
    uint32_t                   old_size;             /*    28     4 */
    uint8_t                    version;              /*    32     1 */
    union {
        uint8_t            pad[31];              /*          31 */
    } u;                                             /*    33    31 */
    /* --- cacheline 1 boundary (64 bytes) --- */

    /* size: 64, cachelines: 1, members: 10 */
};

So it all looks correct. (and I can cast the old_addr to uint8_t).
Naturally we can expand the pad[] to hold whatever is needed
when implementing xSplice under ARM

However I would appreciate advise from ARM folks if I made some
wrong assumptions..
Stefano Stabellini April 10, 2016, 8:58 p.m. UTC | #12
On Sun, 10 Apr 2016, Konrad Rzeszutek Wilk wrote:
> On Sat, Apr 09, 2016 at 10:36:02PM -0400, Konrad Rzeszutek Wilk wrote:
> > On Thu, Apr 07, 2016 at 09:43:38AM -0600, Jan Beulich wrote:
> > > >>> On 07.04.16 at 05:09, <konrad.wilk@oracle.com> wrote:
> > > >> > +    uint8_t *old_ptr;
> > > >> > +
> > > >> > +    BUILD_BUG_ON(PATCH_INSN_SIZE > sizeof(func->undo));
> > > >> > +    BUILD_BUG_ON(PATCH_INSN_SIZE != (1 + sizeof val));
> > > >> > +
> > > >> > +    old_ptr = (uint8_t *)func->old_addr;
> > > >> 
> > > >> (Considering this cast, the "old_addr" member should be
> > > >> unsigned long (or void *), not uint64_t: The latest on ARM32
> > > >> such would otherwise cause problems.)
> > > > 
> > > > I has to be uint8_t to make the single byte modifications. Keep
> > > > also in mind that this file is only for x86.
> > > 
> > > old_addr can't reasonably be uint8_t, so I can only assume you're
> > > mixing up things here. (And yes, I do realize this is x86 code, but
> > > my reference to ARM32 was only mean to say that there you'll
> > > need to do something similar, and casting uint64_t to whatever
> > > kind of pointer type is not going to work without compiler warning.)
> > 
> > Way back .. when we spoke about the .xsplice.funcs structure
> > you recommended to make the types be either uintXX specific
> > or Elf types. I choose Elf types but then we realized that
> > ARM32 hypervisor would be involved which of course would have
> > a different size of the structure. So I went with uintXXX
> > to save a bit of headache (specifically those BUILD_BUG_ON
> > checks).
> > 
> > I can't see making the old_addr be unsigned long or void *,
> > which means going back to Elf types. And for ARM32 I will
> > have to deal with a different structure size. 
>
> CC-ing Stefano and Julien here to advise. I ended up 
> exposing the ABI part in sysctl.h (and design document as):
> 
> 
> #define XSPLICE_PAYLOAD_VERSION 1
> /*
>  * .xsplice.funcs structure layout defined in the `Payload format`
>  * section in the xSplice design document.
>  *
>  * The size should be exactly 64 bytes.
>  */
> struct xsplice_patch_func {
>     const char *name;       /* Name of function to be patched. */
>     uint64_t new_addr;
>     uint64_t old_addr;      /* Can be zero and name will be looked up. */
>     uint32_t new_size;
>     uint32_t old_size;
>     uint8_t version;        /* MUST be XSPLICE_PAYLOAD_VERSION. */
>     uint8_t pad[31];        /* MUST be zero filled. */
> };
> typedef struct xsplice_patch_func xsplice_patch_func_t;
> 
> Which looks nice.
> 
> When the ELF file is loaded we load it as this structure:
> 
> [x86]
> #ifndef CONFIG_ARM
> struct xsplice_patch_func_internal {
>     const char *name;
>     void *new_addr;
>     void *old_addr;
>     uint32_t new_size;
>     uint32_t old_size;
>     uint8_t version;
>     union {
>         uint8_t undo[8];
>         uint8_t pad[31];
>     } u;
> };
> #else
> [ARM]
> struct xsplice_patch_func_internal {
>     const char *name;
>     uint32_t _pad0;
>     void *new_addr;
>     uint32_t _pad1;
>     void *old_addr;
>     uint32_t _pad2;
>     uint32_t new_size;
>     uint32_t old_size;
>     uint8_t version;
>     union {
>         uint8_t pad[31];
>     } u;
> };
> #endif
> 
> That allows the size and offsets to be the same. I checked under ARM32
> builds:
> 
> 
> struct xsplice_patch_func_internal {
>     const char  *              name;                 /*     0     4 */
>     uint32_t                   _pad0;                /*     4     4 */
>     void *                     new_addr;             /*     8     4 */
>     uint32_t                   _pad1;                /*    12     4 */
>     void *                     old_addr;             /*    16     4 */
>     uint32_t                   _pad2;                /*    20     4 */
>     uint32_t                   new_size;             /*    24     4 */
>     uint32_t                   old_size;             /*    28     4 */
>     uint8_t                    version;              /*    32     1 */
>     union {
>         uint8_t            pad[31];              /*          31 */
>     } u;                                             /*    33    31 */
>     /* --- cacheline 1 boundary (64 bytes) --- */
> 
>     /* size: 64, cachelines: 1, members: 10 */
> };
> 
> So it all looks correct. (and I can cast the old_addr to uint8_t).
> Naturally we can expand the pad[] to hold whatever is needed
> when implementing xSplice under ARM
> 
> However I would appreciate advise from ARM folks if I made some
> wrong assumptions..

It looks good. I take that ARM64 will be like x86. In that case, instead
of #ifdef'ing x86 or ARM, I would #ifdef BITS_PER_LONG or POINTER_ALIGN.
Jan Beulich April 11, 2016, 3:38 p.m. UTC | #13
>>> On 09.04.16 at 16:42, <konrad@kernel.org> wrote:
> On Thu, Apr 07, 2016 at 09:38:53AM -0600, Jan Beulich wrote:
>> >>> On 07.04.16 at 05:05, <konrad.wilk@oracle.com> wrote:
>> >> >> > +        /* All CPUs are waiting, now signal to disable IRQs. */
>> >> >> > +        xsplice_work.ready = 1;
>> >> >> > +        smp_wmb();
>> >> >> > +
>> >> >> > +        atomic_inc(&xsplice_work.irq_semaphore);
>> >> >> > +        if ( !xsplice_do_wait(&xsplice_work.irq_semaphore, timeout, 
> total_cpus,
>> >> >> > +                              "Timed out on IRQ semaphore") )
>> >> >> 
>> >> >> I'd prefer if the common parts of that message moved into
>> >> >> xsplice_do_wait() - no reason to have more string literal space
>> >> >> occupied than really needed. Also, looking back, the respective
>> >> >> function parameter could do with a more descriptive name.
>> >> >> 
>> >> >> And then - does it make sense to wait the perhaps full 30ms
>> >> >> on this second step? Rendezvoused CPUs should react rather
>> >> >> quickly to the signal to disable interrupts.
>> >> > 
>> >> > We don't reset the timeout - the timeout is for both calls
>> >> > to xsplice_do_wait.
>> >> 
>> >> I understand that's the way it's right now, but that's what I'm putting
>> >> under question. Rendezvousing CPUs is quite a bit more at risk of
>> >> taking some time compared to having already rendezvoused CPUs
>> >> disable their IRQs.
>> > 
>> > Yes. I could expand the timeout, but maybe have some reset (add more
>> > timeout) once CPUs  come along?
>> 
>> Expand the timeout? Add more timeout? I don't understand. My
>> point was about shortening the timeout on the second step.
> 
> By how much?

1ms would seem more than enough to me at a first glance.

> The clock (timeout) starts ticking the moment the schedule_work
> is called - to quiten all the CPUs. Adding an acceleration once
> they have passed the #1 stage is modifying the semantics of the
> timeout ("well, it was 30ms, but once it goes over phase #1 it
> is shorten by half (or is it a quarter?").
> 
> Should we expose both timeouts to the sysctl so that the user can
> customize the acceleration ratio?
> 
> Perhaps we can fiddle with this later?

Sure. It was just a thought, not an immediate requirement.

Jan
Jan Beulich April 11, 2016, 3:41 p.m. UTC | #14
>>> On 10.04.16 at 04:45, <konrad@kernel.org> wrote:
> On Sat, Apr 09, 2016 at 10:36:00PM -0400, Konrad Rzeszutek Wilk wrote:
>> On Thu, Apr 07, 2016 at 09:43:38AM -0600, Jan Beulich wrote:
>> > >>> On 07.04.16 at 05:09, <konrad.wilk@oracle.com> wrote:
>> > >> > +    uint8_t *old_ptr;
>> > >> > +
>> > >> > +    BUILD_BUG_ON(PATCH_INSN_SIZE > sizeof(func->undo));
>> > >> > +    BUILD_BUG_ON(PATCH_INSN_SIZE != (1 + sizeof val));
>> > >> > +
>> > >> > +    old_ptr = (uint8_t *)func->old_addr;
>> > >> 
>> > >> (Considering this cast, the "old_addr" member should be
>> > >> unsigned long (or void *), not uint64_t: The latest on ARM32
>> > >> such would otherwise cause problems.)
>> > > 
>> > > I has to be uint8_t to make the single byte modifications. Keep
>> > > also in mind that this file is only for x86.
>> > 
>> > old_addr can't reasonably be uint8_t, so I can only assume you're
>> > mixing up things here. (And yes, I do realize this is x86 code, but
>> > my reference to ARM32 was only mean to say that there you'll
>> > need to do something similar, and casting uint64_t to whatever
>> > kind of pointer type is not going to work without compiler warning.)
>> 
>> Way back .. when we spoke about the .xsplice.funcs structure
>> you recommended to make the types be either uintXX specific
>> or Elf types. I choose Elf types but then we realized that
>> ARM32 hypervisor would be involved which of course would have
>> a different size of the structure. So I went with uintXXX
>> to save a bit of headache (specifically those BUILD_BUG_ON
>> checks).
>> 
>> I can't see making the old_addr be unsigned long or void *,
>> which means going back to Elf types. And for ARM32 I will
>> have to deal with a different structure size. 
> 
> Oh gosh, that is going to be problem with our headers as
> I would be now exposing the 'xsplice_patch_func' structure
> in a public header which would depend on Elf_X types.

So how about uintptr_t? Not exactly the thing we do in public
headers already, but at least a possibility. Or else maybe
xen_ulong_t, albeit on ARM32 that again won't cast well to
pointer types.

Jan
Jan Beulich April 11, 2016, 3:44 p.m. UTC | #15
>>> On 10.04.16 at 21:47, <konrad.wilk@oracle.com> wrote:
> That allows the size and offsets to be the same. I checked under ARM32
> builds:
> 
> 
> struct xsplice_patch_func_internal {
>     const char  *              name;                 /*     0     4 */
>     uint32_t                   _pad0;                /*     4     4 */
>     void *                     new_addr;             /*     8     4 */
>     uint32_t                   _pad1;                /*    12     4 */
>     void *                     old_addr;             /*    16     4 */
>     uint32_t                   _pad2;                /*    20     4 */
>     uint32_t                   new_size;             /*    24     4 */
>     uint32_t                   old_size;             /*    28     4 */
>     uint8_t                    version;              /*    32     1 */
>     union {
>         uint8_t            pad[31];              /*          31 */
>     } u;                                             /*    33    31 */
>     /* --- cacheline 1 boundary (64 bytes) --- */
> 
>     /* size: 64, cachelines: 1, members: 10 */
> };
> 
> So it all looks correct. (and I can cast the old_addr to uint8_t).
> Naturally we can expand the pad[] to hold whatever is needed
> when implementing xSplice under ARM

I still don't get: Why is it so important for this structure to have
the same size and layout across architectures?

Jan
Konrad Rzeszutek Wilk April 11, 2016, 3:50 p.m. UTC | #16
On Mon, Apr 11, 2016 at 09:44:55AM -0600, Jan Beulich wrote:
> >>> On 10.04.16 at 21:47, <konrad.wilk@oracle.com> wrote:
> > That allows the size and offsets to be the same. I checked under ARM32
> > builds:
> > 
> > 
> > struct xsplice_patch_func_internal {
> >     const char  *              name;                 /*     0     4 */
> >     uint32_t                   _pad0;                /*     4     4 */
> >     void *                     new_addr;             /*     8     4 */
> >     uint32_t                   _pad1;                /*    12     4 */
> >     void *                     old_addr;             /*    16     4 */
> >     uint32_t                   _pad2;                /*    20     4 */
> >     uint32_t                   new_size;             /*    24     4 */
> >     uint32_t                   old_size;             /*    28     4 */
> >     uint8_t                    version;              /*    32     1 */
> >     union {
> >         uint8_t            pad[31];              /*          31 */
> >     } u;                                             /*    33    31 */
> >     /* --- cacheline 1 boundary (64 bytes) --- */
> > 
> >     /* size: 64, cachelines: 1, members: 10 */
> > };
> > 
> > So it all looks correct. (and I can cast the old_addr to uint8_t).
> > Naturally we can expand the pad[] to hold whatever is needed
> > when implementing xSplice under ARM
> 
> I still don't get: Why is it so important for this structure to have
> the same size and layout across architectures?

I have a very bad taste from the blkif.h struct request where the structure
size is 112 or 108 depending on the 32 vs 64.

Having the same offsets for variables should make it easier for the
tools to generate the payload much simpler without having to worry
about the sizes or offset. Also it means that the common code BUILT_IN_BUG
can be generic across ARM and x86.
Jan Beulich April 11, 2016, 4:05 p.m. UTC | #17
>>> On 11.04.16 at 17:50, <konrad.wilk@oracle.com> wrote:
> On Mon, Apr 11, 2016 at 09:44:55AM -0600, Jan Beulich wrote:
>> >>> On 10.04.16 at 21:47, <konrad.wilk@oracle.com> wrote:
>> > That allows the size and offsets to be the same. I checked under ARM32
>> > builds:
>> > 
>> > 
>> > struct xsplice_patch_func_internal {
>> >     const char  *              name;                 /*     0     4 */
>> >     uint32_t                   _pad0;                /*     4     4 */
>> >     void *                     new_addr;             /*     8     4 */
>> >     uint32_t                   _pad1;                /*    12     4 */
>> >     void *                     old_addr;             /*    16     4 */
>> >     uint32_t                   _pad2;                /*    20     4 */
>> >     uint32_t                   new_size;             /*    24     4 */
>> >     uint32_t                   old_size;             /*    28     4 */
>> >     uint8_t                    version;              /*    32     1 */
>> >     union {
>> >         uint8_t            pad[31];              /*          31 */
>> >     } u;                                             /*    33    31 */
>> >     /* --- cacheline 1 boundary (64 bytes) --- */
>> > 
>> >     /* size: 64, cachelines: 1, members: 10 */
>> > };
>> > 
>> > So it all looks correct. (and I can cast the old_addr to uint8_t).
>> > Naturally we can expand the pad[] to hold whatever is needed
>> > when implementing xSplice under ARM
>> 
>> I still don't get: Why is it so important for this structure to have
>> the same size and layout across architectures?
> 
> I have a very bad taste from the blkif.h struct request where the structure
> size is 112 or 108 depending on the 32 vs 64.

The two have very little - if anything - in common imo.

> Having the same offsets for variables should make it easier for the
> tools to generate the payload much simpler without having to worry
> about the sizes or offset. Also it means that the common code BUILT_IN_BUG
> can be generic across ARM and x86.

None of this seems overly hard to retain even when the structure
size wasn't consistent - just that you couldn't use a literal 64 anymore,
for example.

Jan
Konrad Rzeszutek Wilk April 11, 2016, 11:29 p.m. UTC | #18
On Apr 11, 2016 11:41 AM, "Jan Beulich" <JBeulich@suse.com> wrote:
>
> >>> On 10.04.16 at 04:45, <konrad@kernel.org> wrote:
> > On Sat, Apr 09, 2016 at 10:36:00PM -0400, Konrad Rzeszutek Wilk wrote:
> >> On Thu, Apr 07, 2016 at 09:43:38AM -0600, Jan Beulich wrote:
> >> > >>> On 07.04.16 at 05:09, <konrad.wilk@oracle.com> wrote:
> >> > >> > +    uint8_t *old_ptr;
> >> > >> > +
> >> > >> > +    BUILD_BUG_ON(PATCH_INSN_SIZE > sizeof(func->undo));
> >> > >> > +    BUILD_BUG_ON(PATCH_INSN_SIZE != (1 + sizeof val));
> >> > >> > +
> >> > >> > +    old_ptr = (uint8_t *)func->old_addr;
> >> > >>
> >> > >> (Considering this cast, the "old_addr" member should be
> >> > >> unsigned long (or void *), not uint64_t: The latest on ARM32
> >> > >> such would otherwise cause problems.)
> >> > >
> >> > > I has to be uint8_t to make the single byte modifications. Keep
> >> > > also in mind that this file is only for x86.
> >> >
> >> > old_addr can't reasonably be uint8_t, so I can only assume you're
> >> > mixing up things here. (And yes, I do realize this is x86 code, but
> >> > my reference to ARM32 was only mean to say that there you'll
> >> > need to do something similar, and casting uint64_t to whatever
> >> > kind of pointer type is not going to work without compiler warning.)
> >>
> >> Way back .. when we spoke about the .xsplice.funcs structure
> >> you recommended to make the types be either uintXX specific
> >> or Elf types. I choose Elf types but then we realized that
> >> ARM32 hypervisor would be involved which of course would have
> >> a different size of the structure. So I went with uintXXX
> >> to save a bit of headache (specifically those BUILD_BUG_ON
> >> checks).
> >>
> >> I can't see making the old_addr be unsigned long or void *,
> >> which means going back to Elf types. And for ARM32 I will
> >> have to deal with a different structure size.
> >
> > Oh gosh, that is going to be problem with our headers as
> > I would be now exposing the 'xsplice_patch_func' structure
> > in a public header which would depend on Elf_X types.
>
> So how about uintptr_t? Not exactly the thing we do in public
> headers already, but at least a possibility. Or else maybe
> xen_ulong_t, albeit on ARM32 that again won't cast well to
> pointer types.

I ended up (see v7 patches) making it uint64_t in public and in the private
void*.

>
> Jan
>
diff mbox

Patch

diff --git a/docs/misc/xsplice.markdown b/docs/misc/xsplice.markdown
index a5303f0..28140dd 100644
--- a/docs/misc/xsplice.markdown
+++ b/docs/misc/xsplice.markdown
@@ -841,7 +841,8 @@  The implementation must also have a mechanism for:
  * Be able to lookup in the Xen hypervisor the symbol names of functions from the ELF payload.
  * Be able to patch .rodata, .bss, and .data sections.
  * Further safety checks (blacklist of which functions cannot be patched, check
-   the stack, make sure the payload is built with same compiler as hypervisor).
+   the stack, make sure the payload is built with same compiler as hypervisor,
+   and NMI/MCE handlers and do_nmi for right now - until an safe solution is found).
  * NOP out the code sequence if `new_size` is zero.
  * Deal with other relocation types:  R_X86_64_[8,16,32,32S], R_X86_64_PC[8,16,64] in payload file.
 
diff --git a/xen/arch/arm/xsplice.c b/xen/arch/arm/xsplice.c
index e9c49ab..4fb19b3 100644
--- a/xen/arch/arm/xsplice.c
+++ b/xen/arch/arm/xsplice.c
@@ -6,6 +6,26 @@ 
 #include <xen/xsplice_elf.h>
 #include <xen/xsplice.h>
 
+void arch_xsplice_patching_enter(void)
+{
+}
+
+void arch_xsplice_patching_leave(void)
+{
+}
+
+void arch_xsplice_apply_jmp(struct xsplice_patch_func *func)
+{
+}
+
+void arch_xsplice_revert_jmp(struct xsplice_patch_func *func)
+{
+}
+
+void arch_xsplice_post_action(void)
+{
+}
+
 int arch_xsplice_verify_elf(const struct xsplice_elf *elf, void *data)
 {
     return -ENOSYS;
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 6ec7554..c63ee21 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -36,6 +36,7 @@ 
 #include <xen/cpu.h>
 #include <xen/wait.h>
 #include <xen/guest_access.h>
+#include <xen/xsplice.h>
 #include <public/sysctl.h>
 #include <public/hvm/hvm_vcpu.h>
 #include <asm/regs.h>
@@ -120,6 +121,7 @@  static void idle_loop(void)
         (*pm_idle)();
         do_tasklet();
         do_softirq();
+        check_for_xsplice_work(); /* Must be last. */
     }
 }
 
@@ -136,6 +138,7 @@  void startup_cpu_idle_loop(void)
 
 static void noreturn continue_idle_domain(struct vcpu *v)
 {
+    check_for_xsplice_work();
     reset_stack_and_jump(idle_loop);
 }
 
@@ -143,6 +146,7 @@  static void noreturn continue_nonidle_domain(struct vcpu *v)
 {
     check_wakeup_from_wait();
     mark_regs_dirty(guest_cpu_user_regs());
+    check_for_xsplice_work();
     reset_stack_and_jump(ret_from_intr);
 }
 
diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index 7634c3f..bbb0a73 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -26,6 +26,7 @@ 
 #include <xen/hypercall.h>
 #include <xen/domain_page.h>
 #include <xen/xenoprof.h>
+#include <xen/xsplice.h>
 #include <asm/current.h>
 #include <asm/io.h>
 #include <asm/paging.h>
@@ -1096,6 +1097,7 @@  static void noreturn svm_do_resume(struct vcpu *v)
 
     hvm_do_resume(v);
 
+    check_for_xsplice_work();
     reset_stack_and_jump(svm_asm_do_resume);
 }
 
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index 8284281..ec454dd 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -25,6 +25,7 @@ 
 #include <xen/kernel.h>
 #include <xen/keyhandler.h>
 #include <xen/vm_event.h>
+#include <xen/xsplice.h>
 #include <asm/current.h>
 #include <asm/cpufeature.h>
 #include <asm/processor.h>
@@ -1722,6 +1723,7 @@  void vmx_do_resume(struct vcpu *v)
     }
 
     hvm_do_resume(v);
+    check_for_xsplice_work();
     reset_stack_and_jump(vmx_asm_do_vmentry);
 }
 
diff --git a/xen/arch/x86/xsplice.c b/xen/arch/x86/xsplice.c
index 7149540..1f77bd8 100644
--- a/xen/arch/x86/xsplice.c
+++ b/xen/arch/x86/xsplice.c
@@ -10,6 +10,46 @@ 
 #include <xen/xsplice_elf.h>
 #include <xen/xsplice.h>
 
+#define PATCH_INSN_SIZE 5
+
+void arch_xsplice_patching_enter(void)
+{
+    /* Disable WP to allow changes to read-only pages. */
+    write_cr0(read_cr0() & ~X86_CR0_WP);
+}
+
+void arch_xsplice_patching_leave(void)
+{
+    /* Reinstate WP. */
+    write_cr0(read_cr0() | X86_CR0_WP);
+}
+
+void arch_xsplice_apply_jmp(struct xsplice_patch_func *func)
+{
+    uint32_t val;
+    uint8_t *old_ptr;
+
+    BUILD_BUG_ON(PATCH_INSN_SIZE > sizeof(func->undo));
+    BUILD_BUG_ON(PATCH_INSN_SIZE != (1 + sizeof val));
+
+    old_ptr = (uint8_t *)func->old_addr;
+    memcpy(func->undo, old_ptr, PATCH_INSN_SIZE);
+
+    *old_ptr++ = 0xe9; /* Relative jump */
+    val = func->new_addr - func->old_addr - PATCH_INSN_SIZE;
+    memcpy(old_ptr, &val, sizeof val);
+}
+
+void arch_xsplice_revert_jmp(struct xsplice_patch_func *func)
+{
+    memcpy((void *)func->old_addr, func->undo, PATCH_INSN_SIZE);
+}
+
+void arch_xsplice_post_action(void)
+{
+    cpuid_eax(0);
+}
+
 int arch_xsplice_verify_elf(const struct xsplice_elf *elf, void *data)
 {
 
diff --git a/xen/common/xsplice.c b/xen/common/xsplice.c
index 4d24898..a0d7fe0 100644
--- a/xen/common/xsplice.c
+++ b/xen/common/xsplice.c
@@ -3,6 +3,7 @@ 
  *
  */
 
+#include <xen/cpu.h>
 #include <xen/err.h>
 #include <xen/guest_access.h>
 #include <xen/keyhandler.h>
@@ -11,17 +12,29 @@ 
 #include <xen/mm.h>
 #include <xen/sched.h>
 #include <xen/smp.h>
+#include <xen/softirq.h>
 #include <xen/spinlock.h>
 #include <xen/vmap.h>
+#include <xen/wait.h>
 #include <xen/xsplice_elf.h>
 #include <xen/xsplice.h>
 
 #include <asm/event.h>
+#include <asm/nmi.h>
 #include <public/sysctl.h>
 
+/*
+ * Protects against payload_list operations and also allows only one
+ * caller in schedule_work.
+ */
 static DEFINE_SPINLOCK(payload_lock);
 static LIST_HEAD(payload_list);
 
+/*
+ * Patches which have been applied.
+ */
+static LIST_HEAD(applied_list);
+
 static unsigned int payload_cnt;
 static unsigned int payload_version = 1;
 
@@ -38,9 +51,31 @@  struct payload {
     size_t payload_pages;                /* Nr of the pages for the text_addr;
                                             rw_addr, and ro_addr (if any) */
     mfn_t *mfn;                          /* Array of MFNs of the pages. */
+    struct list_head applied_list;       /* Linked to 'applied_list'. */
+    struct xsplice_patch_func *funcs;    /* The array of functions to patch. */
+    unsigned int nfuncs;                 /* Nr of functions to patch. */
     char name[XEN_XSPLICE_NAME_SIZE];    /* Name of it. */
 };
 
+/* Defines an outstanding patching action. */
+struct xsplice_work
+{
+    atomic_t semaphore;          /* Used for rendezvous. First to grab it will
+                                    do the patching. */
+    atomic_t irq_semaphore;      /* Used to signal all IRQs disabled. */
+    uint32_t timeout;                    /* Timeout to do the operation. */
+    struct payload *data;        /* The payload on which to act. */
+    volatile bool_t do_work;     /* Signals work to do. */
+    volatile bool_t ready;       /* Signals all CPUs synchronized. */
+    uint32_t cmd;                /* Action request: XSPLICE_ACTION_* */
+};
+
+/* There can be only one outstanding patching action. */
+static struct xsplice_work xsplice_work;
+
+/* Indicate whether the CPU needs to consult xsplice_work structure. */
+static DEFINE_PER_CPU(bool_t, work_to_do);
+
 static int verify_name(const xen_xsplice_name_t *name)
 {
     if ( !name->size || name->size > XEN_XSPLICE_NAME_SIZE )
@@ -296,6 +331,77 @@  static int secure_payload(struct payload *payload, struct xsplice_elf *elf)
     return rc;
 }
 
+static int check_special_sections(struct payload *payload,
+                                  struct xsplice_elf *elf)
+{
+    unsigned int i;
+    static const char *const names[] = { ".xsplice.funcs" };
+
+    for ( i = 0; i < ARRAY_SIZE(names); i++ )
+    {
+        struct xsplice_elf_sec *sec;
+
+        sec = xsplice_elf_sec_by_name(elf, names[i]);
+        if ( !sec )
+        {
+            printk(XENLOG_ERR "%s%s: %s is missing!\n",
+                   XSPLICE, elf->name, names[i]);
+            return -EINVAL;
+        }
+
+        if ( !sec->sec->sh_size )
+            return -EINVAL;
+    }
+
+    return 0;
+}
+
+static int prepare_payload(struct payload *payload,
+                           struct xsplice_elf *elf)
+{
+    struct xsplice_elf_sec *sec;
+    unsigned int i;
+    struct xsplice_patch_func *f;
+
+    sec = xsplice_elf_sec_by_name(elf, ".xsplice.funcs");
+    if ( sec )
+    {
+        if ( sec->sec->sh_size % sizeof *payload->funcs )
+        {
+            dprintk(XENLOG_DEBUG, "%s%s: Wrong size of .xsplice.funcs!\n",
+                    XSPLICE, elf->name);
+            return -EINVAL;
+        }
+
+        payload->funcs = (struct xsplice_patch_func *)sec->load_addr;
+        payload->nfuncs = sec->sec->sh_size / (sizeof *payload->funcs);
+    }
+
+    for ( i = 0; i < payload->nfuncs; i++ )
+    {
+        unsigned int j;
+
+        f = &(payload->funcs[i]);
+
+        if ( !f->new_addr || !f->old_addr || !f->old_size || !f->new_size )
+        {
+            dprintk(XENLOG_DEBUG, "%s%s: Address or size fields are zero!\n",
+                    XSPLICE, elf->name);
+            return -EINVAL;
+        }
+
+        for ( j = 0; j < 8; j++ )
+            if ( f->undo[j] )
+                return -EINVAL;
+
+        for ( j = 0; j < 24; j++ )
+            if ( f->pad[j] )
+                return -EINVAL;
+    }
+
+    return 0;
+}
+
 /* We MUST be holding the payload_lock spinlock. */
 static void free_payload(struct payload *data)
 {
@@ -336,6 +442,14 @@  static int load_payload_data(struct payload *payload, void *raw, ssize_t len)
     if ( rc )
         goto out;
 
+    rc = check_special_sections(payload, &elf);
+    if ( rc )
+        goto out;
+
+    rc = prepare_payload(payload, &elf);
+    if ( rc )
+        goto out;
+
     rc = secure_payload(payload, &elf);
 
  out:
@@ -392,6 +506,7 @@  static int xsplice_upload(xen_sysctl_xsplice_upload_t *upload)
 
     data->state = XSPLICE_STATE_CHECKED;
     INIT_LIST_HEAD(&data->list);
+    INIT_LIST_HEAD(&data->applied_list);
 
     spin_lock_recursive(&payload_lock);
     list_add_tail(&data->list, &payload_list);
@@ -499,6 +614,321 @@  static int xsplice_list(xen_sysctl_xsplice_list_t *list)
     return rc ? : idx;
 }
 
+/*
+ * The following functions get the CPUs into an appropriate state and
+ * apply (or revert) each of the payload's functions. This is needed
+ * for XEN_SYSCTL_XSPLICE_ACTION operation (see xsplice_action).
+ */
+
+static int apply_payload(struct payload *data)
+{
+    unsigned int i;
+
+    dprintk(XENLOG_DEBUG, "%s%s: Applying %u functions.\n", XSPLICE,
+            data->name, data->nfuncs);
+
+    arch_xsplice_patching_enter();
+
+    for ( i = 0; i < data->nfuncs; i++ )
+        arch_xsplice_apply_jmp(&data->funcs[i]);
+
+    arch_xsplice_patching_leave();
+
+    list_add_tail(&data->applied_list, &applied_list);
+
+    return 0;
+}
+
+/*
+ * This function is executed having all other CPUs with no stack (we may
+ * have cpu_idle on it) and IRQs disabled.
+ */
+static int revert_payload(struct payload *data)
+{
+    unsigned int i;
+
+    dprintk(XENLOG_DEBUG, "%s%s: Reverting.\n", XSPLICE, data->name);
+
+    arch_xsplice_patching_enter();
+
+    for ( i = 0; i < data->nfuncs; i++ )
+        arch_xsplice_revert_jmp(&data->funcs[i]);
+
+    arch_xsplice_patching_leave();
+
+    list_del_init(&data->applied_list);
+
+    return 0;
+}
+
+/*
+ * This function is executed having all other CPUs with no stack (we may
+ * have cpu_idle on it) and IRQs disabled. We guard against NMI by temporarily
+ * installing our NOP NMI handler.
+ */
+static void xsplice_do_action(void)
+{
+    int rc;
+    struct payload *data, *other, *tmp;
+
+    data = xsplice_work.data;
+    /* Now this function should be the only one on any stack.
+     * No need to lock the payload list or applied list. */
+    switch ( xsplice_work.cmd )
+    {
+    case XSPLICE_ACTION_APPLY:
+        rc = apply_payload(data);
+        if ( rc == 0 )
+            data->state = XSPLICE_STATE_APPLIED;
+        break;
+
+    case XSPLICE_ACTION_REVERT:
+        rc = revert_payload(data);
+        if ( rc == 0 )
+            data->state = XSPLICE_STATE_CHECKED;
+        break;
+
+    case XSPLICE_ACTION_REPLACE:
+        rc = 0;
+        /* N.B: Use 'applied_list' member, not 'list'. */
+        list_for_each_entry_safe_reverse ( other, tmp, &applied_list, applied_list )
+        {
+            other->rc = revert_payload(other);
+            if ( other->rc == 0 )
+                other->state = XSPLICE_STATE_CHECKED;
+            else
+            {
+                rc = -EINVAL;
+                break;
+            }
+        }
+
+        if ( rc != -EINVAL )
+        {
+            rc = apply_payload(data);
+            if ( rc == 0 )
+                data->state = XSPLICE_STATE_APPLIED;
+        }
+        break;
+
+    default:
+        rc = -EINVAL;
+        break;
+    }
+
+    data->rc = rc;
+}
+
+/*
+ * MUST be holding the payload_lock.
+ */
+static int schedule_work(struct payload *data, uint32_t cmd, uint32_t timeout)
+{
+    unsigned int cpu;
+
+    ASSERT(spin_is_locked(&payload_lock));
+
+    /* Fail if an operation is already scheduled. */
+    if ( xsplice_work.do_work )
+        return -EBUSY;
+
+    if ( !get_cpu_maps() )
+    {
+        printk(XENLOG_ERR "%s%s: unable to get cpu_maps lock!\n",
+               XSPLICE, data->name);
+        return -EBUSY;
+    }
+
+    xsplice_work.cmd = cmd;
+    xsplice_work.data = data;
+    xsplice_work.timeout = timeout ?: MILLISECS(30);
+
+    dprintk(XENLOG_DEBUG, "%s%s: timeout is %"PRI_stime"ms\n",
+            XSPLICE, data->name, xsplice_work.timeout / MILLISECS(1));
+
+    /*
+     * Once the patching has been completed, the semaphore value will
+     * be num_online_cpus()-1.
+     */
+    atomic_set(&xsplice_work.semaphore, -1);
+    atomic_set(&xsplice_work.irq_semaphore, -1);
+
+    xsplice_work.ready = 0;
+    smp_wmb();
+    xsplice_work.do_work = 1;
+    smp_wmb();
+    /*
+     * Above smp_wmb() gives us an compiler barrier, as we MUST do this
+     * after setting the global structure.
+     */
+    for_each_online_cpu ( cpu )
+        per_cpu(work_to_do, cpu) = 1;
+
+    put_cpu_maps();
+
+    return 0;
+}
+
+/*
+ * Note that because of this NOP code the do_nmi is not safely patchable.
+ * Also if we do receive 'real' NMIs we have lost them. Ditto for MCE.
+ */
+static int mask_nmi_callback(const struct cpu_user_regs *regs, int cpu)
+{
+    /* TODO: Handle missing NMI/MCE.*/
+    return 1;
+}
+
+static void reschedule_fn(void *unused)
+{
+    smp_mb(); /* Synchronize with setting do_work */
+    raise_softirq(SCHEDULE_SOFTIRQ);
+}
+
+static int xsplice_do_wait(atomic_t *counter, s_time_t timeout,
+                           unsigned int total_cpus, const char *s)
+{
+    int rc = 0;
+
+    while ( atomic_read(counter) != total_cpus && NOW() < timeout )
+        cpu_relax();
+
+    /* Log & abort. */
+    if ( atomic_read(counter) != total_cpus )
+    {
+        printk(XENLOG_ERR "%s%s: %s %u/%u\n", XSPLICE,
+               xsplice_work.data->name, s, atomic_read(counter), total_cpus);
+        rc = -EBUSY;
+        xsplice_work.data->rc = rc;
+        xsplice_work.do_work = 0;
+        smp_wmb();
+    }
+
+    return rc;
+}
+
+/*
+ * The main function which manages the work of quiescing the system and
+ * patching code.
+ */
+void check_for_xsplice_work(void)
+{
+    unsigned int cpu = smp_processor_id();
+    nmi_callback_t saved_nmi_callback;
+    s_time_t timeout;
+    unsigned long flags;
+
+    /* Fast path: no work to do. */
+    if ( !per_cpu(work_to_do, cpu ) )
+        return;
+
+    /* In case we aborted, other CPUs can skip right away. */
+    if ( (!xsplice_work.do_work) )
+    {
+        per_cpu(work_to_do, cpu) = 0;
+        return;
+    }
+
+    ASSERT(local_irq_is_enabled());
+
+    /* Set at -1, so will go up to num_online_cpus - 1. */
+    if ( atomic_inc_and_test(&xsplice_work.semaphore) )
+    {
+        struct payload *p;
+        unsigned int total_cpus;
+
+        p = xsplice_work.data;
+        if ( !get_cpu_maps() )
+        {
+            printk(XENLOG_ERR "%s%s: CPU%u - unable to get cpu_maps lock!\n",
+                   XSPLICE, p->name, cpu);
+            per_cpu(work_to_do, cpu) = 0;
+            xsplice_work.data->rc = -EBUSY;
+            xsplice_work.do_work = 0;
+            /*
+             * Do NOT decrement semaphore down - as that may cause the other
+             * CPU (which may be at this exact moment checking the ASSERT)
+             * to assume the role of master and then needlessly time out
+             * out (as do_work is zero).
+             */
+            return;
+        }
+
+        barrier(); /* MUST do it after get_cpu_maps. */
+        total_cpus = num_online_cpus() - 1;
+
+        if ( total_cpus )
+        {
+            dprintk(XENLOG_DEBUG, "%s%s: CPU%u - IPIing the %u CPUs\n",
+                    XSPLICE, p->name, cpu, total_cpus);
+            smp_call_function(reschedule_fn, NULL, 0);
+        }
+
+        timeout = xsplice_work.timeout + NOW();
+        if ( xsplice_do_wait(&xsplice_work.semaphore, timeout, total_cpus,
+                             "Timed out on CPU semaphore") )
+            goto abort;
+
+        /* "Mask" NMIs. */
+        saved_nmi_callback = set_nmi_callback(mask_nmi_callback);
+
+        /* All CPUs are waiting, now signal to disable IRQs. */
+        xsplice_work.ready = 1;
+        smp_wmb();
+
+        atomic_inc(&xsplice_work.irq_semaphore);
+        if ( !xsplice_do_wait(&xsplice_work.irq_semaphore, timeout, total_cpus,
+                              "Timed out on IRQ semaphore") )
+        {
+            local_irq_save(flags);
+            /* Do the patching. */
+            xsplice_do_action();
+            /* To flush out pipeline. */
+            arch_xsplice_post_action();
+            local_irq_restore(flags);
+        }
+        set_nmi_callback(saved_nmi_callback);
+
+ abort:
+        per_cpu(work_to_do, cpu) = 0;
+        xsplice_work.do_work = 0;
+
+        smp_wmb(); /* Synchronize with waiting CPUs. */
+        ASSERT(local_irq_is_enabled());
+
+        put_cpu_maps();
+
+        printk(XENLOG_INFO "%s%s finished with rc=%d\n", XSPLICE,
+               p->name, p->rc);
+    }
+    else
+    {
+        /* Wait for all CPUs to rendezvous. */
+        while ( xsplice_work.do_work && !xsplice_work.ready )
+        {
+            cpu_relax();
+            smp_rmb();
+        }
+
+        /* Disable IRQs and signal. */
+        local_irq_save(flags);
+        atomic_inc(&xsplice_work.irq_semaphore);
+
+        /* Wait for patching to complete. */
+        while ( xsplice_work.do_work )
+        {
+            cpu_relax();
+            smp_rmb();
+        }
+
+        /* To flush out pipeline. */
+        arch_xsplice_post_action();
+        local_irq_restore(flags);
+
+        per_cpu(work_to_do, cpu) = 0;
+    }
+}
+
 static int xsplice_action(xen_sysctl_xsplice_action_t *action)
 {
     struct payload *data;
@@ -542,27 +972,24 @@  static int xsplice_action(xen_sysctl_xsplice_action_t *action)
     case XSPLICE_ACTION_REVERT:
         if ( data->state == XSPLICE_STATE_APPLIED )
         {
-            /* No implementation yet. */
-            data->state = XSPLICE_STATE_CHECKED;
-            data->rc = 0;
+            data->rc = -EAGAIN;
+            rc = schedule_work(data, action->cmd, action->timeout);
         }
         break;
 
     case XSPLICE_ACTION_APPLY:
         if ( data->state == XSPLICE_STATE_CHECKED )
         {
-            /* No implementation yet. */
-            data->state = XSPLICE_STATE_APPLIED;
-            data->rc = 0;
+            data->rc = -EAGAIN;
+            rc = schedule_work(data, action->cmd, action->timeout);
         }
         break;
 
     case XSPLICE_ACTION_REPLACE:
         if ( data->state == XSPLICE_STATE_CHECKED )
         {
-            /* No implementation yet. */
-            data->state = XSPLICE_STATE_CHECKED;
-            data->rc = 0;
+            data->rc = -EAGAIN;
+            rc = schedule_work(data, action->cmd, action->timeout);
         }
         break;
 
@@ -627,6 +1054,7 @@  static const char *state2str(uint32_t state)
 static void xsplice_printall(unsigned char key)
 {
     struct payload *data;
+    unsigned int i;
 
     if ( !spin_trylock_recursive(&payload_lock) )
     {
@@ -635,15 +1063,30 @@  static void xsplice_printall(unsigned char key)
     }
 
     list_for_each_entry ( data, &payload_list, list )
+    {
         printk(" name=%s state=%s(%d) %p (.data=%p, .rodata=%p) using %zu pages.\n",
                data->name, state2str(data->state), data->state, data->text_addr,
                data->rw_addr, data->ro_addr, data->payload_pages);
 
+        for ( i = 0; i < data->nfuncs; i++ )
+        {
+            struct xsplice_patch_func *f = &(data->funcs[i]);
+            printk("    %s patch 0x%"PRIx64"(%u) with 0x%"PRIx64"(%u)\n",
+                   f->name, f->old_addr, f->old_size, f->new_addr, f->new_size);
+            if ( !(i % 100) )
+                process_pending_softirqs();
+        }
+    }
+
     spin_unlock_recursive(&payload_lock);
 }
 
 static int __init xsplice_init(void)
 {
+    BUILD_BUG_ON( sizeof(struct xsplice_patch_func) != 64 );
+    BUILD_BUG_ON( offsetof(struct xsplice_patch_func, new_addr) != 8 );
+    BUILD_BUG_ON( offsetof(struct xsplice_patch_func, new_size) != 24 );
+
     register_keyhandler('x', xsplice_printall, "print xsplicing info", 1);
 
     arch_xsplice_register_find_space(&find_hole);
diff --git a/xen/include/asm-arm/nmi.h b/xen/include/asm-arm/nmi.h
index a60587e..82aff35 100644
--- a/xen/include/asm-arm/nmi.h
+++ b/xen/include/asm-arm/nmi.h
@@ -4,6 +4,19 @@ 
 #define register_guest_nmi_callback(a)  (-ENOSYS)
 #define unregister_guest_nmi_callback() (-ENOSYS)
 
+typedef int (*nmi_callback_t)(const struct cpu_user_regs *regs, int cpu);
+
+/**
+ * set_nmi_callback
+ *
+ * Set a handler for an NMI. Only one handler may be
+ * set. Return the old nmi callback handler.
+ */
+static inline nmi_callback_t set_nmi_callback(nmi_callback_t callback)
+{
+    return NULL;
+}
+
 #endif /* ASM_NMI_H */
 /*
  * Local variables:
diff --git a/xen/include/xen/xsplice.h b/xen/include/xen/xsplice.h
index 74850ea..00b3cb2 100644
--- a/xen/include/xen/xsplice.h
+++ b/xen/include/xen/xsplice.h
@@ -11,12 +11,30 @@  struct xsplice_elf_sec;
 struct xsplice_elf_sym;
 struct xen_sysctl_xsplice_op;
 
+#include <xen/elfstructs.h>
+/*
+ * The structure which defines the patching. This is what the hypervisor
+ * expects in the '.xsplice.func' section of the ELF file.
+ *
+ * This MUST be in sync with what the tools generate.
+ */
+struct xsplice_patch_func {
+    const char *name;
+    uint64_t new_addr;
+    uint64_t old_addr;
+    uint32_t new_size;
+    uint32_t old_size;
+    uint8_t undo[8];
+    uint8_t pad[24];
+};
+
 #ifdef CONFIG_XSPLICE
 
 /* Convenience define for printk. */
 #define XSPLICE "xsplice: "
 
 int xsplice_op(struct xen_sysctl_xsplice_op *);
+void check_for_xsplice_work(void);
 
 /* Arch hooks. */
 int arch_xsplice_verify_elf(const struct xsplice_elf *elf, void *data);
@@ -61,6 +79,17 @@  void arch_xsplice_free_payload(void *va, unsigned int pages);
 typedef int (find_space_t)(size_t, unsigned long *, unsigned long *);
 void arch_xsplice_register_find_space(find_space_t *cb);
 
+/*
+ * These functions are called around the critical region patching live code,
+ * for an architecture to take make appropratie global state adjustments.
+ */
+void arch_xsplice_patching_enter(void);
+void arch_xsplice_patching_leave(void);
+
+void arch_xsplice_apply_jmp(struct xsplice_patch_func *func);
+void arch_xsplice_revert_jmp(struct xsplice_patch_func *func);
+void arch_xsplice_post_action(void);
+
 #else
 
 #include <xen/errno.h> /* For -EOPNOTSUPP */
@@ -68,7 +97,7 @@  static inline int xsplice_op(struct xen_sysctl_xsplice_op *op)
 {
     return -EOPNOTSUPP;
 }
-
+static inline void check_for_xsplice_work(void) { };
 #endif /* CONFIG_XSPLICE */
 
 #endif /* __XEN_XSPLICE_H__ */