diff mbox series

[v3,28/47] xen/sched: add code to sync scheduling of all vcpus of a sched unit

Message ID 20190914085251.18816-29-jgross@suse.com (mailing list archive)
State Superseded
Headers show
Series xen: add core scheduling support | expand

Commit Message

Jürgen Groß Sept. 14, 2019, 8:52 a.m. UTC
When switching sched units synchronize all vcpus of the new unit to be
scheduled at the same time.

A variable sched_granularity is added which holds the number of vcpus
per schedule unit.

As tasklets require to schedule the idle unit it is required to set the
tasklet_work_scheduled parameter of do_schedule() to true if any cpu
covered by the current schedule() call has any pending tasklet work.

For joining other vcpus of the schedule unit we need to add a new
softirq SCHED_SLAVE_SOFTIRQ in order to have a way to initiate a
context switch without calling the generic schedule() function
selecting the vcpu to switch to, as we already know which vcpu we
want to run. This has the other advantage not to loose any other
concurrent SCHEDULE_SOFTIRQ events.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
RFC V2:
- move syncing after context_switch() to schedule.c
V2:
- don't run tasklets directly from sched_wait_rendezvous_in()
V3:
- adapt array size in sched_move_domain() (Jan Beulich)
- int -> unsigned int (Jan Beulich)
---
 xen/arch/arm/domain.c      |   2 +-
 xen/arch/x86/domain.c      |   3 +-
 xen/common/schedule.c      | 345 +++++++++++++++++++++++++++++++++++----------
 xen/common/softirq.c       |   6 +-
 xen/include/xen/sched-if.h |   1 +
 xen/include/xen/sched.h    |  16 ++-
 xen/include/xen/softirq.h  |   1 +
 7 files changed, 287 insertions(+), 87 deletions(-)

Comments

Jan Beulich Sept. 20, 2019, 4:08 p.m. UTC | #1
On 14.09.2019 10:52, Juergen Gross wrote:
> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -55,6 +55,9 @@ boolean_param("sched_smt_power_savings", sched_smt_power_savings);
>  int sched_ratelimit_us = SCHED_DEFAULT_RATELIMIT_US;
>  integer_param("sched_ratelimit_us", sched_ratelimit_us);
>  
> +/* Number of vcpus per struct sched_unit. */
> +static unsigned int __read_mostly sched_granularity = 1;

Didn't you indicate earlier that this would be a per-pool property?
Or was that just a longer term plan?

> +/*
> + * Rendezvous before taking a scheduling decision.
> + * Called with schedule lock held, so all accesses to the rendezvous counter
> + * can be normal ones (no atomic accesses needed).
> + * The counter is initialized to the number of cpus to rendezvous initially.
> + * Each cpu entering will decrement the counter. In case the counter becomes
> + * zero do_schedule() is called and the rendezvous counter for leaving
> + * context_switch() is set. All other members will wait until the counter is
> + * becoming zero, dropping the schedule lock in between.
> + */

This recurring lock/unlock is liable to cause a massive cache line
ping-pong, especially for socket or node scheduling. Instead of
just a cpu_relax() between the main unlock and re-lock, could there
perhaps be lock-less checks to determine whether there's any point
at all re-acquiring the lock?

> +static void schedule(void)
> +{
> +    struct vcpu          *vnext, *vprev = current;
> +    struct sched_unit    *prev = vprev->sched_unit, *next = NULL;
> +    s_time_t              now;
> +    struct sched_resource *sd;
> +    spinlock_t           *lock;
> +    int cpu = smp_processor_id();
> +
> +    ASSERT_NOT_IN_ATOMIC();
> +
> +    SCHED_STAT_CRANK(sched_run);
> +
> +    sd = get_sched_res(cpu);
> +
> +    lock = pcpu_schedule_lock_irq(cpu);
> +
> +    if ( prev->rendezvous_in_cnt )
> +    {
> +        /*
> +         * We have a race: sched_slave() should be called, so raise a softirq
> +         * in order to re-enter schedule() later and call sched_slave() now.
> +         */
> +        pcpu_schedule_unlock_irq(lock, cpu);
> +
> +        raise_softirq(SCHEDULE_SOFTIRQ);
> +        return sched_slave();
> +    }
> +
> +    now = NOW();
> +
> +    stop_timer(&sd->s_timer);

Is the order of these two relevant? A while ago there were a couple
of changes moving such NOW() invocations past anything that may take
non-negligible time, to make accounting as accurate as possible.

> --- a/xen/include/xen/softirq.h
> +++ b/xen/include/xen/softirq.h
> @@ -4,6 +4,7 @@
>  /* Low-latency softirqs come first in the following list. */
>  enum {
>      TIMER_SOFTIRQ = 0,
> +    SCHED_SLAVE_SOFTIRQ,
>      SCHEDULE_SOFTIRQ,
>      NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ,
>      RCU_SOFTIRQ,

Seeing the comment, is the insertion you do as well as the pre-
existing placement of SCHEDULE_SOFTIRQ still appropriate with
the rendezvous-ing you introduce?

Jan
Jürgen Groß Sept. 24, 2019, 2:14 p.m. UTC | #2
On 20.09.19 18:08, Jan Beulich wrote:
> On 14.09.2019 10:52, Juergen Gross wrote:
>> --- a/xen/common/schedule.c
>> +++ b/xen/common/schedule.c
>> @@ -55,6 +55,9 @@ boolean_param("sched_smt_power_savings", sched_smt_power_savings);
>>   int sched_ratelimit_us = SCHED_DEFAULT_RATELIMIT_US;
>>   integer_param("sched_ratelimit_us", sched_ratelimit_us);
>>   
>> +/* Number of vcpus per struct sched_unit. */
>> +static unsigned int __read_mostly sched_granularity = 1;
> 
> Didn't you indicate earlier that this would be a per-pool property?
> Or was that just a longer term plan?

That was planned for later.

> 
>> +/*
>> + * Rendezvous before taking a scheduling decision.
>> + * Called with schedule lock held, so all accesses to the rendezvous counter
>> + * can be normal ones (no atomic accesses needed).
>> + * The counter is initialized to the number of cpus to rendezvous initially.
>> + * Each cpu entering will decrement the counter. In case the counter becomes
>> + * zero do_schedule() is called and the rendezvous counter for leaving
>> + * context_switch() is set. All other members will wait until the counter is
>> + * becoming zero, dropping the schedule lock in between.
>> + */
> 
> This recurring lock/unlock is liable to cause a massive cache line
> ping-pong, especially for socket or node scheduling. Instead of
> just a cpu_relax() between the main unlock and re-lock, could there
> perhaps be lock-less checks to determine whether there's any point
> at all re-acquiring the lock?

Hmm, this is certainly an idea for improvement.

I will think about that and in case I can come up with something I'll
send either a followup patch or include it in the series, depending on
the complexity of the solution.

> 
>> +static void schedule(void)
>> +{
>> +    struct vcpu          *vnext, *vprev = current;
>> +    struct sched_unit    *prev = vprev->sched_unit, *next = NULL;
>> +    s_time_t              now;
>> +    struct sched_resource *sd;
>> +    spinlock_t           *lock;
>> +    int cpu = smp_processor_id();
>> +
>> +    ASSERT_NOT_IN_ATOMIC();
>> +
>> +    SCHED_STAT_CRANK(sched_run);
>> +
>> +    sd = get_sched_res(cpu);
>> +
>> +    lock = pcpu_schedule_lock_irq(cpu);
>> +
>> +    if ( prev->rendezvous_in_cnt )
>> +    {
>> +        /*
>> +         * We have a race: sched_slave() should be called, so raise a softirq
>> +         * in order to re-enter schedule() later and call sched_slave() now.
>> +         */
>> +        pcpu_schedule_unlock_irq(lock, cpu);
>> +
>> +        raise_softirq(SCHEDULE_SOFTIRQ);
>> +        return sched_slave();
>> +    }
>> +
>> +    now = NOW();
>> +
>> +    stop_timer(&sd->s_timer);
> 
> Is the order of these two relevant? A while ago there were a couple
> of changes moving such NOW() invocations past anything that may take
> non-negligible time, to make accounting as accurate as possible.

No, I don't think the order is relevant. I can swap them.

> 
>> --- a/xen/include/xen/softirq.h
>> +++ b/xen/include/xen/softirq.h
>> @@ -4,6 +4,7 @@
>>   /* Low-latency softirqs come first in the following list. */
>>   enum {
>>       TIMER_SOFTIRQ = 0,
>> +    SCHED_SLAVE_SOFTIRQ,
>>       SCHEDULE_SOFTIRQ,
>>       NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ,
>>       RCU_SOFTIRQ,
> 
> Seeing the comment, is the insertion you do as well as the pre-
> existing placement of SCHEDULE_SOFTIRQ still appropriate with
> the rendezvous-ing you introduce?

Putting SCHED_SLAVE_SOFTIRQ before SCHEDULE_SOFTIRQ is done on purpose,
as I want slave events to have higher priority than normal schedule
events.

Whether both want to be at that place or should be moved is something
which should be considered carefully. Is it okay to postpone that
question?


Juergen
Jan Beulich Sept. 24, 2019, 2:39 p.m. UTC | #3
On 24.09.2019 16:14, Jürgen Groß wrote:
> On 20.09.19 18:08, Jan Beulich wrote:
>> On 14.09.2019 10:52, Juergen Gross wrote:
>>> --- a/xen/include/xen/softirq.h
>>> +++ b/xen/include/xen/softirq.h
>>> @@ -4,6 +4,7 @@
>>>   /* Low-latency softirqs come first in the following list. */
>>>   enum {
>>>       TIMER_SOFTIRQ = 0,
>>> +    SCHED_SLAVE_SOFTIRQ,
>>>       SCHEDULE_SOFTIRQ,
>>>       NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ,
>>>       RCU_SOFTIRQ,
>>
>> Seeing the comment, is the insertion you do as well as the pre-
>> existing placement of SCHEDULE_SOFTIRQ still appropriate with
>> the rendezvous-ing you introduce?
> 
> Putting SCHED_SLAVE_SOFTIRQ before SCHEDULE_SOFTIRQ is done on purpose,
> as I want slave events to have higher priority than normal schedule
> events.
> 
> Whether both want to be at that place or should be moved is something
> which should be considered carefully. Is it okay to postpone that
> question?

Sure, it was just something that occurred to me when the comment
caught my attention.

Jan
diff mbox series

Patch

diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index a9c4113c26..c8efef4179 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -315,7 +315,7 @@  static void schedule_tail(struct vcpu *prev)
 
     local_irq_enable();
 
-    context_saved(prev);
+    sched_context_switched(prev, current);
 
     update_runstate_area(current);
 
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index dbdf6b1bc2..6f3132682d 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1781,7 +1781,6 @@  static void __context_switch(void)
     per_cpu(curr_vcpu, cpu) = n;
 }
 
-
 void context_switch(struct vcpu *prev, struct vcpu *next)
 {
     unsigned int cpu = smp_processor_id();
@@ -1857,7 +1856,7 @@  void context_switch(struct vcpu *prev, struct vcpu *next)
         }
     }
 
-    context_saved(prev);
+    sched_context_switched(prev, next);
 
     _update_runstate_area(next);
     /* Must be done with interrupts enabled */
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index aad396ee54..78b47acedf 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -55,6 +55,9 @@  boolean_param("sched_smt_power_savings", sched_smt_power_savings);
 int sched_ratelimit_us = SCHED_DEFAULT_RATELIMIT_US;
 integer_param("sched_ratelimit_us", sched_ratelimit_us);
 
+/* Number of vcpus per struct sched_unit. */
+static unsigned int __read_mostly sched_granularity = 1;
+
 /* Common lock for free cpus. */
 static DEFINE_SPINLOCK(sched_free_cpu_lock);
 
@@ -520,8 +523,8 @@  int sched_move_domain(struct domain *d, struct cpupool *c)
     if ( IS_ERR(domdata) )
         return PTR_ERR(domdata);
 
-    /* TODO: fix array size with multiple vcpus per unit. */
-    unit_priv = xzalloc_array(void *, d->max_vcpus);
+    unit_priv = xzalloc_array(void *,
+                              DIV_ROUND_UP(d->max_vcpus, sched_granularity));
     if ( unit_priv == NULL )
     {
         sched_free_domdata(c->sched, domdata);
@@ -1707,133 +1710,319 @@  void vcpu_set_periodic_timer(struct vcpu *v, s_time_t value)
     spin_unlock(&v->periodic_timer_lock);
 }
 
-/*
- * The main function
- * - deschedule the current domain (scheduler independent).
- * - pick a new domain (scheduler dependent).
- */
-static void schedule(void)
+static void sched_switch_units(struct sched_resource *sd,
+                               struct sched_unit *next, struct sched_unit *prev,
+                               s_time_t now)
 {
-    struct sched_unit    *prev = current->sched_unit, *next = NULL;
-    s_time_t              now;
-    struct scheduler     *sched;
-    unsigned long        *tasklet_work = &this_cpu(tasklet_work_to_do);
-    bool                  tasklet_work_scheduled = false;
-    struct sched_resource *sd;
-    spinlock_t           *lock;
-    int cpu = smp_processor_id();
+    sd->curr = next;
 
-    ASSERT_NOT_IN_ATOMIC();
+    TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id, prev->unit_id,
+             now - prev->state_entry_time);
+    TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id, next->unit_id,
+             (next->vcpu_list->runstate.state == RUNSTATE_runnable) ?
+             (now - next->state_entry_time) : 0, prev->next_time);
 
-    SCHED_STAT_CRANK(sched_run);
+    ASSERT(prev->vcpu_list->runstate.state == RUNSTATE_running);
 
-    sd = get_sched_res(cpu);
+    TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->unit_id,
+             next->domain->domain_id, next->unit_id);
+
+    sched_unit_runstate_change(prev, false, now);
+
+    ASSERT(next->vcpu_list->runstate.state != RUNSTATE_running);
+    sched_unit_runstate_change(next, true, now);
+
+    /*
+     * NB. Don't add any trace records from here until the actual context
+     * switch, else lost_records resume will not work properly.
+     */
+
+    ASSERT(!next->is_running);
+    next->vcpu_list->is_running = 1;
+    next->is_running = 1;
+    next->state_entry_time = now;
+}
+
+static bool sched_tasklet_check_cpu(unsigned int cpu)
+{
+    unsigned long *tasklet_work = &per_cpu(tasklet_work_to_do, cpu);
 
-    /* Update tasklet scheduling status. */
     switch ( *tasklet_work )
     {
     case TASKLET_enqueued:
         set_bit(_TASKLET_scheduled, tasklet_work);
         /* fallthrough */
     case TASKLET_enqueued|TASKLET_scheduled:
-        tasklet_work_scheduled = true;
+        return true;
         break;
     case TASKLET_scheduled:
         clear_bit(_TASKLET_scheduled, tasklet_work);
+        /* fallthrough */
     case 0:
-        /*tasklet_work_scheduled = false;*/
+        /* return false; */
         break;
     default:
         BUG();
     }
 
-    lock = pcpu_schedule_lock_irq(cpu);
+    return false;
+}
 
-    now = NOW();
+static bool sched_tasklet_check(unsigned int cpu)
+{
+    bool tasklet_work_scheduled = false;
+    const cpumask_t *mask = get_sched_res(cpu)->cpus;
+    unsigned int cpu_iter;
 
-    stop_timer(&sd->s_timer);
+    for_each_cpu ( cpu_iter, mask )
+        if ( sched_tasklet_check_cpu(cpu_iter) )
+            tasklet_work_scheduled = true;
+
+    return tasklet_work_scheduled;
+}
+
+static struct sched_unit *do_schedule(struct sched_unit *prev, s_time_t now,
+                                      unsigned int cpu)
+{
+    struct scheduler *sched = per_cpu(scheduler, cpu);
+    struct sched_resource *sd = get_sched_res(cpu);
+    struct sched_unit *next;
 
     /* get policy-specific decision on scheduling... */
-    sched = this_cpu(scheduler);
-    sched->do_schedule(sched, prev, now, tasklet_work_scheduled);
+    sched->do_schedule(sched, prev, now, sched_tasklet_check(cpu));
 
     next = prev->next_task;
 
-    sd->curr = next;
-
     if ( prev->next_time >= 0 ) /* -ve means no limit */
         set_timer(&sd->s_timer, now + prev->next_time);
 
-    if ( unlikely(prev == next) )
+    if ( likely(prev != next) )
+        sched_switch_units(sd, next, prev, now);
+
+    return next;
+}
+
+static void context_saved(struct vcpu *prev)
+{
+    struct sched_unit *unit = prev->sched_unit;
+
+    /* Clear running flag /after/ writing context to memory. */
+    smp_wmb();
+
+    prev->is_running = 0;
+    unit->is_running = 0;
+    unit->state_entry_time = NOW();
+
+    /* Check for migration request /after/ clearing running flag. */
+    smp_mb();
+
+    sched_context_saved(vcpu_scheduler(prev), unit);
+
+    sched_unit_migrate_finish(unit);
+}
+
+/*
+ * Rendezvous on end of context switch.
+ * As no lock is protecting this rendezvous function we need to use atomic
+ * access functions on the counter.
+ * The counter will be 0 in case no rendezvous is needed. For the rendezvous
+ * case it is initialised to the number of cpus to rendezvous plus 1. Each
+ * member entering decrements the counter. The last one will decrement it to
+ * 1 and perform the final needed action in that case (call of context_saved()
+ * if vcpu was switched), and then set the counter to zero. The other members
+ * will wait until the counter becomes zero until they proceed.
+ */
+void sched_context_switched(struct vcpu *vprev, struct vcpu *vnext)
+{
+    struct sched_unit *next = vnext->sched_unit;
+
+    if ( atomic_read(&next->rendezvous_out_cnt) )
+    {
+        int cnt = atomic_dec_return(&next->rendezvous_out_cnt);
+
+        /* Call context_saved() before releasing other waiters. */
+        if ( cnt == 1 )
+        {
+            if ( vprev != vnext )
+                context_saved(vprev);
+            atomic_set(&next->rendezvous_out_cnt, 0);
+        }
+        else
+            while ( atomic_read(&next->rendezvous_out_cnt) )
+                cpu_relax();
+    }
+    else if ( vprev != vnext )
+        context_saved(vprev);
+}
+
+static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext,
+                                 s_time_t now)
+{
+    if ( unlikely(vprev == vnext) )
     {
-        pcpu_schedule_unlock_irq(lock, cpu);
         TRACE_4D(TRC_SCHED_SWITCH_INFCONT,
-                 next->domain->domain_id, next->unit_id,
-                 now - prev->state_entry_time,
-                 prev->next_time);
-        trace_continue_running(next->vcpu_list);
-        return continue_running(prev->vcpu_list);
+                 vnext->domain->domain_id, vnext->sched_unit->unit_id,
+                 now - vprev->runstate.state_entry_time,
+                 vprev->sched_unit->next_time);
+        sched_context_switched(vprev, vnext);
+        trace_continue_running(vnext);
+        return continue_running(vprev);
     }
 
-    TRACE_3D(TRC_SCHED_SWITCH_INFPREV,
-             prev->domain->domain_id, prev->unit_id,
-             now - prev->state_entry_time);
-    TRACE_4D(TRC_SCHED_SWITCH_INFNEXT,
-             next->domain->domain_id, next->unit_id,
-             (next->vcpu_list->runstate.state == RUNSTATE_runnable) ?
-             (now - next->state_entry_time) : 0,
-             prev->next_time);
+    SCHED_STAT_CRANK(sched_ctx);
 
-    ASSERT(prev->vcpu_list->runstate.state == RUNSTATE_running);
+    stop_timer(&vprev->periodic_timer);
 
-    TRACE_4D(TRC_SCHED_SWITCH,
-             prev->domain->domain_id, prev->unit_id,
-             next->domain->domain_id, next->unit_id);
+    if ( vnext->sched_unit->migrated )
+        vcpu_move_irqs(vnext);
 
-    sched_unit_runstate_change(prev, false, now);
+    vcpu_periodic_timer_work(vnext);
 
-    ASSERT(next->vcpu_list->runstate.state != RUNSTATE_running);
-    sched_unit_runstate_change(next, true, now);
+    context_switch(vprev, vnext);
+}
 
-    /*
-     * NB. Don't add any trace records from here until the actual context
-     * switch, else lost_records resume will not work properly.
-     */
+/*
+ * Rendezvous before taking a scheduling decision.
+ * Called with schedule lock held, so all accesses to the rendezvous counter
+ * can be normal ones (no atomic accesses needed).
+ * The counter is initialized to the number of cpus to rendezvous initially.
+ * Each cpu entering will decrement the counter. In case the counter becomes
+ * zero do_schedule() is called and the rendezvous counter for leaving
+ * context_switch() is set. All other members will wait until the counter is
+ * becoming zero, dropping the schedule lock in between.
+ */
+static struct sched_unit *sched_wait_rendezvous_in(struct sched_unit *prev,
+                                                   spinlock_t **lock, int cpu,
+                                                   s_time_t now)
+{
+    struct sched_unit *next;
 
-    ASSERT(!next->is_running);
-    next->vcpu_list->is_running = 1;
-    next->is_running = 1;
-    next->state_entry_time = now;
+    if ( !--prev->rendezvous_in_cnt )
+    {
+        next = do_schedule(prev, now, cpu);
+        atomic_set(&next->rendezvous_out_cnt, sched_granularity + 1);
+        return next;
+    }
 
-    pcpu_schedule_unlock_irq(lock, cpu);
+    while ( prev->rendezvous_in_cnt )
+    {
+        /*
+         * Coming from idle might need to do tasklet work.
+         * In order to avoid deadlocks we can't do that here, but have to
+         * continue the idle loop.
+         * Undo the rendezvous_in_cnt decrement and schedule another call of
+         * sched_slave().
+         */
+        if ( is_idle_unit(prev) && sched_tasklet_check_cpu(cpu) )
+        {
+            struct vcpu *vprev = current;
 
-    SCHED_STAT_CRANK(sched_ctx);
+            prev->rendezvous_in_cnt++;
+            atomic_set(&prev->rendezvous_out_cnt, 0);
 
-    stop_timer(&prev->vcpu_list->periodic_timer);
+            pcpu_schedule_unlock_irq(*lock, cpu);
 
-    if ( next->migrated )
-        vcpu_move_irqs(next->vcpu_list);
+            raise_softirq(SCHED_SLAVE_SOFTIRQ);
+            sched_context_switch(vprev, vprev, now);
+        }
 
-    vcpu_periodic_timer_work(next->vcpu_list);
+        pcpu_schedule_unlock_irq(*lock, cpu);
 
-    context_switch(prev->vcpu_list, next->vcpu_list);
+        cpu_relax();
+
+        *lock = pcpu_schedule_lock_irq(cpu);
+    }
+
+    return prev->next_task;
 }
 
-void context_saved(struct vcpu *prev)
+static void sched_slave(void)
 {
-    /* Clear running flag /after/ writing context to memory. */
-    smp_wmb();
+    struct vcpu          *vprev = current;
+    struct sched_unit    *prev = vprev->sched_unit, *next;
+    s_time_t              now;
+    spinlock_t           *lock;
+    unsigned int          cpu = smp_processor_id();
 
-    prev->is_running = 0;
-    prev->sched_unit->is_running = 0;
-    prev->sched_unit->state_entry_time = NOW();
+    ASSERT_NOT_IN_ATOMIC();
 
-    /* Check for migration request /after/ clearing running flag. */
-    smp_mb();
+    lock = pcpu_schedule_lock_irq(cpu);
+
+    now = NOW();
 
-    sched_context_saved(vcpu_scheduler(prev), prev->sched_unit);
+    if ( !prev->rendezvous_in_cnt )
+    {
+        pcpu_schedule_unlock_irq(lock, cpu);
+        return;
+    }
+
+    stop_timer(&get_sched_res(cpu)->s_timer);
+
+    next = sched_wait_rendezvous_in(prev, &lock, cpu, now);
+
+    pcpu_schedule_unlock_irq(lock, cpu);
 
-    sched_unit_migrate_finish(prev->sched_unit);
+    sched_context_switch(vprev, next->vcpu_list, now);
+}
+
+/*
+ * The main function
+ * - deschedule the current domain (scheduler independent).
+ * - pick a new domain (scheduler dependent).
+ */
+static void schedule(void)
+{
+    struct vcpu          *vnext, *vprev = current;
+    struct sched_unit    *prev = vprev->sched_unit, *next = NULL;
+    s_time_t              now;
+    struct sched_resource *sd;
+    spinlock_t           *lock;
+    int cpu = smp_processor_id();
+
+    ASSERT_NOT_IN_ATOMIC();
+
+    SCHED_STAT_CRANK(sched_run);
+
+    sd = get_sched_res(cpu);
+
+    lock = pcpu_schedule_lock_irq(cpu);
+
+    if ( prev->rendezvous_in_cnt )
+    {
+        /*
+         * We have a race: sched_slave() should be called, so raise a softirq
+         * in order to re-enter schedule() later and call sched_slave() now.
+         */
+        pcpu_schedule_unlock_irq(lock, cpu);
+
+        raise_softirq(SCHEDULE_SOFTIRQ);
+        return sched_slave();
+    }
+
+    now = NOW();
+
+    stop_timer(&sd->s_timer);
+
+    if ( sched_granularity > 1 )
+    {
+        cpumask_t mask;
+
+        prev->rendezvous_in_cnt = sched_granularity;
+        cpumask_andnot(&mask, sd->cpus, cpumask_of(cpu));
+        cpumask_raise_softirq(&mask, SCHED_SLAVE_SOFTIRQ);
+        next = sched_wait_rendezvous_in(prev, &lock, cpu, now);
+    }
+    else
+    {
+        prev->rendezvous_in_cnt = 0;
+        next = do_schedule(prev, now, cpu);
+        atomic_set(&next->rendezvous_out_cnt, 0);
+    }
+
+    pcpu_schedule_unlock_irq(lock, cpu);
+
+    vnext = next->vcpu_list;
+    sched_context_switch(vprev, vnext, now);
 }
 
 /* The scheduler timer: force a run through the scheduler */
@@ -1874,6 +2063,7 @@  static int cpu_schedule_up(unsigned int cpu)
     if ( sd == NULL )
         return -ENOMEM;
     sd->master_cpu = cpu;
+    sd->cpus = cpumask_of(cpu);
     set_sched_res(cpu, sd);
 
     per_cpu(scheduler, cpu) = &sched_idle_ops;
@@ -1894,6 +2084,8 @@  static int cpu_schedule_up(unsigned int cpu)
     if ( idle_vcpu[cpu] == NULL )
         return -ENOMEM;
 
+    idle_vcpu[cpu]->sched_unit->rendezvous_in_cnt = 0;
+
     /*
      * No need to allocate any scheduler data, as cpus coming online are
      * free initially and the idle scheduler doesn't need any data areas
@@ -1994,6 +2186,7 @@  void __init scheduler_init(void)
     int i;
 
     open_softirq(SCHEDULE_SOFTIRQ, schedule);
+    open_softirq(SCHED_SLAVE_SOFTIRQ, sched_slave);
 
     for ( i = 0; i < NUM_SCHEDULERS; i++)
     {
diff --git a/xen/common/softirq.c b/xen/common/softirq.c
index 83c3c09bd5..2d66193203 100644
--- a/xen/common/softirq.c
+++ b/xen/common/softirq.c
@@ -33,8 +33,8 @@  static void __do_softirq(unsigned long ignore_mask)
     for ( ; ; )
     {
         /*
-         * Initialise @cpu on every iteration: SCHEDULE_SOFTIRQ may move
-         * us to another processor.
+         * Initialise @cpu on every iteration: SCHEDULE_SOFTIRQ or
+         * SCHED_SLAVE_SOFTIRQ may move us to another processor.
          */
         cpu = smp_processor_id();
 
@@ -55,7 +55,7 @@  void process_pending_softirqs(void)
 {
     ASSERT(!in_irq() && local_irq_is_enabled());
     /* Do not enter scheduler as it can preempt the calling context. */
-    __do_softirq(1ul<<SCHEDULE_SOFTIRQ);
+    __do_softirq((1ul << SCHEDULE_SOFTIRQ) | (1ul << SCHED_SLAVE_SOFTIRQ));
 }
 
 void do_softirq(void)
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index 4e89a1e640..4872570612 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -42,6 +42,7 @@  struct sched_resource {
 
     /* Cpu with lowest id in scheduling resource. */
     unsigned int        master_cpu;
+    const cpumask_t    *cpus;           /* cpus covered by this struct     */
 };
 
 #define curr_on_cpu(c)    (get_sched_res(c)->curr)
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index a8164d873c..c0e4dc2dc3 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -292,6 +292,12 @@  struct sched_unit {
     /* Next unit to run. */
     struct sched_unit      *next_task;
     s_time_t                next_time;
+
+    /* Number of vcpus not yet joined for context switch. */
+    unsigned int            rendezvous_in_cnt;
+
+    /* Number of vcpus not yet finished with context switch. */
+    atomic_t                rendezvous_out_cnt;
 };
 
 #define for_each_sched_unit(d, e)                                         \
@@ -694,10 +700,10 @@  void sync_local_execstate(void);
 
 /*
  * Called by the scheduler to switch to another VCPU. This function must
- * call context_saved(@prev) when the local CPU is no longer running in
- * @prev's context, and that context is saved to memory. Alternatively, if
- * implementing lazy context switching, it suffices to ensure that invoking
- * sync_vcpu_execstate() will switch and commit @prev's state.
+ * call sched_context_switched(@prev, @next) when the local CPU is no longer
+ * running in @prev's context, and that context is saved to memory.
+ * Alternatively, if implementing lazy context switching, it suffices to ensure
+ * that invoking sync_vcpu_execstate() will switch and commit @prev's state.
  */
 void context_switch(
     struct vcpu *prev,
@@ -709,7 +715,7 @@  void context_switch(
  * saved to memory. Alternatively, if implementing lazy context switching,
  * ensure that invoking sync_vcpu_execstate() will switch and commit @prev.
  */
-void context_saved(struct vcpu *prev);
+void sched_context_switched(struct vcpu *prev, struct vcpu *vnext);
 
 /* Called by the scheduler to continue running the current VCPU. */
 void continue_running(
diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
index c327c9b6cd..d7273b389b 100644
--- a/xen/include/xen/softirq.h
+++ b/xen/include/xen/softirq.h
@@ -4,6 +4,7 @@ 
 /* Low-latency softirqs come first in the following list. */
 enum {
     TIMER_SOFTIRQ = 0,
+    SCHED_SLAVE_SOFTIRQ,
     SCHEDULE_SOFTIRQ,
     NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ,
     RCU_SOFTIRQ,