[35/60] xen/sched: add code to sync scheduling of all vcpus of a sched unit

Message ID	20190528103313.1343-36-jgross@suse.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xenproject.org> From: Juergen Gross <jgross@suse.com> To: xen-devel@lists.xenproject.org Date: Tue, 28 May 2019 12:32:48 +0200 Message-Id: <20190528103313.1343-36-jgross@suse.com> In-Reply-To: <20190528103313.1343-1-jgross@suse.com> References: <20190528103313.1343-1-jgross@suse.com> Subject: [Xen-devel] [PATCH 35/60] xen/sched: add code to sync scheduling of all vcpus of a sched unit Precedence: list Cc: Juergen Gross <jgross@suse.com>, Stefano Stabellini <sstabellini@kernel.org>, Wei Liu <wl@xen.org>, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>, George Dunlap <George.Dunlap@eu.citrix.com>, Andrew Cooper <andrew.cooper3@citrix.com>, Ian Jackson <ian.jackson@eu.citrix.com>, Tim Deegan <tim@xen.org>, Julien Grall <julien.grall@arm.com>, Jan Beulich <jbeulich@suse.com>, Dario Faggioli <dfaggioli@suse.com>, =?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>
Series	xen: add core scheduling support \| expand [00/60] xen: add core scheduling support [01/60] xen/sched: only allow schedulers with all mandatory functions available [02/60] xen/sched: add inline wrappers for calling per-scheduler functions [03/60] xen/sched: let sched_switch_sched() return new lock address [04/60] xen/sched: use new sched_unit instead of vcpu in scheduler interfaces [05/60] xen/sched: alloc struct sched_unit for each vcpu [06/60] xen/sched: move per-vcpu scheduler private data pointer to sched_unit [07/60] xen/sched: build a linked list of struct sched_unit [08/60] xen/sched: introduce struct sched_resource [09/60] xen/sched: let pick_cpu return a scheduler resource [10/60] xen/sched: switch schedule_data.curr to point at sched_unit [11/60] xen/sched: move per cpu scheduler private data into struct sched_resource [12/60] xen/sched: switch vcpu_schedule_lock to unit_schedule_lock [13/60] xen/sched: move some per-vcpu items to struct sched_unit [14/60] xen/sched: add scheduler helpers hiding vcpu [15/60] xen/sched: add domain pointer to struct sched_unit [16/60] xen/sched: add id to struct sched_unit [17/60] xen/sched: rename scheduler related perf counters [18/60] xen/sched: switch struct task_slice from vcpu to sched_unit [19/60] xen/sched: add is_running indicator to struct sched_unit [20/60] xen/sched: make null scheduler vcpu agnostic. [21/60] xen/sched: make rt scheduler vcpu agnostic. [22/60] xen/sched: make credit scheduler vcpu agnostic. [23/60] xen/sched: make credit2 scheduler vcpu agnostic. [24/60] xen/sched: make arinc653 scheduler vcpu agnostic. [25/60] xen: add sched_unit_pause_nosync() and sched_unit_unpause() [26/60] xen: let vcpu_create() select processor [27/60] xen/sched: use sched_resource cpu instead smp_processor_id in schedulers [28/60] xen/sched: switch schedule() from vcpus to sched_units [29/60] xen/sched: switch sched_move_irqs() to take sched_unit as parameter [30/60] xen: switch from for_each_vcpu() to for_each_sched_unit() [31/60] xen/sched: add runstate counters to struct sched_unit [32/60] xen/sched: rework and rename vcpu_force_reschedule() [33/60] xen/sched: Change vcpu_migrate_*() to operate on schedule unit [34/60] xen/sched: move struct task_slice into struct sched_unit [35/60] xen/sched: add code to sync scheduling of all vcpus of a sched unit [36/60] xen/sched: introduce unit_runnable_state() [37/60] xen/sched: add support for multiple vcpus per sched unit where missing [38/60] x86: make loading of GDT at context switch more modular [39/60] x86: optimize loading of GDT at context switch [40/60] xen/sched: modify cpupool_domain_cpumask() to be an unit mask [41/60] xen/sched: support allocating multiple vcpus into one sched unit [42/60] xen/sched: add a scheduler_percpu_init() function [43/60] xen/sched: add a percpu resource index [44/60] xen/sched: add fall back to idle vcpu when scheduling unit [45/60] xen/sched: make vcpu_wake() and vcpu_sleep() core scheduling aware [46/60] xen/sched: carve out freeing sched_unit memory into dedicated function [47/60] xen/sched: move per-cpu variable scheduler to struct sched_resource [48/60] xen/sched: move per-cpu variable cpupool to struct sched_resource [49/60] xen/sched: reject switching smt on/off with core scheduling active [50/60] xen/sched: prepare per-cpupool scheduling granularity [51/60] xen/sched: use one schedule lock for all free cpus [52/60] xen/sched: populate cpupool0 only after all cpus are up [53/60] xen/sched: remove cpu from pool0 before removing it [54/60] xen/sched: add minimalistic idle scheduler for free cpus [55/60] xen/sched: split schedule_cpu_switch() [56/60] xen/sched: protect scheduling resource via rcu [57/60] xen/sched: support multiple cpus per scheduling resource [58/60] xen/sched: support differing granularity in schedule_cpu_[add/rm]() [59/60] xen/sched: support core scheduling for moving cpus to/from cpupools [60/60] xen/sched: add scheduling granularity enum

diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c index ff330b35e6..9140a79c60 100644 --- a/xen/arch/arm/domain.c +++ b/xen/arch/arm/domain.c @@ -313,7 +313,7 @@ static void schedule_tail(struct vcpu *prev) local_irq_enable(); - context_saved(prev); + sched_context_switched(prev, current); update_runstate_area(current); diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index ac960ddd40..d3ee699da6 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -1721,7 +1721,6 @@ static void __context_switch(void) per_cpu(curr_vcpu, cpu) = n; } - void context_switch(struct vcpu *prev, struct vcpu *next) { unsigned int cpu = smp_processor_id(); @@ -1797,7 +1796,7 @@ void context_switch(struct vcpu *prev, struct vcpu *next) } } - context_saved(prev); + sched_context_switched(prev, next); _update_runstate_area(next); /* Must be done with interrupts enabled */ diff --git a/xen/common/schedule.c b/xen/common/schedule.c index c23a5629de..9de315ada1 100644 --- a/xen/common/schedule.c +++ b/xen/common/schedule.c @@ -54,6 +54,10 @@ boolean_param("sched_smt_power_savings", sched_smt_power_savings); * */ int sched_ratelimit_us = SCHED_DEFAULT_RATELIMIT_US; integer_param("sched_ratelimit_us", sched_ratelimit_us); + +/* Number of vcpus per struct sched_unit. */ +static unsigned int sched_granularity = 1; + /* Various timer handlers. */ static void s_timer_fn(void *unused); static void vcpu_periodic_timer_fn(void *data); @@ -1574,134 +1578,304 @@ static void vcpu_periodic_timer_work(struct vcpu *v) set_timer(&v->periodic_timer, periodic_next_event); } -/* - * The main function - * - deschedule the current domain (scheduler independent). - * - pick a new domain (scheduler dependent). - */ -static void schedule(void) +static void sched_switch_units(struct sched_resource *sd, + struct sched_unit *next, struct sched_unit *prev, + s_time_t now) { - struct sched_unit *prev = current->sched_unit, *next = NULL; - s_time_t now; - struct scheduler *sched; - unsigned long *tasklet_work = &this_cpu(tasklet_work_to_do); - bool tasklet_work_scheduled = false; - struct sched_resource *sd; - spinlock_t *lock; - int cpu = smp_processor_id(); + sd->curr = next; - ASSERT_NOT_IN_ATOMIC(); + TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id, prev->unit_id, + now - prev->state_entry_time); + TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id, next->unit_id, + (next->vcpu->runstate.state == RUNSTATE_runnable) ? + (now - next->state_entry_time) : 0, prev->next_time); - SCHED_STAT_CRANK(sched_run); + ASSERT(prev->vcpu->runstate.state == RUNSTATE_running); - sd = get_sched_res(cpu); + TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->unit_id, + next->domain->domain_id, next->unit_id); + + sched_unit_runstate_change(prev, false, now); + prev->last_run_time = now; + + ASSERT(next->vcpu->runstate.state != RUNSTATE_running); + sched_unit_runstate_change(next, true, now); + + /* + * NB. Don't add any trace records from here until the actual context + * switch, else lost_records resume will not work properly. + */ + + ASSERT(!next->is_running); + next->vcpu->is_running = 1; + next->is_running = 1; +} + +static bool sched_tasklet_check_cpu(unsigned int cpu) +{ + unsigned long *tasklet_work = &per_cpu(tasklet_work_to_do, cpu); - /* Update tasklet scheduling status. */ switch ( *tasklet_work ) { case TASKLET_enqueued: set_bit(_TASKLET_scheduled, tasklet_work); /* fallthrough */ case TASKLET_enqueued|TASKLET_scheduled: - tasklet_work_scheduled = true; + return true; break; case TASKLET_scheduled: clear_bit(_TASKLET_scheduled, tasklet_work); + /* fallthrough */ case 0: - /*tasklet_work_scheduled = false;*/ + /* return false; */ break; default: BUG(); } - lock = pcpu_schedule_lock_irq(cpu); + return false; +} - now = NOW(); +static bool sched_tasklet_check(unsigned int cpu) +{ + bool tasklet_work_scheduled = false; + const cpumask_t *mask = get_sched_res(cpu)->cpus; + int cpu_iter; - stop_timer(&sd->s_timer); + for_each_cpu ( cpu_iter, mask ) + if ( sched_tasklet_check_cpu(cpu_iter) ) + tasklet_work_scheduled = true; + + return tasklet_work_scheduled; +} + +static struct sched_unit *do_schedule(struct sched_unit *prev, s_time_t now, + unsigned int cpu) +{ + struct scheduler *sched = per_cpu(scheduler, cpu); + struct sched_resource *sd = get_sched_res(cpu); + struct sched_unit *next; /* get policy-specific decision on scheduling... */ - sched = this_cpu(scheduler); - sched->do_schedule(sched, prev, now, tasklet_work_scheduled); + sched->do_schedule(sched, prev, now, sched_tasklet_check(cpu)); next = prev->next_task; - sd->curr = next; - if ( prev->next_time >= 0 ) /* -ve means no limit */ set_timer(&sd->s_timer, now + prev->next_time); - if ( unlikely(prev == next) ) + if ( likely(prev != next) ) + sched_switch_units(sd, next, prev, now); + + return next; +} + +static void context_saved(struct vcpu *prev) +{ + struct sched_unit *unit = prev->sched_unit; + + unit->is_running = 0; + unit->state_entry_time = NOW(); + + /* Check for migration request /after/ clearing running flag. */ + smp_mb(); + + sched_context_saved(vcpu_scheduler(prev), unit); + + sched_unit_migrate_finish(unit); +} + +/* + * Rendezvous on end of context switch. + * As no lock is protecting this rendezvous function we need to use atomic + * access functions on the counter. + * The counter will be 0 in case no rendezvous is needed. For the rendezvous + * case it is initialised to the number of cpus to rendezvous plus 1. Each + * member entering decrements the counter. The last one will decrement it to + * 1 and perform the final needed action in that case (call of context_saved() + * if vcpu was switched), and then set the counter to zero. The other members + * will wait until the counter becomes zero until they proceed. + */ +void sched_context_switched(struct vcpu *vprev, struct vcpu *vnext) +{ + struct sched_unit *next = vnext->sched_unit; + + /* Clear running flag /after/ writing context to memory. */ + smp_wmb(); + + vprev->is_running = 0; + + if ( atomic_read(&next->rendezvous_out_cnt) ) + { + int cnt = atomic_dec_return(&next->rendezvous_out_cnt); + + /* Call context_saved() before releasing other waiters. */ + if ( cnt == 1 ) + { + if ( vprev != vnext ) + context_saved(vprev); + atomic_set(&next->rendezvous_out_cnt, 0); + } + else + while ( atomic_read(&next->rendezvous_out_cnt) ) + cpu_relax(); + } + else if ( vprev != vnext ) + context_saved(vprev); +} + +static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext, + s_time_t now) +{ + if ( unlikely(vprev == vnext) ) { - pcpu_schedule_unlock_irq(lock, cpu); TRACE_4D(TRC_SCHED_SWITCH_INFCONT, - next->domain->domain_id, next->unit_id, - now - prev->state_entry_time, - prev->next_time); - trace_continue_running(next->vcpu); - return continue_running(prev->vcpu); + vnext->domain->domain_id, vnext->sched_unit->unit_id, + now - vprev->runstate.state_entry_time, + vprev->sched_unit->next_time); + sched_context_switched(vprev, vnext); + trace_continue_running(vnext); + return continue_running(vprev); } - TRACE_3D(TRC_SCHED_SWITCH_INFPREV, - prev->domain->domain_id, prev->unit_id, - now - prev->state_entry_time); - TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, - next->domain->domain_id, next->unit_id, - (next->vcpu->runstate.state == RUNSTATE_runnable) ? - (now - next->state_entry_time) : 0, - prev->next_time); + SCHED_STAT_CRANK(sched_ctx); - ASSERT(prev->vcpu->runstate.state == RUNSTATE_running); + stop_timer(&vprev->periodic_timer); - TRACE_4D(TRC_SCHED_SWITCH, - prev->domain->domain_id, prev->unit_id, - next->domain->domain_id, next->unit_id); + if ( vnext->sched_unit->migrated ) + vcpu_move_irqs(vnext); - sched_unit_runstate_change(prev, false, now); - prev->last_run_time = now; + vcpu_periodic_timer_work(vnext); - ASSERT(next->vcpu->runstate.state != RUNSTATE_running); - sched_unit_runstate_change(next, true, now); + context_switch(vprev, vnext); +} - /* - * NB. Don't add any trace records from here until the actual context - * switch, else lost_records resume will not work properly. - */ +/* + * Rendezvous before taking a scheduling decision. + * Called with schedule lock held, so all accesses to the rendezvous counter + * can be normal ones (no atomic accesses needed). + * The counter is initialized to the number of cpus to rendezvous initially. + * Each cpu entering will decrement the counter. In case the counter becomes + * zero do_schedule() is called and the rendezvous counter for leaving + * context_switch() is set. All other members will wait until the counter is + * becoming zero, dropping the schedule lock in between. + */ +static struct sched_unit *sched_wait_rendezvous_in(struct sched_unit *prev, + spinlock_t *lock, int cpu, + s_time_t now) +{ + struct sched_unit *next; - ASSERT(!next->is_running); - next->vcpu->is_running = 1; - next->is_running = 1; - next->state_entry_time = now; + if ( !--prev->rendezvous_in_cnt ) + { + next = do_schedule(prev, now, cpu); + atomic_set(&next->rendezvous_out_cnt, sched_granularity + 1); + return next; + } - pcpu_schedule_unlock_irq(lock, cpu); + while ( prev->rendezvous_in_cnt ) + { + pcpu_schedule_unlock_irq(lock, cpu); - SCHED_STAT_CRANK(sched_ctx); + /* Coming from idle might need to do tasklet work. */ + if ( is_idle_unit(prev) && sched_tasklet_check_cpu(cpu) ) + do_tasklet(); + else + cpu_relax(); + + pcpu_schedule_lock_irq(cpu); + } + + return prev->next_task; +} + +static void sched_slave(void) +{ + struct vcpu *vprev = current; + struct sched_unit *prev = vprev->sched_unit, *next; + s_time_t now; + spinlock_t *lock; + int cpu = smp_processor_id(); - stop_timer(&prev->vcpu->periodic_timer); + ASSERT_NOT_IN_ATOMIC(); + + lock = pcpu_schedule_lock_irq(cpu); + + now = NOW(); + + if ( !prev->rendezvous_in_cnt ) + { + pcpu_schedule_unlock_irq(lock, cpu); + return; + } - if ( next->migrated ) - vcpu_move_irqs(next->vcpu); + stop_timer(&get_sched_res(cpu)->s_timer); - vcpu_periodic_timer_work(next->vcpu); + next = sched_wait_rendezvous_in(prev, lock, cpu, now); - context_switch(prev->vcpu, next->vcpu); + pcpu_schedule_unlock_irq(lock, cpu); + + sched_context_switch(vprev, next->vcpu, now); } -void context_saved(struct vcpu *prev) +/* + * The main function + * - deschedule the current domain (scheduler independent). + * - pick a new domain (scheduler dependent). + */ +static void schedule(void) { - /* Clear running flag /after/ writing context to memory. */ - smp_wmb(); + struct vcpu *vnext, *vprev = current; + struct sched_unit *prev = vprev->sched_unit, *next = NULL; + s_time_t now; + struct sched_resource *sd; + spinlock_t *lock; + int cpu = smp_processor_id(); - prev->is_running = 0; - prev->sched_unit->is_running = 0; - prev->sched_unit->state_entry_time = NOW(); + ASSERT_NOT_IN_ATOMIC(); - /* Check for migration request /after/ clearing running flag. */ - smp_mb(); + SCHED_STAT_CRANK(sched_run); + + sd = get_sched_res(cpu); + + lock = pcpu_schedule_lock_irq(cpu); + + if ( prev->rendezvous_in_cnt ) + { + /* + * We have a race: sched_slave() should be called, so raise a softirq + * in order to re-enter schedule() later and call sched_slave() now. + */ + pcpu_schedule_unlock_irq(lock, cpu); + + raise_softirq(SCHEDULE_SOFTIRQ); + return sched_slave(); + } + + now = NOW(); - sched_context_saved(vcpu_scheduler(prev), prev->sched_unit); + stop_timer(&sd->s_timer); + + if ( sched_granularity > 1 ) + { + cpumask_t mask; + + prev->rendezvous_in_cnt = sched_granularity; + cpumask_andnot(&mask, sd->cpus, cpumask_of(cpu)); + cpumask_raise_softirq(&mask, SCHED_SLAVE_SOFTIRQ); + next = sched_wait_rendezvous_in(prev, lock, cpu, now); + } + else + { + prev->rendezvous_in_cnt = 0; + next = do_schedule(prev, now, cpu); + atomic_set(&next->rendezvous_out_cnt, 0); + } + + pcpu_schedule_unlock_irq(lock, cpu); - sched_unit_migrate_finish(prev->sched_unit); + vnext = next->vcpu; + sched_context_switch(vprev, vnext, now); } /* The scheduler timer: force a run through the scheduler */ @@ -1743,6 +1917,7 @@ static int cpu_schedule_up(unsigned int cpu) if ( sd == NULL ) return -ENOMEM; sd->processor = cpu; + sd->cpus = cpumask_of(cpu); set_sched_res(cpu, sd); per_cpu(scheduler, cpu) = &ops; @@ -1905,6 +2080,7 @@ void __init scheduler_init(void) int i; open_softirq(SCHEDULE_SOFTIRQ, schedule); + open_softirq(SCHED_SLAVE_SOFTIRQ, sched_slave); for ( i = 0; i < NUM_SCHEDULERS; i++) { diff --git a/xen/common/softirq.c b/xen/common/softirq.c index 83c3c09bd5..2d66193203 100644 --- a/xen/common/softirq.c +++ b/xen/common/softirq.c @@ -33,8 +33,8 @@ static void __do_softirq(unsigned long ignore_mask) for ( ; ; ) { /* - * Initialise @cpu on every iteration: SCHEDULE_SOFTIRQ may move - * us to another processor. + * Initialise @cpu on every iteration: SCHEDULE_SOFTIRQ or + * SCHED_SLAVE_SOFTIRQ may move us to another processor. */ cpu = smp_processor_id(); @@ -55,7 +55,7 @@ void process_pending_softirqs(void) { ASSERT(!in_irq() && local_irq_is_enabled()); /* Do not enter scheduler as it can preempt the calling context. */ - __do_softirq(1ul<<SCHEDULE_SOFTIRQ); + __do_softirq((1ul << SCHEDULE_SOFTIRQ) | (1ul << SCHED_SLAVE_SOFTIRQ)); } void do_softirq(void) diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h index 7163ee869b..e66d866f41 100644 --- a/xen/include/xen/sched-if.h +++ b/xen/include/xen/sched-if.h @@ -41,6 +41,7 @@ struct sched_resource { struct timer s_timer; /* scheduling timer */ atomic_t urgent_count; /* how many urgent vcpus */ unsigned int processor; + const cpumask_t *cpus; /* cpus covered by this struct */ }; #define curr_on_cpu(c) (get_sched_res(c)->curr) diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index d0048711cf..9ff9cb148d 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -295,6 +295,12 @@ struct sched_unit { /* Next unit to run. */ struct sched_unit *next_task; s_time_t next_time; + + /* Number of vcpus not yet joined for context switch. */ + unsigned int rendezvous_in_cnt; + + /* Number of vcpus not yet finished with context switch. */ + atomic_t rendezvous_out_cnt; }; #define for_each_sched_unit(d, e) \ @@ -695,10 +701,10 @@ void sync_local_execstate(void); /* * Called by the scheduler to switch to another VCPU. This function must - * call context_saved(@prev) when the local CPU is no longer running in - * @prev's context, and that context is saved to memory. Alternatively, if - * implementing lazy context switching, it suffices to ensure that invoking - * sync_vcpu_execstate() will switch and commit @prev's state. + * call sched_context_switched(@prev, @next) when the local CPU is no longer + * running in @prev's context, and that context is saved to memory. + * Alternatively, if implementing lazy context switching, it suffices to ensure + * that invoking sync_vcpu_execstate() will switch and commit @prev's state. */ void context_switch( struct vcpu *prev, @@ -710,7 +716,7 @@ void context_switch( * saved to memory. Alternatively, if implementing lazy context switching, * ensure that invoking sync_vcpu_execstate() will switch and commit @prev. */ -void context_saved(struct vcpu *prev); +void sched_context_switched(struct vcpu *prev, struct vcpu *vnext); /* Called by the scheduler to continue running the current VCPU. */ void continue_running( diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h index c327c9b6cd..d7273b389b 100644 --- a/xen/include/xen/softirq.h +++ b/xen/include/xen/softirq.h @@ -4,6 +4,7 @@ /* Low-latency softirqs come first in the following list. */ enum { TIMER_SOFTIRQ = 0, + SCHED_SLAVE_SOFTIRQ, SCHEDULE_SOFTIRQ, NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ, RCU_SOFTIRQ,

[35/60] xen/sched: add code to sync scheduling of all vcpus of a sched unit

Commit Message

Patch