diff mbox

[v6,RFC] xen: sched: convert RTDS from time to event driven model

Message ID 1456430736-4606-1-git-send-email-tiche@seas.upenn.edu (mailing list archive)
State New, archived
Headers show

Commit Message

Tianyang Chen Feb. 25, 2016, 8:05 p.m. UTC
changes since v5:
    removed unnecessary vcpu_on_replq() checks
    deadline_queue_insert() returns a flag to indicate if it's
    needed to re-program the timer
    removed unnecessary timer checks
    added helper functions to remove vcpus from queues
    coding style

Changes since v4:
    removed unnecessary replenishment queue checks in vcpu_wake()
    extended replq_remove() to all cases in vcpu_sleep()
    used _deadline_queue_insert() helper function for both queues
    _replq_insert() and _replq_remove() program timer internally

Changes since v3:
    removed running queue.
    added repl queue to keep track of repl events.
    timer is now per scheduler.
    timer is init on a valid cpu in a cpupool.

Budget replenishment and enforcement are separated by adding
a replenishment timer, which fires at the next most imminent
release time of all runnable vcpus.

A new runningq has been added to keep track of all vcpus that
are on pcpus.

The following functions have major changes to manage the runningq
and replenishment

repl_handler(): It is a timer handler which is re-programmed
to fire at the nearest vcpu deadline to replenish vcpus on runq,
depeletedq and runningq. When the replenishment is done, each
replenished vcpu should tickle a pcpu to see if it needs to
preempt any running vcpus.

rt_schedule(): picks the highest runnable vcpu based on cpu
affinity and ret.time will be passed to schedule(). If an idle
vcpu is picked, -1 is returned to avoid busy-waiting. repl_update()
has been removed.

rt_vcpu_wake(): when a vcpu is awake, it tickles instead of
picking one from runq. When a vcpu wakes up, it might reprogram
the timer if it has a more recent release time.

rt_context_saved(): when context switching is finished, the
preempted vcpu will be put back into the runq. Runningq is
updated and picking from runq and tickling are removed.

Simplified funtional graph:

schedule.c SCHEDULE_SOFTIRQ:
    rt_schedule():
        [spin_lock]
        burn_budget(scurr)
        snext = runq_pick()
        [spin_unlock]

sched_rt.c TIMER_SOFTIRQ
    replenishment_timer_handler()
        [spin_lock]
        <for_each_vcpu_on_q(i)> {
            replenish(i)
            runq_tickle(i)
        }>
        program_timer()
        [spin_lock]

Signed-off-by: Tianyang Chen <tiche@seas.upenn.edu>
Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
Signed-off-by: Dagaen Golomb <dgolomb@seas.upenn.edu>
---
 xen/common/sched_rt.c |  328 ++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 244 insertions(+), 84 deletions(-)

Comments

Dario Faggioli Feb. 25, 2016, 11:31 p.m. UTC | #1
Hey again,

Thanks for turning up so quickly.

We are getting closer and closer, although (of course :-)) I have some
more comments.

However, is there a particular reason why you are keeping the RFC tag?
Until you do that, it's like saying that you are chasing feedback, but
you do not think yourself the patch should be considered for being
upstreamed. As far as my opinion goes, this patch is not ready to go in
right now (as I said, I've got questions and comments), but its status
is way past RFC.

On Thu, 2016-02-25 at 15:05 -0500, Tianyang Chen wrote:
> changes since v5:
>     removed unnecessary vcpu_on_replq() checks
>     deadline_queue_insert() returns a flag to indicate if it's
>     needed to re-program the timer
>     removed unnecessary timer checks
>     added helper functions to remove vcpus from queues
>     coding style
> 
> Changes since v4:
>     removed unnecessary replenishment queue checks in vcpu_wake()
>     extended replq_remove() to all cases in vcpu_sleep()
>     used _deadline_queue_insert() helper function for both queues
>     _replq_insert() and _replq_remove() program timer internally
> 
> Changes since v3:
>     removed running queue.
>     added repl queue to keep track of repl events.
>     timer is now per scheduler.
>     timer is init on a valid cpu in a cpupool.
> 
So, this does not belong here. It is ok to have it in this part of the
email, but it should not land in the actual commit changelog, once the
patch will be committed into Xen's git tree.

The way to achieve the above is to put this summary of changes below
the actual changelog, and below the Signed-of-by lines, after a marker
that looks like this "---".

> Budget replenishment and enforcement are separated by adding
> a replenishment timer, which fires at the next most imminent
> release time of all runnable vcpus.
> 
> A new runningq has been added to keep track of all vcpus that
> are on pcpus.
> 
Mmm.. Is this the proper changelog? runningq is something we discussed,
and that appeared in v2, but is certainly no longer here... :-O

> diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
> index 2e5430f..16f77f9 100644
> --- a/xen/common/sched_rt.c
> +++ b/xen/common/sched_rt.c

> @@ -213,8 +220,14 @@ static inline struct list_head
> *rt_depletedq(const struct scheduler *ops)
>      return &rt_priv(ops)->depletedq;
>  }
>  
> +static inline struct list_head *rt_replq(const struct scheduler
> *ops)
> +{
> +    return &rt_priv(ops)->replq;
> +}
> +
>  /*
> - * Queue helper functions for runq and depletedq
> + * Queue helper functions for runq, depletedq
> + * and replenishment event queue
>
Full stop. :-)

In any case, I'd change this in something like:
"Helper functions for manipulating the runqueue, the depleted queue,
and the replenishment events queue."

> @@ -228,6 +241,18 @@ __q_elem(struct list_head *elem)
>      return list_entry(elem, struct rt_vcpu, q_elem);
>  }
>  
> +static struct rt_vcpu *
> +__replq_elem(struct list_head *elem)
> +{
> +    return list_entry(elem, struct rt_vcpu, replq_elem);
> +}
> +
> +static int
> +__vcpu_on_replq(const struct rt_vcpu *svc)
> +{
> +   return !list_empty(&svc->replq_elem);
> +}
> +
>
Ok, sorry for changing my mind again, but I really can't stand seeing
these underscores. Please, rename these as replq_elem() and
vcpu_on_replq(). There is nor sensible reason why we should prefix
these with '__'.

I know, that will create some amount of inconsistency, but:
 - there is inconsistency already (here and in other sched_* file)
 - not introducing more __ prefixed function is a step in the right 
   direction; we'll convert the one that are already there with time.

> + * Removing a vcpu from the replenishment queue could
> + * re-program the timer for the next replenishment event
> + * if there is any on the list
> 
> + */
> +static inline void
> +__replq_remove(const struct scheduler *ops, struct rt_vcpu *svc)
> +{
> +    struct rt_private *prv = rt_priv(ops);
> +    struct list_head *replq = rt_replq(ops);
> +    struct timer* repl_timer = prv->repl_timer;
> +
> +    /*
> +     * Disarm the timer before re-programming it.
> +     * It doesn't matter if the vcpu to be removed
> +     * is on top of the list or not because the timer
> +     * is stopped and needs to be re-programmed anyways
> +     */
> +    stop_timer(repl_timer);
> +
> +    deadline_queue_remove(&svc->replq_elem);
> +
> +    /* re-arm the timer for the next replenishment event */
> +    if( !list_empty(replq) )
> +    {
> +        struct rt_vcpu *svc_next = __replq_elem(replq->next);
> +        set_timer(repl_timer, svc_next->cur_deadline);
> +    }
>
Wait, maybe you misunderstood and/or I did not make myself clear enough
(in which case, sorry). I never meant to say "always stop the timer".
Atually, in one of my last email I said the opposite, and I think that
would be the best thing to do.

Do you think there is any specific reason why we need to always stop
and restart it? If not, I think we can:
 - have deadline_queue_remove() also return whether the element 
   removed was the first one (i.e., the one the timer was programmed 
   after);
 - if it was, re-program the timer after the new front of the queue;
 - if it wasn't, nothing to do.

Thoughts?

> +/*
> + * An utility function that inserts a vcpu to a
> + * queue based on certain order (EDF). The returned
> + * value is 1 if a vcpu has been inserted to the
> + * front of a list
> + */
> +static int
> +deadline_queue_insert(struct rt_vcpu * (*_get_q_elem)(struct
> list_head *elem),
> +    struct rt_vcpu *svc, struct list_head *elem, struct list_head
> *queue)
> +{
> +    struct list_head *iter;
> +    int pos = 0;
> +
> +    list_for_each(iter, queue)
> +    {
> +        struct rt_vcpu * iter_svc = (*_get_q_elem)(iter);
> +        if ( svc->cur_deadline <= iter_svc->cur_deadline )
> +            break;
> +
> +        pos++;
> +    }
> +
> +    list_add_tail(elem, iter);
> +    return !pos;
>  }
> 
Ok.

> @@ -405,22 +500,38 @@ __runq_insert(const struct scheduler *ops,
> struct rt_vcpu *svc)
>  
>      /* add svc to runq if svc still has budget */
>      if ( svc->cur_budget > 0 )
> -    {
> -        list_for_each(iter, runq)
> -        {
> -            struct rt_vcpu * iter_svc = __q_elem(iter);
> -            if ( svc->cur_deadline <= iter_svc->cur_deadline )
> -                    break;
> -         }
> -        list_add_tail(&svc->q_elem, iter);
> -    }
> +        deadline_queue_insert(&__q_elem, svc, &svc->q_elem, runq);
>      else
>      {
>          list_add(&svc->q_elem, &prv->depletedq);
> +        ASSERT( __vcpu_on_replq(svc) );
>
So, by doing this, you're telling me that, if the vcpu is added to the
depleted queue, there must be a replenishment planned for it (or the
ASSERT() would fail).

But if we are adding it to the runq, there may or may not be a
replenishment planned for it.

Is this what we want? Why there must not be a replenishment planned
already for a vcpu going into the runq (to happen at its next
deadline)?

>  /*
> + * Insert svc into the replenishment event list
> + * in replenishment time order.
> + * vcpus that need to be replished earlier go first.
> + * The timer may be re-programmed if svc is inserted
> + * at the front of the event list.
> + */
> +static void
> +__replq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
> +{
> +    struct list_head *replq = rt_replq(ops);
> +    struct rt_private *prv = rt_priv(ops);
> +    struct timer *repl_timer = prv->repl_timer;
> +    int set;
> +
> +    ASSERT( !__vcpu_on_replq(svc) );
> +
> +    set = deadline_queue_insert(&__replq_elem, svc, &svc-
> >replq_elem, replq);
> + 
> +    if( set )
> +        set_timer(repl_timer,svc->cur_deadline);
> +}
A matter of taste, mostly, but I'd avoid the local variable (if this
function will at some point become more complex, then we'll see, but
for now, I think it's ok to just use the return value of
deadline_queue_insert() inside the if().

> @@ -868,6 +968,8 @@ rt_schedule(const struct scheduler *ops, s_time_t
> now, bool_t tasklet_work_sched
>          set_bit(__RTDS_delayed_runq_add, &scurr->flags);
>  
>      snext->last_start = now;
> +
>
You don't need to add neither this blank line...

> +    ret.time =  -1; /* if an idle vcpu is picked */ 
>      if ( !is_idle_vcpu(snext->vcpu) )
>      {
>          if ( snext != scurr )
> @@ -880,9 +982,11 @@ rt_schedule(const struct scheduler *ops,
> s_time_t now, bool_t tasklet_work_sched
>              snext->vcpu->processor = cpu;
>              ret.migrated = 1;
>          }
> +
> +        ret.time = snext->budget; /* invoke the scheduler next time
> */
> +
...nor this one.
>      }

> @@ -924,6 +1028,8 @@ rt_vcpu_sleep(const struct scheduler *ops,
> struct vcpu *vc)
>          __q_remove(svc);
>      else if ( svc->flags & RTDS_delayed_runq_add )
>          clear_bit(__RTDS_delayed_runq_add, &svc->flags);
> +
> +    __replq_remove(ops, svc);
>
What I said in my last email is that you probably can get rid of this
(see below, whe commenting on context_saved()).
 
> @@ -1051,6 +1153,22 @@ rt_vcpu_wake(const struct scheduler *ops,
> struct vcpu *vc)
>      else
>          SCHED_STAT_CRANK(vcpu_wake_not_runnable);
>  
> +    /*
> +     * If a deadline passed while svc was asleep/blocked, we need
> new
> +     * scheduling parameters ( a new deadline and full budget), and
> +     * also a new replenishment event
> +     */
> +    if ( now >= svc->cur_deadline)
> +    {   
> +        rt_update_deadline(now, svc);
> +        __replq_remove(ops, svc);
> +    }
> +
> +    if( !__vcpu_on_replq(svc) )
> +    {
> +        __replq_insert(ops, svc);
> +        ASSERT( vcpu_runnable(vc) );
>
Mmm... What's this assert about?

> @@ -1087,10 +1193,6 @@ static void
>  rt_context_saved(const struct scheduler *ops, struct vcpu *vc)
>  {
>      struct rt_vcpu *svc = rt_vcpu(vc);
> -    struct rt_vcpu *snext = NULL;
> -    struct rt_dom *sdom = NULL;
> -    struct rt_private *prv = rt_priv(ops);
> -    cpumask_t *online;
>      spinlock_t *lock = vcpu_schedule_lock_irq(vc);
>  
>      clear_bit(__RTDS_scheduled, &svc->flags);
> @@ -1102,14 +1204,7 @@ rt_context_saved(const struct scheduler *ops,
> struct vcpu *vc)
>           likely(vcpu_runnable(vc)) )
>      {
>          __runq_insert(ops, svc);
> -        __repl_update(ops, NOW());
> -
> -        ASSERT(!list_empty(&prv->sdom));
> -        sdom = list_entry(prv->sdom.next, struct rt_dom, sdom_elem);
> -        online = cpupool_domain_cpumask(sdom->dom);
> -        snext = __runq_pick(ops, online); /* pick snext from ALL
> cpus */
> -
> -        runq_tickle(ops, snext);
> +        runq_tickle(ops, svc);
>      }
>
So, here we are.

What I meant was to make this function look more or less like this:

static void
rt_context_saved(const struct scheduler *ops, struct vcpu *vc)
{
    struct rt_vcpu *svc = rt_vcpu(vc);
    spinlock_t *lock = vcpu_schedule_lock_irq(vc);

    clear_bit(__RTDS_scheduled, &svc->flags);
    /* not insert idle vcpu to runq */
    if ( is_idle_vcpu(vc) )
        goto out;

    if ( test_and_clear_bit(__RTDS_delayed_runq_add, &svc->flags) &&
         likely(vcpu_runnable(vc)) )
    {
        __runq_insert(ops, svc);
        runq_tickle(ops, snext);
    }
    else
        __replq_remove(ops, svc);
out:
    vcpu_schedule_unlock_irq(lock, vc);
}

And, as said above, if you do this, try also removing the
__replq_remove() call from rt_vcpu_sleep(), this one should cover for
that (and, actually, more than just that!).

After all this, check whether you still have the assert in
__replq_insert() triggering and let me know

Thanks and Regards,
Dario
Tianyang Chen Feb. 26, 2016, 5:15 a.m. UTC | #2
On 2/25/2016 6:31 PM, Dario Faggioli wrote:
> Hey again,
>
> Thanks for turning up so quickly.
>
> We are getting closer and closer, although (of course :-)) I have some
> more comments.
>
> However, is there a particular reason why you are keeping the RFC tag?
> Until you do that, it's like saying that you are chasing feedback, but
> you do not think yourself the patch should be considered for being
> upstreamed. As far as my opinion goes, this patch is not ready to go in
> right now (as I said, I've got questions and comments), but its status
> is way past RFC.
>

Oh OK, I had the impression that RFC means request for comments and it 
should always be used because indeed, I'm asking for comments.

> On Thu, 2016-02-25 at 15:05 -0500, Tianyang Chen wrote:
>> changes since v5:
>>      removed unnecessary vcpu_on_replq() checks
>>      deadline_queue_insert() returns a flag to indicate if it's
>>      needed to re-program the timer
>>      removed unnecessary timer checks
>>      added helper functions to remove vcpus from queues
>>      coding style
>>
>> Changes since v4:
>>      removed unnecessary replenishment queue checks in vcpu_wake()
>>      extended replq_remove() to all cases in vcpu_sleep()
>>      used _deadline_queue_insert() helper function for both queues
>>      _replq_insert() and _replq_remove() program timer internally
>>
>> Changes since v3:
>>      removed running queue.
>>      added repl queue to keep track of repl events.
>>      timer is now per scheduler.
>>      timer is init on a valid cpu in a cpupool.
>>
> So, this does not belong here. It is ok to have it in this part of the
> email, but it should not land in the actual commit changelog, once the
> patch will be committed into Xen's git tree.
>
> The way to achieve the above is to put this summary of changes below
> the actual changelog, and below the Signed-of-by lines, after a marker
> that looks like this "---".
>
>> Budget replenishment and enforcement are separated by adding
>> a replenishment timer, which fires at the next most imminent
>> release time of all runnable vcpus.
>>
>> A new runningq has been added to keep track of all vcpus that
>> are on pcpus.
>>
> Mmm.. Is this the proper changelog? runningq is something we discussed,
> and that appeared in v2, but is certainly no longer here... :-O
>

oops...

> Wait, maybe you misunderstood and/or I did not make myself clear enough
> (in which case, sorry). I never meant to say "always stop the timer".
> Atually, in one of my last email I said the opposite, and I think that
> would be the best thing to do.
>
> Do you think there is any specific reason why we need to always stop
> and restart it? If not, I think we can:
>   - have deadline_queue_remove() also return whether the element
>     removed was the first one (i.e., the one the timer was programmed
>     after);
>   - if it was, re-program the timer after the new front of the queue;
>   - if it wasn't, nothing to do.
>
> Thoughts?
>

It was my thought originally that the timer needs to be re-programmed 
only when the top vcpu is taken off. So did you mean when I manipulated 
repl_timer->expires manually, the timer should be stopped using proper 
timer API? The manipulation is gone now. Also, set_timer internally 
disables the timer so I assume it's safe just to call set_timer() 
directly, right?

>> @@ -405,22 +500,38 @@ __runq_insert(const struct scheduler *ops,
>> struct rt_vcpu *svc)
>>
>>       /* add svc to runq if svc still has budget */
>>       if ( svc->cur_budget > 0 )
>> -    {
>> -        list_for_each(iter, runq)
>> -        {
>> -            struct rt_vcpu * iter_svc = __q_elem(iter);
>> -            if ( svc->cur_deadline <= iter_svc->cur_deadline )
>> -                    break;
>> -         }
>> -        list_add_tail(&svc->q_elem, iter);
>> -    }
>> +        deadline_queue_insert(&__q_elem, svc, &svc->q_elem, runq);
>>       else
>>       {
>>           list_add(&svc->q_elem, &prv->depletedq);
>> +        ASSERT( __vcpu_on_replq(svc) );
>>
> So, by doing this, you're telling me that, if the vcpu is added to the
> depleted queue, there must be a replenishment planned for it (or the
> ASSERT() would fail).
>
> But if we are adding it to the runq, there may or may not be a
> replenishment planned for it.
>
> Is this what we want? Why there must not be a replenishment planned
> already for a vcpu going into the runq (to happen at its next
> deadline)?
>

It looks like the current code doesn't add a vcpu to the replenishment 
queue when vcpu_insert() is called. When the scheduler is initialized, 
all the vcpus are added to the replenishment queue after waking up from 
sleep. This needs to be changed (add vcpu to replq in vcpu_insert()) to 
make it consistent in a sense that when rt_vcpu_insert() is called, it 
needs to have a corresponding replenishment event queued.

This way the ASSERT() is for both cases in __runq_insert() to enforce 
the fact that "when a vcpu is inserted to runq/depletedq, a 
replenishment event is waiting for it".

>> @@ -1087,10 +1193,6 @@ static void
>>   rt_context_saved(const struct scheduler *ops, struct vcpu *vc)
>>   {
>>       struct rt_vcpu *svc = rt_vcpu(vc);
>> -    struct rt_vcpu *snext = NULL;
>> -    struct rt_dom *sdom = NULL;
>> -    struct rt_private *prv = rt_priv(ops);
>> -    cpumask_t *online;
>>       spinlock_t *lock = vcpu_schedule_lock_irq(vc);
>>
>>       clear_bit(__RTDS_scheduled, &svc->flags);
>> @@ -1102,14 +1204,7 @@ rt_context_saved(const struct scheduler *ops,
>> struct vcpu *vc)
>>            likely(vcpu_runnable(vc)) )
>>       {
>>           __runq_insert(ops, svc);
>> -        __repl_update(ops, NOW());
>> -
>> -        ASSERT(!list_empty(&prv->sdom));
>> -        sdom = list_entry(prv->sdom.next, struct rt_dom, sdom_elem);
>> -        online = cpupool_domain_cpumask(sdom->dom);
>> -        snext = __runq_pick(ops, online); /* pick snext from ALL
>> cpus */
>> -
>> -        runq_tickle(ops, snext);
>> +        runq_tickle(ops, svc);
>>       }
>>
> So, here we are.
>
> What I meant was to make this function look more or less like this:
>
> static void
> rt_context_saved(const struct scheduler *ops, struct vcpu *vc)
> {
>      struct rt_vcpu *svc = rt_vcpu(vc);
>      spinlock_t *lock = vcpu_schedule_lock_irq(vc);
>
>      clear_bit(__RTDS_scheduled, &svc->flags);
>      /* not insert idle vcpu to runq */
>      if ( is_idle_vcpu(vc) )
>          goto out;
>
>      if ( test_and_clear_bit(__RTDS_delayed_runq_add, &svc->flags) &&
>           likely(vcpu_runnable(vc)) )
>      {
>          __runq_insert(ops, svc);
>          runq_tickle(ops, snext);
>      }
>      else
>          __replq_remove(ops, svc);
> out:
>      vcpu_schedule_unlock_irq(lock, vc);
> }
>
> And, as said above, if you do this, try also removing the
> __replq_remove() call from rt_vcpu_sleep(), this one should cover for
> that (and, actually, more than just that!).
>
> After all this, check whether you still have the assert in
> __replq_insert() triggering and let me know

So after moving the __replq_remove() to rt_context_saved(), the assert 
in __replq_insert() still fails when dom0 boots up.

(XEN) Assertion '!vcpu_on_replq(svc)' failed at sched_rt.c:524
(XEN) ----[ Xen-4.7-unstable  x86_64  debug=y  Tainted:    C ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82d080128c07>] sched_rt.c#__replq_insert+0x2b/0x64
(XEN) RFLAGS: 0000000000010002   CONTEXT: hypervisor (d0v3)
(XEN) rax: 0000000000000001   rbx: ffff83023b522940   rcx: 0000000000000001
(XEN) rdx: 00000031bb1b9980   rsi: ffff83023b486760   rdi: ffff83023b486760
(XEN) rbp: ffff8300bfcffd48   rsp: ffff8300bfcffd28   r8:  0000000000000004
(XEN) r9:  00000000deadbeef   r10: ffff82d08025f5a0   r11: 0000000000000206
(XEN) r12: ffff83023b486760   r13: ffff83023b522d80   r14: ffff83023b4b5000
(XEN) r15: 000000023a6e774b   cr0: 0000000080050033   cr4: 00000000000406a0
(XEN) cr3: 0000000231c0d000   cr2: ffff8802200d81f8
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff8300bfcffd28:
(XEN)    ffff82d08019642e ffff83023b486760 ffff8300bfd47000 ffff82d080299b00
(XEN)    ffff8300bfcffd88 ffff82d08012a072 ffff8300bfcffd70 ffff8300bfd47000
(XEN)    000000023a6e75c0 ffff83023b522940 ffff82d08032bc00 0000000000000282
(XEN)    ffff8300bfcffdd8 ffff82d08012be2c ffff83023b4b5000 ffff83023b4f1000
(XEN)    ffff8300bfd44000 ffff8300bfd47000 0000000000000001 ffff83023b4b40c0
(XEN)    ffff880230c14440 0000000000000000 ffff8300bfcffde8 ffff82d08012c347
(XEN)    ffff8300bfcffe08 ffff82d080169cea ffff83023b4b5000 0000000000000003
(XEN)    ffff8300bfcffe18 ffff82d080169d65 ffff8300bfcffe38 ffff82d08010762a
(XEN)    ffff83023b4b40c0 ffff83023b4b5000 ffff8300bfcffe68 ffff82d08010822a
(XEN)    ffff8300bfcffe68 fffffffffffffff2 ffff88022061dcb4 0000000000000000
(XEN)    ffff8300bfcffef8 ffff82d0801096fc 0000000000000001 ffff8300bfcfff18
(XEN)    ffff8300bfcffef8 ffff82d080240e85 ffff8300bfcf8000 0000000000000000
(XEN)    0000000000000246 ffffffff810013aa 0000000000000003 ffffffff810013aa
(XEN)    ffff8300bfcffee8 ffff8300bfd44000 ffff8802205e8000 0000000000000000
(XEN)    ffff880230c14440 0000000000000000 00007cff403000c7 ffff82d0802439e2
(XEN)    ffffffff8100140a 0000000000000020 0000000000000003 ffff880230c71900
(XEN)    ffff8802206584d0 ffff880220658000 ffff88022061dcb8 0000000000000000
(XEN)    0000000000000206 0000000000000000 ffff880223000168 ffff880223408e00
(XEN)    0000000000000020 ffffffff8100140a 0000000000000000 ffff88022061dcb4
(XEN)    0000000000000004 0001010000000000 ffffffff8100140a 000000000000e033
(XEN) Xen call trace:
(XEN)    [<ffff82d080128c07>] sched_rt.c#__replq_insert+0x2b/0x64
(XEN)    [<ffff82d08012a072>] sched_rt.c#rt_vcpu_wake+0xf2/0x12c
(XEN)    [<ffff82d08012be2c>] vcpu_wake+0x213/0x3d4
(XEN)    [<ffff82d08012c347>] vcpu_unblock+0x4b/0x4d
(XEN)    [<ffff82d080169cea>] vcpu_kick+0x20/0x6f
(XEN)    [<ffff82d080169d65>] vcpu_mark_events_pending+0x2c/0x2f
(XEN)    [<ffff82d08010762a>] event_2l.c#evtchn_2l_set_pending+0xa9/0xb9
(XEN)    [<ffff82d08010822a>] evtchn_send+0x158/0x183
(XEN)    [<ffff82d0801096fc>] do_event_channel_op+0xe21/0x147d
(XEN)    [<ffff82d0802439e2>] lstar_enter+0xe2/0x13c
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Assertion '!vcpu_on_replq(svc)' failed at sched_rt.c:524
(XEN) ****************************************

Thanks,
Tianyang
Dario Faggioli Feb. 26, 2016, 9:11 a.m. UTC | #3
On Fri, 2016-02-26 at 00:15 -0500, Tianyang Chen wrote:
> On 2/25/2016 6:31 PM, Dario Faggioli wrote:
> > As far as my opinion goes, this patch is not ready to
> > go in
> > right now (as I said, I've got questions and comments), but its
> > status
> > is way past RFC.
> > 
> Oh OK, I had the impression that RFC means request for comments and
> it 
> should always be used because indeed, I'm asking for comments.
> 
Exactly. Everyone is asking for comments when sending out a patch
series, and that's why it's not necessary to state it with the tag...
it's always the case, so no need to tell it all the times!

And, for the same reason, this means that, when the tag is there,
you're not only asking for comments and/or, if everything is ok,
inclusion, but you're asking for "just some feedback on a draft", and
as I said, we're beyond that phase. :-)

> > Wait, maybe you misunderstood and/or I did not make myself clear
> > enough
> > (in which case, sorry). I never meant to say "always stop the
> > timer".
> > Atually, in one of my last email I said the opposite, and I think
> > that
> > would be the best thing to do.
> > 
> > Do you think there is any specific reason why we need to always
> > stop
> > and restart it? If not, I think we can:
> >   - have deadline_queue_remove() also return whether the element
> >     removed was the first one (i.e., the one the timer was
> > programmed
> >     after);
> >   - if it was, re-program the timer after the new front of the
> > queue;
> >   - if it wasn't, nothing to do.
> > 
> > Thoughts?
> > 
> It was my thought originally that the timer needs to be re-
> programmed 
> only when the top vcpu is taken off. So did you mean when I
> manipulated 
> repl_timer->expires manually, the timer should be stopped using
> proper 
> timer API? The manipulation is gone now.
>
I know... This is mostly due to my fault commenting on this in two
different emails.

So, basically, yes, I meant that, if you want to fiddle with the timer
--like you where doing with those 'repl_timer->expires = 0'-- you need
to do it properly, with the proper API, locking, etc.

Then, in a subsequent email, I said that you just only need to do that
in a subset of the cases when this function is called.

Of course, the desired result is the combination of the two above
considerations, i.e.:
 - only stop/restart the timer when necessary,
 - if necessary, do it properly.

>  Also, set_timer internally 
> disables the timer so I assume it's safe just to call set_timer() 
> directly, right?
> 
Yes, it is.

> > > @@ -405,22 +500,38 @@ __runq_insert(const struct scheduler *ops,
> > > struct rt_vcpu *svc)
> > > 
> > >       /* add svc to runq if svc still has budget */
> > >       if ( svc->cur_budget > 0 )
> > > -    {
> > > -        list_for_each(iter, runq)
> > > -        {
> > > -            struct rt_vcpu * iter_svc = __q_elem(iter);
> > > -            if ( svc->cur_deadline <= iter_svc->cur_deadline )
> > > -                    break;
> > > -         }
> > > -        list_add_tail(&svc->q_elem, iter);
> > > -    }
> > > +        deadline_queue_insert(&__q_elem, svc, &svc->q_elem,
> > > runq);
> > >       else
> > >       {
> > >           list_add(&svc->q_elem, &prv->depletedq);
> > > +        ASSERT( __vcpu_on_replq(svc) );
> > > 
> > So, by doing this, you're telling me that, if the vcpu is added to
> > the
> > depleted queue, there must be a replenishment planned for it (or
> > the
> > ASSERT() would fail).
> > 
> > But if we are adding it to the runq, there may or may not be a
> > replenishment planned for it.
> > 
> > Is this what we want? Why there must not be a replenishment planned
> > already for a vcpu going into the runq (to happen at its next
> > deadline)?
> > 
> It looks like the current code doesn't add a vcpu to the
> replenishment 
> queue when vcpu_insert() is called. When the scheduler is
> initialized, 
> all the vcpus are added to the replenishment queue after waking up
> from 
> sleep. This needs to be changed (add vcpu to replq in vcpu_insert())
> to 
> make it consistent in a sense that when rt_vcpu_insert() is called,
> it 
> needs to have a corresponding replenishment event queued.
> 
> This way the ASSERT() is for both cases in __runq_insert() to
> enforce 
> the fact that "when a vcpu is inserted to runq/depletedq, a 
> replenishment event is waiting for it".
> 
I am ok with this (calling replq_insert() in rt_vcpu_insert()). Not
just doing that unconditionally though, as a vcpu can, e.g., be paused
when the insert_vcpu hook is called, and in that case, I don't think we
want to enqueue the replenishment event, do we?

> > static void
> > rt_context_saved(const struct scheduler *ops, struct vcpu *vc)
> > {
> >      struct rt_vcpu *svc = rt_vcpu(vc);
> >      spinlock_t *lock = vcpu_schedule_lock_irq(vc);
> > 
> >      clear_bit(__RTDS_scheduled, &svc->flags);
> >      /* not insert idle vcpu to runq */
> >      if ( is_idle_vcpu(vc) )
> >          goto out;
> > 
> >      if ( test_and_clear_bit(__RTDS_delayed_runq_add, &svc->flags)
> > &&
> >           likely(vcpu_runnable(vc)) )
> >      {
> >          __runq_insert(ops, svc);
> >          runq_tickle(ops, snext);
> >      }
> >      else
> >          __replq_remove(ops, svc);
> > out:
> >      vcpu_schedule_unlock_irq(lock, vc);
> > }
> > 
> > And, as said above, if you do this, try also removing the
> > __replq_remove() call from rt_vcpu_sleep(), this one should cover
> > for
> > that (and, actually, more than just that!).
> > 
> > After all this, check whether you still have the assert in
> > __replq_insert() triggering and let me know
> So after moving the __replq_remove() to rt_context_saved(), the
> assert 
> in __replq_insert() still fails when dom0 boots up.
> 
Well, maybe removing __replq_remove() from rt_vcpu_sleep() is not
entirely ok, as if we do that we fail to deal with the case of when
(still in rt_vcpu_sleep()), __vcpu_on_q(svc) is true.

So, I'd say, do as I said above wrt rt_context_saved(). For
rt_vcpu_sleep(), you can try something like this:

static void
rt_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
{
    struct rt_vcpu * const svc = rt_vcpu(vc);

    BUG_ON( is_idle_vcpu(vc) );
    SCHED_STAT_CRANK(vcpu_sleep);

    if ( curr_on_cpu(vc->processor) == vc )
        cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ);
    else if ( __vcpu_on_q(svc) )
    {
        __q_remove(svc);
        __replq_remove(svc);  <=== *** LOOK HERE ***
    }
    else if ( svc->flags & RTDS_delayed_runq_add )
        clear_bit(__RTDS_delayed_runq_add, &svc->flags);
}

In fact, in both the first and the third case, we go will at some point
pass through rt_context_switch(), and hit the __replq_remove() that I
made you put there.

In the case in the middle, as the vcpu was just queued, and for making
it go to sleep, it is enough to remove it from the runq (or depletedq,
in the case of this scheduler), we won't go through
rt_schedule()-->rt_context_saved(), and hence the __replq_remove()
won't be called.

Sorry for the overlook, can you try this.

That being said...

> (XEN) Assertion '!vcpu_on_replq(svc)' failed at sched_rt.c:524
> (XEN) ----[ Xen-4.7-unstable  x86_64  debug=y  Tainted:    C ]----

> (XEN) Xen call trace:
> (XEN)    [<ffff82d080128c07>] sched_rt.c#__replq_insert+0x2b/0x64
> (XEN)    [<ffff82d08012a072>] sched_rt.c#rt_vcpu_wake+0xf2/0x12c
> (XEN)    [<ffff82d08012be2c>] vcpu_wake+0x213/0x3d4
> (XEN)    [<ffff82d08012c347>] vcpu_unblock+0x4b/0x4d
> (XEN)    [<ffff82d080169cea>] vcpu_kick+0x20/0x6f
> (XEN)    [<ffff82d080169d65>] vcpu_mark_events_pending+0x2c/0x2f
> (XEN)    [<ffff82d08010762a>]
> event_2l.c#evtchn_2l_set_pending+0xa9/0xb9
> (XEN)    [<ffff82d08010822a>] evtchn_send+0x158/0x183
> (XEN)    [<ffff82d0801096fc>] do_event_channel_op+0xe21/0x147d
> (XEN)    [<ffff82d0802439e2>] lstar_enter+0xe2/0x13c
> (XEN)
>
... This says that the we call __replq_insert() from rt_vcpu_wake() and
find out in there that vcpu_on_replq() is true.

However, in v6 code for rt_vcpu_wake(), I can see this:

+    if( !__vcpu_on_replq(svc) )
+    {
+        __replq_insert(ops, svc);
+        ASSERT( vcpu_runnable(vc) );
+    }

which would make me think that, if vcpu_on_replq() is true, we
shouldn't be calling __replq_insert() in the first place.

So, have you made other changes wrt v6 when trying this?

Regards,
Dario
Tianyang Chen Feb. 26, 2016, 5:28 p.m. UTC | #4
On 2/26/2016 4:11 AM, Dario Faggioli wrote:
>> It looks like the current code doesn't add a vcpu to the
>> replenishment
>> queue when vcpu_insert() is called. When the scheduler is
>> initialized,
>> all the vcpus are added to the replenishment queue after waking up
>> from
>> sleep. This needs to be changed (add vcpu to replq in vcpu_insert())
>> to
>> make it consistent in a sense that when rt_vcpu_insert() is called,
>> it
>> needs to have a corresponding replenishment event queued.
>>
>> This way the ASSERT() is for both cases in __runq_insert() to
>> enforce
>> the fact that "when a vcpu is inserted to runq/depletedq, a
>> replenishment event is waiting for it".
>>
> I am ok with this (calling replq_insert() in rt_vcpu_insert()). Not
> just doing that unconditionally though, as a vcpu can, e.g., be paused
> when the insert_vcpu hook is called, and in that case, I don't think we
> want to enqueue the replenishment event, do we?
>

Yes.

>>> static void
>>> rt_context_saved(const struct scheduler *ops, struct vcpu *vc)
>>> {
>>>       struct rt_vcpu *svc = rt_vcpu(vc);
>>>       spinlock_t *lock = vcpu_schedule_lock_irq(vc);
>>>
>>>       clear_bit(__RTDS_scheduled, &svc->flags);
>>>       /* not insert idle vcpu to runq */
>>>       if ( is_idle_vcpu(vc) )
>>>           goto out;
>>>
>>>       if ( test_and_clear_bit(__RTDS_delayed_runq_add, &svc->flags)
>>> &&
>>>            likely(vcpu_runnable(vc)) )
>>>       {
>>>           __runq_insert(ops, svc);
>>>           runq_tickle(ops, snext);
>>>       }
>>>       else
>>>           __replq_remove(ops, svc);
>>> out:
>>>       vcpu_schedule_unlock_irq(lock, vc);
>>> }
>>>
>>> And, as said above, if you do this, try also removing the
>>> __replq_remove() call from rt_vcpu_sleep(), this one should cover
>>> for
>>> that (and, actually, more than just that!).
>>>
>>> After all this, check whether you still have the assert in
>>> __replq_insert() triggering and let me know
>> So after moving the __replq_remove() to rt_context_saved(), the
>> assert
>> in __replq_insert() still fails when dom0 boots up.
>>
> Well, maybe removing __replq_remove() from rt_vcpu_sleep() is not
> entirely ok, as if we do that we fail to deal with the case of when
> (still in rt_vcpu_sleep()), __vcpu_on_q(svc) is true.
>
> So, I'd say, do as I said above wrt rt_context_saved(). For
> rt_vcpu_sleep(), you can try something like this:
>
> static void
> rt_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
> {
>      struct rt_vcpu * const svc = rt_vcpu(vc);
>
>      BUG_ON( is_idle_vcpu(vc) );
>      SCHED_STAT_CRANK(vcpu_sleep);
>
>      if ( curr_on_cpu(vc->processor) == vc )
>          cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ);
>      else if ( __vcpu_on_q(svc) )
>      {
>          __q_remove(svc);
>          __replq_remove(svc);  <=== *** LOOK HERE ***
>      }
>      else if ( svc->flags & RTDS_delayed_runq_add )
>          clear_bit(__RTDS_delayed_runq_add, &svc->flags);
> }
>
> In fact, in both the first and the third case, we go will at some point
> pass through rt_context_switch(), and hit the __replq_remove() that I
> made you put there.
>
> In the case in the middle, as the vcpu was just queued, and for making
> it go to sleep, it is enough to remove it from the runq (or depletedq,
> in the case of this scheduler), we won't go through
> rt_schedule()-->rt_context_saved(), and hence the __replq_remove()
> won't be called.
>
> Sorry for the overlook, can you try this.
>
> That being said...
>
>> (XEN) Assertion '!vcpu_on_replq(svc)' failed at sched_rt.c:524
>> (XEN) ----[ Xen-4.7-unstable  x86_64  debug=y  Tainted:    C ]----
>>
>> (XEN) Xen call trace:
>> (XEN)    [<ffff82d080128c07>] sched_rt.c#__replq_insert+0x2b/0x64
>> (XEN)    [<ffff82d08012a072>] sched_rt.c#rt_vcpu_wake+0xf2/0x12c
>> (XEN)    [<ffff82d08012be2c>] vcpu_wake+0x213/0x3d4
>> (XEN)    [<ffff82d08012c347>] vcpu_unblock+0x4b/0x4d
>> (XEN)    [<ffff82d080169cea>] vcpu_kick+0x20/0x6f
>> (XEN)    [<ffff82d080169d65>] vcpu_mark_events_pending+0x2c/0x2f
>> (XEN)    [<ffff82d08010762a>]
>> event_2l.c#evtchn_2l_set_pending+0xa9/0xb9
>> (XEN)    [<ffff82d08010822a>] evtchn_send+0x158/0x183
>> (XEN)    [<ffff82d0801096fc>] do_event_channel_op+0xe21/0x147d
>> (XEN)    [<ffff82d0802439e2>] lstar_enter+0xe2/0x13c
>> (XEN)
>>
> ... This says that the we call __replq_insert() from rt_vcpu_wake() and
> find out in there that vcpu_on_replq() is true.
>
> However, in v6 code for rt_vcpu_wake(), I can see this:
>
> +    if( !__vcpu_on_replq(svc) )
> +    {
> +        __replq_insert(ops, svc);
> +        ASSERT( vcpu_runnable(vc) );
> +    }
>
> which would make me think that, if vcpu_on_replq() is true, we
> shouldn't be calling __replq_insert() in the first place.
>
> So, have you made other changes wrt v6 when trying this?

The v6 doesn't have the if statement commented out when I submitted it. 
But I tried commenting it out, the assertion failed.

It fails in V6 with:
rt_vcpu_sleep(): removing replenishment event for all cases
rt_context_saved(): not removing replenishment events
rt_vcpu_wake(): not checking if the event is already queued.

or with:
rt_vcpu_sleep(): not removing replenishment event at all
rt_context_saved(): removing replenishment events if not runnable
rt_vcpu_wake(): not checking if the event is already queued.

Also with:
rt_vcpu_sleep(): removing replenishment event if the vcpu is on 
runq/depletedq
rt_context_saved(): removing replenishment events if not runnable
rt_vcpu_wake(): not checking if the event is already queued.

I added debug prints in all these functions and noticed that it could be 
caused by racing between spurious wakeups and context switching. See the 
following events for the last modification above:

(XEN) cpu1 picked idle
(XEN) d0 attempted to change d0v1's CR4 flags 00000620 -> 00040660
(XEN) cpu2 picked idle
(XEN) vcpu1 sleeps on cpu
(XEN) cpu0 picked idle
(XEN) vcpu1 context saved not runnable
(XEN) vcpu1 wakes up nowhere
(XEN) cpu0 picked vcpu1
(XEN) vcpu1 sleeps on cpu
(XEN) cpu0 picked idle
(XEN) vcpu1 context saved not runnable
(XEN) vcpu1 wakes up nowhere
(XEN) cpu0 picked vcpu1
(XEN) cpu0 picked idle
(XEN) vcpu1 context saved not runnable
(XEN) cpu0 picked vcpu0
(XEN) vcpu1 wakes up nowhere
(XEN) cpu1 picked vcpu1     *** vcpu1 is on a cpu
(XEN) cpu1 picked idle      *** vcpu1 is waiting to be context switched
(XEN) vcpu2 wakes up nowhere
(XEN) cpu0 picked vcpu0
(XEN) cpu2 picked vcpu2
(XEN) cpu0 picked vcpu0
(XEN) cpu0 picked vcpu0
(XEN) d0 attempted to change d0v2's CR4 flags 00000620 -> 00040660
(XEN) cpu0 picked vcpu0
(XEN) vcpu2 sleeps on cpu
(XEN) vcpu1 wakes up nowhere      *** vcpu1 wakes up without sleep?

(XEN) Assertion '!__vcpu_on_replq(svc)' failed at sched_rt.c:526
(XEN) ----[ Xen-4.7-unstable  x86_64  debug=y  Tainted:    C ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82d08012a151>] sched_rt.c#rt_vcpu_wake+0x11f/0x17b
...
(XEN) Xen call trace:
(XEN)    [<ffff82d08012a151>] sched_rt.c#rt_vcpu_wake+0x11f/0x17b
(XEN)    [<ffff82d08012bf2c>] vcpu_wake+0x213/0x3d4
(XEN)    [<ffff82d08012c447>] vcpu_unblock+0x4b/0x4d
(XEN)    [<ffff82d080169cea>] vcpu_kick+0x20/0x6f
(XEN)    [<ffff82d080169d65>] vcpu_mark_events_pending+0x2c/0x2f
(XEN)    [<ffff82d08010762a>] event_2l.c#evtchn_2l_set_pending+0xa9/0xb9
(XEN)    [<ffff82d080108312>] send_guest_vcpu_virq+0x9d/0xba
(XEN)    [<ffff82d080196cbe>] send_timer_event+0xe/0x10
(XEN)    [<ffff82d08012a7b5>] schedule.c#vcpu_singleshot_timer_fn+0x9/0xb
(XEN)    [<ffff82d080131978>] timer.c#execute_timer+0x4e/0x6c
(XEN)    [<ffff82d080131ab9>] timer.c#timer_softirq_action+0xdd/0x213
(XEN)    [<ffff82d08012df32>] softirq.c#__do_softirq+0x82/0x8d
(XEN)    [<ffff82d08012df8a>] do_softirq+0x13/0x15
(XEN)    [<ffff82d080243ad1>] cpufreq.c#process_softirqs+0x21/0x30


So, it looks like spurious wakeup for vcpu1 happens before it was 
completely context switched off a cpu. But rt_vcpu_wake() didn't see it 
on cpu with curr_on_cpu() so it fell through the first two RETURNs.

I guess the replenishment queue check is necessary for this situation?

Thanks,
Tianyang
Dario Faggioli Feb. 26, 2016, 6:09 p.m. UTC | #5
On Fri, 2016-02-26 at 12:28 -0500, Tianyang Chen wrote:
> > So, have you made other changes wrt v6 when trying this?
> The v6 doesn't have the if statement commented out when I submitted
> it. 
> But I tried commenting it out, the assertion failed.
> 
Ok, thanks for these tests. Can you send (just quick-&-dirtily, as an
attached to a replay to this email, no need of a proper re-submission
of a new version) the patch that does this:

> rt_vcpu_sleep(): removing replenishment event if the vcpu is on 
> runq/depletedq
> rt_context_saved(): removing replenishment events if not runnable
> rt_vcpu_wake(): not checking if the event is already queued.
> 
> I added debug prints in all these functions and noticed that it could
> be 
> caused by racing between spurious wakeups and context switching. 
>
And the code that produces these debug output as well?

> (XEN) cpu1 picked idle
> (XEN) d0 attempted to change d0v1's CR4 flags 00000620 -> 00040660
> (XEN) cpu2 picked idle
> (XEN) vcpu1 sleeps on cpu
> (XEN) cpu0 picked idle
> (XEN) vcpu1 context saved not runnable
> (XEN) vcpu1 wakes up nowhere
> (XEN) cpu0 picked vcpu1
> (XEN) vcpu1 sleeps on cpu
> (XEN) cpu0 picked idle
> (XEN) vcpu1 context saved not runnable
> (XEN) vcpu1 wakes up nowhere
> (XEN) cpu0 picked vcpu1
> (XEN) cpu0 picked idle
> (XEN) vcpu1 context saved not runnable
> (XEN) cpu0 picked vcpu0
> (XEN) vcpu1 wakes up nowhere
> (XEN) cpu1 picked vcpu1     *** vcpu1 is on a cpu
> (XEN) cpu1 picked idle      *** vcpu1 is waiting to be context
> switched
> (XEN) vcpu2 wakes up nowhere
> (XEN) cpu0 picked vcpu0
> (XEN) cpu2 picked vcpu2
> (XEN) cpu0 picked vcpu0
> (XEN) cpu0 picked vcpu0
> (XEN) d0 attempted to change d0v2's CR4 flags 00000620 -> 00040660
> (XEN) cpu0 picked vcpu0
> (XEN) vcpu2 sleeps on cpu
> (XEN) vcpu1 wakes up nowhere      *** vcpu1 wakes up without sleep?
> 
> (XEN) Assertion '!__vcpu_on_replq(svc)' failed at sched_rt.c:526
> (XEN) ----[ Xen-4.7-unstable  x86_64  debug=y  Tainted:    C ]----
> (XEN) CPU:    0
> (XEN) RIP:    e008:[<ffff82d08012a151>]
> sched_rt.c#rt_vcpu_wake+0x11f/0x17b
> ...
> (XEN) Xen call trace:
> (XEN)    [<ffff82d08012a151>] sched_rt.c#rt_vcpu_wake+0x11f/0x17b
> (XEN)    [<ffff82d08012bf2c>] vcpu_wake+0x213/0x3d4
> (XEN)    [<ffff82d08012c447>] vcpu_unblock+0x4b/0x4d
> (XEN)    [<ffff82d080169cea>] vcpu_kick+0x20/0x6f
> (XEN)    [<ffff82d080169d65>] vcpu_mark_events_pending+0x2c/0x2f
> (XEN)    [<ffff82d08010762a>]
> event_2l.c#evtchn_2l_set_pending+0xa9/0xb9
> (XEN)    [<ffff82d080108312>] send_guest_vcpu_virq+0x9d/0xba
> (XEN)    [<ffff82d080196cbe>] send_timer_event+0xe/0x10
> (XEN)    [<ffff82d08012a7b5>]
> schedule.c#vcpu_singleshot_timer_fn+0x9/0xb
> (XEN)    [<ffff82d080131978>] timer.c#execute_timer+0x4e/0x6c
> (XEN)    [<ffff82d080131ab9>] timer.c#timer_softirq_action+0xdd/0x213
> (XEN)    [<ffff82d08012df32>] softirq.c#__do_softirq+0x82/0x8d
> (XEN)    [<ffff82d08012df8a>] do_softirq+0x13/0x15
> (XEN)    [<ffff82d080243ad1>] cpufreq.c#process_softirqs+0x21/0x30
> 
> 
> So, it looks like spurious wakeup for vcpu1 happens before it was 
> completely context switched off a cpu. But rt_vcpu_wake() didn't see
> it 
> on cpu with curr_on_cpu() so it fell through the first two RETURNs.
> 
> I guess the replenishment queue check is necessary for this
> situation?
> 
Perhaps, but I first want to make sure we understand what is really
happening.

Regards,
Dario
Tianyang Chen Feb. 26, 2016, 6:33 p.m. UTC | #6
Attached.

Tianyang

On 2016-02-26 13:09, Dario Faggioli wrote:
> On Fri, 2016-02-26 at 12:28 -0500, Tianyang Chen wrote:
>> > So, have you made other changes wrt v6 when trying this?
>> The v6 doesn't have the if statement commented out when I submitted
>> it. 
>> But I tried commenting it out, the assertion failed.
>> 
> Ok, thanks for these tests. Can you send (just quick-&-dirtily, as an
> attached to a replay to this email, no need of a proper re-submission
> of a new version) the patch that does this:
> 
>> rt_vcpu_sleep(): removing replenishment event if the vcpu is on 
>> runq/depletedq
>> rt_context_saved(): removing replenishment events if not runnable
>> rt_vcpu_wake(): not checking if the event is already queued.
>> 
>> I added debug prints in all these functions and noticed that it could
>> be 
>> caused by racing between spurious wakeups and context switching.
>> 
> And the code that produces these debug output as well?
> 
>> (XEN) cpu1 picked idle
>> (XEN) d0 attempted to change d0v1's CR4 flags 00000620 -> 00040660
>> (XEN) cpu2 picked idle
>> (XEN) vcpu1 sleeps on cpu
>> (XEN) cpu0 picked idle
>> (XEN) vcpu1 context saved not runnable
>> (XEN) vcpu1 wakes up nowhere
>> (XEN) cpu0 picked vcpu1
>> (XEN) vcpu1 sleeps on cpu
>> (XEN) cpu0 picked idle
>> (XEN) vcpu1 context saved not runnable
>> (XEN) vcpu1 wakes up nowhere
>> (XEN) cpu0 picked vcpu1
>> (XEN) cpu0 picked idle
>> (XEN) vcpu1 context saved not runnable
>> (XEN) cpu0 picked vcpu0
>> (XEN) vcpu1 wakes up nowhere
>> (XEN) cpu1 picked vcpu1     *** vcpu1 is on a cpu
>> (XEN) cpu1 picked idle      *** vcpu1 is waiting to be context
>> switched
>> (XEN) vcpu2 wakes up nowhere
>> (XEN) cpu0 picked vcpu0
>> (XEN) cpu2 picked vcpu2
>> (XEN) cpu0 picked vcpu0
>> (XEN) cpu0 picked vcpu0
>> (XEN) d0 attempted to change d0v2's CR4 flags 00000620 -> 00040660
>> (XEN) cpu0 picked vcpu0
>> (XEN) vcpu2 sleeps on cpu
>> (XEN) vcpu1 wakes up nowhere      *** vcpu1 wakes up without sleep?
>> 
>> (XEN) Assertion '!__vcpu_on_replq(svc)' failed at sched_rt.c:526
>> (XEN) ----[ Xen-4.7-unstable  x86_64  debug=y  Tainted:    C ]----
>> (XEN) CPU:    0
>> (XEN) RIP:    e008:[<ffff82d08012a151>]
>> sched_rt.c#rt_vcpu_wake+0x11f/0x17b
>> ...
>> (XEN) Xen call trace:
>> (XEN)    [<ffff82d08012a151>] sched_rt.c#rt_vcpu_wake+0x11f/0x17b
>> (XEN)    [<ffff82d08012bf2c>] vcpu_wake+0x213/0x3d4
>> (XEN)    [<ffff82d08012c447>] vcpu_unblock+0x4b/0x4d
>> (XEN)    [<ffff82d080169cea>] vcpu_kick+0x20/0x6f
>> (XEN)    [<ffff82d080169d65>] vcpu_mark_events_pending+0x2c/0x2f
>> (XEN)    [<ffff82d08010762a>]
>> event_2l.c#evtchn_2l_set_pending+0xa9/0xb9
>> (XEN)    [<ffff82d080108312>] send_guest_vcpu_virq+0x9d/0xba
>> (XEN)    [<ffff82d080196cbe>] send_timer_event+0xe/0x10
>> (XEN)    [<ffff82d08012a7b5>]
>> schedule.c#vcpu_singleshot_timer_fn+0x9/0xb
>> (XEN)    [<ffff82d080131978>] timer.c#execute_timer+0x4e/0x6c
>> (XEN)    [<ffff82d080131ab9>] timer.c#timer_softirq_action+0xdd/0x213
>> (XEN)    [<ffff82d08012df32>] softirq.c#__do_softirq+0x82/0x8d
>> (XEN)    [<ffff82d08012df8a>] do_softirq+0x13/0x15
>> (XEN)    [<ffff82d080243ad1>] cpufreq.c#process_softirqs+0x21/0x30
>> 
>> 
>> So, it looks like spurious wakeup for vcpu1 happens before it was 
>> completely context switched off a cpu. But rt_vcpu_wake() didn't see
>> it 
>> on cpu with curr_on_cpu() so it fell through the first two RETURNs.
>> 
>> I guess the replenishment queue check is necessary for this
>> situation?
>> 
> Perhaps, but I first want to make sure we understand what is really
> happening.
> 
> Regards,
> Dario
diff mbox

Patch

diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
index 2e5430f..16f77f9 100644
--- a/xen/common/sched_rt.c
+++ b/xen/common/sched_rt.c
@@ -16,6 +16,7 @@ 
 #include <xen/delay.h>
 #include <xen/event.h>
 #include <xen/time.h>
+#include <xen/timer.h>
 #include <xen/perfc.h>
 #include <xen/sched-if.h>
 #include <xen/softirq.h>
@@ -87,7 +88,7 @@ 
 #define RTDS_DEFAULT_BUDGET     (MICROSECS(4000))
 
 #define UPDATE_LIMIT_SHIFT      10
-#define MAX_SCHEDULE            (MILLISECS(1))
+
 /*
  * Flags
  */
@@ -142,6 +143,9 @@  static cpumask_var_t *_cpumask_scratch;
  */
 static unsigned int nr_rt_ops;
 
+/* handler for the replenishment timer */
+static void repl_handler(void *data);
+
 /*
  * Systme-wide private data, include global RunQueue/DepletedQ
  * Global lock is referenced by schedule_data.schedule_lock from all
@@ -152,7 +156,9 @@  struct rt_private {
     struct list_head sdom;      /* list of availalbe domains, used for dump */
     struct list_head runq;      /* ordered list of runnable vcpus */
     struct list_head depletedq; /* unordered list of depleted vcpus */
+    struct list_head replq;     /* ordered list of vcpus that need replenishment */
     cpumask_t tickled;          /* cpus been tickled */
+    struct timer *repl_timer;   /* replenishment timer */
 };
 
 /*
@@ -160,6 +166,7 @@  struct rt_private {
  */
 struct rt_vcpu {
     struct list_head q_elem;    /* on the runq/depletedq list */
+    struct list_head replq_elem;/* on the repl event list */
 
     /* Up-pointers */
     struct rt_dom *sdom;
@@ -213,8 +220,14 @@  static inline struct list_head *rt_depletedq(const struct scheduler *ops)
     return &rt_priv(ops)->depletedq;
 }
 
+static inline struct list_head *rt_replq(const struct scheduler *ops)
+{
+    return &rt_priv(ops)->replq;
+}
+
 /*
- * Queue helper functions for runq and depletedq
+ * Queue helper functions for runq, depletedq
+ * and replenishment event queue
  */
 static int
 __vcpu_on_q(const struct rt_vcpu *svc)
@@ -228,6 +241,18 @@  __q_elem(struct list_head *elem)
     return list_entry(elem, struct rt_vcpu, q_elem);
 }
 
+static struct rt_vcpu *
+__replq_elem(struct list_head *elem)
+{
+    return list_entry(elem, struct rt_vcpu, replq_elem);
+}
+
+static int
+__vcpu_on_replq(const struct rt_vcpu *svc)
+{
+   return !list_empty(&svc->replq_elem);
+}
+
 /*
  * Debug related code, dump vcpu/cpu information
  */
@@ -288,7 +313,7 @@  rt_dump_pcpu(const struct scheduler *ops, int cpu)
 static void
 rt_dump(const struct scheduler *ops)
 {
-    struct list_head *runq, *depletedq, *iter;
+    struct list_head *runq, *depletedq, *replq, *iter;
     struct rt_private *prv = rt_priv(ops);
     struct rt_vcpu *svc;
     struct rt_dom *sdom;
@@ -301,6 +326,7 @@  rt_dump(const struct scheduler *ops)
 
     runq = rt_runq(ops);
     depletedq = rt_depletedq(ops);
+    replq = rt_replq(ops);
 
     printk("Global RunQueue info:\n");
     list_for_each( iter, runq )
@@ -316,6 +342,13 @@  rt_dump(const struct scheduler *ops)
         rt_dump_vcpu(ops, svc);
     }
 
+    printk("Global Replenishment Event info:\n");
+    list_for_each( iter, replq )
+    {
+        svc = __replq_elem(iter);
+        rt_dump_vcpu(ops, svc);
+    }
+
     printk("Domain info:\n");
     list_for_each( iter, &prv->sdom )
     {
@@ -380,11 +413,74 @@  rt_update_deadline(s_time_t now, struct rt_vcpu *svc)
     return;
 }
 
+/* a helper function that only removes a vcpu from a queue */
+static inline void
+deadline_queue_remove(struct list_head *elem)
+{
+    list_del_init(elem);
+}
+
 static inline void
 __q_remove(struct rt_vcpu *svc)
 {
     if ( __vcpu_on_q(svc) )
-        list_del_init(&svc->q_elem);
+        deadline_queue_remove(&svc->q_elem);
+}
+
+/*
+ * Removing a vcpu from the replenishment queue could
+ * re-program the timer for the next replenishment event
+ * if there is any on the list
+ */
+static inline void
+__replq_remove(const struct scheduler *ops, struct rt_vcpu *svc)
+{
+    struct rt_private *prv = rt_priv(ops);
+    struct list_head *replq = rt_replq(ops);
+    struct timer* repl_timer = prv->repl_timer;
+
+    /*
+     * Disarm the timer before re-programming it.
+     * It doesn't matter if the vcpu to be removed
+     * is on top of the list or not because the timer
+     * is stopped and needs to be re-programmed anyways
+     */
+    stop_timer(repl_timer);
+
+    deadline_queue_remove(&svc->replq_elem);
+
+    /* re-arm the timer for the next replenishment event */
+    if( !list_empty(replq) )
+    {
+        struct rt_vcpu *svc_next = __replq_elem(replq->next);
+        set_timer(repl_timer, svc_next->cur_deadline);
+    }
+}
+
+/*
+ * An utility function that inserts a vcpu to a
+ * queue based on certain order (EDF). The returned
+ * value is 1 if a vcpu has been inserted to the
+ * front of a list
+ */
+static int
+deadline_queue_insert(struct rt_vcpu * (*_get_q_elem)(struct list_head *elem),
+    struct rt_vcpu *svc, struct list_head *elem, struct list_head *queue)
+{
+    struct list_head *iter;
+    int pos = 0;
+
+    list_for_each(iter, queue)
+    {
+        struct rt_vcpu * iter_svc = (*_get_q_elem)(iter);
+        if ( svc->cur_deadline <= iter_svc->cur_deadline )
+            break;
+
+        pos++;
+    }
+
+    list_add_tail(elem, iter);
+    return !pos;
 }
 
 /*
@@ -397,7 +493,6 @@  __runq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
 {
     struct rt_private *prv = rt_priv(ops);
     struct list_head *runq = rt_runq(ops);
-    struct list_head *iter;
 
     ASSERT( spin_is_locked(&prv->lock) );
 
@@ -405,22 +500,38 @@  __runq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
 
     /* add svc to runq if svc still has budget */
     if ( svc->cur_budget > 0 )
-    {
-        list_for_each(iter, runq)
-        {
-            struct rt_vcpu * iter_svc = __q_elem(iter);
-            if ( svc->cur_deadline <= iter_svc->cur_deadline )
-                    break;
-         }
-        list_add_tail(&svc->q_elem, iter);
-    }
+        deadline_queue_insert(&__q_elem, svc, &svc->q_elem, runq);
     else
     {
         list_add(&svc->q_elem, &prv->depletedq);
+        ASSERT( __vcpu_on_replq(svc) );
     }
 }
 
 /*
+ * Insert svc into the replenishment event list
+ * in replenishment time order.
+ * vcpus that need to be replished earlier go first.
+ * The timer may be re-programmed if svc is inserted
+ * at the front of the event list.
+ */
+static void
+__replq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
+{
+    struct list_head *replq = rt_replq(ops);
+    struct rt_private *prv = rt_priv(ops);
+    struct timer *repl_timer = prv->repl_timer;
+    int set;
+
+    ASSERT( !__vcpu_on_replq(svc) );
+
+    set = deadline_queue_insert(&__replq_elem, svc, &svc->replq_elem, replq);
+ 
+    if( set )
+        set_timer(repl_timer,svc->cur_deadline);
+}
+
+/*
  * Init/Free related code
  */
 static int
@@ -449,11 +560,18 @@  rt_init(struct scheduler *ops)
     INIT_LIST_HEAD(&prv->sdom);
     INIT_LIST_HEAD(&prv->runq);
     INIT_LIST_HEAD(&prv->depletedq);
+    INIT_LIST_HEAD(&prv->replq);
 
     cpumask_clear(&prv->tickled);
 
     ops->sched_data = prv;
 
+    /* 
+     * The timer initialization will happen later when 
+     * the first pcpu is added to this pool in alloc_pdata
+     */
+    prv->repl_timer = NULL;
+
     return 0;
 
  no_mem:
@@ -473,6 +591,10 @@  rt_deinit(const struct scheduler *ops)
         xfree(_cpumask_scratch);
         _cpumask_scratch = NULL;
     }
+
+    kill_timer(prv->repl_timer);
+    xfree(prv->repl_timer);
+
     xfree(prv);
 }
 
@@ -493,6 +615,17 @@  rt_alloc_pdata(const struct scheduler *ops, int cpu)
     if ( !alloc_cpumask_var(&_cpumask_scratch[cpu]) )
         return NULL;
 
+    if( prv->repl_timer == NULL )
+    {   
+        /* allocate the timer on the first cpu of this pool */
+        prv->repl_timer = xzalloc(struct timer);
+
+        if(prv->repl_timer == NULL )
+            return NULL;
+
+        init_timer(prv->repl_timer, repl_handler, (void *)ops, cpu);
+    }
+
     /* 1 indicates alloc. succeed in schedule.c */
     return (void *)1;
 }
@@ -586,6 +719,7 @@  rt_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
         return NULL;
 
     INIT_LIST_HEAD(&svc->q_elem);
+    INIT_LIST_HEAD(&svc->replq_elem);
     svc->flags = 0U;
     svc->sdom = dd;
     svc->vcpu = vc;
@@ -609,7 +743,8 @@  rt_free_vdata(const struct scheduler *ops, void *priv)
 }
 
 /*
- * This function is called in sched_move_domain() in schedule.c
+ * It is called in sched_move_domain() and sched_init_vcpu
+ * in schedule.c
  * When move a domain to a new cpupool.
  * It inserts vcpus of moving domain to the scheduler's RunQ in
  * dest. cpupool.
@@ -651,6 +786,10 @@  rt_vcpu_remove(const struct scheduler *ops, struct vcpu *vc)
     lock = vcpu_schedule_lock_irq(vc);
     if ( __vcpu_on_q(svc) )
         __q_remove(svc);
+
+    if( __vcpu_on_replq(svc) )
+        __replq_remove(ops,svc);
+
     vcpu_schedule_unlock_irq(lock, vc);
 }
 
@@ -785,44 +924,6 @@  __runq_pick(const struct scheduler *ops, const cpumask_t *mask)
 }
 
 /*
- * Update vcpu's budget and
- * sort runq by insert the modifed vcpu back to runq
- * lock is grabbed before calling this function
- */
-static void
-__repl_update(const struct scheduler *ops, s_time_t now)
-{
-    struct list_head *runq = rt_runq(ops);
-    struct list_head *depletedq = rt_depletedq(ops);
-    struct list_head *iter;
-    struct list_head *tmp;
-    struct rt_vcpu *svc = NULL;
-
-    list_for_each_safe(iter, tmp, runq)
-    {
-        svc = __q_elem(iter);
-        if ( now < svc->cur_deadline )
-            break;
-
-        rt_update_deadline(now, svc);
-        /* reinsert the vcpu if its deadline is updated */
-        __q_remove(svc);
-        __runq_insert(ops, svc);
-    }
-
-    list_for_each_safe(iter, tmp, depletedq)
-    {
-        svc = __q_elem(iter);
-        if ( now >= svc->cur_deadline )
-        {
-            rt_update_deadline(now, svc);
-            __q_remove(svc); /* remove from depleted queue */
-            __runq_insert(ops, svc); /* add to runq */
-        }
-    }
-}
-
-/*
  * schedule function for rt scheduler.
  * The lock is already grabbed in schedule.c, no need to lock here
  */
@@ -841,7 +942,6 @@  rt_schedule(const struct scheduler *ops, s_time_t now, bool_t tasklet_work_sched
     /* burn_budget would return for IDLE VCPU */
     burn_budget(ops, scurr, now);
 
-    __repl_update(ops, now);
 
     if ( tasklet_work_scheduled )
     {
@@ -868,6 +968,8 @@  rt_schedule(const struct scheduler *ops, s_time_t now, bool_t tasklet_work_sched
         set_bit(__RTDS_delayed_runq_add, &scurr->flags);
 
     snext->last_start = now;
+
+    ret.time =  -1; /* if an idle vcpu is picked */ 
     if ( !is_idle_vcpu(snext->vcpu) )
     {
         if ( snext != scurr )
@@ -880,9 +982,11 @@  rt_schedule(const struct scheduler *ops, s_time_t now, bool_t tasklet_work_sched
             snext->vcpu->processor = cpu;
             ret.migrated = 1;
         }
+
+        ret.time = snext->budget; /* invoke the scheduler next time */
+
     }
 
-    ret.time = MIN(snext->budget, MAX_SCHEDULE); /* sched quantum */
     ret.task = snext->vcpu;
 
     /* TRACE */
@@ -924,6 +1028,8 @@  rt_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
         __q_remove(svc);
     else if ( svc->flags & RTDS_delayed_runq_add )
         clear_bit(__RTDS_delayed_runq_add, &svc->flags);
+
+    __replq_remove(ops, svc);
 }
 
 /*
@@ -1026,10 +1132,6 @@  rt_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
 {
     struct rt_vcpu * const svc = rt_vcpu(vc);
     s_time_t now = NOW();
-    struct rt_private *prv = rt_priv(ops);
-    struct rt_vcpu *snext = NULL; /* highest priority on RunQ */
-    struct rt_dom *sdom = NULL;
-    cpumask_t *online;
 
     BUG_ON( is_idle_vcpu(vc) );
 
@@ -1051,6 +1153,22 @@  rt_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
     else
         SCHED_STAT_CRANK(vcpu_wake_not_runnable);
 
+    /*
+     * If a deadline passed while svc was asleep/blocked, we need new
+     * scheduling parameters ( a new deadline and full budget), and
+     * also a new replenishment event
+     */
+    if ( now >= svc->cur_deadline)
+    {   
+        rt_update_deadline(now, svc);
+        __replq_remove(ops, svc);
+    }
+
+    if( !__vcpu_on_replq(svc) )
+    {
+        __replq_insert(ops, svc);
+        ASSERT( vcpu_runnable(vc) );
+    }
     /* If context hasn't been saved for this vcpu yet, we can't put it on
      * the Runqueue/DepletedQ. Instead, we set a flag so that it will be
      * put on the Runqueue/DepletedQ after the context has been saved.
@@ -1061,22 +1179,10 @@  rt_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
         return;
     }
 
-    if ( now >= svc->cur_deadline)
-        rt_update_deadline(now, svc);
-
     /* insert svc to runq/depletedq because svc is not in queue now */
     __runq_insert(ops, svc);
 
-    __repl_update(ops, now);
-
-    ASSERT(!list_empty(&prv->sdom));
-    sdom = list_entry(prv->sdom.next, struct rt_dom, sdom_elem);
-    online = cpupool_domain_cpumask(sdom->dom);
-    snext = __runq_pick(ops, online); /* pick snext from ALL valid cpus */
-
-    runq_tickle(ops, snext);
-
-    return;
+    runq_tickle(ops, svc);
 }
 
 /*
@@ -1087,10 +1193,6 @@  static void
 rt_context_saved(const struct scheduler *ops, struct vcpu *vc)
 {
     struct rt_vcpu *svc = rt_vcpu(vc);
-    struct rt_vcpu *snext = NULL;
-    struct rt_dom *sdom = NULL;
-    struct rt_private *prv = rt_priv(ops);
-    cpumask_t *online;
     spinlock_t *lock = vcpu_schedule_lock_irq(vc);
 
     clear_bit(__RTDS_scheduled, &svc->flags);
@@ -1102,14 +1204,7 @@  rt_context_saved(const struct scheduler *ops, struct vcpu *vc)
          likely(vcpu_runnable(vc)) )
     {
         __runq_insert(ops, svc);
-        __repl_update(ops, NOW());
-
-        ASSERT(!list_empty(&prv->sdom));
-        sdom = list_entry(prv->sdom.next, struct rt_dom, sdom_elem);
-        online = cpupool_domain_cpumask(sdom->dom);
-        snext = __runq_pick(ops, online); /* pick snext from ALL cpus */
-
-        runq_tickle(ops, snext);
+        runq_tickle(ops, svc);
     }
 out:
     vcpu_schedule_unlock_irq(lock, vc);
@@ -1168,6 +1263,71 @@  rt_dom_cntl(
     return rc;
 }
 
+/*
+ * The replenishment timer handler picks vcpus
+ * from the replq and does the actual replenishment
+ */
+static void repl_handler(void *data){
+    unsigned long flags;
+    s_time_t now = NOW();
+    struct scheduler *ops = data; 
+    struct rt_private *prv = rt_priv(ops);
+    struct list_head *replq = rt_replq(ops);
+    struct timer *repl_timer = prv->repl_timer;
+    struct list_head *iter, *tmp;
+    struct rt_vcpu *svc = NULL;
+
+    spin_lock_irqsave(&prv->lock, flags);
+
+    stop_timer(repl_timer);
+
+    list_for_each_safe(iter, tmp, replq)
+    {
+        svc = __replq_elem(iter);
+
+        if ( now < svc->cur_deadline )
+            break;
+
+        rt_update_deadline(now, svc);
+
+        /*
+         * when the replenishment happens
+         * svc is either on a pcpu or on
+         * runq/depletedq
+         */
+        if( __vcpu_on_q(svc) )
+        {
+            /* put back to runq */
+            __q_remove(svc);
+            __runq_insert(ops, svc);
+        }
+
+        /* 
+         * tickle regardless where it's at 
+         * because a running vcpu could have
+         * a later deadline than others after
+         * replenishment
+         */
+        runq_tickle(ops, svc);
+
+        /* 
+         * update replenishment event queue
+         * without reprogramming the timer
+         */
+        deadline_queue_remove(&svc->replq_elem);
+        deadline_queue_insert(&__replq_elem, svc, &svc->replq_elem, replq);
+    }
+
+    /* 
+     * use the vcpu that's on the top
+     * or else don't program the timer
+     */
+    if( !list_empty(replq) )
+        set_timer(repl_timer, __replq_elem(replq->next)->cur_deadline);
+
+    spin_unlock_irqrestore(&prv->lock, flags);
+}
+
 static struct rt_private _rt_priv;
 
 static const struct scheduler sched_rtds_def = {