diff mbox

[v2,2/3] xen: add hypercall option to temporarily pin a vcpu

Message ID 1456822933-25041-3-git-send-email-jgross@suse.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jürgen Groß March 1, 2016, 9:02 a.m. UTC
Some hardware (e.g. Dell studio 1555 laptops) require SMIs to be
called on physical cpu 0 only. Linux drivers like dcdbas or i8k try
to achieve this by pinning the running thread to cpu 0, but in Dom0
this is not enough: the vcpu must be pinned to physical cpu 0 via
Xen, too.

Add a stable hypercall option SCHEDOP_pin_temp to the sched_op
hypercall to achieve this. It is taking a physical cpu number as
parameter. If pinning is possible (the calling domain has the
privilege to make the call and the cpu is available in the domain's
cpupool) the calling vcpu is pinned to the specified cpu. The old
cpu affinity is saved. To undo the temporary pinning a cpu -1 is
specified. This will restore the original cpu affinity for the vcpu.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V2: - limit operation to hardware domain as suggested by Jan Beulich
    - some style issues corrected as requested by Jan Beulich
    - use fixed width types in interface as requested by Jan Beulich
    - add compat layer checking as requested by Jan Beulich
---
 xen/common/compat/schedule.c |  4 ++
 xen/common/schedule.c        | 92 +++++++++++++++++++++++++++++++++++++++++---
 xen/include/public/sched.h   | 17 ++++++++
 xen/include/xlat.lst         |  1 +
 4 files changed, 109 insertions(+), 5 deletions(-)

Comments

Jan Beulich March 1, 2016, 11:27 a.m. UTC | #1
>>> On 01.03.16 at 10:02, <JGross@suse.com> wrote:
> @@ -752,14 +766,20 @@ static int vcpu_set_affinity(
>      struct vcpu *v, const cpumask_t *affinity, cpumask_t *which)
>  {
>      spinlock_t *lock;
> +    int ret = 0;
>  
>      lock = vcpu_schedule_lock_irq(v);
>  
> -    cpumask_copy(which, affinity);
> +    if ( v->affinity_broken )
> +        ret = -EBUSY;
> +    else
> +    {
> +        cpumask_copy(which, affinity);
>  
> -    /* Always ask the scheduler to re-evaluate placement
> -     * when changing the affinity */
> -    set_bit(_VPF_migrating, &v->pause_flags);
> +        /* Always ask the scheduler to re-evaluate placement
> +         * when changing the affinity */
> +        set_bit(_VPF_migrating, &v->pause_flags);

When you touch code like this, would it be possible to at once fix
the coding style issues it (the comment in this case) has?

> @@ -978,6 +998,51 @@ void watchdog_domain_destroy(struct domain *d)
>          kill_timer(&d->watchdog_timer[i]);
>  }
>  
> +static long do_pin_temp(int cpu)

As expressed before, throughout this patch I dislike the "temp"
naming, when the temporary nature of this operation isn't being
enforced by anything.

Apart from that I (vaguely) recall there having been previous
suggestions in the direction of (temporary), which have got
rejected.

On both points I think we need to have input from the scheduler
maintainers.

> +{
> +    struct vcpu *v = current;
> +    spinlock_t *lock;
> +    long ret = -EINVAL;

"int" seems completely sufficient for both the variable and the
function return type.

> @@ -1087,6 +1152,23 @@ ret_t do_sched_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>          break;
>      }
>  
> +    case SCHEDOP_pin_temp:
> +    {
> +        struct sched_pin_temp sched_pin_temp;
> +
> +        ret = -EFAULT;
> +        if ( copy_from_guest(&sched_pin_temp, arg, 1) )
> +            break;
> +
> +        ret = -EPERM;
> +        if ( !is_hardware_domain(current->domain) )
> +            break;

I'd generally suggest swapping these two.

> --- a/xen/include/public/sched.h
> +++ b/xen/include/public/sched.h
> @@ -118,6 +118,17 @@
>   * With id != 0 and timeout != 0, poke watchdog timer and set new timeout.
>   */
>  #define SCHEDOP_watchdog    6
> +
> +/*
> + * Temporarily pin the current vcpu to one physical cpu or undo that pinning.
> + * @arg == pointer to sched_pin_temp_t structure.
> + *
> + * Setting pcpu to -1 will undo a previous temporary pinning and restore the
> + * previous cpu affinity. The temporary aspect of the pinning isn't enforced
> + * by the hypervisor.

This comment is now out of sync with the code, since you now
accept any negative CPU number as "undo" request.

Jan
David Vrabel March 1, 2016, 11:55 a.m. UTC | #2
On 01/03/16 09:02, Juergen Gross wrote:
> Some hardware (e.g. Dell studio 1555 laptops) require SMIs to be
> called on physical cpu 0 only. Linux drivers like dcdbas or i8k try
> to achieve this by pinning the running thread to cpu 0, but in Dom0
> this is not enough: the vcpu must be pinned to physical cpu 0 via
> Xen, too.
> 
> Add a stable hypercall option SCHEDOP_pin_temp to the sched_op
> hypercall to achieve this. It is taking a physical cpu number as
> parameter. If pinning is possible (the calling domain has the
> privilege to make the call and the cpu is available in the domain's
> cpupool) the calling vcpu is pinned to the specified cpu. The old
> cpu affinity is saved. To undo the temporary pinning a cpu -1 is
> specified. This will restore the original cpu affinity for the vcpu.

I suggest SCHEDOP_pin_override as a name.

David
Jürgen Groß March 1, 2016, 11:58 a.m. UTC | #3
On 01/03/16 12:27, Jan Beulich wrote:
>>>> On 01.03.16 at 10:02, <JGross@suse.com> wrote:
>> @@ -752,14 +766,20 @@ static int vcpu_set_affinity(
>>      struct vcpu *v, const cpumask_t *affinity, cpumask_t *which)
>>  {
>>      spinlock_t *lock;
>> +    int ret = 0;
>>  
>>      lock = vcpu_schedule_lock_irq(v);
>>  
>> -    cpumask_copy(which, affinity);
>> +    if ( v->affinity_broken )
>> +        ret = -EBUSY;
>> +    else
>> +    {
>> +        cpumask_copy(which, affinity);
>>  
>> -    /* Always ask the scheduler to re-evaluate placement
>> -     * when changing the affinity */
>> -    set_bit(_VPF_migrating, &v->pause_flags);
>> +        /* Always ask the scheduler to re-evaluate placement
>> +         * when changing the affinity */
>> +        set_bit(_VPF_migrating, &v->pause_flags);
> 
> When you touch code like this, would it be possible to at once fix
> the coding style issues it (the comment in this case) has?

Sure, NP.

> 
>> @@ -978,6 +998,51 @@ void watchdog_domain_destroy(struct domain *d)
>>          kill_timer(&d->watchdog_timer[i]);
>>  }
>>  
>> +static long do_pin_temp(int cpu)
> 
> As expressed before, throughout this patch I dislike the "temp"
> naming, when the temporary nature of this operation isn't being
> enforced by anything.
> 
> Apart from that I (vaguely) recall there having been previous
> suggestions in the direction of (temporary), which have got
> rejected.
> 
> On both points I think we need to have input from the scheduler
> maintainers.

Okay. I don't mind changing the name. We should just agree on one.

> 
>> +{
>> +    struct vcpu *v = current;
>> +    spinlock_t *lock;
>> +    long ret = -EINVAL;
> 
> "int" seems completely sufficient for both the variable and the
> function return type.

Hmm, yes.

> 
>> @@ -1087,6 +1152,23 @@ ret_t do_sched_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>>          break;
>>      }
>>  
>> +    case SCHEDOP_pin_temp:
>> +    {
>> +        struct sched_pin_temp sched_pin_temp;
>> +
>> +        ret = -EFAULT;
>> +        if ( copy_from_guest(&sched_pin_temp, arg, 1) )
>> +            break;
>> +
>> +        ret = -EPERM;
>> +        if ( !is_hardware_domain(current->domain) )
>> +            break;
> 
> I'd generally suggest swapping these two.

Will do.

> 
>> --- a/xen/include/public/sched.h
>> +++ b/xen/include/public/sched.h
>> @@ -118,6 +118,17 @@
>>   * With id != 0 and timeout != 0, poke watchdog timer and set new timeout.
>>   */
>>  #define SCHEDOP_watchdog    6
>> +
>> +/*
>> + * Temporarily pin the current vcpu to one physical cpu or undo that pinning.
>> + * @arg == pointer to sched_pin_temp_t structure.
>> + *
>> + * Setting pcpu to -1 will undo a previous temporary pinning and restore the
>> + * previous cpu affinity. The temporary aspect of the pinning isn't enforced
>> + * by the hypervisor.
> 
> This comment is now out of sync with the code, since you now
> accept any negative CPU number as "undo" request.

Will change it.


Juergen
Jürgen Groß March 1, 2016, 11:58 a.m. UTC | #4
On 01/03/16 12:55, David Vrabel wrote:
> On 01/03/16 09:02, Juergen Gross wrote:
>> Some hardware (e.g. Dell studio 1555 laptops) require SMIs to be
>> called on physical cpu 0 only. Linux drivers like dcdbas or i8k try
>> to achieve this by pinning the running thread to cpu 0, but in Dom0
>> this is not enough: the vcpu must be pinned to physical cpu 0 via
>> Xen, too.
>>
>> Add a stable hypercall option SCHEDOP_pin_temp to the sched_op
>> hypercall to achieve this. It is taking a physical cpu number as
>> parameter. If pinning is possible (the calling domain has the
>> privilege to make the call and the cpu is available in the domain's
>> cpupool) the calling vcpu is pinned to the specified cpu. The old
>> cpu affinity is saved. To undo the temporary pinning a cpu -1 is
>> specified. This will restore the original cpu affinity for the vcpu.
> 
> I suggest SCHEDOP_pin_override as a name.

I'm fine with that. Any objections?


Juergen
Dario Faggioli March 1, 2016, 12:15 p.m. UTC | #5
On Tue, 2016-03-01 at 12:58 +0100, Juergen Gross wrote:
> On 01/03/16 12:55, David Vrabel wrote:
> > 
> > On 01/03/16 09:02, Juergen Gross wrote:
> > > 
> > > Some hardware (e.g. Dell studio 1555 laptops) require SMIs to be
> > > called on physical cpu 0 only. Linux drivers like dcdbas or i8k
> > > try
> > > to achieve this by pinning the running thread to cpu 0, but in
> > > Dom0
> > > this is not enough: the vcpu must be pinned to physical cpu 0 via
> > > Xen, too.
> > > 
> > > Add a stable hypercall option SCHEDOP_pin_temp to the sched_op
> > > hypercall to achieve this. It is taking a physical cpu number as
> > > parameter. If pinning is possible (the calling domain has the
> > > privilege to make the call and the cpu is available in the
> > > domain's
> > > cpupool) the calling vcpu is pinned to the specified cpu. The old
> > > cpu affinity is saved. To undo the temporary pinning a cpu -1 is
> > > specified. This will restore the original cpu affinity for the
> > > vcpu.
> > I suggest SCHEDOP_pin_override as a name.
>
> I'm fine with that. Any objections?
> 
Not at all. I actually like it a lot.

Thanks and Regards,
Dario
George Dunlap March 1, 2016, 2:02 p.m. UTC | #6
On 01/03/16 12:15, Dario Faggioli wrote:
> On Tue, 2016-03-01 at 12:58 +0100, Juergen Gross wrote:
>> On 01/03/16 12:55, David Vrabel wrote:
>>>
>>> On 01/03/16 09:02, Juergen Gross wrote:
>>>>
>>>> Some hardware (e.g. Dell studio 1555 laptops) require SMIs to be
>>>> called on physical cpu 0 only. Linux drivers like dcdbas or i8k
>>>> try
>>>> to achieve this by pinning the running thread to cpu 0, but in
>>>> Dom0
>>>> this is not enough: the vcpu must be pinned to physical cpu 0 via
>>>> Xen, too.
>>>>
>>>> Add a stable hypercall option SCHEDOP_pin_temp to the sched_op
>>>> hypercall to achieve this. It is taking a physical cpu number as
>>>> parameter. If pinning is possible (the calling domain has the
>>>> privilege to make the call and the cpu is available in the
>>>> domain's
>>>> cpupool) the calling vcpu is pinned to the specified cpu. The old
>>>> cpu affinity is saved. To undo the temporary pinning a cpu -1 is
>>>> specified. This will restore the original cpu affinity for the
>>>> vcpu.
>>> I suggest SCHEDOP_pin_override as a name.
>>
>> I'm fine with that. Any objections?
>>
> Not at all. I actually like it a lot.

+1 to the name.

 -George
George Dunlap March 1, 2016, 3:52 p.m. UTC | #7
On 01/03/16 09:02, Juergen Gross wrote:
> Some hardware (e.g. Dell studio 1555 laptops) require SMIs to be
> called on physical cpu 0 only. Linux drivers like dcdbas or i8k try
> to achieve this by pinning the running thread to cpu 0, but in Dom0
> this is not enough: the vcpu must be pinned to physical cpu 0 via
> Xen, too.
> 
> Add a stable hypercall option SCHEDOP_pin_temp to the sched_op
> hypercall to achieve this. It is taking a physical cpu number as
> parameter. If pinning is possible (the calling domain has the
> privilege to make the call and the cpu is available in the domain's
> cpupool) the calling vcpu is pinned to the specified cpu. The old
> cpu affinity is saved. To undo the temporary pinning a cpu -1 is
> specified. This will restore the original cpu affinity for the vcpu.
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>
> ---
> V2: - limit operation to hardware domain as suggested by Jan Beulich
>     - some style issues corrected as requested by Jan Beulich
>     - use fixed width types in interface as requested by Jan Beulich
>     - add compat layer checking as requested by Jan Beulich
> ---
>  xen/common/compat/schedule.c |  4 ++
>  xen/common/schedule.c        | 92 +++++++++++++++++++++++++++++++++++++++++---
>  xen/include/public/sched.h   | 17 ++++++++
>  xen/include/xlat.lst         |  1 +
>  4 files changed, 109 insertions(+), 5 deletions(-)
> 
> diff --git a/xen/common/compat/schedule.c b/xen/common/compat/schedule.c
> index 812c550..73b0f01 100644
> --- a/xen/common/compat/schedule.c
> +++ b/xen/common/compat/schedule.c
> @@ -10,6 +10,10 @@
>  
>  #define do_sched_op compat_sched_op
>  
> +#define xen_sched_pin_temp sched_pin_temp
> +CHECK_sched_pin_temp;
> +#undef xen_sched_pin_temp
> +
>  #define xen_sched_shutdown sched_shutdown
>  CHECK_sched_shutdown;
>  #undef xen_sched_shutdown
> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> index b0d4b18..653f852 100644
> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -271,6 +271,12 @@ int sched_move_domain(struct domain *d, struct cpupool *c)
>      struct scheduler *old_ops;
>      void *old_domdata;
>  
> +    for_each_vcpu ( d, v )
> +    {
> +        if ( v->affinity_broken )
> +            return -EBUSY;
> +    }
> +
>      domdata = SCHED_OP(c->sched, alloc_domdata, d);
>      if ( domdata == NULL )
>          return -ENOMEM;
> @@ -669,6 +675,14 @@ int cpu_disable_scheduler(unsigned int cpu)
>              if ( cpumask_empty(&online_affinity) &&
>                   cpumask_test_cpu(cpu, v->cpu_hard_affinity) )
>              {
> +                if ( v->affinity_broken )
> +                {
> +                    /* The vcpu is temporarily pinned, can't move it. */
> +                    vcpu_schedule_unlock_irqrestore(lock, flags, v);
> +                    ret = -EBUSY;
> +                    break;
> +                }

Does this mean that if the user closes the laptop lid while one of these
drivers has vcpu0 pinned, that Xen will crash (see
xen/arch/x86/smpboot.c:__cpu_disable())?  Or is it the OS's job to make
sure that all temporary pins are removed before suspending?

Also -- have you actually tested the "cpupool move while pinned"
functionality to make sure it actually works?  There's a weird bit in
cpupool_unassign_cpu_helper() where after calling
cpu_disable_scheduler(cpu), it unconditionally sets the cpu bit in the
cpupool_free_cpus mask, even if it returns an error.  That can't be
right, even for the existing -EAGAIN case, can it?

I see that you have a loop to retry this call several times in the next
patch; but what if it fails every time -- what state is the system in?

And, in general, what happens if the device driver gets mixed up and
forgets to unpin the vcpu?  Is the only recourse to reboot your host (or
deal with the fact that you can't reconfigure your cpupools)?

 -George
George Dunlap March 1, 2016, 3:55 p.m. UTC | #8
On 01/03/16 15:52, George Dunlap wrote:
> On 01/03/16 09:02, Juergen Gross wrote:
>> Some hardware (e.g. Dell studio 1555 laptops) require SMIs to be
>> called on physical cpu 0 only. Linux drivers like dcdbas or i8k try
>> to achieve this by pinning the running thread to cpu 0, but in Dom0
>> this is not enough: the vcpu must be pinned to physical cpu 0 via
>> Xen, too.
>>
>> Add a stable hypercall option SCHEDOP_pin_temp to the sched_op
>> hypercall to achieve this. It is taking a physical cpu number as
>> parameter. If pinning is possible (the calling domain has the
>> privilege to make the call and the cpu is available in the domain's
>> cpupool) the calling vcpu is pinned to the specified cpu. The old
>> cpu affinity is saved. To undo the temporary pinning a cpu -1 is
>> specified. This will restore the original cpu affinity for the vcpu.
>>
>> Signed-off-by: Juergen Gross <jgross@suse.com>
>> ---
>> V2: - limit operation to hardware domain as suggested by Jan Beulich
>>     - some style issues corrected as requested by Jan Beulich
>>     - use fixed width types in interface as requested by Jan Beulich
>>     - add compat layer checking as requested by Jan Beulich
>> ---
>>  xen/common/compat/schedule.c |  4 ++
>>  xen/common/schedule.c        | 92 +++++++++++++++++++++++++++++++++++++++++---
>>  xen/include/public/sched.h   | 17 ++++++++
>>  xen/include/xlat.lst         |  1 +
>>  4 files changed, 109 insertions(+), 5 deletions(-)
>>
>> diff --git a/xen/common/compat/schedule.c b/xen/common/compat/schedule.c
>> index 812c550..73b0f01 100644
>> --- a/xen/common/compat/schedule.c
>> +++ b/xen/common/compat/schedule.c
>> @@ -10,6 +10,10 @@
>>  
>>  #define do_sched_op compat_sched_op
>>  
>> +#define xen_sched_pin_temp sched_pin_temp
>> +CHECK_sched_pin_temp;
>> +#undef xen_sched_pin_temp
>> +
>>  #define xen_sched_shutdown sched_shutdown
>>  CHECK_sched_shutdown;
>>  #undef xen_sched_shutdown
>> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
>> index b0d4b18..653f852 100644
>> --- a/xen/common/schedule.c
>> +++ b/xen/common/schedule.c
>> @@ -271,6 +271,12 @@ int sched_move_domain(struct domain *d, struct cpupool *c)
>>      struct scheduler *old_ops;
>>      void *old_domdata;
>>  
>> +    for_each_vcpu ( d, v )
>> +    {
>> +        if ( v->affinity_broken )
>> +            return -EBUSY;
>> +    }
>> +
>>      domdata = SCHED_OP(c->sched, alloc_domdata, d);
>>      if ( domdata == NULL )
>>          return -ENOMEM;
>> @@ -669,6 +675,14 @@ int cpu_disable_scheduler(unsigned int cpu)
>>              if ( cpumask_empty(&online_affinity) &&
>>                   cpumask_test_cpu(cpu, v->cpu_hard_affinity) )
>>              {
>> +                if ( v->affinity_broken )
>> +                {
>> +                    /* The vcpu is temporarily pinned, can't move it. */
>> +                    vcpu_schedule_unlock_irqrestore(lock, flags, v);
>> +                    ret = -EBUSY;
>> +                    break;
>> +                }
> 
> Does this mean that if the user closes the laptop lid while one of these
> drivers has vcpu0 pinned, that Xen will crash (see
> xen/arch/x86/smpboot.c:__cpu_disable())?  Or is it the OS's job to make
> sure that all temporary pins are removed before suspending?
> 
> Also -- have you actually tested the "cpupool move while pinned"
> functionality to make sure it actually works?  There's a weird bit in
> cpupool_unassign_cpu_helper() where after calling
> cpu_disable_scheduler(cpu), it unconditionally sets the cpu bit in the
> cpupool_free_cpus mask, even if it returns an error.  That can't be
> right, even for the existing -EAGAIN case, can it?
> 
> I see that you have a loop to retry this call several times in the next
> patch; but what if it fails every time -- what state is the system in?
> 
> And, in general, what happens if the device driver gets mixed up and
> forgets to unpin the vcpu?  Is the only recourse to reboot your host (or
> deal with the fact that you can't reconfigure your cpupools)?

(I should say, I think this probably is the best solution to this
problem; I just want to make sure we think about the error cases carefully.)

 -George
Jan Beulich March 1, 2016, 4:11 p.m. UTC | #9
>>> On 01.03.16 at 16:55, <george.dunlap@citrix.com> wrote:
> On 01/03/16 15:52, George Dunlap wrote:
>> On 01/03/16 09:02, Juergen Gross wrote:
>>> --- a/xen/common/schedule.c
>>> +++ b/xen/common/schedule.c
>>> @@ -271,6 +271,12 @@ int sched_move_domain(struct domain *d, struct cpupool *c)
>>>      struct scheduler *old_ops;
>>>      void *old_domdata;
>>>  
>>> +    for_each_vcpu ( d, v )
>>> +    {
>>> +        if ( v->affinity_broken )
>>> +            return -EBUSY;
>>> +    }
>>> +
>>>      domdata = SCHED_OP(c->sched, alloc_domdata, d);
>>>      if ( domdata == NULL )
>>>          return -ENOMEM;
>>> @@ -669,6 +675,14 @@ int cpu_disable_scheduler(unsigned int cpu)
>>>              if ( cpumask_empty(&online_affinity) &&
>>>                   cpumask_test_cpu(cpu, v->cpu_hard_affinity) )
>>>              {
>>> +                if ( v->affinity_broken )
>>> +                {
>>> +                    /* The vcpu is temporarily pinned, can't move it. */
>>> +                    vcpu_schedule_unlock_irqrestore(lock, flags, v);
>>> +                    ret = -EBUSY;
>>> +                    break;
>>> +                }
>> 
>> Does this mean that if the user closes the laptop lid while one of these
>> drivers has vcpu0 pinned, that Xen will crash (see
>> xen/arch/x86/smpboot.c:__cpu_disable())?  Or is it the OS's job to make
>> sure that all temporary pins are removed before suspending?
>> 
>> Also -- have you actually tested the "cpupool move while pinned"
>> functionality to make sure it actually works?  There's a weird bit in
>> cpupool_unassign_cpu_helper() where after calling
>> cpu_disable_scheduler(cpu), it unconditionally sets the cpu bit in the
>> cpupool_free_cpus mask, even if it returns an error.  That can't be
>> right, even for the existing -EAGAIN case, can it?
>> 
>> I see that you have a loop to retry this call several times in the next
>> patch; but what if it fails every time -- what state is the system in?
>> 
>> And, in general, what happens if the device driver gets mixed up and
>> forgets to unpin the vcpu?  Is the only recourse to reboot your host (or
>> deal with the fact that you can't reconfigure your cpupools)?
> 
> (I should say, I think this probably is the best solution to this
> problem; I just want to make sure we think about the error cases carefully.)

I guess in the worst case there could be a utility or xl command
doing the missing unpin in such an emergency?

Jan
Jürgen Groß March 2, 2016, 7:14 a.m. UTC | #10
On 01/03/16 16:52, George Dunlap wrote:
> On 01/03/16 09:02, Juergen Gross wrote:
>> Some hardware (e.g. Dell studio 1555 laptops) require SMIs to be
>> called on physical cpu 0 only. Linux drivers like dcdbas or i8k try
>> to achieve this by pinning the running thread to cpu 0, but in Dom0
>> this is not enough: the vcpu must be pinned to physical cpu 0 via
>> Xen, too.
>>
>> Add a stable hypercall option SCHEDOP_pin_temp to the sched_op
>> hypercall to achieve this. It is taking a physical cpu number as
>> parameter. If pinning is possible (the calling domain has the
>> privilege to make the call and the cpu is available in the domain's
>> cpupool) the calling vcpu is pinned to the specified cpu. The old
>> cpu affinity is saved. To undo the temporary pinning a cpu -1 is
>> specified. This will restore the original cpu affinity for the vcpu.
>>
>> Signed-off-by: Juergen Gross <jgross@suse.com>
>> ---
>> V2: - limit operation to hardware domain as suggested by Jan Beulich
>>     - some style issues corrected as requested by Jan Beulich
>>     - use fixed width types in interface as requested by Jan Beulich
>>     - add compat layer checking as requested by Jan Beulich
>> ---
>>  xen/common/compat/schedule.c |  4 ++
>>  xen/common/schedule.c        | 92 +++++++++++++++++++++++++++++++++++++++++---
>>  xen/include/public/sched.h   | 17 ++++++++
>>  xen/include/xlat.lst         |  1 +
>>  4 files changed, 109 insertions(+), 5 deletions(-)
>>
>> diff --git a/xen/common/compat/schedule.c b/xen/common/compat/schedule.c
>> index 812c550..73b0f01 100644
>> --- a/xen/common/compat/schedule.c
>> +++ b/xen/common/compat/schedule.c
>> @@ -10,6 +10,10 @@
>>  
>>  #define do_sched_op compat_sched_op
>>  
>> +#define xen_sched_pin_temp sched_pin_temp
>> +CHECK_sched_pin_temp;
>> +#undef xen_sched_pin_temp
>> +
>>  #define xen_sched_shutdown sched_shutdown
>>  CHECK_sched_shutdown;
>>  #undef xen_sched_shutdown
>> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
>> index b0d4b18..653f852 100644
>> --- a/xen/common/schedule.c
>> +++ b/xen/common/schedule.c
>> @@ -271,6 +271,12 @@ int sched_move_domain(struct domain *d, struct cpupool *c)
>>      struct scheduler *old_ops;
>>      void *old_domdata;
>>  
>> +    for_each_vcpu ( d, v )
>> +    {
>> +        if ( v->affinity_broken )
>> +            return -EBUSY;
>> +    }
>> +
>>      domdata = SCHED_OP(c->sched, alloc_domdata, d);
>>      if ( domdata == NULL )
>>          return -ENOMEM;
>> @@ -669,6 +675,14 @@ int cpu_disable_scheduler(unsigned int cpu)
>>              if ( cpumask_empty(&online_affinity) &&
>>                   cpumask_test_cpu(cpu, v->cpu_hard_affinity) )
>>              {
>> +                if ( v->affinity_broken )
>> +                {
>> +                    /* The vcpu is temporarily pinned, can't move it. */
>> +                    vcpu_schedule_unlock_irqrestore(lock, flags, v);
>> +                    ret = -EBUSY;
>> +                    break;
>> +                }
> 
> Does this mean that if the user closes the laptop lid while one of these
> drivers has vcpu0 pinned, that Xen will crash (see
> xen/arch/x86/smpboot.c:__cpu_disable())?  Or is it the OS's job to make
> sure that all temporary pins are removed before suspending?

Yes, this must be ensured by the OS.

> Also -- have you actually tested the "cpupool move while pinned"
> functionality to make sure it actually works?  There's a weird bit in
> cpupool_unassign_cpu_helper() where after calling
> cpu_disable_scheduler(cpu), it unconditionally sets the cpu bit in the
> cpupool_free_cpus mask, even if it returns an error.  That can't be
> right, even for the existing -EAGAIN case, can it?

That should be no problem. Such a failure can be repaired easily by
adding the cpu to the cpupool again. Adding a comment seems to be a
good idea. :-)

What is wrong and even worse, schedule_cpu_switch() returning an error
will leak domlist_read_lock. I'll write another patch to correct this
issue.

> I see that you have a loop to retry this call several times in the next
> patch; but what if it fails every time -- what state is the system in?

The cpu can be added to the original cpupool via "xl cpupool-add" again.

> And, in general, what happens if the device driver gets mixed up and
> forgets to unpin the vcpu?  Is the only recourse to reboot your host (or
> deal with the fact that you can't reconfigure your cpupools)?

Unless we add a "forced" option to "xl vcpu-pin", yes.

Thanks for the thorough review,

Juergen
Dario Faggioli March 2, 2016, 9:27 a.m. UTC | #11
On Wed, 2016-03-02 at 08:14 +0100, Juergen Gross wrote:
> On 01/03/16 16:52, George Dunlap wrote:
> > 
> > 
> > Also -- have you actually tested the "cpupool move while pinned"
> > functionality to make sure it actually works?  There's a weird bit
> > in
> > cpupool_unassign_cpu_helper() where after calling
> > cpu_disable_scheduler(cpu), it unconditionally sets the cpu bit in
> > the
> > cpupool_free_cpus mask, even if it returns an error.  That can't be
> > right, even for the existing -EAGAIN case, can it?
> That should be no problem. Such a failure can be repaired easily by
> adding the cpu to the cpupool again. 
>
And there's not much else one can do, I would say. When we are in
cpu_disable_scheduler(), coming from
cpupool_unassign_cpu()-->cpupool_unassign_cpu() we're already halfway
through removing the cpu from the pool (e.g., we already cleared the
relevant bit from the cpupool's cpu_valid mask).

And we don't actually want to revert that, as doing so would allow the
scheduler to start again moving vcpus to that cpu (and the following
attempts will risk failing with EAGAIN again :-D).

FWIW, I've also found that part rather weird for quite some time... But
it does indeed makes sense, IMO.

> Adding a comment seems to be a
> good idea. :-)
> 
Yep. Should we also add an error message for the user to be able to see
it, even if she can't read the comment in the source code? (Not
necessarily right there, if that would make it trigger too much... just
in a place where it can be seen in the case the user actually need to
do something).

> What is wrong and even worse, schedule_cpu_switch() returning an
> error
> will leak domlist_read_lock. 
>
Indeed, good catch. :-)

> > And, in general, what happens if the device driver gets mixed up
> > and
> > forgets to unpin the vcpu?  Is the only recourse to reboot your
> > host (or
> > deal with the fact that you can't reconfigure your cpupools)?
> Unless we add a "forced" option to "xl vcpu-pin", yes.
> 
Which would be fine to have, IMO. I'm not sure if it would better be an
`xl vcpu-pin' flag, or a separate utility (as Jan is also saying).

A separate utility would fit better the "emergency nature" of the
thing, avoiding having to clobber xl for that (as this will be the
only, pretty uncommon, case where such flag would be needed).

However, an xl flag is easier to add, easier to document and easier and
more natural to find, from the point of view of an user that really
needs it. And perhaps it could turn out useful for other situations in
future. So, I guess I'd say:
 - yes, let's add that
 - let's do it as a "force flag" of `xl vcpu-pin'.

Regards,
Dario
Jürgen Groß March 2, 2016, 11:19 a.m. UTC | #12
On 02/03/16 10:27, Dario Faggioli wrote:
> On Wed, 2016-03-02 at 08:14 +0100, Juergen Gross wrote:
>> On 01/03/16 16:52, George Dunlap wrote:
>>>  
>>>
>>> Also -- have you actually tested the "cpupool move while pinned"
>>> functionality to make sure it actually works?  There's a weird bit
>>> in
>>> cpupool_unassign_cpu_helper() where after calling
>>> cpu_disable_scheduler(cpu), it unconditionally sets the cpu bit in
>>> the
>>> cpupool_free_cpus mask, even if it returns an error.  That can't be
>>> right, even for the existing -EAGAIN case, can it?
>> That should be no problem. Such a failure can be repaired easily by
>> adding the cpu to the cpupool again. 
>>
> And there's not much else one can do, I would say. When we are in
> cpu_disable_scheduler(), coming from
> cpupool_unassign_cpu()-->cpupool_unassign_cpu() we're already halfway
> through removing the cpu from the pool (e.g., we already cleared the
> relevant bit from the cpupool's cpu_valid mask).
> 
> And we don't actually want to revert that, as doing so would allow the
> scheduler to start again moving vcpus to that cpu (and the following
> attempts will risk failing with EAGAIN again :-D).

Yep.

> 
> FWIW, I've also found that part rather weird for quite some time... But
> it does indeed makes sense, IMO.
> 
>> Adding a comment seems to be a
>> good idea. :-)
>>
> Yep. Should we also add an error message for the user to be able to see
> it, even if she can't read the comment in the source code? (Not
> necessarily right there, if that would make it trigger too much... just
> in a place where it can be seen in the case the user actually need to
> do something).

I'd rather add the error message to xl. That's where the user will see
it and where he can react at once. The message can even tell the user
the correct command, which would be a very strange thing to do in the
hypervisor.

Another patch, I guess. :-)

> 
>> What is wrong and even worse, schedule_cpu_switch() returning an
>> error
>> will leak domlist_read_lock. 
>>
> Indeed, good catch. :-)
> 
>>> And, in general, what happens if the device driver gets mixed up
>>> and
>>> forgets to unpin the vcpu?  Is the only recourse to reboot your
>>> host (or
>>> deal with the fact that you can't reconfigure your cpupools)?
>> Unless we add a "forced" option to "xl vcpu-pin", yes.
>>
> Which would be fine to have, IMO. I'm not sure if it would better be an
> `xl vcpu-pin' flag, or a separate utility (as Jan is also saying).
> 
> A separate utility would fit better the "emergency nature" of the
> thing, avoiding having to clobber xl for that (as this will be the
> only, pretty uncommon, case where such flag would be needed).
> 
> However, an xl flag is easier to add, easier to document and easier and
> more natural to find, from the point of view of an user that really
> needs it. And perhaps it could turn out useful for other situations in
> future. So, I guess I'd say:
>  - yes, let's add that
>  - let's do it as a "force flag" of `xl vcpu-pin'.

Okay, patch will follow...


Juergen
Dario Faggioli March 2, 2016, 11:49 a.m. UTC | #13
On Wed, 2016-03-02 at 12:19 +0100, Juergen Gross wrote:
> On 02/03/16 10:27, Dario Faggioli wrote:
> > 
> > Yep. Should we also add an error message for the user to be able to
> > see
> > it, even if she can't read the comment in the source code? (Not
> > necessarily right there, if that would make it trigger too much...
> > just
> > in a place where it can be seen in the case the user actually need
> > to
> > do something).
> I'd rather add the error message to xl. That's where the user will
> see
> it and where he can react at once. The message can even tell the user
> the correct command, which would be a very strange thing to do in the
> hypervisor.
> 
Sure, wherever it's most useful.

> Another patch, I guess. :-)
> 
Yeah, sorry. :-)

Regards,
Dario
Jürgen Groß March 2, 2016, 12:12 p.m. UTC | #14
On 02/03/16 12:49, Dario Faggioli wrote:
> On Wed, 2016-03-02 at 12:19 +0100, Juergen Gross wrote:
>> On 02/03/16 10:27, Dario Faggioli wrote:
>>>  
>>> Yep. Should we also add an error message for the user to be able to
>>> see
>>> it, even if she can't read the comment in the source code? (Not
>>> necessarily right there, if that would make it trigger too much...
>>> just
>>> in a place where it can be seen in the case the user actually need
>>> to
>>> do something).
>> I'd rather add the error message to xl. That's where the user will
>> see
>> it and where he can react at once. The message can even tell the user
>> the correct command, which would be a very strange thing to do in the
>> hypervisor.
>>
> Sure, wherever it's most useful.
> 
>> Another patch, I guess. :-)
>>
> Yeah, sorry. :-)

Sarcio ergo sum! :-)


Juergen
Jürgen Groß March 2, 2016, 3:34 p.m. UTC | #15
On 02/03/16 10:27, Dario Faggioli wrote:
> On Wed, 2016-03-02 at 08:14 +0100, Juergen Gross wrote:
>> On 01/03/16 16:52, George Dunlap wrote:
>>>  
>>>
>>> Also -- have you actually tested the "cpupool move while pinned"
>>> functionality to make sure it actually works?  There's a weird bit
>>> in
>>> cpupool_unassign_cpu_helper() where after calling
>>> cpu_disable_scheduler(cpu), it unconditionally sets the cpu bit in
>>> the
>>> cpupool_free_cpus mask, even if it returns an error.  That can't be
>>> right, even for the existing -EAGAIN case, can it?
>> That should be no problem. Such a failure can be repaired easily by
>> adding the cpu to the cpupool again. 
>>
> And there's not much else one can do, I would say. When we are in
> cpu_disable_scheduler(), coming from
> cpupool_unassign_cpu()-->cpupool_unassign_cpu() we're already halfway
> through removing the cpu from the pool (e.g., we already cleared the
> relevant bit from the cpupool's cpu_valid mask).
> 
> And we don't actually want to revert that, as doing so would allow the
> scheduler to start again moving vcpus to that cpu (and the following
> attempts will risk failing with EAGAIN again :-D).
> 
> FWIW, I've also found that part rather weird for quite some time... But
> it does indeed makes sense, IMO.
> 
>> Adding a comment seems to be a
>> good idea. :-)
>>
> Yep. Should we also add an error message for the user to be able to see
> it, even if she can't read the comment in the source code? (Not
> necessarily right there, if that would make it trigger too much... just
> in a place where it can be seen in the case the user actually need to
> do something).
> 
>> What is wrong and even worse, schedule_cpu_switch() returning an
>> error
>> will leak domlist_read_lock. 
>>
> Indeed, good catch. :-)
> 
>>> And, in general, what happens if the device driver gets mixed up
>>> and
>>> forgets to unpin the vcpu?  Is the only recourse to reboot your
>>> host (or
>>> deal with the fact that you can't reconfigure your cpupools)?
>> Unless we add a "forced" option to "xl vcpu-pin", yes.
>>
> Which would be fine to have, IMO. I'm not sure if it would better be an
> `xl vcpu-pin' flag, or a separate utility (as Jan is also saying).
> 
> A separate utility would fit better the "emergency nature" of the
> thing, avoiding having to clobber xl for that (as this will be the
> only, pretty uncommon, case where such flag would be needed).
> 
> However, an xl flag is easier to add, easier to document and easier and
> more natural to find, from the point of view of an user that really
> needs it. And perhaps it could turn out useful for other situations in
> future. So, I guess I'd say:
>  - yes, let's add that
>  - let's do it as a "force flag" of `xl vcpu-pin'.

Which raises the question: how to do that on the libxl level?

a) expand libxl_set_vcpuaffinity() with another parameter (is this even
   possible? I could do some ifdeffery, but the API would change...)

b) add a libxl_set_vcpuaffinity_force() variant

c) imply the force flag by specifying both hard and soft maps as NULL
   (it _is_ basically just that: keep both affinity sets), implying that
   it makes no sense to specify any affinities with the -f flag (which
   renders the "force" meaning rather strange, would be more a "restore"
   now).


Juergen

> 
> Regards,
> Dario
>
Dario Faggioli March 2, 2016, 4:03 p.m. UTC | #16
On Wed, 2016-03-02 at 16:34 +0100, Juergen Gross wrote:
> On 02/03/16 10:27, Dario Faggioli wrote:
> > 
> > However, an xl flag is easier to add, easier to document and easier
> > and
> > more natural to find, from the point of view of an user that really
> > needs it. And perhaps it could turn out useful for other situations
> > in
> > future. So, I guess I'd say:
> >  - yes, let's add that
> >  - let's do it as a "force flag" of `xl vcpu-pin'.
> Which raises the question: how to do that on the libxl level?
> 
Ah, right.

> a) expand libxl_set_vcpuaffinity() with another parameter (is this
> even
>    possible? I could do some ifdeffery, but the API would change...)
> 
> b) add a libxl_set_vcpuaffinity_force() variant
> 
> c) imply the force flag by specifying both hard and soft maps as NULL
>    (it _is_ basically just that: keep both affinity sets), implying
> that
>    it makes no sense to specify any affinities with the -f flag
> (which
>    renders the "force" meaning rather strange, would be more a
> "restore"
>    now).
> 
Eheh, tools' maintainers' call. My preference would be b).

I don't like a), mostly because that would mean everyone will need to
specify a parameter that it is really only necessary in special cases.

I could live with c), but it indeed makes the semantic too convoluted
for my taste.

I guess, however, that even if going for b), we need to decide whether
to require a cpumask or not, and what to do if one passes NULL. Maybe
we can have a cpumask parameter and,
 - if it is not NULL, force affinity to that,
 - if it is NULL, just 'restore';
what do you think?

Actually, at Xen level, the override only acts on hard affinity...
should libxl take only one cpumask (for hard affinity only), or both
hard and soft?
I'd say just one for hard is enough, unless we want to make space for a
potential future situation where we will want to break and restore soft
affinity as well...

Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
Jürgen Groß March 2, 2016, 5:15 p.m. UTC | #17
On 02/03/16 17:03, Dario Faggioli wrote:
> On Wed, 2016-03-02 at 16:34 +0100, Juergen Gross wrote:
>> On 02/03/16 10:27, Dario Faggioli wrote:
>>>  
>>> However, an xl flag is easier to add, easier to document and easier
>>> and
>>> more natural to find, from the point of view of an user that really
>>> needs it. And perhaps it could turn out useful for other situations
>>> in
>>> future. So, I guess I'd say:
>>>  - yes, let's add that
>>>  - let's do it as a "force flag" of `xl vcpu-pin'.
>> Which raises the question: how to do that on the libxl level?
>>
> Ah, right.
> 
>> a) expand libxl_set_vcpuaffinity() with another parameter (is this
>> even
>>    possible? I could do some ifdeffery, but the API would change...)
>>
>> b) add a libxl_set_vcpuaffinity_force() variant
>>
>> c) imply the force flag by specifying both hard and soft maps as NULL
>>    (it _is_ basically just that: keep both affinity sets), implying
>> that
>>    it makes no sense to specify any affinities with the -f flag
>> (which
>>    renders the "force" meaning rather strange, would be more a
>> "restore"
>>    now).
>>
> Eheh, tools' maintainers' call. My preference would be b).
> 
> I don't like a), mostly because that would mean everyone will need to
> specify a parameter that it is really only necessary in special cases.
> 
> I could live with c), but it indeed makes the semantic too convoluted
> for my taste.
> 
> I guess, however, that even if going for b), we need to decide whether
> to require a cpumask or not, and what to do if one passes NULL. Maybe
> we can have a cpumask parameter and,
>  - if it is not NULL, force affinity to that,
>  - if it is NULL, just 'restore';
> what do you think?

I would just let the force flag restore the old setting (thus clearing
the affinity_broken flag) and then apply the normal affinity settings.

> Actually, at Xen level, the override only acts on hard affinity...
> should libxl take only one cpumask (for hard affinity only), or both
> hard and soft?

Just as the user is specifying: 0, 1 or 2.

> I'd say just one for hard is enough, unless we want to make space for a
> potential future situation where we will want to break and restore soft
> affinity as well...

The force flag would be just an add-on. That's rather easy in the
hypervisor and in the tools.


Juergen
Anshul Makkar March 2, 2016, 5:21 p.m. UTC | #18
Hi,


-----Original Message-----
From: Xen-devel [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of George Dunlap

Sent: 01 March 2016 15:53
To: Juergen Gross <jgross@suse.com>; xen-devel@lists.xen.org
Cc: Wei Liu <wei.liu2@citrix.com>; Stefano Stabellini <Stefano.Stabellini@citrix.com>; George Dunlap <George.Dunlap@citrix.com>; Andrew Cooper <Andrew.Cooper3@citrix.com>; Dario Faggioli <dario.faggioli@citrix.com>; Ian Jackson <Ian.Jackson@citrix.com>; David Vrabel <david.vrabel@citrix.com>; jbeulich@suse.com
Subject: Re: [Xen-devel] [PATCH v2 2/3] xen: add hypercall option to temporarily pin a vcpu

On 01/03/16 09:02, Juergen Gross wrote:
> Some hardware (e.g. Dell studio 1555 laptops) require SMIs to be 

> called on physical cpu 0 only. Linux drivers like dcdbas or i8k try to 

> achieve this by pinning the running thread to cpu 0, but in Dom0 this 

> is not enough: the vcpu must be pinned to physical cpu 0 via Xen, too.

> 

> Add a stable hypercall option SCHEDOP_pin_temp to the sched_op 

> hypercall to achieve this. It is taking a physical cpu number as 

> parameter. If pinning is possible (the calling domain has the 

> privilege to make the call and the cpu is available in the domain's

> cpupool) the calling vcpu is pinned to the specified cpu. The old cpu 

> affinity is saved. To undo the temporary pinning a cpu -1 is 

> specified. This will restore the original cpu affinity for the vcpu.

> 

> Signed-off-by: Juergen Gross <jgross@suse.com>

> ---

> V2: - limit operation to hardware domain as suggested by Jan Beulich

>     - some style issues corrected as requested by Jan Beulich

>     - use fixed width types in interface as requested by Jan Beulich

>     - add compat layer checking as requested by Jan Beulich

> ---

>  xen/common/compat/schedule.c |  4 ++

>  xen/common/schedule.c        | 92 +++++++++++++++++++++++++++++++++++++++++---

>  xen/include/public/sched.h   | 17 ++++++++

>  xen/include/xlat.lst         |  1 +

>  4 files changed, 109 insertions(+), 5 deletions(-)

> 

> diff --git a/xen/common/compat/schedule.c 

> b/xen/common/compat/schedule.c index 812c550..73b0f01 100644

> --- a/xen/common/compat/schedule.c

> +++ b/xen/common/compat/schedule.c

> @@ -10,6 +10,10 @@

>  

>  #define do_sched_op compat_sched_op

>  

> +#define xen_sched_pin_temp sched_pin_temp CHECK_sched_pin_temp; 

> +#undef xen_sched_pin_temp

> +

>  #define xen_sched_shutdown sched_shutdown  CHECK_sched_shutdown;  

> #undef xen_sched_shutdown diff --git a/xen/common/schedule.c 

> b/xen/common/schedule.c index b0d4b18..653f852 100644

> --- a/xen/common/schedule.c

> +++ b/xen/common/schedule.c

> @@ -271,6 +271,12 @@ int sched_move_domain(struct domain *d, struct cpupool *c)

>      struct scheduler *old_ops;

>      void *old_domdata;

>  

> +    for_each_vcpu ( d, v )

> +    {

> +        if ( v->affinity_broken )

> +            return -EBUSY;

> +    }

> +

>      domdata = SCHED_OP(c->sched, alloc_domdata, d);

>      if ( domdata == NULL )

>          return -ENOMEM;

> @@ -669,6 +675,14 @@ int cpu_disable_scheduler(unsigned int cpu)

>              if ( cpumask_empty(&online_affinity) &&

>                   cpumask_test_cpu(cpu, v->cpu_hard_affinity) )

>              {

> +                if ( v->affinity_broken )

> +                {

> +                    /* The vcpu is temporarily pinned, can't move it. */

> +                    vcpu_schedule_unlock_irqrestore(lock, flags, v);

> +                    ret = -EBUSY;

> +                    break;

> +                }


Does this mean that if the user closes the laptop lid while one of these drivers has vcpu0 pinned, that Xen will crash (see xen/arch/x86/smpboot.c:__cpu_disable())?  Or is it the OS's job to make sure that all temporary pins are removed before suspending?

Also -- have you actually tested the "cpupool move while pinned"
functionality to make sure it actually works?  There's a weird bit in
cpupool_unassign_cpu_helper() where after calling cpu_disable_scheduler(cpu), it unconditionally sets the cpu bit in the cpupool_free_cpus mask, even if it returns an error.  That can't be right, even for the existing -EAGAIN case, can it?

I see that you have a loop to retry this call several times in the next patch; but what if it fails every time -- what state is the system in?

And, in general, what happens if the device driver gets mixed up and forgets to unpin the vcpu?  Is the only recourse to reboot your host (or deal with the fact that you can't reconfigure your cpupools)?

 -George

Sorry, lost the original thread so replying at the top of mail chain.

+static XSM_INLINE int xsm_schedop_pin_temp(XSM_DEFAULT_VOID) 
+{ 
+ XSM_ASSERT_ACTION(XSM_PRIV); 
+ return xsm_default_action(action, current->domain, NULL); 
+}

Is the intention is to restrict the hypercall usage to dom0 only ?

Anshul Makkar

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Jürgen Groß March 3, 2016, 5:31 a.m. UTC | #19
On 02/03/16 18:21, Anshul Makkar wrote:
> Hi,
> 
> 
> -----Original Message-----
> From: Xen-devel [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of George Dunlap
> Sent: 01 March 2016 15:53
> To: Juergen Gross <jgross@suse.com>; xen-devel@lists.xen.org
> Cc: Wei Liu <wei.liu2@citrix.com>; Stefano Stabellini <Stefano.Stabellini@citrix.com>; George Dunlap <George.Dunlap@citrix.com>; Andrew Cooper <Andrew.Cooper3@citrix.com>; Dario Faggioli <dario.faggioli@citrix.com>; Ian Jackson <Ian.Jackson@citrix.com>; David Vrabel <david.vrabel@citrix.com>; jbeulich@suse.com
> Subject: Re: [Xen-devel] [PATCH v2 2/3] xen: add hypercall option to temporarily pin a vcpu
> 
> On 01/03/16 09:02, Juergen Gross wrote:
>> Some hardware (e.g. Dell studio 1555 laptops) require SMIs to be 
>> called on physical cpu 0 only. Linux drivers like dcdbas or i8k try to 
>> achieve this by pinning the running thread to cpu 0, but in Dom0 this 
>> is not enough: the vcpu must be pinned to physical cpu 0 via Xen, too.
>>
>> Add a stable hypercall option SCHEDOP_pin_temp to the sched_op 
>> hypercall to achieve this. It is taking a physical cpu number as 
>> parameter. If pinning is possible (the calling domain has the 
>> privilege to make the call and the cpu is available in the domain's
>> cpupool) the calling vcpu is pinned to the specified cpu. The old cpu 
>> affinity is saved. To undo the temporary pinning a cpu -1 is 
>> specified. This will restore the original cpu affinity for the vcpu.
>>
>> Signed-off-by: Juergen Gross <jgross@suse.com>
>> ---
>> V2: - limit operation to hardware domain as suggested by Jan Beulich
>>     - some style issues corrected as requested by Jan Beulich
>>     - use fixed width types in interface as requested by Jan Beulich
>>     - add compat layer checking as requested by Jan Beulich
>> ---
>>  xen/common/compat/schedule.c |  4 ++
>>  xen/common/schedule.c        | 92 +++++++++++++++++++++++++++++++++++++++++---
>>  xen/include/public/sched.h   | 17 ++++++++
>>  xen/include/xlat.lst         |  1 +
>>  4 files changed, 109 insertions(+), 5 deletions(-)
>>
>> diff --git a/xen/common/compat/schedule.c 
>> b/xen/common/compat/schedule.c index 812c550..73b0f01 100644
>> --- a/xen/common/compat/schedule.c
>> +++ b/xen/common/compat/schedule.c
>> @@ -10,6 +10,10 @@
>>  
>>  #define do_sched_op compat_sched_op
>>  
>> +#define xen_sched_pin_temp sched_pin_temp CHECK_sched_pin_temp; 
>> +#undef xen_sched_pin_temp
>> +
>>  #define xen_sched_shutdown sched_shutdown  CHECK_sched_shutdown;  
>> #undef xen_sched_shutdown diff --git a/xen/common/schedule.c 
>> b/xen/common/schedule.c index b0d4b18..653f852 100644
>> --- a/xen/common/schedule.c
>> +++ b/xen/common/schedule.c
>> @@ -271,6 +271,12 @@ int sched_move_domain(struct domain *d, struct cpupool *c)
>>      struct scheduler *old_ops;
>>      void *old_domdata;
>>  
>> +    for_each_vcpu ( d, v )
>> +    {
>> +        if ( v->affinity_broken )
>> +            return -EBUSY;
>> +    }
>> +
>>      domdata = SCHED_OP(c->sched, alloc_domdata, d);
>>      if ( domdata == NULL )
>>          return -ENOMEM;
>> @@ -669,6 +675,14 @@ int cpu_disable_scheduler(unsigned int cpu)
>>              if ( cpumask_empty(&online_affinity) &&
>>                   cpumask_test_cpu(cpu, v->cpu_hard_affinity) )
>>              {
>> +                if ( v->affinity_broken )
>> +                {
>> +                    /* The vcpu is temporarily pinned, can't move it. */
>> +                    vcpu_schedule_unlock_irqrestore(lock, flags, v);
>> +                    ret = -EBUSY;
>> +                    break;
>> +                }
> 
> Does this mean that if the user closes the laptop lid while one of these drivers has vcpu0 pinned, that Xen will crash (see xen/arch/x86/smpboot.c:__cpu_disable())?  Or is it the OS's job to make sure that all temporary pins are removed before suspending?
> 
> Also -- have you actually tested the "cpupool move while pinned"
> functionality to make sure it actually works?  There's a weird bit in
> cpupool_unassign_cpu_helper() where after calling cpu_disable_scheduler(cpu), it unconditionally sets the cpu bit in the cpupool_free_cpus mask, even if it returns an error.  That can't be right, even for the existing -EAGAIN case, can it?
> 
> I see that you have a loop to retry this call several times in the next patch; but what if it fails every time -- what state is the system in?
> 
> And, in general, what happens if the device driver gets mixed up and forgets to unpin the vcpu?  Is the only recourse to reboot your host (or deal with the fact that you can't reconfigure your cpupools)?
> 
>  -George
> 
> Sorry, lost the original thread so replying at the top of mail chain.
> 
> +static XSM_INLINE int xsm_schedop_pin_temp(XSM_DEFAULT_VOID) 
> +{ 
> + XSM_ASSERT_ACTION(XSM_PRIV); 
> + return xsm_default_action(action, current->domain, NULL); 
> +}
> 
> Is the intention is to restrict the hypercall usage to dom0 only ?

To be more precise: to the hardware domain (the patch sniplet you are
referencing was part of V1 of the series, it isn't existing in V2 any
longer).


Juergen
diff mbox

Patch

diff --git a/xen/common/compat/schedule.c b/xen/common/compat/schedule.c
index 812c550..73b0f01 100644
--- a/xen/common/compat/schedule.c
+++ b/xen/common/compat/schedule.c
@@ -10,6 +10,10 @@ 
 
 #define do_sched_op compat_sched_op
 
+#define xen_sched_pin_temp sched_pin_temp
+CHECK_sched_pin_temp;
+#undef xen_sched_pin_temp
+
 #define xen_sched_shutdown sched_shutdown
 CHECK_sched_shutdown;
 #undef xen_sched_shutdown
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index b0d4b18..653f852 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -271,6 +271,12 @@  int sched_move_domain(struct domain *d, struct cpupool *c)
     struct scheduler *old_ops;
     void *old_domdata;
 
+    for_each_vcpu ( d, v )
+    {
+        if ( v->affinity_broken )
+            return -EBUSY;
+    }
+
     domdata = SCHED_OP(c->sched, alloc_domdata, d);
     if ( domdata == NULL )
         return -ENOMEM;
@@ -669,6 +675,14 @@  int cpu_disable_scheduler(unsigned int cpu)
             if ( cpumask_empty(&online_affinity) &&
                  cpumask_test_cpu(cpu, v->cpu_hard_affinity) )
             {
+                if ( v->affinity_broken )
+                {
+                    /* The vcpu is temporarily pinned, can't move it. */
+                    vcpu_schedule_unlock_irqrestore(lock, flags, v);
+                    ret = -EBUSY;
+                    break;
+                }
+
                 if (system_state == SYS_STATE_suspend)
                 {
                     cpumask_copy(v->cpu_hard_affinity_saved,
@@ -752,14 +766,20 @@  static int vcpu_set_affinity(
     struct vcpu *v, const cpumask_t *affinity, cpumask_t *which)
 {
     spinlock_t *lock;
+    int ret = 0;
 
     lock = vcpu_schedule_lock_irq(v);
 
-    cpumask_copy(which, affinity);
+    if ( v->affinity_broken )
+        ret = -EBUSY;
+    else
+    {
+        cpumask_copy(which, affinity);
 
-    /* Always ask the scheduler to re-evaluate placement
-     * when changing the affinity */
-    set_bit(_VPF_migrating, &v->pause_flags);
+        /* Always ask the scheduler to re-evaluate placement
+         * when changing the affinity */
+        set_bit(_VPF_migrating, &v->pause_flags);
+    }
 
     vcpu_schedule_unlock_irq(lock, v);
 
@@ -771,7 +791,7 @@  static int vcpu_set_affinity(
         vcpu_migrate(v);
     }
 
-    return 0;
+    return ret;
 }
 
 int vcpu_set_hard_affinity(struct vcpu *v, const cpumask_t *affinity)
@@ -978,6 +998,51 @@  void watchdog_domain_destroy(struct domain *d)
         kill_timer(&d->watchdog_timer[i]);
 }
 
+static long do_pin_temp(int cpu)
+{
+    struct vcpu *v = current;
+    spinlock_t *lock;
+    long ret = -EINVAL;
+
+    lock = vcpu_schedule_lock_irq(v);
+
+    if ( cpu < 0 )
+    {
+        if ( v->affinity_broken )
+        {
+            cpumask_copy(v->cpu_hard_affinity, v->cpu_hard_affinity_saved);
+            v->affinity_broken = 0;
+            set_bit(_VPF_migrating, &v->pause_flags);
+            ret = 0;
+        }
+    }
+    else if ( cpu < nr_cpu_ids )
+    {
+        if ( v->affinity_broken )
+            ret = -EBUSY;
+        else if ( cpumask_test_cpu(cpu, VCPU2ONLINE(v)) )
+        {
+            cpumask_copy(v->cpu_hard_affinity_saved, v->cpu_hard_affinity);
+            v->affinity_broken = 1;
+            cpumask_copy(v->cpu_hard_affinity, cpumask_of(cpu));
+            set_bit(_VPF_migrating, &v->pause_flags);
+            ret = 0;
+        }
+    }
+
+    vcpu_schedule_unlock_irq(lock, v);
+
+    domain_update_node_affinity(v->domain);
+
+    if ( v->pause_flags & VPF_migrating )
+    {
+        vcpu_sleep_nosync(v);
+        vcpu_migrate(v);
+    }
+
+    return ret;
+}
+
 typedef long ret_t;
 
 #endif /* !COMPAT */
@@ -1087,6 +1152,23 @@  ret_t do_sched_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         break;
     }
 
+    case SCHEDOP_pin_temp:
+    {
+        struct sched_pin_temp sched_pin_temp;
+
+        ret = -EFAULT;
+        if ( copy_from_guest(&sched_pin_temp, arg, 1) )
+            break;
+
+        ret = -EPERM;
+        if ( !is_hardware_domain(current->domain) )
+            break;
+
+        ret = do_pin_temp(sched_pin_temp.pcpu);
+
+        break;
+    }
+
     default:
         ret = -ENOSYS;
     }
diff --git a/xen/include/public/sched.h b/xen/include/public/sched.h
index 2219696..a0ce5a6 100644
--- a/xen/include/public/sched.h
+++ b/xen/include/public/sched.h
@@ -118,6 +118,17 @@ 
  * With id != 0 and timeout != 0, poke watchdog timer and set new timeout.
  */
 #define SCHEDOP_watchdog    6
+
+/*
+ * Temporarily pin the current vcpu to one physical cpu or undo that pinning.
+ * @arg == pointer to sched_pin_temp_t structure.
+ *
+ * Setting pcpu to -1 will undo a previous temporary pinning and restore the
+ * previous cpu affinity. The temporary aspect of the pinning isn't enforced
+ * by the hypervisor.
+ * This call is allowed for the hardware domain only.
+ */
+#define SCHEDOP_pin_temp    7
 /* ` } */
 
 struct sched_shutdown {
@@ -148,6 +159,12 @@  struct sched_watchdog {
 typedef struct sched_watchdog sched_watchdog_t;
 DEFINE_XEN_GUEST_HANDLE(sched_watchdog_t);
 
+struct sched_pin_temp {
+    int32_t pcpu;
+};
+typedef struct sched_pin_temp sched_pin_temp_t;
+DEFINE_XEN_GUEST_HANDLE(sched_pin_temp_t);
+
 /*
  * Reason codes for SCHEDOP_shutdown. These may be interpreted by control
  * software to determine the appropriate action. For the most part, Xen does
diff --git a/xen/include/xlat.lst b/xen/include/xlat.lst
index fda1137..52c7233 100644
--- a/xen/include/xlat.lst
+++ b/xen/include/xlat.lst
@@ -104,6 +104,7 @@ 
 ?	pmu_data			pmu.h
 ?	pmu_params			pmu.h
 !	sched_poll			sched.h
+?	sched_pin_temp			sched.h
 ?	sched_remote_shutdown		sched.h
 ?	sched_shutdown			sched.h
 ?	tmem_oid			tmem.h