diff mbox

xen: privcmd: schedule() after private hypercall when non CONFIG_PREEMPT

Message ID 1417040805-15857-1-git-send-email-mcgrof@do-not-panic.com (mailing list archive)
State New, archived
Headers show

Commit Message

Luis R. Rodriguez Nov. 26, 2014, 10:26 p.m. UTC
From: "Luis R. Rodriguez" <mcgrof@suse.com>

Some folks had reported that some xen hypercalls take a long time
to complete when issued from the userspace private ioctl mechanism,
this can happen for instance with some hypercalls that have many
sub-operations, this can happen for instance on hypercalls that use
multi-call feature whereby Xen lets one hypercall batch out a series
of other hypercalls on the hypervisor. At times such hypercalls can
even end up triggering the TASK_UNINTERRUPTIBLE hanger check (default
120 seconds), this a non-issue issue on preemptible kernels though as
the kernel may deschedule such long running tasks. Xen for instance
supports multicalls to be preempted as well, this is what Xen calls
continuation (see xen commit 42217cbc5b which introduced this [0]).
On systems without CONFIG_PREEMPT though -- a kernel with voluntary
or no preemption -- a long running hypercall will not be descheduled
until the hypercall is complete and the ioctl returns to user space.

To help with this David had originally implemented support for use
of preempt_schedule_irq() [1] for non CONFIG_PREEMPT kernels. This
solution never went upstream though and upon review to help refactor
this I've concluded that usage of preempt_schedule_irq() would be
a bit abussive of existing APIs -- for a few reasons:

0) we want to avoid spreading its use on non CONFIG_PREEMPT kernels

1) we want try to consider solutions that might work for other
   hypervisors for this same problem, and identify it its an issue
   even present on other hypervisors or if this is a self
   inflicted architectural issue caused by use of multicalls

2) there is no documentation or profiling of the exact hypercalls
   that were causing these issues, nor do we have any context
   to help evaluate this any further

I at least checked with kvm folks and it seems hypercall preemption
is not needed there. We can survey other hypervisors...

If 'something like preemption' is needed then CONFIG_PREEMPT
should just be enabled and encouraged, it seems we want to
encourage CONFIG_PREEMPT on xen, specially when multicalls are
used. In the meantime this tries to address a solution to help
xen on non CONFIG_PREEMPT kernels.

One option tested and evaluated was to put private hypercalls in
process context, however this would introduce complexities such
originating hypercalls from different contexts. Current xen
hypercall callback handlers would need to be changed per architecture,
for instance, we'd also incur the cost of switching states from
user / kernel (this cost is also present if preempt_schedule_irq()
is used). There may be other issues which could be introduced with
this strategy as well. The simplest *shared* alternative is instead
to just explicitly schedule() at the end of a private hypercall on non
preempt kernels. This forces our private hypercall call mechanism
to try to be fair only on non CONFIG_PREEMPT kernels at the cost of
more context switch but keeps the private hypercall context intact.

[0] http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=42217cbc5b3e84b8c145d8cfb62dd5de0134b9e8;hp=3a0b9c57d5c9e82c55dd967c84dd06cb43c49ee9
[1] http://ftp.suse.com/pub/people/mcgrof/xen-preempt-hypercalls/0001-x86-xen-allow-privcmd-hypercalls-to-be-preempted.patch

Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Juergen Gross <JGross@suse.com>
Cc: Olaf Hering <ohering@suse.de>
Cc: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
---
 drivers/xen/privcmd.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Jürgen Groß Nov. 27, 2014, 6:36 a.m. UTC | #1
On 11/26/2014 11:26 PM, Luis R. Rodriguez wrote:
> From: "Luis R. Rodriguez" <mcgrof@suse.com>
>
> Some folks had reported that some xen hypercalls take a long time
> to complete when issued from the userspace private ioctl mechanism,
> this can happen for instance with some hypercalls that have many
> sub-operations, this can happen for instance on hypercalls that use
> multi-call feature whereby Xen lets one hypercall batch out a series
> of other hypercalls on the hypervisor. At times such hypercalls can
> even end up triggering the TASK_UNINTERRUPTIBLE hanger check (default
> 120 seconds), this a non-issue issue on preemptible kernels though as
> the kernel may deschedule such long running tasks. Xen for instance
> supports multicalls to be preempted as well, this is what Xen calls
> continuation (see xen commit 42217cbc5b which introduced this [0]).
> On systems without CONFIG_PREEMPT though -- a kernel with voluntary
> or no preemption -- a long running hypercall will not be descheduled
> until the hypercall is complete and the ioctl returns to user space.
>
> To help with this David had originally implemented support for use
> of preempt_schedule_irq() [1] for non CONFIG_PREEMPT kernels. This
> solution never went upstream though and upon review to help refactor
> this I've concluded that usage of preempt_schedule_irq() would be
> a bit abussive of existing APIs -- for a few reasons:
>
> 0) we want to avoid spreading its use on non CONFIG_PREEMPT kernels
>
> 1) we want try to consider solutions that might work for other
>     hypervisors for this same problem, and identify it its an issue
>     even present on other hypervisors or if this is a self
>     inflicted architectural issue caused by use of multicalls
>
> 2) there is no documentation or profiling of the exact hypercalls
>     that were causing these issues, nor do we have any context
>     to help evaluate this any further
>
> I at least checked with kvm folks and it seems hypercall preemption
> is not needed there. We can survey other hypervisors...
>
> If 'something like preemption' is needed then CONFIG_PREEMPT
> should just be enabled and encouraged, it seems we want to
> encourage CONFIG_PREEMPT on xen, specially when multicalls are
> used. In the meantime this tries to address a solution to help
> xen on non CONFIG_PREEMPT kernels.
>
> One option tested and evaluated was to put private hypercalls in
> process context, however this would introduce complexities such
> originating hypercalls from different contexts. Current xen
> hypercall callback handlers would need to be changed per architecture,
> for instance, we'd also incur the cost of switching states from
> user / kernel (this cost is also present if preempt_schedule_irq()
> is used). There may be other issues which could be introduced with
> this strategy as well. The simplest *shared* alternative is instead
> to just explicitly schedule() at the end of a private hypercall on non
> preempt kernels. This forces our private hypercall call mechanism
> to try to be fair only on non CONFIG_PREEMPT kernels at the cost of
> more context switch but keeps the private hypercall context intact.
>
> [0] http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=42217cbc5b3e84b8c145d8cfb62dd5de0134b9e8;hp=3a0b9c57d5c9e82c55dd967c84dd06cb43c49ee9
> [1] http://ftp.suse.com/pub/people/mcgrof/xen-preempt-hypercalls/0001-x86-xen-allow-privcmd-hypercalls-to-be-preempted.patch
>
> Cc: Davidlohr Bueso <dbueso@suse.de>
> Cc: Joerg Roedel <jroedel@suse.de>
> Cc: Borislav Petkov <bp@suse.de>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Jan Beulich <JBeulich@suse.com>
> Cc: Juergen Gross <JGross@suse.com>
> Cc: Olaf Hering <ohering@suse.de>
> Cc: David Vrabel <david.vrabel@citrix.com>
> Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
> ---
>   drivers/xen/privcmd.c | 3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
> index 569a13b..e29edba 100644
> --- a/drivers/xen/privcmd.c
> +++ b/drivers/xen/privcmd.c
> @@ -60,6 +60,9 @@ static long privcmd_ioctl_hypercall(void __user *udata)
>   			   hypercall.arg[0], hypercall.arg[1],
>   			   hypercall.arg[2], hypercall.arg[3],
>   			   hypercall.arg[4]);
> +#ifndef CONFIG_PREEMPT
> +	schedule();
> +#endif
>
>   	return ret;
>   }
>

Sorry, I don't think this will solve anything. You're calling schedule()
right after the long running hypercall just nanoseconds before returning
to the user.

I suppose you were mislead by the "int 0x82" in [0]. This is the
hypercall from the kernel into the hypervisor, e.g. inside of
privcmd_call().


Juergen
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Luis Chamberlain Nov. 27, 2014, 6:36 p.m. UTC | #2
On Thu, Nov 27, 2014 at 07:36:31AM +0100, Juergen Gross wrote:
> On 11/26/2014 11:26 PM, Luis R. Rodriguez wrote:
>> From: "Luis R. Rodriguez" <mcgrof@suse.com>
>>
>> Some folks had reported that some xen hypercalls take a long time
>> to complete when issued from the userspace private ioctl mechanism,
>> this can happen for instance with some hypercalls that have many
>> sub-operations, this can happen for instance on hypercalls that use
>> multi-call feature whereby Xen lets one hypercall batch out a series
>> of other hypercalls on the hypervisor. At times such hypercalls can
>> even end up triggering the TASK_UNINTERRUPTIBLE hanger check (default
>> 120 seconds), this a non-issue issue on preemptible kernels though as
>> the kernel may deschedule such long running tasks. Xen for instance
>> supports multicalls to be preempted as well, this is what Xen calls
>> continuation (see xen commit 42217cbc5b which introduced this [0]).
>> On systems without CONFIG_PREEMPT though -- a kernel with voluntary
>> or no preemption -- a long running hypercall will not be descheduled
>> until the hypercall is complete and the ioctl returns to user space.
>>
>> To help with this David had originally implemented support for use
>> of preempt_schedule_irq() [1] for non CONFIG_PREEMPT kernels. This
>> solution never went upstream though and upon review to help refactor
>> this I've concluded that usage of preempt_schedule_irq() would be
>> a bit abussive of existing APIs -- for a few reasons:
>>
>> 0) we want to avoid spreading its use on non CONFIG_PREEMPT kernels
>>
>> 1) we want try to consider solutions that might work for other
>>     hypervisors for this same problem, and identify it its an issue
>>     even present on other hypervisors or if this is a self
>>     inflicted architectural issue caused by use of multicalls
>>
>> 2) there is no documentation or profiling of the exact hypercalls
>>     that were causing these issues, nor do we have any context
>>     to help evaluate this any further
>>
>> I at least checked with kvm folks and it seems hypercall preemption
>> is not needed there. We can survey other hypervisors...
>>
>> If 'something like preemption' is needed then CONFIG_PREEMPT
>> should just be enabled and encouraged, it seems we want to
>> encourage CONFIG_PREEMPT on xen, specially when multicalls are
>> used. In the meantime this tries to address a solution to help
>> xen on non CONFIG_PREEMPT kernels.
>>
>> One option tested and evaluated was to put private hypercalls in
>> process context, however this would introduce complexities such
>> originating hypercalls from different contexts. Current xen
>> hypercall callback handlers would need to be changed per architecture,
>> for instance, we'd also incur the cost of switching states from
>> user / kernel (this cost is also present if preempt_schedule_irq()
>> is used). There may be other issues which could be introduced with
>> this strategy as well. The simplest *shared* alternative is instead
>> to just explicitly schedule() at the end of a private hypercall on non
>> preempt kernels. This forces our private hypercall call mechanism
>> to try to be fair only on non CONFIG_PREEMPT kernels at the cost of
>> more context switch but keeps the private hypercall context intact.
>>
>> [0] http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=42217cbc5b3e84b8c145d8cfb62dd5de0134b9e8;hp=3a0b9c57d5c9e82c55dd967c84dd06cb43c49ee9
>> [1] http://ftp.suse.com/pub/people/mcgrof/xen-preempt-hypercalls/0001-x86-xen-allow-privcmd-hypercalls-to-be-preempted.patch
>>
>> Cc: Davidlohr Bueso <dbueso@suse.de>
>> Cc: Joerg Roedel <jroedel@suse.de>
>> Cc: Borislav Petkov <bp@suse.de>
>> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> Cc: Jan Beulich <JBeulich@suse.com>
>> Cc: Juergen Gross <JGross@suse.com>
>> Cc: Olaf Hering <ohering@suse.de>
>> Cc: David Vrabel <david.vrabel@citrix.com>
>> Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
>> ---
>>   drivers/xen/privcmd.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
>> index 569a13b..e29edba 100644
>> --- a/drivers/xen/privcmd.c
>> +++ b/drivers/xen/privcmd.c
>> @@ -60,6 +60,9 @@ static long privcmd_ioctl_hypercall(void __user *udata)
>>   			   hypercall.arg[0], hypercall.arg[1],
>>   			   hypercall.arg[2], hypercall.arg[3],
>>   			   hypercall.arg[4]);
>> +#ifndef CONFIG_PREEMPT
>> +	schedule();
>> +#endif
>>
>>   	return ret;
>>   }
>>
>
> Sorry, I don't think this will solve anything. You're calling schedule()
> right after the long running hypercall just nanoseconds before returning
> to the user.

Yeah, well that is what [1] tried as well only it tried using
preempt_schedule_irq() on the hypercall callback...

> I suppose you were mislead by the "int 0x82" in [0]. This is the
> hypercall from the kernel into the hypervisor, e.g. inside of
> privcmd_call().

Nope, you have to consider what was done in [1], I was trying to
do something similar but less complex that didn't involve mucking
with the callbacks but also not abusing APIs.

I'm afraid we don't have much leg room.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Luis R. Rodriguez Nov. 27, 2014, 6:46 p.m. UTC | #3
On Thu, Nov 27, 2014 at 1:36 PM, Luis R. Rodriguez <mcgrof@suse.com> wrote:
> I'm afraid we don't have much leg room.

Let me be clear, I still think putting some hypercalls in process
context *might help* but because of notes 1) and 2) I highlighted I
think this is the best we can do, with more information we should be
able to consider weighing pros / cons with actual metrics from
alternatives, without more information we're just shooting in the dark
and the last thing I want is to see APIs abused or setting precedents.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Cooper Nov. 27, 2014, 6:50 p.m. UTC | #4
On 27/11/14 18:36, Luis R. Rodriguez wrote:
> On Thu, Nov 27, 2014 at 07:36:31AM +0100, Juergen Gross wrote:
>> On 11/26/2014 11:26 PM, Luis R. Rodriguez wrote:
>>> From: "Luis R. Rodriguez" <mcgrof@suse.com>
>>>
>>> Some folks had reported that some xen hypercalls take a long time
>>> to complete when issued from the userspace private ioctl mechanism,
>>> this can happen for instance with some hypercalls that have many
>>> sub-operations, this can happen for instance on hypercalls that use
>>> multi-call feature whereby Xen lets one hypercall batch out a series
>>> of other hypercalls on the hypervisor. At times such hypercalls can
>>> even end up triggering the TASK_UNINTERRUPTIBLE hanger check (default
>>> 120 seconds), this a non-issue issue on preemptible kernels though as
>>> the kernel may deschedule such long running tasks. Xen for instance
>>> supports multicalls to be preempted as well, this is what Xen calls
>>> continuation (see xen commit 42217cbc5b which introduced this [0]).
>>> On systems without CONFIG_PREEMPT though -- a kernel with voluntary
>>> or no preemption -- a long running hypercall will not be descheduled
>>> until the hypercall is complete and the ioctl returns to user space.
>>>
>>> To help with this David had originally implemented support for use
>>> of preempt_schedule_irq() [1] for non CONFIG_PREEMPT kernels. This
>>> solution never went upstream though and upon review to help refactor
>>> this I've concluded that usage of preempt_schedule_irq() would be
>>> a bit abussive of existing APIs -- for a few reasons:
>>>
>>> 0) we want to avoid spreading its use on non CONFIG_PREEMPT kernels
>>>
>>> 1) we want try to consider solutions that might work for other
>>>     hypervisors for this same problem, and identify it its an issue
>>>     even present on other hypervisors or if this is a self
>>>     inflicted architectural issue caused by use of multicalls
>>>
>>> 2) there is no documentation or profiling of the exact hypercalls
>>>     that were causing these issues, nor do we have any context
>>>     to help evaluate this any further
>>>
>>> I at least checked with kvm folks and it seems hypercall preemption
>>> is not needed there. We can survey other hypervisors...
>>>
>>> If 'something like preemption' is needed then CONFIG_PREEMPT
>>> should just be enabled and encouraged, it seems we want to
>>> encourage CONFIG_PREEMPT on xen, specially when multicalls are
>>> used. In the meantime this tries to address a solution to help
>>> xen on non CONFIG_PREEMPT kernels.
>>>
>>> One option tested and evaluated was to put private hypercalls in
>>> process context, however this would introduce complexities such
>>> originating hypercalls from different contexts. Current xen
>>> hypercall callback handlers would need to be changed per architecture,
>>> for instance, we'd also incur the cost of switching states from
>>> user / kernel (this cost is also present if preempt_schedule_irq()
>>> is used). There may be other issues which could be introduced with
>>> this strategy as well. The simplest *shared* alternative is instead
>>> to just explicitly schedule() at the end of a private hypercall on non
>>> preempt kernels. This forces our private hypercall call mechanism
>>> to try to be fair only on non CONFIG_PREEMPT kernels at the cost of
>>> more context switch but keeps the private hypercall context intact.
>>>
>>> [0] http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=42217cbc5b3e84b8c145d8cfb62dd5de0134b9e8;hp=3a0b9c57d5c9e82c55dd967c84dd06cb43c49ee9
>>> [1] http://ftp.suse.com/pub/people/mcgrof/xen-preempt-hypercalls/0001-x86-xen-allow-privcmd-hypercalls-to-be-preempted.patch
>>>
>>> Cc: Davidlohr Bueso <dbueso@suse.de>
>>> Cc: Joerg Roedel <jroedel@suse.de>
>>> Cc: Borislav Petkov <bp@suse.de>
>>> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>>> Cc: Jan Beulich <JBeulich@suse.com>
>>> Cc: Juergen Gross <JGross@suse.com>
>>> Cc: Olaf Hering <ohering@suse.de>
>>> Cc: David Vrabel <david.vrabel@citrix.com>
>>> Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
>>> ---
>>>   drivers/xen/privcmd.c | 3 +++
>>>   1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
>>> index 569a13b..e29edba 100644
>>> --- a/drivers/xen/privcmd.c
>>> +++ b/drivers/xen/privcmd.c
>>> @@ -60,6 +60,9 @@ static long privcmd_ioctl_hypercall(void __user *udata)
>>>   			   hypercall.arg[0], hypercall.arg[1],
>>>   			   hypercall.arg[2], hypercall.arg[3],
>>>   			   hypercall.arg[4]);
>>> +#ifndef CONFIG_PREEMPT
>>> +	schedule();
>>> +#endif
>>>
>>>   	return ret;
>>>   }
>>>
>> Sorry, I don't think this will solve anything. You're calling schedule()
>> right after the long running hypercall just nanoseconds before returning
>> to the user.
> Yeah, well that is what [1] tried as well only it tried using
> preempt_schedule_irq() on the hypercall callback...
>
>> I suppose you were mislead by the "int 0x82" in [0]. This is the
>> hypercall from the kernel into the hypervisor, e.g. inside of
>> privcmd_call().
> Nope, you have to consider what was done in [1], I was trying to
> do something similar but less complex that didn't involve mucking
> with the callbacks but also not abusing APIs.
>
> I'm afraid we don't have much leg room.

XenServer uses

https://github.com/xenserver/linux-3.x.pg/blob/master/master/0001-x86-xen-allow-privcmd-hypercalls-to-be-preempted.patch

to deal with these issues.  That patch is based on 3.10.

I can remember whether this has been submitted upstream before (and
there were outstanding issues), or whether it fell at an inconvenient
time with our development cycles.

David: do you recall?

~Andrew

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jürgen Groß Nov. 28, 2014, 4:49 a.m. UTC | #5
On 11/27/2014 07:50 PM, Andrew Cooper wrote:
> On 27/11/14 18:36, Luis R. Rodriguez wrote:
>> On Thu, Nov 27, 2014 at 07:36:31AM +0100, Juergen Gross wrote:
>>> On 11/26/2014 11:26 PM, Luis R. Rodriguez wrote:
>>>> From: "Luis R. Rodriguez" <mcgrof@suse.com>
>>>>
>>>> Some folks had reported that some xen hypercalls take a long time
>>>> to complete when issued from the userspace private ioctl mechanism,
>>>> this can happen for instance with some hypercalls that have many
>>>> sub-operations, this can happen for instance on hypercalls that use
>>>> multi-call feature whereby Xen lets one hypercall batch out a series
>>>> of other hypercalls on the hypervisor. At times such hypercalls can
>>>> even end up triggering the TASK_UNINTERRUPTIBLE hanger check (default
>>>> 120 seconds), this a non-issue issue on preemptible kernels though as
>>>> the kernel may deschedule such long running tasks. Xen for instance
>>>> supports multicalls to be preempted as well, this is what Xen calls
>>>> continuation (see xen commit 42217cbc5b which introduced this [0]).
>>>> On systems without CONFIG_PREEMPT though -- a kernel with voluntary
>>>> or no preemption -- a long running hypercall will not be descheduled
>>>> until the hypercall is complete and the ioctl returns to user space.
>>>>
>>>> To help with this David had originally implemented support for use
>>>> of preempt_schedule_irq() [1] for non CONFIG_PREEMPT kernels. This
>>>> solution never went upstream though and upon review to help refactor
>>>> this I've concluded that usage of preempt_schedule_irq() would be
>>>> a bit abussive of existing APIs -- for a few reasons:
>>>>
>>>> 0) we want to avoid spreading its use on non CONFIG_PREEMPT kernels
>>>>
>>>> 1) we want try to consider solutions that might work for other
>>>>      hypervisors for this same problem, and identify it its an issue
>>>>      even present on other hypervisors or if this is a self
>>>>      inflicted architectural issue caused by use of multicalls
>>>>
>>>> 2) there is no documentation or profiling of the exact hypercalls
>>>>      that were causing these issues, nor do we have any context
>>>>      to help evaluate this any further
>>>>
>>>> I at least checked with kvm folks and it seems hypercall preemption
>>>> is not needed there. We can survey other hypervisors...
>>>>
>>>> If 'something like preemption' is needed then CONFIG_PREEMPT
>>>> should just be enabled and encouraged, it seems we want to
>>>> encourage CONFIG_PREEMPT on xen, specially when multicalls are
>>>> used. In the meantime this tries to address a solution to help
>>>> xen on non CONFIG_PREEMPT kernels.
>>>>
>>>> One option tested and evaluated was to put private hypercalls in
>>>> process context, however this would introduce complexities such
>>>> originating hypercalls from different contexts. Current xen
>>>> hypercall callback handlers would need to be changed per architecture,
>>>> for instance, we'd also incur the cost of switching states from
>>>> user / kernel (this cost is also present if preempt_schedule_irq()
>>>> is used). There may be other issues which could be introduced with
>>>> this strategy as well. The simplest *shared* alternative is instead
>>>> to just explicitly schedule() at the end of a private hypercall on non
>>>> preempt kernels. This forces our private hypercall call mechanism
>>>> to try to be fair only on non CONFIG_PREEMPT kernels at the cost of
>>>> more context switch but keeps the private hypercall context intact.
>>>>
>>>> [0] http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=42217cbc5b3e84b8c145d8cfb62dd5de0134b9e8;hp=3a0b9c57d5c9e82c55dd967c84dd06cb43c49ee9
>>>> [1] http://ftp.suse.com/pub/people/mcgrof/xen-preempt-hypercalls/0001-x86-xen-allow-privcmd-hypercalls-to-be-preempted.patch
>>>>
>>>> Cc: Davidlohr Bueso <dbueso@suse.de>
>>>> Cc: Joerg Roedel <jroedel@suse.de>
>>>> Cc: Borislav Petkov <bp@suse.de>
>>>> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>>>> Cc: Jan Beulich <JBeulich@suse.com>
>>>> Cc: Juergen Gross <JGross@suse.com>
>>>> Cc: Olaf Hering <ohering@suse.de>
>>>> Cc: David Vrabel <david.vrabel@citrix.com>
>>>> Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
>>>> ---
>>>>    drivers/xen/privcmd.c | 3 +++
>>>>    1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
>>>> index 569a13b..e29edba 100644
>>>> --- a/drivers/xen/privcmd.c
>>>> +++ b/drivers/xen/privcmd.c
>>>> @@ -60,6 +60,9 @@ static long privcmd_ioctl_hypercall(void __user *udata)
>>>>    			   hypercall.arg[0], hypercall.arg[1],
>>>>    			   hypercall.arg[2], hypercall.arg[3],
>>>>    			   hypercall.arg[4]);
>>>> +#ifndef CONFIG_PREEMPT
>>>> +	schedule();
>>>> +#endif
>>>>
>>>>    	return ret;
>>>>    }
>>>>
>>> Sorry, I don't think this will solve anything. You're calling schedule()
>>> right after the long running hypercall just nanoseconds before returning
>>> to the user.
>> Yeah, well that is what [1] tried as well only it tried using
>> preempt_schedule_irq() on the hypercall callback...
>>
>>> I suppose you were mislead by the "int 0x82" in [0]. This is the
>>> hypercall from the kernel into the hypervisor, e.g. inside of
>>> privcmd_call().
>> Nope, you have to consider what was done in [1], I was trying to
>> do something similar but less complex that didn't involve mucking
>> with the callbacks but also not abusing APIs.
>>
>> I'm afraid we don't have much leg room.
>
> XenServer uses
>
> https://github.com/xenserver/linux-3.x.pg/blob/master/master/0001-x86-xen-allow-privcmd-hypercalls-to-be-preempted.patch
>
> to deal with these issues.  That patch is based on 3.10.

Clever. :-)

>
> I can remember whether this has been submitted upstream before (and
> there were outstanding issues), or whether it fell at an inconvenient
> time with our development cycles.

I found

http://lists.xen.org/archives/html/xen-devel/2014-02/msg02540.html

and nothing else.


Juergen
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Luis Chamberlain Nov. 28, 2014, 9:51 p.m. UTC | #6
On Thu, Nov 27, 2014 at 06:50:31PM +0000, Andrew Cooper wrote:
> On 27/11/14 18:36, Luis R. Rodriguez wrote:
> > On Thu, Nov 27, 2014 at 07:36:31AM +0100, Juergen Gross wrote:
> >> On 11/26/2014 11:26 PM, Luis R. Rodriguez wrote:
> >>> From: "Luis R. Rodriguez" <mcgrof@suse.com>
> >>>
> >>> Some folks had reported that some xen hypercalls take a long time
> >>> to complete when issued from the userspace private ioctl mechanism,
> >>> this can happen for instance with some hypercalls that have many
> >>> sub-operations, this can happen for instance on hypercalls that use
> >>> multi-call feature whereby Xen lets one hypercall batch out a series
> >>> of other hypercalls on the hypervisor. At times such hypercalls can
> >>> even end up triggering the TASK_UNINTERRUPTIBLE hanger check (default
> >>> 120 seconds), this a non-issue issue on preemptible kernels though as
> >>> the kernel may deschedule such long running tasks. Xen for instance
> >>> supports multicalls to be preempted as well, this is what Xen calls
> >>> continuation (see xen commit 42217cbc5b which introduced this [0]).
> >>> On systems without CONFIG_PREEMPT though -- a kernel with voluntary
> >>> or no preemption -- a long running hypercall will not be descheduled
> >>> until the hypercall is complete and the ioctl returns to user space.
> >>>
> >>> To help with this David had originally implemented support for use
> >>> of preempt_schedule_irq() [1] for non CONFIG_PREEMPT kernels. This
> >>> solution never went upstream though and upon review to help refactor
> >>> this I've concluded that usage of preempt_schedule_irq() would be
> >>> a bit abussive of existing APIs -- for a few reasons:
> >>>
> >>> 0) we want to avoid spreading its use on non CONFIG_PREEMPT kernels
> >>>
> >>> 1) we want try to consider solutions that might work for other
> >>>     hypervisors for this same problem, and identify it its an issue
> >>>     even present on other hypervisors or if this is a self
> >>>     inflicted architectural issue caused by use of multicalls
> >>>
> >>> 2) there is no documentation or profiling of the exact hypercalls
> >>>     that were causing these issues, nor do we have any context
> >>>     to help evaluate this any further
> >>>
> >>> I at least checked with kvm folks and it seems hypercall preemption
> >>> is not needed there. We can survey other hypervisors...
> >>>
> >>> If 'something like preemption' is needed then CONFIG_PREEMPT
> >>> should just be enabled and encouraged, it seems we want to
> >>> encourage CONFIG_PREEMPT on xen, specially when multicalls are
> >>> used. In the meantime this tries to address a solution to help
> >>> xen on non CONFIG_PREEMPT kernels.
> >>>
> >>> One option tested and evaluated was to put private hypercalls in
> >>> process context, however this would introduce complexities such
> >>> originating hypercalls from different contexts. Current xen
> >>> hypercall callback handlers would need to be changed per architecture,
> >>> for instance, we'd also incur the cost of switching states from
> >>> user / kernel (this cost is also present if preempt_schedule_irq()
> >>> is used). There may be other issues which could be introduced with
> >>> this strategy as well. The simplest *shared* alternative is instead
> >>> to just explicitly schedule() at the end of a private hypercall on non
> >>> preempt kernels. This forces our private hypercall call mechanism
> >>> to try to be fair only on non CONFIG_PREEMPT kernels at the cost of
> >>> more context switch but keeps the private hypercall context intact.
> >>>
> >>> [0] http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=42217cbc5b3e84b8c145d8cfb62dd5de0134b9e8;hp=3a0b9c57d5c9e82c55dd967c84dd06cb43c49ee9
> >>> [1] http://ftp.suse.com/pub/people/mcgrof/xen-preempt-hypercalls/0001-x86-xen-allow-privcmd-hypercalls-to-be-preempted.patch
> >>>
> >>> Cc: Davidlohr Bueso <dbueso@suse.de>
> >>> Cc: Joerg Roedel <jroedel@suse.de>
> >>> Cc: Borislav Petkov <bp@suse.de>
> >>> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> >>> Cc: Jan Beulich <JBeulich@suse.com>
> >>> Cc: Juergen Gross <JGross@suse.com>
> >>> Cc: Olaf Hering <ohering@suse.de>
> >>> Cc: David Vrabel <david.vrabel@citrix.com>
> >>> Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
> >>> ---
> >>>   drivers/xen/privcmd.c | 3 +++
> >>>   1 file changed, 3 insertions(+)
> >>>
> >>> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
> >>> index 569a13b..e29edba 100644
> >>> --- a/drivers/xen/privcmd.c
> >>> +++ b/drivers/xen/privcmd.c
> >>> @@ -60,6 +60,9 @@ static long privcmd_ioctl_hypercall(void __user *udata)
> >>>   			   hypercall.arg[0], hypercall.arg[1],
> >>>   			   hypercall.arg[2], hypercall.arg[3],
> >>>   			   hypercall.arg[4]);
> >>> +#ifndef CONFIG_PREEMPT
> >>> +	schedule();
> >>> +#endif
> >>>
> >>>   	return ret;
> >>>   }
> >>>
> >> Sorry, I don't think this will solve anything. You're calling schedule()
> >> right after the long running hypercall just nanoseconds before returning
> >> to the user.
> > Yeah, well that is what [1] tried as well only it tried using
> > preempt_schedule_irq() on the hypercall callback...
> >
> >> I suppose you were mislead by the "int 0x82" in [0]. This is the
> >> hypercall from the kernel into the hypervisor, e.g. inside of
> >> privcmd_call().
> > Nope, you have to consider what was done in [1], I was trying to
> > do something similar but less complex that didn't involve mucking
> > with the callbacks but also not abusing APIs.
> >
> > I'm afraid we don't have much leg room.
> 
> XenServer uses
> 
> https://github.com/xenserver/linux-3.x.pg/blob/master/master/0001-x86-xen-allow-privcmd-hypercalls-to-be-preempted.patch
> 
> to deal with these issues.  That patch is based on 3.10.
> 
> I can remember whether this has been submitted upstream before (and
> there were outstanding issues), or whether it fell at an inconvenient
> time with our development cycles.
> 
> David: do you recall?

This was precicely the patch I reviewed in [1].

  Luis
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Vrabel Dec. 1, 2014, 11:01 a.m. UTC | #7
On 28/11/14 04:49, Juergen Gross wrote:
> On 11/27/2014 07:50 PM, Andrew Cooper wrote:
>> 
>> XenServer uses
>>
>> https://github.com/xenserver/linux-3.x.pg/blob/master/master/0001-x86-xen-allow-privcmd-hypercalls-to-be-preempted.patch
>>
>>
>> to deal with these issues.  That patch is based on 3.10.
> 
> Clever. :-)
> 
>>
>> I can remember whether this has been submitted upstream before (and
>> there were outstanding issues), or whether it fell at an inconvenient
>> time with our development cycles.
> 
> I found
> 
> http://lists.xen.org/archives/html/xen-devel/2014-02/msg02540.html
> 
> and nothing else.

I dropped it because it copy-and-paste a bunch of otherwise generic x86
assembler and looked unlikely to get an x86 maintainer ack.  If you
think otherwise, feel free to pick it up and run with it.

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Vrabel Dec. 1, 2014, 11:11 a.m. UTC | #8
On 27/11/14 18:36, Luis R. Rodriguez wrote:
> On Thu, Nov 27, 2014 at 07:36:31AM +0100, Juergen Gross wrote:
>> On 11/26/2014 11:26 PM, Luis R. Rodriguez wrote:
>>> From: "Luis R. Rodriguez" <mcgrof@suse.com>
>>>
>>> Some folks had reported that some xen hypercalls take a long time
>>> to complete when issued from the userspace private ioctl mechanism,
>>> this can happen for instance with some hypercalls that have many
>>> sub-operations, this can happen for instance on hypercalls that use
[...]
>>> --- a/drivers/xen/privcmd.c
>>> +++ b/drivers/xen/privcmd.c
>>> @@ -60,6 +60,9 @@ static long privcmd_ioctl_hypercall(void __user *udata)
>>>   			   hypercall.arg[0], hypercall.arg[1],
>>>   			   hypercall.arg[2], hypercall.arg[3],
>>>   			   hypercall.arg[4]);
>>> +#ifndef CONFIG_PREEMPT
>>> +	schedule();
>>> +#endif

As Juergen points out, this does nothing.  You need to schedule while in
the middle of the hypercall.

Remember that Xen's hypercall preemption only preempts the hypercall to
run interrupts in the guest.

>>>
>>>   	return ret;
>>>   }
>>>
>>
>> Sorry, I don't think this will solve anything. You're calling schedule()
>> right after the long running hypercall just nanoseconds before returning
>> to the user.
> 
> Yeah, well that is what [1] tried as well only it tried using
> preempt_schedule_irq() on the hypercall callback...

No.  My patch added a schedule point in the middle of a hypercall on the
return from an interrupt (e.g., the timer interrupt).

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Luis Chamberlain Dec. 1, 2014, 1:32 p.m. UTC | #9
On Mon, Dec 01, 2014 at 11:01:18AM +0000, David Vrabel wrote:
> On 28/11/14 04:49, Juergen Gross wrote:
> > On 11/27/2014 07:50 PM, Andrew Cooper wrote:
> >> 
> >> XenServer uses
> >>
> >> https://github.com/xenserver/linux-3.x.pg/blob/master/master/0001-x86-xen-allow-privcmd-hypercalls-to-be-preempted.patch
> >>
> >>
> >> to deal with these issues.  That patch is based on 3.10.
> > 
> > Clever. :-)
> > 
> >>
> >> I can remember whether this has been submitted upstream before (and
> >> there were outstanding issues), or whether it fell at an inconvenient
> >> time with our development cycles.
> > 
> > I found
> > 
> > http://lists.xen.org/archives/html/xen-devel/2014-02/msg02540.html
> > 
> > and nothing else.
> 
> I dropped it because it copy-and-paste a bunch of otherwise generic x86
> assembler and looked unlikely to get an x86 maintainer ack.  If you
> think otherwise, feel free to pick it up and run with it.

I was trying to run with it, but my biggest gripe with this was
the use of preempt_schedule_irq(), but we can review that on the
other thread.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jürgen Groß Dec. 1, 2014, 2:42 p.m. UTC | #10
On 12/01/2014 02:32 PM, Luis R. Rodriguez wrote:
> On Mon, Dec 01, 2014 at 11:01:18AM +0000, David Vrabel wrote:
>> On 28/11/14 04:49, Juergen Gross wrote:
>>> On 11/27/2014 07:50 PM, Andrew Cooper wrote:
>>>>
>>>> XenServer uses
>>>>
>>>> https://github.com/xenserver/linux-3.x.pg/blob/master/master/0001-x86-xen-allow-privcmd-hypercalls-to-be-preempted.patch
>>>>
>>>>
>>>> to deal with these issues.  That patch is based on 3.10.
>>>
>>> Clever. :-)
>>>
>>>>
>>>> I can remember whether this has been submitted upstream before (and
>>>> there were outstanding issues), or whether it fell at an inconvenient
>>>> time with our development cycles.
>>>
>>> I found
>>>
>>> http://lists.xen.org/archives/html/xen-devel/2014-02/msg02540.html
>>>
>>> and nothing else.
>>
>> I dropped it because it copy-and-paste a bunch of otherwise generic x86
>> assembler and looked unlikely to get an x86 maintainer ack.  If you
>> think otherwise, feel free to pick it up and run with it.
>
> I was trying to run with it, but my biggest gripe with this was
> the use of preempt_schedule_irq(), but we can review that on the
> other thread.

I think this can be handled completely inside xen_evtchn_do_upcall().

xen_preemptible_hcall_begin() (possibly with another name) could take
the pointer of a function as parameter which is used as continuation
point after an asynchronous interrupt in the critical section.
Parameter for that function would be the exception frame of the
original interrupt to be able to continue at the interrupted position
after e.g. calling schedule().


Juergen
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Luis Chamberlain Dec. 1, 2014, 3:05 p.m. UTC | #11
On Mon, Dec 01, 2014 at 11:11:43AM +0000, David Vrabel wrote:
> On 27/11/14 18:36, Luis R. Rodriguez wrote:
> > On Thu, Nov 27, 2014 at 07:36:31AM +0100, Juergen Gross wrote:
> >> On 11/26/2014 11:26 PM, Luis R. Rodriguez wrote:
> >>> From: "Luis R. Rodriguez" <mcgrof@suse.com>
> >>>
> >>> Some folks had reported that some xen hypercalls take a long time
> >>> to complete when issued from the userspace private ioctl mechanism,
> >>> this can happen for instance with some hypercalls that have many
> >>> sub-operations, this can happen for instance on hypercalls that use
> [...]
> >>> --- a/drivers/xen/privcmd.c
> >>> +++ b/drivers/xen/privcmd.c
> >>> @@ -60,6 +60,9 @@ static long privcmd_ioctl_hypercall(void __user *udata)
> >>>   			   hypercall.arg[0], hypercall.arg[1],
> >>>   			   hypercall.arg[2], hypercall.arg[3],
> >>>   			   hypercall.arg[4]);
> >>> +#ifndef CONFIG_PREEMPT
> >>> +	schedule();
> >>> +#endif
> 
> As Juergen points out, this does nothing.  You need to schedule while in
> the middle of the hypercall.
> 
> Remember that Xen's hypercall preemption only preempts the hypercall to
> run interrupts in the guest.

How is it ensured that when the kernel preempts on this code path on
CONFIG_PREEMPT=n kernel that only interrupts in the guest are run?

> >>>
> >>>   	return ret;
> >>>   }
> >>>
> >>
> >> Sorry, I don't think this will solve anything. You're calling schedule()
> >> right after the long running hypercall just nanoseconds before returning
> >> to the user.
> > 
> > Yeah, well that is what [1] tried as well only it tried using
> > preempt_schedule_irq() on the hypercall callback...
> 
> No.  My patch added a schedule point in the middle of a hypercall on the
> return from an interrupt (e.g., the timer interrupt).

OK that provides much better context and given that I do see the above hunk as
pointless. I was completely misrepresenting what the callback was for. Now --
just to address my issues with the use of preempt_schedule_irq(). If the above
is addressed that I think should address most of my concerns, if we can figure
out a way to not deal with it to be arch specific that'd be neat, and if we
could not have to ifdef around stuff even better.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Vrabel Dec. 1, 2014, 3:18 p.m. UTC | #12
On 01/12/14 15:05, Luis R. Rodriguez wrote:
> On Mon, Dec 01, 2014 at 11:11:43AM +0000, David Vrabel wrote:
>> On 27/11/14 18:36, Luis R. Rodriguez wrote:
>>> On Thu, Nov 27, 2014 at 07:36:31AM +0100, Juergen Gross wrote:
>>>> On 11/26/2014 11:26 PM, Luis R. Rodriguez wrote:
>>>>> From: "Luis R. Rodriguez" <mcgrof@suse.com>
>>>>>
>>>>> Some folks had reported that some xen hypercalls take a long time
>>>>> to complete when issued from the userspace private ioctl mechanism,
>>>>> this can happen for instance with some hypercalls that have many
>>>>> sub-operations, this can happen for instance on hypercalls that use
>> [...]
>>>>> --- a/drivers/xen/privcmd.c
>>>>> +++ b/drivers/xen/privcmd.c
>>>>> @@ -60,6 +60,9 @@ static long privcmd_ioctl_hypercall(void __user *udata)
>>>>>   			   hypercall.arg[0], hypercall.arg[1],
>>>>>   			   hypercall.arg[2], hypercall.arg[3],
>>>>>   			   hypercall.arg[4]);
>>>>> +#ifndef CONFIG_PREEMPT
>>>>> +	schedule();
>>>>> +#endif
>>
>> As Juergen points out, this does nothing.  You need to schedule while in
>> the middle of the hypercall.
>>
>> Remember that Xen's hypercall preemption only preempts the hypercall to
>> run interrupts in the guest.
> 
> How is it ensured that when the kernel preempts on this code path on
> CONFIG_PREEMPT=n kernel that only interrupts in the guest are run?

Sorry, I really didn't describe this very well.

If a hypercall needs a continuation, Xen returns to the guest with the
IP set to the hypercall instruction, and on the way back to the guest
Xen may schedule a different VCPU or it will do any upcalls (as per normal).

The guest is free to return from the upcall to the original task
(continuing the hypercall) or to a different one.

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Luis Chamberlain Dec. 1, 2014, 3:44 p.m. UTC | #13
On Mon, Dec 1, 2014 at 10:18 AM, David Vrabel <david.vrabel@citrix.com> wrote:
> On 01/12/14 15:05, Luis R. Rodriguez wrote:
>> On Mon, Dec 01, 2014 at 11:11:43AM +0000, David Vrabel wrote:
>>> On 27/11/14 18:36, Luis R. Rodriguez wrote:
>>>> On Thu, Nov 27, 2014 at 07:36:31AM +0100, Juergen Gross wrote:
>>>>> On 11/26/2014 11:26 PM, Luis R. Rodriguez wrote:
>>>>>> From: "Luis R. Rodriguez" <mcgrof@suse.com>
>>>>>>
>>>>>> Some folks had reported that some xen hypercalls take a long time
>>>>>> to complete when issued from the userspace private ioctl mechanism,
>>>>>> this can happen for instance with some hypercalls that have many
>>>>>> sub-operations, this can happen for instance on hypercalls that use
>>> [...]
>>>>>> --- a/drivers/xen/privcmd.c
>>>>>> +++ b/drivers/xen/privcmd.c
>>>>>> @@ -60,6 +60,9 @@ static long privcmd_ioctl_hypercall(void __user *udata)
>>>>>>                              hypercall.arg[0], hypercall.arg[1],
>>>>>>                              hypercall.arg[2], hypercall.arg[3],
>>>>>>                              hypercall.arg[4]);
>>>>>> +#ifndef CONFIG_PREEMPT
>>>>>> + schedule();
>>>>>> +#endif
>>>
>>> As Juergen points out, this does nothing.  You need to schedule while in
>>> the middle of the hypercall.
>>>
>>> Remember that Xen's hypercall preemption only preempts the hypercall to
>>> run interrupts in the guest.
>>
>> How is it ensured that when the kernel preempts on this code path on
>> CONFIG_PREEMPT=n kernel that only interrupts in the guest are run?
>
> Sorry, I really didn't describe this very well.
>
> If a hypercall needs a continuation, Xen returns to the guest with the
> IP set to the hypercall instruction, and on the way back to the guest
> Xen may schedule a different VCPU or it will do any upcalls (as per normal).
>
> The guest is free to return from the upcall to the original task
> (continuing the hypercall) or to a different one.

OK so that addresses what Xen will do when using continuation and
hypercall preemption, my concern here was that using
preempt_schedule_irq() on CONFIG_PREEMPT=n kernels in the middle of a
hypercall on the return from an interrupt (e.g., the timer interrupt)
would still let the kernel preempt to tasks other than those related
to Xen.

My gripe was that if this was being done it'd be a bit abusive of the
API even if its safe.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Luis Chamberlain Dec. 1, 2014, 3:50 p.m. UTC | #14
On Mon, Dec 01, 2014 at 03:42:33PM +0100, Juergen Gross wrote:
> On 12/01/2014 02:32 PM, Luis R. Rodriguez wrote:
>> On Mon, Dec 01, 2014 at 11:01:18AM +0000, David Vrabel wrote:
>>> On 28/11/14 04:49, Juergen Gross wrote:
>>>> On 11/27/2014 07:50 PM, Andrew Cooper wrote:
>>>>>
>>>>> XenServer uses
>>>>>
>>>>> https://github.com/xenserver/linux-3.x.pg/blob/master/master/0001-x86-xen-allow-privcmd-hypercalls-to-be-preempted.patch
>>>>>
>>>>>
>>>>> to deal with these issues.  That patch is based on 3.10.
>>>>
>>>> Clever. :-)
>>>>
>>>>>
>>>>> I can remember whether this has been submitted upstream before (and
>>>>> there were outstanding issues), or whether it fell at an inconvenient
>>>>> time with our development cycles.
>>>>
>>>> I found
>>>>
>>>> http://lists.xen.org/archives/html/xen-devel/2014-02/msg02540.html
>>>>
>>>> and nothing else.
>>>
>>> I dropped it because it copy-and-paste a bunch of otherwise generic x86
>>> assembler and looked unlikely to get an x86 maintainer ack.  If you
>>> think otherwise, feel free to pick it up and run with it.
>>
>> I was trying to run with it, but my biggest gripe with this was
>> the use of preempt_schedule_irq(), but we can review that on the
>> other thread.

So much for the other thread :)

> I think this can be handled completely inside xen_evtchn_do_upcall().
>
> xen_preemptible_hcall_begin() (possibly with another name) could take
> the pointer of a function as parameter which is used as continuation
> point after an asynchronous interrupt in the critical section.

Oh so we'd only preempt to one specific task?

> Parameter for that function would be the exception frame of the
> original interrupt to be able to continue at the interrupted position
> after e.g. calling schedule().

Interesting... if reasonable I wonder if this should be generalized.
Are there other CONFIG_PREEMPT=n use cases for such excemptions?

  Luis
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Vrabel Dec. 1, 2014, 3:54 p.m. UTC | #15
On 01/12/14 15:44, Luis R. Rodriguez wrote:
> On Mon, Dec 1, 2014 at 10:18 AM, David Vrabel <david.vrabel@citrix.com> wrote:
>> On 01/12/14 15:05, Luis R. Rodriguez wrote:
>>> On Mon, Dec 01, 2014 at 11:11:43AM +0000, David Vrabel wrote:
>>>> On 27/11/14 18:36, Luis R. Rodriguez wrote:
>>>>> On Thu, Nov 27, 2014 at 07:36:31AM +0100, Juergen Gross wrote:
>>>>>> On 11/26/2014 11:26 PM, Luis R. Rodriguez wrote:
>>>>>>> From: "Luis R. Rodriguez" <mcgrof@suse.com>
>>>>>>>
>>>>>>> Some folks had reported that some xen hypercalls take a long time
>>>>>>> to complete when issued from the userspace private ioctl mechanism,
>>>>>>> this can happen for instance with some hypercalls that have many
>>>>>>> sub-operations, this can happen for instance on hypercalls that use
>>>> [...]
>>>>>>> --- a/drivers/xen/privcmd.c
>>>>>>> +++ b/drivers/xen/privcmd.c
>>>>>>> @@ -60,6 +60,9 @@ static long privcmd_ioctl_hypercall(void __user *udata)
>>>>>>>                              hypercall.arg[0], hypercall.arg[1],
>>>>>>>                              hypercall.arg[2], hypercall.arg[3],
>>>>>>>                              hypercall.arg[4]);
>>>>>>> +#ifndef CONFIG_PREEMPT
>>>>>>> + schedule();
>>>>>>> +#endif
>>>>
>>>> As Juergen points out, this does nothing.  You need to schedule while in
>>>> the middle of the hypercall.
>>>>
>>>> Remember that Xen's hypercall preemption only preempts the hypercall to
>>>> run interrupts in the guest.
>>>
>>> How is it ensured that when the kernel preempts on this code path on
>>> CONFIG_PREEMPT=n kernel that only interrupts in the guest are run?
>>
>> Sorry, I really didn't describe this very well.
>>
>> If a hypercall needs a continuation, Xen returns to the guest with the
>> IP set to the hypercall instruction, and on the way back to the guest
>> Xen may schedule a different VCPU or it will do any upcalls (as per normal).
>>
>> The guest is free to return from the upcall to the original task
>> (continuing the hypercall) or to a different one.
> 
> OK so that addresses what Xen will do when using continuation and
> hypercall preemption, my concern here was that using
> preempt_schedule_irq() on CONFIG_PREEMPT=n kernels in the middle of a
> hypercall on the return from an interrupt (e.g., the timer interrupt)
> would still let the kernel preempt to tasks other than those related
> to Xen.

Um.  Why would that be a problem?  We do want to switch to any task the
Linux scheduler thinks is best.

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Luis Chamberlain Dec. 1, 2014, 4:19 p.m. UTC | #16
On Mon, Dec 01, 2014 at 03:54:24PM +0000, David Vrabel wrote:
> On 01/12/14 15:44, Luis R. Rodriguez wrote:
> > On Mon, Dec 1, 2014 at 10:18 AM, David Vrabel <david.vrabel@citrix.com> wrote:
> >> On 01/12/14 15:05, Luis R. Rodriguez wrote:
> >>> On Mon, Dec 01, 2014 at 11:11:43AM +0000, David Vrabel wrote:
> >>>> On 27/11/14 18:36, Luis R. Rodriguez wrote:
> >>>>> On Thu, Nov 27, 2014 at 07:36:31AM +0100, Juergen Gross wrote:
> >>>>>> On 11/26/2014 11:26 PM, Luis R. Rodriguez wrote:
> >>>>>>> From: "Luis R. Rodriguez" <mcgrof@suse.com>
> >>>>>>>
> >>>>>>> Some folks had reported that some xen hypercalls take a long time
> >>>>>>> to complete when issued from the userspace private ioctl mechanism,
> >>>>>>> this can happen for instance with some hypercalls that have many
> >>>>>>> sub-operations, this can happen for instance on hypercalls that use
> >>>> [...]
> >>>>>>> --- a/drivers/xen/privcmd.c
> >>>>>>> +++ b/drivers/xen/privcmd.c
> >>>>>>> @@ -60,6 +60,9 @@ static long privcmd_ioctl_hypercall(void __user *udata)
> >>>>>>>                              hypercall.arg[0], hypercall.arg[1],
> >>>>>>>                              hypercall.arg[2], hypercall.arg[3],
> >>>>>>>                              hypercall.arg[4]);
> >>>>>>> +#ifndef CONFIG_PREEMPT
> >>>>>>> + schedule();
> >>>>>>> +#endif
> >>>>
> >>>> As Juergen points out, this does nothing.  You need to schedule while in
> >>>> the middle of the hypercall.
> >>>>
> >>>> Remember that Xen's hypercall preemption only preempts the hypercall to
> >>>> run interrupts in the guest.
> >>>
> >>> How is it ensured that when the kernel preempts on this code path on
> >>> CONFIG_PREEMPT=n kernel that only interrupts in the guest are run?
> >>
> >> Sorry, I really didn't describe this very well.
> >>
> >> If a hypercall needs a continuation, Xen returns to the guest with the
> >> IP set to the hypercall instruction, and on the way back to the guest
> >> Xen may schedule a different VCPU or it will do any upcalls (as per normal).
> >>
> >> The guest is free to return from the upcall to the original task
> >> (continuing the hypercall) or to a different one.
> > 
> > OK so that addresses what Xen will do when using continuation and
> > hypercall preemption, my concern here was that using
> > preempt_schedule_irq() on CONFIG_PREEMPT=n kernels in the middle of a
> > hypercall on the return from an interrupt (e.g., the timer interrupt)
> > would still let the kernel preempt to tasks other than those related
> > to Xen.
> 
> Um.  Why would that be a problem?  We do want to switch to any task the
> Linux scheduler thinks is best.

Its safe but -- it technically is doing kernel preemption, unless we want
to adjust the definition of CONFIG_PREEMPT=n to exclude hypercalls. This
was my original concern with the use of preempt_schedule_irq() to do this.
I am afraid of setting precedents without being clear or wider review and
acceptance.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jürgen Groß Dec. 1, 2014, 5:07 p.m. UTC | #17
On 12/01/2014 05:19 PM, Luis R. Rodriguez wrote:
> On Mon, Dec 01, 2014 at 03:54:24PM +0000, David Vrabel wrote:
>> On 01/12/14 15:44, Luis R. Rodriguez wrote:
>>> On Mon, Dec 1, 2014 at 10:18 AM, David Vrabel <david.vrabel@citrix.com> wrote:
>>>> On 01/12/14 15:05, Luis R. Rodriguez wrote:
>>>>> On Mon, Dec 01, 2014 at 11:11:43AM +0000, David Vrabel wrote:
>>>>>> On 27/11/14 18:36, Luis R. Rodriguez wrote:
>>>>>>> On Thu, Nov 27, 2014 at 07:36:31AM +0100, Juergen Gross wrote:
>>>>>>>> On 11/26/2014 11:26 PM, Luis R. Rodriguez wrote:
>>>>>>>>> From: "Luis R. Rodriguez" <mcgrof@suse.com>
>>>>>>>>>
>>>>>>>>> Some folks had reported that some xen hypercalls take a long time
>>>>>>>>> to complete when issued from the userspace private ioctl mechanism,
>>>>>>>>> this can happen for instance with some hypercalls that have many
>>>>>>>>> sub-operations, this can happen for instance on hypercalls that use
>>>>>> [...]
>>>>>>>>> --- a/drivers/xen/privcmd.c
>>>>>>>>> +++ b/drivers/xen/privcmd.c
>>>>>>>>> @@ -60,6 +60,9 @@ static long privcmd_ioctl_hypercall(void __user *udata)
>>>>>>>>>                               hypercall.arg[0], hypercall.arg[1],
>>>>>>>>>                               hypercall.arg[2], hypercall.arg[3],
>>>>>>>>>                               hypercall.arg[4]);
>>>>>>>>> +#ifndef CONFIG_PREEMPT
>>>>>>>>> + schedule();
>>>>>>>>> +#endif
>>>>>>
>>>>>> As Juergen points out, this does nothing.  You need to schedule while in
>>>>>> the middle of the hypercall.
>>>>>>
>>>>>> Remember that Xen's hypercall preemption only preempts the hypercall to
>>>>>> run interrupts in the guest.
>>>>>
>>>>> How is it ensured that when the kernel preempts on this code path on
>>>>> CONFIG_PREEMPT=n kernel that only interrupts in the guest are run?
>>>>
>>>> Sorry, I really didn't describe this very well.
>>>>
>>>> If a hypercall needs a continuation, Xen returns to the guest with the
>>>> IP set to the hypercall instruction, and on the way back to the guest
>>>> Xen may schedule a different VCPU or it will do any upcalls (as per normal).
>>>>
>>>> The guest is free to return from the upcall to the original task
>>>> (continuing the hypercall) or to a different one.
>>>
>>> OK so that addresses what Xen will do when using continuation and
>>> hypercall preemption, my concern here was that using
>>> preempt_schedule_irq() on CONFIG_PREEMPT=n kernels in the middle of a
>>> hypercall on the return from an interrupt (e.g., the timer interrupt)
>>> would still let the kernel preempt to tasks other than those related
>>> to Xen.
>>
>> Um.  Why would that be a problem?  We do want to switch to any task the
>> Linux scheduler thinks is best.
>
> Its safe but -- it technically is doing kernel preemption, unless we want
> to adjust the definition of CONFIG_PREEMPT=n to exclude hypercalls. This
> was my original concern with the use of preempt_schedule_irq() to do this.
> I am afraid of setting precedents without being clear or wider review and
> acceptance.

I wonder whether it would be more acceptable to add (or completely
switch to) another preemption model: PREEMPT_SWITCHABLE. This would be
similar to CONFIG_PREEMPT, but the "normal" value of __preempt_count
would be settable via kernel parameter (default 2):

0: preempt
1: preempt_voluntary
2: preempt_none

The kernel would run with preemption enabled. cond_sched() would
reschedule if __preempt_count <= 1. And in case of long running kernel
activities (like the hypercall case or other stuff requiring schedule()
calls to avoid hangups) we would just set __preempt_count to 0 during
these periods and restore the old value afterwards.

This would be a rather intrusive but clean change IMO.

Any thoughts?


Juergen
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Luis Chamberlain Dec. 1, 2014, 5:52 p.m. UTC | #18
On Mon, Dec 01, 2014 at 06:07:48PM +0100, Juergen Gross wrote:
> On 12/01/2014 05:19 PM, Luis R. Rodriguez wrote:
>> On Mon, Dec 01, 2014 at 03:54:24PM +0000, David Vrabel wrote:
>>> On 01/12/14 15:44, Luis R. Rodriguez wrote:
>>>> On Mon, Dec 1, 2014 at 10:18 AM, David Vrabel <david.vrabel@citrix.com> wrote:
>>>>> On 01/12/14 15:05, Luis R. Rodriguez wrote:
>>>>>> On Mon, Dec 01, 2014 at 11:11:43AM +0000, David Vrabel wrote:
>>>>>>> On 27/11/14 18:36, Luis R. Rodriguez wrote:
>>>>>>>> On Thu, Nov 27, 2014 at 07:36:31AM +0100, Juergen Gross wrote:
>>>>>>>>> On 11/26/2014 11:26 PM, Luis R. Rodriguez wrote:
>>>>>>>>>> From: "Luis R. Rodriguez" <mcgrof@suse.com>
>>>>>>>>>>
>>>>>>>>>> Some folks had reported that some xen hypercalls take a long time
>>>>>>>>>> to complete when issued from the userspace private ioctl mechanism,
>>>>>>>>>> this can happen for instance with some hypercalls that have many
>>>>>>>>>> sub-operations, this can happen for instance on hypercalls that use
>>>>>>> [...]
>>>>>>>>>> --- a/drivers/xen/privcmd.c
>>>>>>>>>> +++ b/drivers/xen/privcmd.c
>>>>>>>>>> @@ -60,6 +60,9 @@ static long privcmd_ioctl_hypercall(void __user *udata)
>>>>>>>>>>                               hypercall.arg[0], hypercall.arg[1],
>>>>>>>>>>                               hypercall.arg[2], hypercall.arg[3],
>>>>>>>>>>                               hypercall.arg[4]);
>>>>>>>>>> +#ifndef CONFIG_PREEMPT
>>>>>>>>>> + schedule();
>>>>>>>>>> +#endif
>>>>>>>
>>>>>>> As Juergen points out, this does nothing.  You need to schedule while in
>>>>>>> the middle of the hypercall.
>>>>>>>
>>>>>>> Remember that Xen's hypercall preemption only preempts the hypercall to
>>>>>>> run interrupts in the guest.
>>>>>>
>>>>>> How is it ensured that when the kernel preempts on this code path on
>>>>>> CONFIG_PREEMPT=n kernel that only interrupts in the guest are run?
>>>>>
>>>>> Sorry, I really didn't describe this very well.
>>>>>
>>>>> If a hypercall needs a continuation, Xen returns to the guest with the
>>>>> IP set to the hypercall instruction, and on the way back to the guest
>>>>> Xen may schedule a different VCPU or it will do any upcalls (as per normal).
>>>>>
>>>>> The guest is free to return from the upcall to the original task
>>>>> (continuing the hypercall) or to a different one.
>>>>
>>>> OK so that addresses what Xen will do when using continuation and
>>>> hypercall preemption, my concern here was that using
>>>> preempt_schedule_irq() on CONFIG_PREEMPT=n kernels in the middle of a
>>>> hypercall on the return from an interrupt (e.g., the timer interrupt)
>>>> would still let the kernel preempt to tasks other than those related
>>>> to Xen.
>>>
>>> Um.  Why would that be a problem?  We do want to switch to any task the
>>> Linux scheduler thinks is best.
>>
>> Its safe but -- it technically is doing kernel preemption, unless we want
>> to adjust the definition of CONFIG_PREEMPT=n to exclude hypercalls. This
>> was my original concern with the use of preempt_schedule_irq() to do this.
>> I am afraid of setting precedents without being clear or wider review and
>> acceptance.
>
> I wonder whether it would be more acceptable to add (or completely
> switch to) another preemption model: PREEMPT_SWITCHABLE. This would be
> similar to CONFIG_PREEMPT, but the "normal" value of __preempt_count
> would be settable via kernel parameter (default 2):
>
> 0: preempt
> 1: preempt_voluntary
> 2: preempt_none
>
> The kernel would run with preemption enabled. cond_sched() would
> reschedule if __preempt_count <= 1. And in case of long running kernel
> activities (like the hypercall case or other stuff requiring schedule()
> calls to avoid hangups) we would just set __preempt_count to 0 during
> these periods and restore the old value afterwards.
>
> This would be a rather intrusive but clean change IMO.
>
> Any thoughts?

I like the idea of dynamically changing at run time the preemption model and
personally find this reasonable, however I am not certain if this would
introduce a series of issues hard to address. Thoughts by others who linger
deep in the cold lonely scheduler caves ?

  Luis
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Vrabel Dec. 1, 2014, 6:16 p.m. UTC | #19
On 01/12/14 16:19, Luis R. Rodriguez wrote:
> On Mon, Dec 01, 2014 at 03:54:24PM +0000, David Vrabel wrote:
>> On 01/12/14 15:44, Luis R. Rodriguez wrote:
>>> On Mon, Dec 1, 2014 at 10:18 AM, David Vrabel <david.vrabel@citrix.com> wrote:
>>>> On 01/12/14 15:05, Luis R. Rodriguez wrote:
>>>>> On Mon, Dec 01, 2014 at 11:11:43AM +0000, David Vrabel wrote:
>>>>>> On 27/11/14 18:36, Luis R. Rodriguez wrote:
>>>>>>> On Thu, Nov 27, 2014 at 07:36:31AM +0100, Juergen Gross wrote:
>>>>>>>> On 11/26/2014 11:26 PM, Luis R. Rodriguez wrote:
>>>>>>>>> From: "Luis R. Rodriguez" <mcgrof@suse.com>
>>>>>>>>>
>>>>>>>>> Some folks had reported that some xen hypercalls take a long time
>>>>>>>>> to complete when issued from the userspace private ioctl mechanism,
>>>>>>>>> this can happen for instance with some hypercalls that have many
>>>>>>>>> sub-operations, this can happen for instance on hypercalls that use
>>>>>> [...]
>>>>>>>>> --- a/drivers/xen/privcmd.c
>>>>>>>>> +++ b/drivers/xen/privcmd.c
>>>>>>>>> @@ -60,6 +60,9 @@ static long privcmd_ioctl_hypercall(void __user *udata)
>>>>>>>>>                              hypercall.arg[0], hypercall.arg[1],
>>>>>>>>>                              hypercall.arg[2], hypercall.arg[3],
>>>>>>>>>                              hypercall.arg[4]);
>>>>>>>>> +#ifndef CONFIG_PREEMPT
>>>>>>>>> + schedule();
>>>>>>>>> +#endif
>>>>>>
>>>>>> As Juergen points out, this does nothing.  You need to schedule while in
>>>>>> the middle of the hypercall.
>>>>>>
>>>>>> Remember that Xen's hypercall preemption only preempts the hypercall to
>>>>>> run interrupts in the guest.
>>>>>
>>>>> How is it ensured that when the kernel preempts on this code path on
>>>>> CONFIG_PREEMPT=n kernel that only interrupts in the guest are run?
>>>>
>>>> Sorry, I really didn't describe this very well.
>>>>
>>>> If a hypercall needs a continuation, Xen returns to the guest with the
>>>> IP set to the hypercall instruction, and on the way back to the guest
>>>> Xen may schedule a different VCPU or it will do any upcalls (as per normal).
>>>>
>>>> The guest is free to return from the upcall to the original task
>>>> (continuing the hypercall) or to a different one.
>>>
>>> OK so that addresses what Xen will do when using continuation and
>>> hypercall preemption, my concern here was that using
>>> preempt_schedule_irq() on CONFIG_PREEMPT=n kernels in the middle of a
>>> hypercall on the return from an interrupt (e.g., the timer interrupt)
>>> would still let the kernel preempt to tasks other than those related
>>> to Xen.
>>
>> Um.  Why would that be a problem?  We do want to switch to any task the
>> Linux scheduler thinks is best.
> 
> Its safe but -- it technically is doing kernel preemption, unless we want
> to adjust the definition of CONFIG_PREEMPT=n to exclude hypercalls. This
> was my original concern with the use of preempt_schedule_irq() to do this.
> I am afraid of setting precedents without being clear or wider review and
> acceptance.

It's voluntary preemption at a well defined point.  It's no different to
a cond_resched() call.

Note that we're not trying to fix this for the non-voluntary-preempt
kernels.

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index 569a13b..e29edba 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c
@@ -60,6 +60,9 @@  static long privcmd_ioctl_hypercall(void __user *udata)
 			   hypercall.arg[0], hypercall.arg[1],
 			   hypercall.arg[2], hypercall.arg[3],
 			   hypercall.arg[4]);
+#ifndef CONFIG_PREEMPT
+	schedule();
+#endif
 
 	return ret;
 }