diff mbox

xen: add steal_clock support on x86

Message ID 1463573758-11441-1-git-send-email-jgross@suse.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jürgen Groß May 18, 2016, 12:15 p.m. UTC
With CONFIG_PARAVIRT_TIME_ACCOUNTING the kernel is capable to account
for time a thread wasn't able to run due to hypervisor scheduling.
Add support in Xen arch independent time handling for this feature by
moving it out of the arm arch into drivers/xen.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/arm/xen/enlighten.c | 17 ++---------------
 arch/x86/xen/time.c      |  2 ++
 drivers/xen/time.c       | 19 +++++++++++++++++++
 include/xen/xen-ops.h    |  1 +
 4 files changed, 24 insertions(+), 15 deletions(-)

Comments

Boris Ostrovsky May 18, 2016, 2:46 p.m. UTC | #1
On 05/18/2016 08:15 AM, Juergen Gross wrote:
>  }
>  
> +void __init xen_time_setup_guest(void)
> +{
> +	pv_time_ops.steal_clock = xen_steal_clock;
> +
> +	static_key_slow_inc(&paravirt_steal_enabled);
> +	/*
> +	 * We can't set paravirt_steal_rq_enabled as this would require the
> +	 * capability to read another cpu's runstate info.
> +	 */
> +}

Won't we be accounting for stolen cycles twice now --- once from
steal_account_process_tick()->steal_clock() and second time from
do_stolen_accounting()?

-boris
Jürgen Groß May 18, 2016, 2:53 p.m. UTC | #2
On 18/05/16 16:46, Boris Ostrovsky wrote:
> On 05/18/2016 08:15 AM, Juergen Gross wrote:
>>  }
>>  
>> +void __init xen_time_setup_guest(void)
>> +{
>> +	pv_time_ops.steal_clock = xen_steal_clock;
>> +
>> +	static_key_slow_inc(&paravirt_steal_enabled);
>> +	/*
>> +	 * We can't set paravirt_steal_rq_enabled as this would require the
>> +	 * capability to read another cpu's runstate info.
>> +	 */
>> +}
> 
> Won't we be accounting for stolen cycles twice now --- once from
> steal_account_process_tick()->steal_clock() and second time from
> do_stolen_accounting()?

Uuh, yes.

I guess I should rip do_stolen_accounting() out, too? It is a
Xen-specific hack, so I guess nobody will cry. Maybe it would be a
good idea to select CONFIG_PARAVIRT_TIME_ACCOUNTING for XEN then?


Juergen
Boris Ostrovsky May 18, 2016, 3:25 p.m. UTC | #3
On 05/18/2016 10:53 AM, Juergen Gross wrote:
> On 18/05/16 16:46, Boris Ostrovsky wrote:
>> On 05/18/2016 08:15 AM, Juergen Gross wrote:
>>>  }
>>>  
>>> +void __init xen_time_setup_guest(void)
>>> +{
>>> +	pv_time_ops.steal_clock = xen_steal_clock;
>>> +
>>> +	static_key_slow_inc(&paravirt_steal_enabled);
>>> +	/*
>>> +	 * We can't set paravirt_steal_rq_enabled as this would require the
>>> +	 * capability to read another cpu's runstate info.
>>> +	 */
>>> +}
>> Won't we be accounting for stolen cycles twice now --- once from
>> steal_account_process_tick()->steal_clock() and second time from
>> do_stolen_accounting()?
> Uuh, yes.
>
> I guess I should rip do_stolen_accounting() out, too? 

I don't think PARAVIRT_TIME_ACCOUNTING is always selected for Xen. If
that's indeed the case then we should ifndef do_stolen_accounting(). Or
maybe check for paravirt_steal_enabled.

-boris

> It is a
> Xen-specific hack, so I guess nobody will cry. Maybe it would be a
> good idea to select CONFIG_PARAVIRT_TIME_ACCOUNTING for XEN then?
>
>
> Juergen
>
Jürgen Groß May 18, 2016, 3:42 p.m. UTC | #4
On 18/05/16 17:25, Boris Ostrovsky wrote:
> On 05/18/2016 10:53 AM, Juergen Gross wrote:
>> On 18/05/16 16:46, Boris Ostrovsky wrote:
>>> On 05/18/2016 08:15 AM, Juergen Gross wrote:
>>>>  }
>>>>  
>>>> +void __init xen_time_setup_guest(void)
>>>> +{
>>>> +	pv_time_ops.steal_clock = xen_steal_clock;
>>>> +
>>>> +	static_key_slow_inc(&paravirt_steal_enabled);
>>>> +	/*
>>>> +	 * We can't set paravirt_steal_rq_enabled as this would require the
>>>> +	 * capability to read another cpu's runstate info.
>>>> +	 */
>>>> +}
>>> Won't we be accounting for stolen cycles twice now --- once from
>>> steal_account_process_tick()->steal_clock() and second time from
>>> do_stolen_accounting()?
>> Uuh, yes.
>>
>> I guess I should rip do_stolen_accounting() out, too? 
> 
> I don't think PARAVIRT_TIME_ACCOUNTING is always selected for Xen. If

This is easy to accomplish. :-)

> that's indeed the case then we should ifndef do_stolen_accounting(). Or
> maybe check for paravirt_steal_enabled.

Is this really a sensible thing to do? There is a mechanism used by KVM
to do the stolen accounting. I think we should use it instead of having
a second implementation doing the same thing in case the generic one
isn't enabled.

Juergen
David Vrabel May 18, 2016, 3:45 p.m. UTC | #5
On 18/05/16 16:42, Juergen Gross wrote:
> On 18/05/16 17:25, Boris Ostrovsky wrote:
>> On 05/18/2016 10:53 AM, Juergen Gross wrote:
>>> On 18/05/16 16:46, Boris Ostrovsky wrote:
>>>> On 05/18/2016 08:15 AM, Juergen Gross wrote:
>>>>>  }
>>>>>  
>>>>> +void __init xen_time_setup_guest(void)
>>>>> +{
>>>>> +	pv_time_ops.steal_clock = xen_steal_clock;
>>>>> +
>>>>> +	static_key_slow_inc(&paravirt_steal_enabled);
>>>>> +	/*
>>>>> +	 * We can't set paravirt_steal_rq_enabled as this would require the
>>>>> +	 * capability to read another cpu's runstate info.
>>>>> +	 */
>>>>> +}
>>>> Won't we be accounting for stolen cycles twice now --- once from
>>>> steal_account_process_tick()->steal_clock() and second time from
>>>> do_stolen_accounting()?
>>> Uuh, yes.
>>>
>>> I guess I should rip do_stolen_accounting() out, too? 
>>
>> I don't think PARAVIRT_TIME_ACCOUNTING is always selected for Xen. If
> 
> This is easy to accomplish. :-)
> 
>> that's indeed the case then we should ifndef do_stolen_accounting(). Or
>> maybe check for paravirt_steal_enabled.
> 
> Is this really a sensible thing to do? There is a mechanism used by KVM
> to do the stolen accounting. I think we should use it instead of having
> a second implementation doing the same thing in case the generic one
> isn't enabled.

I agree.

Although I don't think selecting PARAVIRT_TIME_ACC' is necessary -- I
don't think it's essential (or is it?).

David
Jürgen Groß May 18, 2016, 3:51 p.m. UTC | #6
On 18/05/16 17:45, David Vrabel wrote:
> On 18/05/16 16:42, Juergen Gross wrote:
>> On 18/05/16 17:25, Boris Ostrovsky wrote:
>>> On 05/18/2016 10:53 AM, Juergen Gross wrote:
>>>> On 18/05/16 16:46, Boris Ostrovsky wrote:
>>>>> On 05/18/2016 08:15 AM, Juergen Gross wrote:
>>>>>>  }
>>>>>>  
>>>>>> +void __init xen_time_setup_guest(void)
>>>>>> +{
>>>>>> +	pv_time_ops.steal_clock = xen_steal_clock;
>>>>>> +
>>>>>> +	static_key_slow_inc(&paravirt_steal_enabled);
>>>>>> +	/*
>>>>>> +	 * We can't set paravirt_steal_rq_enabled as this would require the
>>>>>> +	 * capability to read another cpu's runstate info.
>>>>>> +	 */
>>>>>> +}
>>>>> Won't we be accounting for stolen cycles twice now --- once from
>>>>> steal_account_process_tick()->steal_clock() and second time from
>>>>> do_stolen_accounting()?
>>>> Uuh, yes.
>>>>
>>>> I guess I should rip do_stolen_accounting() out, too? 
>>>
>>> I don't think PARAVIRT_TIME_ACCOUNTING is always selected for Xen. If
>>
>> This is easy to accomplish. :-)
>>
>>> that's indeed the case then we should ifndef do_stolen_accounting(). Or
>>> maybe check for paravirt_steal_enabled.
>>
>> Is this really a sensible thing to do? There is a mechanism used by KVM
>> to do the stolen accounting. I think we should use it instead of having
>> a second implementation doing the same thing in case the generic one
>> isn't enabled.
> 
> I agree.
> 
> Although I don't think selecting PARAVIRT_TIME_ACC' is necessary -- I
> don't think it's essential (or is it?).

Not doing so will change behavior in case I rip out
do_stolen_accounting(). What about "default y if XEN" ?


Juergen
David Vrabel May 18, 2016, 3:52 p.m. UTC | #7
On 18/05/16 16:51, Juergen Gross wrote:
> On 18/05/16 17:45, David Vrabel wrote:
>> On 18/05/16 16:42, Juergen Gross wrote:
>>> On 18/05/16 17:25, Boris Ostrovsky wrote:
>>>> On 05/18/2016 10:53 AM, Juergen Gross wrote:
>>>>> On 18/05/16 16:46, Boris Ostrovsky wrote:
>>>>>> On 05/18/2016 08:15 AM, Juergen Gross wrote:
>>>>>>>  }
>>>>>>>  
>>>>>>> +void __init xen_time_setup_guest(void)
>>>>>>> +{
>>>>>>> +	pv_time_ops.steal_clock = xen_steal_clock;
>>>>>>> +
>>>>>>> +	static_key_slow_inc(&paravirt_steal_enabled);
>>>>>>> +	/*
>>>>>>> +	 * We can't set paravirt_steal_rq_enabled as this would require the
>>>>>>> +	 * capability to read another cpu's runstate info.
>>>>>>> +	 */
>>>>>>> +}
>>>>>> Won't we be accounting for stolen cycles twice now --- once from
>>>>>> steal_account_process_tick()->steal_clock() and second time from
>>>>>> do_stolen_accounting()?
>>>>> Uuh, yes.
>>>>>
>>>>> I guess I should rip do_stolen_accounting() out, too? 
>>>>
>>>> I don't think PARAVIRT_TIME_ACCOUNTING is always selected for Xen. If
>>>
>>> This is easy to accomplish. :-)
>>>
>>>> that's indeed the case then we should ifndef do_stolen_accounting(). Or
>>>> maybe check for paravirt_steal_enabled.
>>>
>>> Is this really a sensible thing to do? There is a mechanism used by KVM
>>> to do the stolen accounting. I think we should use it instead of having
>>> a second implementation doing the same thing in case the generic one
>>> isn't enabled.
>>
>> I agree.
>>
>> Although I don't think selecting PARAVIRT_TIME_ACC' is necessary -- I
>> don't think it's essential (or is it?).
> 
> Not doing so will change behavior in case I rip out
> do_stolen_accounting(). What about "default y if XEN" ?

Ok.

David
Boris Ostrovsky May 18, 2016, 3:53 p.m. UTC | #8
On 05/18/2016 11:45 AM, David Vrabel wrote:
> On 18/05/16 16:42, Juergen Gross wrote:
>> On 18/05/16 17:25, Boris Ostrovsky wrote:
>>> On 05/18/2016 10:53 AM, Juergen Gross wrote:
>>>> On 18/05/16 16:46, Boris Ostrovsky wrote:
>>>>> On 05/18/2016 08:15 AM, Juergen Gross wrote:
>>>>>>  }
>>>>>>  
>>>>>> +void __init xen_time_setup_guest(void)
>>>>>> +{
>>>>>> +	pv_time_ops.steal_clock = xen_steal_clock;
>>>>>> +
>>>>>> +	static_key_slow_inc(&paravirt_steal_enabled);
>>>>>> +	/*
>>>>>> +	 * We can't set paravirt_steal_rq_enabled as this would require the
>>>>>> +	 * capability to read another cpu's runstate info.
>>>>>> +	 */
>>>>>> +}
>>>>> Won't we be accounting for stolen cycles twice now --- once from
>>>>> steal_account_process_tick()->steal_clock() and second time from
>>>>> do_stolen_accounting()?
>>>> Uuh, yes.
>>>>
>>>> I guess I should rip do_stolen_accounting() out, too? 
>>> I don't think PARAVIRT_TIME_ACCOUNTING is always selected for Xen. If
>> This is easy to accomplish. :-)


I looked at KVM code (PARAVIRT_TIME_ACCOUNTING is not selected there
neither) and in their case that's presumably because stealing accounting
is a CPUID bit, i.e. it might not be supported. In Xen case we always
have this interface.


>>
>>> that's indeed the case then we should ifndef do_stolen_accounting(). Or
>>> maybe check for paravirt_steal_enabled.
>> Is this really a sensible thing to do? There is a mechanism used by KVM
>> to do the stolen accounting. I think we should use it instead of having
>> a second implementation doing the same thing in case the generic one
>> isn't enabled.
> I agree.
>
> Although I don't think selecting PARAVIRT_TIME_ACC' is necessary -- I
> don't think it's essential (or is it?).

Looks like it's useful only if paravirt_steal_rq_enabled, which we don't
support yet.

-boris
Jürgen Groß May 18, 2016, 4 p.m. UTC | #9
On 18/05/16 17:53, Boris Ostrovsky wrote:
> On 05/18/2016 11:45 AM, David Vrabel wrote:
>> On 18/05/16 16:42, Juergen Gross wrote:
>>> On 18/05/16 17:25, Boris Ostrovsky wrote:
>>>> On 05/18/2016 10:53 AM, Juergen Gross wrote:
>>>>> On 18/05/16 16:46, Boris Ostrovsky wrote:
>>>>>> On 05/18/2016 08:15 AM, Juergen Gross wrote:
>>>>>>>  }
>>>>>>>  
>>>>>>> +void __init xen_time_setup_guest(void)
>>>>>>> +{
>>>>>>> +	pv_time_ops.steal_clock = xen_steal_clock;
>>>>>>> +
>>>>>>> +	static_key_slow_inc(&paravirt_steal_enabled);
>>>>>>> +	/*
>>>>>>> +	 * We can't set paravirt_steal_rq_enabled as this would require the
>>>>>>> +	 * capability to read another cpu's runstate info.
>>>>>>> +	 */
>>>>>>> +}
>>>>>> Won't we be accounting for stolen cycles twice now --- once from
>>>>>> steal_account_process_tick()->steal_clock() and second time from
>>>>>> do_stolen_accounting()?
>>>>> Uuh, yes.
>>>>>
>>>>> I guess I should rip do_stolen_accounting() out, too? 
>>>> I don't think PARAVIRT_TIME_ACCOUNTING is always selected for Xen. If
>>> This is easy to accomplish. :-)
> 
> 
> I looked at KVM code (PARAVIRT_TIME_ACCOUNTING is not selected there
> neither) and in their case that's presumably because stealing accounting
> is a CPUID bit, i.e. it might not be supported. In Xen case we always
> have this interface.

So they added it later and the default is to keep the old behavior.

>>>> that's indeed the case then we should ifndef do_stolen_accounting(). Or
>>>> maybe check for paravirt_steal_enabled.
>>> Is this really a sensible thing to do? There is a mechanism used by KVM
>>> to do the stolen accounting. I think we should use it instead of having
>>> a second implementation doing the same thing in case the generic one
>>> isn't enabled.
>> I agree.
>>
>> Although I don't think selecting PARAVIRT_TIME_ACC' is necessary -- I
>> don't think it's essential (or is it?).
> 
> Looks like it's useful only if paravirt_steal_rq_enabled, which we don't
> support yet.

I think the patch is still useful. It is reducing code size and
it is removing arch-specific Xen-hack(s). With the patch Xen's
solution for arm and x86 is common and the same as for KVM. Adding
paravirt_steal_rq_enabled later will be much easier as only one
function needs substantial modification.


Juergen
Dario Faggioli May 18, 2016, 4:10 p.m. UTC | #10
On Wed, 2016-05-18 at 16:53 +0200, Juergen Gross wrote:
> On 18/05/16 16:46, Boris Ostrovsky wrote:
> > 
> > Won't we be accounting for stolen cycles twice now --- once from
> > steal_account_process_tick()->steal_clock() and second time from
> > do_stolen_accounting()?
> Uuh, yes.
> 
> I guess I should rip do_stolen_accounting() out, too? It is a
> Xen-specific hack, so I guess nobody will cry. Maybe it would be a
> good idea to select CONFIG_PARAVIRT_TIME_ACCOUNTING for XEN then?
> 
So, config options aside, if I understand this correctly, it looks like
we were actually already doing steal time accounting, although in a
non-standard way.

And yet, people seem to have issues relating to lack of (proper?) steal
time accounting (Cc-ing Tony).

I guess this means that, either:
 - the issue being reported is actually not caused by the lack of 
   steal time accounting,
 - our current (Xen specific) steal time accounting solution is flawed,
 - the issue is caused by the lack of the bit of steal time accounting
   that we do not support yet,
 - other ideas? Tony?

Dario
Boris Ostrovsky May 18, 2016, 4:13 p.m. UTC | #11
On 05/18/2016 12:00 PM, Juergen Gross wrote:
> On 18/05/16 17:53, Boris Ostrovsky wrote:
>> On 05/18/2016 11:45 AM, David Vrabel wrote:
>>> On 18/05/16 16:42, Juergen Gross wrote:
>>>> On 18/05/16 17:25, Boris Ostrovsky wrote:
>>>>> On 05/18/2016 10:53 AM, Juergen Gross wrote:
>>>>>> On 18/05/16 16:46, Boris Ostrovsky wrote:
>>>>>>> On 05/18/2016 08:15 AM, Juergen Gross wrote:
>>>>>>>>  }
>>>>>>>>  
>>>>>>>> +void __init xen_time_setup_guest(void)
>>>>>>>> +{
>>>>>>>> +	pv_time_ops.steal_clock = xen_steal_clock;
>>>>>>>> +
>>>>>>>> +	static_key_slow_inc(&paravirt_steal_enabled);
>>>>>>>> +	/*
>>>>>>>> +	 * We can't set paravirt_steal_rq_enabled as this would require the
>>>>>>>> +	 * capability to read another cpu's runstate info.
>>>>>>>> +	 */
>>>>>>>> +}
>>>>>>> Won't we be accounting for stolen cycles twice now --- once from
>>>>>>> steal_account_process_tick()->steal_clock() and second time from
>>>>>>> do_stolen_accounting()?
>>>>>> Uuh, yes.
>>>>>>
>>>>>> I guess I should rip do_stolen_accounting() out, too? 
>>>>> I don't think PARAVIRT_TIME_ACCOUNTING is always selected for Xen. If
>>>> This is easy to accomplish. :-)
>>
>> I looked at KVM code (PARAVIRT_TIME_ACCOUNTING is not selected there
>> neither) and in their case that's presumably because stealing accounting
>> is a CPUID bit, i.e. it might not be supported. In Xen case we always
>> have this interface.
> So they added it later and the default is to keep the old behavior.
>
>>>>> that's indeed the case then we should ifndef do_stolen_accounting(). Or
>>>>> maybe check for paravirt_steal_enabled.
>>>> Is this really a sensible thing to do? There is a mechanism used by KVM
>>>> to do the stolen accounting. I think we should use it instead of having
>>>> a second implementation doing the same thing in case the generic one
>>>> isn't enabled.
>>> I agree.
>>>
>>> Although I don't think selecting PARAVIRT_TIME_ACC' is necessary -- I
>>> don't think it's essential (or is it?).
>> Looks like it's useful only if paravirt_steal_rq_enabled, which we don't
>> support yet.
> I think the patch is still useful. It is reducing code size and
> it is removing arch-specific Xen-hack(s). With the patch Xen's
> solution for arm and x86 is common and the same as for KVM. Adding
> paravirt_steal_rq_enabled later will be much easier as only one
> function needs substantial modification.

I am not arguing against having a patch that will remove
do_stolen_accounting(). I was responding to David's statement about
whether we need to select CONFIG_PARAVIRT_TIME_ACCOUNTING, and I am not
sure this is necessary since steal_account_process_tick() (that will
take case of things that do_stolen_accounting() currently does) doesn't
need it.

(And if it is indeed needed --- can we have Xen's Kconfig select it
instead of "default y if XEN" ?)

-boris
Boris Ostrovsky May 18, 2016, 5:20 p.m. UTC | #12
On 05/18/2016 12:10 PM, Dario Faggioli wrote:
> On Wed, 2016-05-18 at 16:53 +0200, Juergen Gross wrote:
>> On 18/05/16 16:46, Boris Ostrovsky wrote:
>>>  
>>> Won't we be accounting for stolen cycles twice now --- once from
>>> steal_account_process_tick()->steal_clock() and second time from
>>> do_stolen_accounting()?
>> Uuh, yes.
>>
>> I guess I should rip do_stolen_accounting() out, too? It is a
>> Xen-specific hack, so I guess nobody will cry. Maybe it would be a
>> good idea to select CONFIG_PARAVIRT_TIME_ACCOUNTING for XEN then?
>>
> So, config options aside, if I understand this correctly, it looks like
> we were actually already doing steal time accounting, although in a
> non-standard way.
>
> And yet, people seem to have issues relating to lack of (proper?) steal
> time accounting (Cc-ing Tony).
>
> I guess this means that, either:
>  - the issue being reported is actually not caused by the lack of 
>    steal time accounting,
>  - our current (Xen specific) steal time accounting solution is flawed,
>  - the issue is caused by the lack of the bit of steal time accounting
>    that we do not support yet,

I believe it's this one.

Tony narrowed the problem down to update_curr() where vruntime is
calculated, based on runqueue's clock_task value. That value is computed
in update_rq_clock_task(), which needs paravirt_steal_rq_enabled.

-boris

>  - other ideas? Tony?
>
> Dario
Tony S May 18, 2016, 5:59 p.m. UTC | #13
On Wed, May 18, 2016 at 11:20 AM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
> On 05/18/2016 12:10 PM, Dario Faggioli wrote:
>> On Wed, 2016-05-18 at 16:53 +0200, Juergen Gross wrote:
>>> On 18/05/16 16:46, Boris Ostrovsky wrote:
>>>>
>>>> Won't we be accounting for stolen cycles twice now --- once from
>>>> steal_account_process_tick()->steal_clock() and second time from
>>>> do_stolen_accounting()?
>>> Uuh, yes.
>>>
>>> I guess I should rip do_stolen_accounting() out, too? It is a
>>> Xen-specific hack, so I guess nobody will cry. Maybe it would be a
>>> good idea to select CONFIG_PARAVIRT_TIME_ACCOUNTING for XEN then?
>>>
>> So, config options aside, if I understand this correctly, it looks like
>> we were actually already doing steal time accounting, although in a
>> non-standard way.
>>
>> And yet, people seem to have issues relating to lack of (proper?) steal
>> time accounting (Cc-ing Tony).
>>
>> I guess this means that, either:
>>  - the issue being reported is actually not caused by the lack of
>>    steal time accounting,
>>  - our current (Xen specific) steal time accounting solution is flawed,
>>  - the issue is caused by the lack of the bit of steal time accounting
>>    that we do not support yet,
>
> I believe it's this one.
>
> Tony narrowed the problem down to update_curr() where vruntime is
> calculated, based on runqueue's clock_task value. That value is computed
> in update_rq_clock_task(), which needs paravirt_steal_rq_enabled.
>

Hi Boris,

You are right.

The real problem is steal_clock in pv_time_ops is implemented in KVM
but not in Xen.

arch/x86/include/asm/paravirt_types.h
struct pv_time_ops {
        unsigned long long (*sched_clock)(void);
        unsigned long long (*steal_clock)(int cpu);
        unsigned long (*get_tsc_khz)(void);
};


(1) KVM implemented both sched_clock and steal_clock.

arch/x86/kernel/kvmclock.c
pv_time_ops.sched_clock = kvm_clock_read;

arch/x86/kernel/kvm.c
pv_time_ops.steal_clock = kvm_steal_clock;


(2) However, Xen just implemented sched_clock while the steal_clock is
still native_steal_clock(). The function native_steal_clock() just
simply return 0.

arch/x86/xen/time.c
.sched_clock = xen_clocksource_read;

arch/x86/kernel/paravirt.c
static u64 native_steal_clock(int cpu)
{
         return 0;
}


Therefore, even though update_rq_clock_task() calculates the value and
paravirt_steal_rq_enabled option is enabled, the steal value just
returns 0. This will cause the problem which I mentioned.

update_rq_clock_task
--> paravirt_steal_clock
--> pv_time_ops.steal_clock
--> native_steal_clock (if in Xen)
--> 0

The fundamental solution is to implement a steal_clock in Xen(learn
from KVM implementation) instead of using the native one.

Tony

> -boris
>
>>  - other ideas? Tony?
>>
>> Dario
>
>
Tony S May 18, 2016, 6:03 p.m. UTC | #14
On Wed, May 18, 2016 at 11:59 AM, Tony S <suokunstar@gmail.com> wrote:
> On Wed, May 18, 2016 at 11:20 AM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
>> On 05/18/2016 12:10 PM, Dario Faggioli wrote:
>>> On Wed, 2016-05-18 at 16:53 +0200, Juergen Gross wrote:
>>>> On 18/05/16 16:46, Boris Ostrovsky wrote:
>>>>>
>>>>> Won't we be accounting for stolen cycles twice now --- once from
>>>>> steal_account_process_tick()->steal_clock() and second time from
>>>>> do_stolen_accounting()?
>>>> Uuh, yes.
>>>>
>>>> I guess I should rip do_stolen_accounting() out, too? It is a
>>>> Xen-specific hack, so I guess nobody will cry. Maybe it would be a
>>>> good idea to select CONFIG_PARAVIRT_TIME_ACCOUNTING for XEN then?
>>>>
>>> So, config options aside, if I understand this correctly, it looks like
>>> we were actually already doing steal time accounting, although in a
>>> non-standard way.
>>>
>>> And yet, people seem to have issues relating to lack of (proper?) steal
>>> time accounting (Cc-ing Tony).
>>>
>>> I guess this means that, either:
>>>  - the issue being reported is actually not caused by the lack of
>>>    steal time accounting,
>>>  - our current (Xen specific) steal time accounting solution is flawed,
>>>  - the issue is caused by the lack of the bit of steal time accounting
>>>    that we do not support yet,
>>
>> I believe it's this one.
>>
>> Tony narrowed the problem down to update_curr() where vruntime is
>> calculated, based on runqueue's clock_task value. That value is computed
>> in update_rq_clock_task(), which needs paravirt_steal_rq_enabled.
>>
>
> Hi Boris,
>
> You are right.
>
> The real problem is steal_clock in pv_time_ops is implemented in KVM
> but not in Xen.
>
> arch/x86/include/asm/paravirt_types.h
> struct pv_time_ops {
>         unsigned long long (*sched_clock)(void);
>         unsigned long long (*steal_clock)(int cpu);
>         unsigned long (*get_tsc_khz)(void);
> };
>
>
> (1) KVM implemented both sched_clock and steal_clock.
>
> arch/x86/kernel/kvmclock.c
> pv_time_ops.sched_clock = kvm_clock_read;
>
> arch/x86/kernel/kvm.c
> pv_time_ops.steal_clock = kvm_steal_clock;
>
>
> (2) However, Xen just implemented sched_clock while the steal_clock is
> still native_steal_clock(). The function native_steal_clock() just
> simply return 0.
>
> arch/x86/xen/time.c
> .sched_clock = xen_clocksource_read;
>
> arch/x86/kernel/paravirt.c
> static u64 native_steal_clock(int cpu)
> {
>          return 0;
> }
>
>
> Therefore, even though update_rq_clock_task() calculates the value and
> paravirt_steal_rq_enabled option is enabled, the steal value just
> returns 0. This will cause the problem which I mentioned.
>
> update_rq_clock_task
> --> paravirt_steal_clock
> --> pv_time_ops.steal_clock
> --> native_steal_clock (if in Xen)
> --> 0
>
> The fundamental solution is to implement a steal_clock in Xen(learn
> from KVM implementation) instead of using the native one.
>
> Tony
>

Also, I tried the latest long term version of Linux 4.4, this issue
still exists there. Hoping the next version can add this patch.

Tony


>> -boris
>>
>>>  - other ideas? Tony?
>>>
>>> Dario
>>
>>
Dario Faggioli May 19, 2016, 3:36 a.m. UTC | #15
On Wed, 2016-05-18 at 12:03 -0600, Tony S wrote:
> On Wed, May 18, 2016 at 11:59 AM, Tony S <suokunstar@gmail.com>
> wrote:
> > On Wed, May 18, 2016 at 11:20 AM, Boris Ostrovsky
> > <boris.ostrovsky@oracle.com> wrote:
> > > Tony narrowed the problem down to update_curr() where vruntime is
> > > calculated, based on runqueue's clock_task value. That value is
> > > computed
> > > in update_rq_clock_task(), which needs paravirt_steal_rq_enabled.
> > > 
> > Hi Boris,
> > 
> > You are right.
> > 
> > The real problem is steal_clock in pv_time_ops is implemented in
> > KVM
> > but not in Xen.
> > 
> > [...]
> >
> > Therefore, even though update_rq_clock_task() calculates the value
> > and
> > paravirt_steal_rq_enabled option is enabled, the steal value just
> > returns 0. This will cause the problem which I mentioned.
> > 
> > update_rq_clock_task
> > --> paravirt_steal_clock
> > --> pv_time_ops.steal_clock
> > --> native_steal_clock (if in Xen)
> > --> 0
> > 
Ok, thanks again for confirming this.

> Also, I tried the latest long term version of Linux 4.4, this issue
> still exists there. Hoping the next version can add this patch.
> 
Yes, as Juergen said here:

 http://lists.xen.org/archives/html/xen-devel/2016-05/msg01712.html

he has in his plans to implement the full mechanism. It will just take
a bit longer, due to the amount of work necessary for this second part.

Regards,
Dario
Jürgen Groß May 19, 2016, 5:33 a.m. UTC | #16
On 18/05/16 18:13, Boris Ostrovsky wrote:
> On 05/18/2016 12:00 PM, Juergen Gross wrote:
>> On 18/05/16 17:53, Boris Ostrovsky wrote:
>>> On 05/18/2016 11:45 AM, David Vrabel wrote:
>>>> On 18/05/16 16:42, Juergen Gross wrote:
>>>>> On 18/05/16 17:25, Boris Ostrovsky wrote:
>>>>>> On 05/18/2016 10:53 AM, Juergen Gross wrote:
>>>>>>> On 18/05/16 16:46, Boris Ostrovsky wrote:
>>>>>>>> On 05/18/2016 08:15 AM, Juergen Gross wrote:
>>>>>>>>>  }
>>>>>>>>>  
>>>>>>>>> +void __init xen_time_setup_guest(void)
>>>>>>>>> +{
>>>>>>>>> +	pv_time_ops.steal_clock = xen_steal_clock;
>>>>>>>>> +
>>>>>>>>> +	static_key_slow_inc(&paravirt_steal_enabled);
>>>>>>>>> +	/*
>>>>>>>>> +	 * We can't set paravirt_steal_rq_enabled as this would require the
>>>>>>>>> +	 * capability to read another cpu's runstate info.
>>>>>>>>> +	 */
>>>>>>>>> +}
>>>>>>>> Won't we be accounting for stolen cycles twice now --- once from
>>>>>>>> steal_account_process_tick()->steal_clock() and second time from
>>>>>>>> do_stolen_accounting()?
>>>>>>> Uuh, yes.
>>>>>>>
>>>>>>> I guess I should rip do_stolen_accounting() out, too? 
>>>>>> I don't think PARAVIRT_TIME_ACCOUNTING is always selected for Xen. If
>>>>> This is easy to accomplish. :-)
>>>
>>> I looked at KVM code (PARAVIRT_TIME_ACCOUNTING is not selected there
>>> neither) and in their case that's presumably because stealing accounting
>>> is a CPUID bit, i.e. it might not be supported. In Xen case we always
>>> have this interface.
>> So they added it later and the default is to keep the old behavior.
>>
>>>>>> that's indeed the case then we should ifndef do_stolen_accounting(). Or
>>>>>> maybe check for paravirt_steal_enabled.
>>>>> Is this really a sensible thing to do? There is a mechanism used by KVM
>>>>> to do the stolen accounting. I think we should use it instead of having
>>>>> a second implementation doing the same thing in case the generic one
>>>>> isn't enabled.
>>>> I agree.
>>>>
>>>> Although I don't think selecting PARAVIRT_TIME_ACC' is necessary -- I
>>>> don't think it's essential (or is it?).
>>> Looks like it's useful only if paravirt_steal_rq_enabled, which we don't
>>> support yet.
>> I think the patch is still useful. It is reducing code size and
>> it is removing arch-specific Xen-hack(s). With the patch Xen's
>> solution for arm and x86 is common and the same as for KVM. Adding
>> paravirt_steal_rq_enabled later will be much easier as only one
>> function needs substantial modification.
> 
> I am not arguing against having a patch that will remove
> do_stolen_accounting(). I was responding to David's statement about
> whether we need to select CONFIG_PARAVIRT_TIME_ACCOUNTING, and I am not
> sure this is necessary since steal_account_process_tick() (that will
> take case of things that do_stolen_accounting() currently does) doesn't
> need it.

Aah, okay. That's a good reason to not add the Kconfig stuff.

> (And if it is indeed needed --- can we have Xen's Kconfig select it
> instead of "default y if XEN" ?)

I've verified that CONFIG_PARAVIRT_TIME_ACCOUNTING is _not_ needed.
I've removed it from .config and used my patch with
do_stolen_accounting() removed. In an overcommitted guest (4 vcpus on 2
physical cpus) running a parallel make top showed near 50% stolen time.


Juergen
diff mbox

Patch

diff --git a/arch/arm/xen/enlighten.c b/arch/arm/xen/enlighten.c
index 75cd734..9163b94 100644
--- a/arch/arm/xen/enlighten.c
+++ b/arch/arm/xen/enlighten.c
@@ -84,19 +84,6 @@  int xen_unmap_domain_gfn_range(struct vm_area_struct *vma,
 }
 EXPORT_SYMBOL_GPL(xen_unmap_domain_gfn_range);
 
-static unsigned long long xen_stolen_accounting(int cpu)
-{
-	struct vcpu_runstate_info state;
-
-	BUG_ON(cpu != smp_processor_id());
-
-	xen_get_runstate_snapshot(&state);
-
-	WARN_ON(state.state != RUNSTATE_running);
-
-	return state.time[RUNSTATE_runnable] + state.time[RUNSTATE_offline];
-}
-
 static void xen_read_wallclock(struct timespec64 *ts)
 {
 	u32 version;
@@ -355,8 +342,8 @@  static int __init xen_guest_init(void)
 
 	register_cpu_notifier(&xen_cpu_notifier);
 
-	pv_time_ops.steal_clock = xen_stolen_accounting;
-	static_key_slow_inc(&paravirt_steal_enabled);
+	xen_time_setup_guest();
+
 	if (xen_initial_domain())
 		pvclock_gtod_register_notifier(&xen_pvclock_gtod_notifier);
 
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index a0a4e55..f478169 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -431,6 +431,8 @@  static void __init xen_time_init(void)
 	xen_setup_timer(cpu);
 	xen_setup_cpu_clockevents();
 
+	xen_time_setup_guest();
+
 	if (xen_initial_domain())
 		pvclock_gtod_register_notifier(&xen_pvclock_gtod_notifier);
 }
diff --git a/drivers/xen/time.c b/drivers/xen/time.c
index 7107842..6648a78 100644
--- a/drivers/xen/time.c
+++ b/drivers/xen/time.c
@@ -75,6 +75,15 @@  bool xen_vcpu_stolen(int vcpu)
 	return per_cpu(xen_runstate, vcpu).state == RUNSTATE_runnable;
 }
 
+static u64 xen_steal_clock(int cpu)
+{
+	struct vcpu_runstate_info state;
+
+	BUG_ON(cpu != smp_processor_id());
+	xen_get_runstate_snapshot(&state);
+	return state.time[RUNSTATE_runnable] + state.time[RUNSTATE_offline];
+}
+
 void xen_setup_runstate_info(int cpu)
 {
 	struct vcpu_register_runstate_memory_area area;
@@ -86,3 +95,13 @@  void xen_setup_runstate_info(int cpu)
 		BUG();
 }
 
+void __init xen_time_setup_guest(void)
+{
+	pv_time_ops.steal_clock = xen_steal_clock;
+
+	static_key_slow_inc(&paravirt_steal_enabled);
+	/*
+	 * We can't set paravirt_steal_rq_enabled as this would require the
+	 * capability to read another cpu's runstate info.
+	 */
+}
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index 86abe07..5ce51c2 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -21,6 +21,7 @@  void xen_resume_notifier_unregister(struct notifier_block *nb);
 
 bool xen_vcpu_stolen(int vcpu);
 void xen_setup_runstate_info(int cpu);
+void __init xen_time_setup_guest(void);
 void xen_get_runstate_snapshot(struct vcpu_runstate_info *res);
 
 int xen_setup_shutdown_event(void);