Message ID | 20140728185803.GA24663@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Jul 28, 2014 at 08:58:03PM +0200, Oleg Nesterov wrote: > Off-topic, but... > > On 07/28, Oleg Nesterov wrote: > > > > But we should always call user_exit() unconditionally? > > Frederic, don't we need the patch below? In fact clear_() can be moved > under "if ()" too. and probably copy_process() should clear this flag... > > Or. __context_tracking_task_switch() can simply do > > if (context_tracking_cpu_is_enabled()) > set_tsk_thread_flag(next, TIF_NOHZ); > else > clear_tsk_thread_flag(next, TIF_NOHZ); > > and then we can forget about copy_process(). Or I am totally confused? > > > I am also wondering if we can extend user_return_notifier to handle > enter/exit and kill TIF_NOHZ. > > Oleg. > > --- x/kernel/context_tracking.c > +++ x/kernel/context_tracking.c > @@ -202,7 +202,8 @@ void __context_tracking_task_switch(stru > struct task_struct *next) > { > clear_tsk_thread_flag(prev, TIF_NOHZ); > - set_tsk_thread_flag(next, TIF_NOHZ); > + if (context_tracking_cpu_is_enabled()) > + set_tsk_thread_flag(next, TIF_NOHZ); > } > > #ifdef CONFIG_CONTEXT_TRACKING_FORCE Unfortunately, as long as tasks can migrate in and out a context tracked CPU, we need to track all CPUs. This is because there is always a small shift between hard and soft kernelspace boundaries. Hard boundaries are the real strict boundaries: between "int", "iret" or faulting instructions for example. Soft boundaries are the place where we put our context tracking probes. They are just function calls and a distance between them and hard boundaries is inevitable. So here is a scenario where this is a problem: a task runs on CPU 0, passes the context tracking call before returning from a syscall to userspace, and gets an interrupt. The interrupt preempts the task and it moves to CPU 1. So it returns from preempt_schedule_irq() after which it is going to resume to userspace. In this scenario, if context tracking is only enabled on CPU 1, we have no way to know that the task is resuming to userspace, because we passed through the context tracking probe already and it was ignored on CPU 0. This might be hackbable by ensuring that irqs are disabled between context tracking calls and actual returns to userspace. It's a nightmare to audit on all archs though, and it makes the context tracking callers less flexible also that only solve the issue for irqs. Exception have a similar problem and we can't mask them.
On 07/28, Frederic Weisbecker wrote: > > On Mon, Jul 28, 2014 at 08:58:03PM +0200, Oleg Nesterov wrote: > > > > Frederic, don't we need the patch below? In fact clear_() can be moved > > under "if ()" too. and probably copy_process() should clear this flag... > > > > Or. __context_tracking_task_switch() can simply do > > > > if (context_tracking_cpu_is_enabled()) > > set_tsk_thread_flag(next, TIF_NOHZ); > > else > > clear_tsk_thread_flag(next, TIF_NOHZ); > > > > and then we can forget about copy_process(). Or I am totally confused? > > > > > > I am also wondering if we can extend user_return_notifier to handle > > enter/exit and kill TIF_NOHZ. > > > > Oleg. > > > > --- x/kernel/context_tracking.c > > +++ x/kernel/context_tracking.c > > @@ -202,7 +202,8 @@ void __context_tracking_task_switch(stru > > struct task_struct *next) > > { > > clear_tsk_thread_flag(prev, TIF_NOHZ); > > - set_tsk_thread_flag(next, TIF_NOHZ); > > + if (context_tracking_cpu_is_enabled()) > > + set_tsk_thread_flag(next, TIF_NOHZ); > > } > > > > #ifdef CONFIG_CONTEXT_TRACKING_FORCE > > Unfortunately, as long as tasks can migrate in and out a context tracked CPU, we > need to track all CPUs. Thanks Frederic for your explanations. Yes, I was confused. But cough, now I am even more confused. I didn't even try to read this code, perhaps I'll try later, but let me ask another question while you are here ;) The comment above __context_tracking_task_switch() says: * The context tracking uses the syscall slow path to implement its user-kernel * boundaries probes on syscalls. This way it doesn't impact the syscall fast * path on CPUs that don't do context tracking. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ How? Every running task should have TIF_NOHZ set if context_tracking_is_enabled() ? * But we need to clear the flag on the previous task because it may later * migrate to some CPU that doesn't do the context tracking. As such the TIF * flag may not be desired there. For what? How this can help? This flag will be set again when we switch to this task again? Looks like, we can kill context_tracking_task_switch() and simply change the "__init" callers of context_tracking_cpu_set() to do set_thread_flag(TIF_NOHZ) ? Then this flag will be propagated by copy_process(). Or I am totally confused? (quite possible). > So here is a scenario where this is a problem: a task runs on CPU 0, passes the context > tracking call before returning from a syscall to userspace, and gets an interrupt. The > interrupt preempts the task and it moves to CPU 1. So it returns from preempt_schedule_irq() > after which it is going to resume to userspace. > > In this scenario, if context tracking is only enabled on CPU 1, we have no way to know that > the task is resuming to userspace, because we passed through the context tracking probe > already and it was ignored on CPU 0. Thanks. But I still can't understand... So if we only track CPU 1, then in this case context_tracking.state == IN_USER on CPU 0, but it can be IN_USER or IN_KERNEL on CPU 1. Oleg.
On Tue, Jul 29, 2014 at 07:54:14PM +0200, Oleg Nesterov wrote: > Thanks Frederic for your explanations. Yes, I was confused. But cough, now I am > even more confused. > > I didn't even try to read this code, perhaps I'll try later, but let me ask > another question while you are here ;) > > The comment above __context_tracking_task_switch() says: > > * The context tracking uses the syscall slow path to implement its user-kernel > * boundaries probes on syscalls. This way it doesn't impact the syscall fast > * path on CPUs that don't do context tracking. > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Indeed, in fact the comment is confusing the way it explain things. It suggests that some CPUs maybe doing context tracking while some other can choose not to context track. It's rather all or nothing. Actually TIF_NOHZ optimize systems that have CONFIG_CONTEXT_TRACKING=y and that don't need context tracking. In this case TIF_NOHZ is cleared and thus the syscall fastpath has no overhead. So I should rephrase it that way: * The context tracking uses the syscall slow path to implement its user-kernel * boundaries probes on syscalls. This way it doesn't impact the syscall fast * path when context tracking is globally disabled. > > How? Every running task should have TIF_NOHZ set if context_tracking_is_enabled() ? > > * But we need to clear the flag on the previous task because it may later > * migrate to some CPU that doesn't do the context tracking. As such the TIF > * flag may not be desired there. > > For what? How this can help? This flag will be set again when we switch to this > task again? That is indeed a stale comment from aborted early design. > > Looks like, we can kill context_tracking_task_switch() and simply change the > "__init" callers of context_tracking_cpu_set() to do set_thread_flag(TIF_NOHZ) ? > Then this flag will be propagated by copy_process(). Right, that would be much better. Good catch! context tracking is enabled from tick_nohz_init(). This is the init 0 task so the flag should be propagated from there. I still think we need a for_each_process_thread() set as well though because some kernel threads may well have been created at this stage already. > > Or I am totally confused? (quite possible). > > > So here is a scenario where this is a problem: a task runs on CPU 0, passes the context > > tracking call before returning from a syscall to userspace, and gets an interrupt. The > > interrupt preempts the task and it moves to CPU 1. So it returns from preempt_schedule_irq() > > after which it is going to resume to userspace. > > > > In this scenario, if context tracking is only enabled on CPU 1, we have no way to know that > > the task is resuming to userspace, because we passed through the context tracking probe > > already and it was ignored on CPU 0. > > Thanks. But I still can't understand... So if we only track CPU 1, then in this > case context_tracking.state == IN_USER on CPU 0, but it can be IN_USER or IN_KERNEL > on CPU 1. I'm not sure I understand your question. Context tracking is either enabled everywhere or nowhere. I need to say though that there is a per CPU context tracking state named context_tracking.active. It's confusing because it suggests that context tracking is active per CPU. Actually it's tracked everywhere when globally enabled, but active determines if we call the RCU and vtime callbacks or not. So only nohz full CPUs have context_tracking.active set because only these need to call the RCU and vtime callbacks. Other CPUs still do the context tracking but they won't call rcu and vtime functions. > > Oleg. >
On 07/30, Frederic Weisbecker wrote: > > On Tue, Jul 29, 2014 at 07:54:14PM +0200, Oleg Nesterov wrote: > > > > > Looks like, we can kill context_tracking_task_switch() and simply change the > > "__init" callers of context_tracking_cpu_set() to do set_thread_flag(TIF_NOHZ) ? > > Then this flag will be propagated by copy_process(). > > Right, that would be much better. Good catch! context tracking is enabled from > tick_nohz_init(). This is the init 0 task so the flag should be propagated from there. actually init 1 task, but this doesn't matter. > I still think we need a for_each_process_thread() set as well though because some > kernel threads may well have been created at this stage already. Yes... Or we can add set_thread_flag(TIF_NOHZ) into ____call_usermodehelper(). > > Or I am totally confused? (quite possible). > > > > > So here is a scenario where this is a problem: a task runs on CPU 0, passes the context > > > tracking call before returning from a syscall to userspace, and gets an interrupt. The > > > interrupt preempts the task and it moves to CPU 1. So it returns from preempt_schedule_irq() > > > after which it is going to resume to userspace. > > > > > > In this scenario, if context tracking is only enabled on CPU 1, we have no way to know that > > > the task is resuming to userspace, because we passed through the context tracking probe > > > already and it was ignored on CPU 0. > > > > Thanks. But I still can't understand... So if we only track CPU 1, then in this > > case context_tracking.state == IN_USER on CPU 0, but it can be IN_USER or IN_KERNEL > > on CPU 1. > > I'm not sure I understand your question. Probably because it was stupid. Seriously, I still have no idea what this code actually does. > Context tracking is either enabled everywhere or > nowhere. > > I need to say though that there is a per CPU context tracking state named context_tracking.active. > It's confusing because it suggests that context tracking is active per CPU. Actually it's tracked > everywhere when globally enabled, but active determines if we call the RCU and vtime callbacks or > not. > > So only nohz full CPUs have context_tracking.active set because only these need to call the RCU > and vtime callbacks. Other CPUs still do the context tracking but they won't call rcu and vtime > functions. I meant that in the scenario you described above the "global" TIF_NOHZ doesn't really make a difference, afaics. Lets assume that context tracking is only enabled on CPU 1. To simplify, assume that we have a single usermode task T which sleeps in kernel mode. So context_tracking[0].state == context_tracking[1].state == IN_KERNEL. T wakes up on CPU_0, returns to user space, calls user_enter(). This sets context_tracking[0].state = IN_USER but otherwise does nothing else, this CPU is not tracked and .active is false. Right after local_irq_restore() this task can migrate to CPU_1 and finish its ret-to-usermode path. But since it had already passed user_enter() we do not change context_tracking[1].state and do not play with rcu/vtime. (unless this task hits SCHEDULE_USER in asm). The same for user_exit() of course. Oleg.
On Wed, Jul 30, 2014 at 07:46:30PM +0200, Oleg Nesterov wrote: > On 07/30, Frederic Weisbecker wrote: > > > > On Tue, Jul 29, 2014 at 07:54:14PM +0200, Oleg Nesterov wrote: > > > > > > > > Looks like, we can kill context_tracking_task_switch() and simply change the > > > "__init" callers of context_tracking_cpu_set() to do set_thread_flag(TIF_NOHZ) ? > > > Then this flag will be propagated by copy_process(). > > > > Right, that would be much better. Good catch! context tracking is enabled from > > tick_nohz_init(). This is the init 0 task so the flag should be propagated from there. > > actually init 1 task, but this doesn't matter. Are you sure? It does matter because that would invalidate everything I understood about init/main.c :) I was convinced that the very first kernel init task is PID 0 then it forks on rest_init() to launch the userspace init with PID 1. Then init/0 becomes the idle task of the boot CPU. > > > I still think we need a for_each_process_thread() set as well though because some > > kernel threads may well have been created at this stage already. > > Yes... Or we can add set_thread_flag(TIF_NOHZ) into ____call_usermodehelper(). Couldn't there be some other tasks than usermodehelper stuffs at this stage? Like workqueues or random kernel threads? > > > > Or I am totally confused? (quite possible). > > > > > > > So here is a scenario where this is a problem: a task runs on CPU 0, passes the context > > > > tracking call before returning from a syscall to userspace, and gets an interrupt. The > > > > interrupt preempts the task and it moves to CPU 1. So it returns from preempt_schedule_irq() > > > > after which it is going to resume to userspace. > > > > > > > > In this scenario, if context tracking is only enabled on CPU 1, we have no way to know that > > > > the task is resuming to userspace, because we passed through the context tracking probe > > > > already and it was ignored on CPU 0. > > > > > > Thanks. But I still can't understand... So if we only track CPU 1, then in this > > > case context_tracking.state == IN_USER on CPU 0, but it can be IN_USER or IN_KERNEL > > > on CPU 1. > > > > I'm not sure I understand your question. > > Probably because it was stupid. Seriously, I still have no idea what this code > actually does. > > > Context tracking is either enabled everywhere or > > nowhere. > > > > I need to say though that there is a per CPU context tracking state named context_tracking.active. > > It's confusing because it suggests that context tracking is active per CPU. Actually it's tracked > > everywhere when globally enabled, but active determines if we call the RCU and vtime callbacks or > > not. > > > > So only nohz full CPUs have context_tracking.active set because only these need to call the RCU > > and vtime callbacks. Other CPUs still do the context tracking but they won't call rcu and vtime > > functions. > > I meant that in the scenario you described above the "global" TIF_NOHZ doesn't > really make a difference, afaics. > > Lets assume that context tracking is only enabled on CPU 1. To simplify, > assume that we have a single usermode task T which sleeps in kernel mode. > > So context_tracking[0].state == context_tracking[1].state == IN_KERNEL. > > T wakes up on CPU_0, returns to user space, calls user_enter(). This sets > context_tracking[0].state = IN_USER but otherwise does nothing else, this > CPU is not tracked and .active is false. > > Right after local_irq_restore() this task can migrate to CPU_1 and finish > its ret-to-usermode path. But since it had already passed user_enter() we > do not change context_tracking[1].state and do not play with rcu/vtime. > (unless this task hits SCHEDULE_USER in asm). > > The same for user_exit() of course. So indeed if context tracking is enabled on CPU 1 and not in CPU 0, we risk such situation where CPU 1 has wrong context tracking. But global TIF_NOHZ should enforce context tracking everywhere. And also it's less context switch overhead. > > Oleg. >
On 07/31, Frederic Weisbecker wrote: > > On Wed, Jul 30, 2014 at 07:46:30PM +0200, Oleg Nesterov wrote: > > On 07/30, Frederic Weisbecker wrote: > > > > > > On Tue, Jul 29, 2014 at 07:54:14PM +0200, Oleg Nesterov wrote: > > > > > > > > > > > Looks like, we can kill context_tracking_task_switch() and simply change the > > > > "__init" callers of context_tracking_cpu_set() to do set_thread_flag(TIF_NOHZ) ? > > > > Then this flag will be propagated by copy_process(). > > > > > > Right, that would be much better. Good catch! context tracking is enabled from > > > tick_nohz_init(). This is the init 0 task so the flag should be propagated from there. > > > > actually init 1 task, but this doesn't matter. > > Are you sure? It does matter because that would invalidate everything I understood > about init/main.c :) Sorry for confusion ;) > I was convinced that the very first kernel init task is PID 0 then > it forks on rest_init() to launch the userspace init with PID 1. Then init/0 becomes the > idle task of the boot CPU. Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not by "swapper". And we do not care about idle threads at all. > > > I still think we need a for_each_process_thread() set as well though because some > > > kernel threads may well have been created at this stage already. > > > > Yes... Or we can add set_thread_flag(TIF_NOHZ) into ____call_usermodehelper(). > > Couldn't there be some other tasks than usermodehelper stuffs at this stage? Like workqueues > or random kernel threads? Sure, but we do not care. A kernel thread can never return to user space, it must never call user_enter/exit(). > > I meant that in the scenario you described above the "global" TIF_NOHZ doesn't > > really make a difference, afaics. > > > > Lets assume that context tracking is only enabled on CPU 1. To simplify, > > assume that we have a single usermode task T which sleeps in kernel mode. > > > > So context_tracking[0].state == context_tracking[1].state == IN_KERNEL. > > > > T wakes up on CPU_0, returns to user space, calls user_enter(). This sets > > context_tracking[0].state = IN_USER but otherwise does nothing else, this > > CPU is not tracked and .active is false. > > > > Right after local_irq_restore() this task can migrate to CPU_1 and finish > > its ret-to-usermode path. But since it had already passed user_enter() we > > do not change context_tracking[1].state and do not play with rcu/vtime. > > (unless this task hits SCHEDULE_USER in asm). > > > > The same for user_exit() of course. > > So indeed if context tracking is enabled on CPU 1 and not in CPU 0, we risk > such situation where CPU 1 has wrong context tracking. OK. To simplify, lets discuss user_enter() only. So, it is actually a nop on CPU_0, and CPU_1 can miss it anyway. > But global TIF_NOHZ should enforce context tracking everywhere. And this is what I can't understand. Lets return to my initial question, why we can't change __context_tracking_task_switch() void __context_tracking_task_switch(struct task_struct *prev, struct task_struct *next) { if (context_tracking_cpu_is_enabled()) set_tsk_thread_flag(next, TIF_NOHZ); else clear_tsk_thread_flag(next, TIF_NOHZ); } ? How can the global TIF_NOHZ help? OK, OK, a task can return to usermode on CPU_0, notice TIF_NOHZ, take the slow path, and do the "right" thing if it migrates to CPU_1 _before_ it comes to user_enter(). But this case is very unlikely, certainly this can't explain why do we penalize the untracked CPU's ? > And also it's > less context switch overhead. Why??? I think I have a blind spot here. Help! And of course I can't understand exception_enter/exit(). Not to mention that (afaics) "prev_ctx == IN_USER" in exception_exit() can be false positive even if we forget that the caller can migrate in between. Just because, once again, a tracked CPU can miss user_exit(). So, why not static inline void exception_enter(void) { user_exit(); } static inline void exception_exit(struct pt_regs *regs) { if (user_mode(regs)) user_enter(); } ? Oleg.
On Thu, Jul 31, 2014 at 06:03:53PM +0200, Oleg Nesterov wrote: > On 07/31, Frederic Weisbecker wrote: > > > > On Wed, Jul 30, 2014 at 07:46:30PM +0200, Oleg Nesterov wrote: > > > On 07/30, Frederic Weisbecker wrote: > > > > > > > > On Tue, Jul 29, 2014 at 07:54:14PM +0200, Oleg Nesterov wrote: > > > > > > > > > > > > > > Looks like, we can kill context_tracking_task_switch() and simply change the > > > > > "__init" callers of context_tracking_cpu_set() to do set_thread_flag(TIF_NOHZ) ? > > > > > Then this flag will be propagated by copy_process(). > > > > > > > > Right, that would be much better. Good catch! context tracking is enabled from > > > > tick_nohz_init(). This is the init 0 task so the flag should be propagated from there. > > > > > > actually init 1 task, but this doesn't matter. > > > > Are you sure? It does matter because that would invalidate everything I understood > > about init/main.c :) > > Sorry for confusion ;) > > > I was convinced that the very first kernel init task is PID 0 then > > it forks on rest_init() to launch the userspace init with PID 1. Then init/0 becomes the > > idle task of the boot CPU. > > Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not > by "swapper". Are you sure? It's called from start_kernel() which is init/0. But that's not the point. You're right that kernel threads don't care about it. In fact we only need init/1 to get the flag and also call_usermodehelper(), good point! Maybe we should just enable it everywhere. Trying to spare the flag on specific tasks will bring headaches for no much performance win. Kernel threads don't do syscalls and seldom do exceptions, so they shouldn't hit context tracking that much. > > > > > I still think we need a for_each_process_thread() set as well though because some > > > > kernel threads may well have been created at this stage already. > > > > > > Yes... Or we can add set_thread_flag(TIF_NOHZ) into ____call_usermodehelper(). > > > > Couldn't there be some other tasks than usermodehelper stuffs at this stage? Like workqueues > > or random kernel threads? > > Sure, but we do not care. A kernel thread can never return to user space, it > must never call user_enter/exit(). Yep, thanks for noticing that! > > > > I meant that in the scenario you described above the "global" TIF_NOHZ doesn't > > > really make a difference, afaics. > > > > > > Lets assume that context tracking is only enabled on CPU 1. To simplify, > > > assume that we have a single usermode task T which sleeps in kernel mode. > > > > > > So context_tracking[0].state == context_tracking[1].state == IN_KERNEL. > > > > > > T wakes up on CPU_0, returns to user space, calls user_enter(). This sets > > > context_tracking[0].state = IN_USER but otherwise does nothing else, this > > > CPU is not tracked and .active is false. > > > > > > Right after local_irq_restore() this task can migrate to CPU_1 and finish > > > its ret-to-usermode path. But since it had already passed user_enter() we > > > do not change context_tracking[1].state and do not play with rcu/vtime. > > > (unless this task hits SCHEDULE_USER in asm). > > > > > > The same for user_exit() of course. > > > > So indeed if context tracking is enabled on CPU 1 and not in CPU 0, we risk > > such situation where CPU 1 has wrong context tracking. > > OK. To simplify, lets discuss user_enter() only. So, it is actually a nop on > CPU_0, and CPU_1 can miss it anyway. > > > But global TIF_NOHZ should enforce context tracking everywhere. > > And this is what I can't understand. Lets return to my initial question, why > we can't change __context_tracking_task_switch() > > void __context_tracking_task_switch(struct task_struct *prev, > struct task_struct *next) > { > if (context_tracking_cpu_is_enabled()) > set_tsk_thread_flag(next, TIF_NOHZ); > else > clear_tsk_thread_flag(next, TIF_NOHZ); > } > > ? Well we can change it to global TIF_NOHZ > How can the global TIF_NOHZ help? It avoids that flag swap on task_switch. > > OK, OK, a task can return to usermode on CPU_0, notice TIF_NOHZ, take the > slow path, and do the "right" thing if it migrates to CPU_1 _before_ it > comes to user_enter(). But this case is very unlikely, certainly this can't > explain why do we penalize the untracked CPU's ? It's rather that CPU 0 calls user_enter() and then migrate to CPU 1 and resume to userspace. It's unlikely but possible. I actually observed that very easily on early testing. And it's a big problem because then the CPU runs in userspace, possibly for a long while in HPC case, and context tracking thinks it is in kernelspace. As a result, RCU waits for that CPU to complete grace periods and cputime is accounted to kernelspace instead of userspace. It looks like a harmless race but it can have big consequences. > > > And also it's > > less context switch overhead. > > Why??? Because calling context_switch_task_switch() on every context switch is costly. > > I think I have a blind spot here. Help! > > > > And of course I can't understand exception_enter/exit(). Not to mention that > (afaics) "prev_ctx == IN_USER" in exception_exit() can be false positive even > if we forget that the caller can migrate in between. Just because, once again, > a tracked CPU can miss user_exit(). You lost me on this. How can a tracked CPU miss user_exit()? > > So, why not > > static inline void exception_enter(void) > { > user_exit(); > } > > static inline void exception_exit(struct pt_regs *regs) > { > if (user_mode(regs)) > user_enter(); > } That's how I implemented it first. But then I changed it the way it is now: 6c1e0256fad84a843d915414e4b5973b7443d48d ("context_tracking: Restore correct previous context state on exception exit") This is again due to the shift between hard and soft userspace boundaries. user_mode(regs) checks hard boundaries only. Lets get back to our beloved example: CPU 0 CPU 1 --------------------------------------------- returning from syscall { user_enter(); exception { exception_enter() PREEMPT! -----------------------> //resume exception exception_exit(); return to userspace Here if we use user_mode(regs) from exception_exit(), we are screwed because the task is in the dead zone between user_enter() and the actual hard return to userspace. user_mode() thinks we are in the kernel, but from the context tracking POV we are in userspace. So we again risk to run in userspace for an undetermined time and RCU will think we are in the kernel and disturb CPU 1 with IPIs to report quiescent states. Also all the time spent in userspace will be accounted as kernelspace. Yeah the context tracking code gave me a lot of headaches :)
On 07/31, Frederic Weisbecker wrote: > > On Thu, Jul 31, 2014 at 06:03:53PM +0200, Oleg Nesterov wrote: > > On 07/31, Frederic Weisbecker wrote: > > > > > > I was convinced that the very first kernel init task is PID 0 then > > > it forks on rest_init() to launch the userspace init with PID 1. Then init/0 becomes the > > > idle task of the boot CPU. > > > > Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not > > by "swapper". > > Are you sure? It's called from start_kernel() which is init/0. But do_initcalls() is called by kernel_init(), this is the init process which is going to exec /sbin/init later. But this doesn't really matter, > Maybe we should just enable it everywhere. Yes, yes, this doesn't really matter. We can even add set(TIF_NOHZ) at the start of start_kernel(). The question is, I still can't understand why do we want to have the global TIF_NOHZ. > > OK. To simplify, lets discuss user_enter() only. So, it is actually a nop on > > CPU_0, and CPU_1 can miss it anyway. > > > > > But global TIF_NOHZ should enforce context tracking everywhere. > > > > And this is what I can't understand. Lets return to my initial question, why > > we can't change __context_tracking_task_switch() > > > > void __context_tracking_task_switch(struct task_struct *prev, > > struct task_struct *next) > > { > > if (context_tracking_cpu_is_enabled()) > > set_tsk_thread_flag(next, TIF_NOHZ); > > else > > clear_tsk_thread_flag(next, TIF_NOHZ); > > } > > > > ? > > Well we can change it to global TIF_NOHZ > > > How can the global TIF_NOHZ help? > > It avoids that flag swap on task_switch. Ah, you probably meant that we can kill context_tracking_task_switch() as we discussed. But I meant another thing, TIF_NOHZ is already global because it is always set after context_tracking_cpu_set(). Performance-wise, this set/clear code above can be better because it avoids the slow paths on the untracked CPU's. But let's ignore this, the question is why the change above is not correct? Or why it can make the things worse? > > OK, OK, a task can return to usermode on CPU_0, notice TIF_NOHZ, take the > > slow path, and do the "right" thing if it migrates to CPU_1 _before_ it > > comes to user_enter(). But this case is very unlikely, certainly this can't > > explain why do we penalize the untracked CPU's ? > > It's rather that CPU 0 calls user_enter() and then migrate to CPU 1 and resume > to userspace. And in this case a) user_enter() is pointless on CPU_0, and b) CPU_1 misses the necessary user_enter(). > It's unlikely but possible. I actually observed that very easily on early testing. Sure. And this can happen anyway? Why the change in __context_tracking_task_switch() is wrong? > And it's a big problem because then the CPU runs in userspace, possibly for a long while > in HPC case, and context tracking thinks it is in kernelspace. As a result, RCU waits > for that CPU to complete grace periods and cputime is accounted to kernelspace instead of > userspace. > > It looks like a harmless race but it can have big consequences. I see. Again, does the global TIF_NOHZ really help? > > > And also it's > > > less context switch overhead. > > > > Why??? > > Because calling context_switch_task_switch() on every context switch is costly. See above. This depends, but forcing the slow paths on all CPU's can be more costly. > > And of course I can't understand exception_enter/exit(). Not to mention that > > (afaics) "prev_ctx == IN_USER" in exception_exit() can be false positive even > > if we forget that the caller can migrate in between. Just because, once again, > > a tracked CPU can miss user_exit(). > > You lost me on this. How can a tracked CPU miss user_exit()? I am lost too ;) Didn't we already discuss this? A task calls user_exit() on CPU_0 for no reason, migrates to the tracked CPU_1 and finally returns to user space leaving this cpu in IN_KERNEL state? > > So, why not > > > > static inline void exception_enter(void) > > { > > user_exit(); > > } > > > > static inline void exception_exit(struct pt_regs *regs) > > { > > if (user_mode(regs)) > > user_enter(); > > } > > That's how I implemented it first. But then I changed it the way it is now: > 6c1e0256fad84a843d915414e4b5973b7443d48d > ("context_tracking: Restore correct previous context state on exception exit") > > This is again due to the shift between hard and soft userspace boundaries. > user_mode(regs) checks hard boundaries only. > > Lets get back to our beloved example: > > CPU 0 CPU 1 > --------------------------------------------- > > returning from syscall { > user_enter(); > exception { > exception_enter() > PREEMPT! > -----------------------> > //resume exception > exception_exit(); > return to userspace OK, thanks, so in this case we miss user_enter(). But again, we can miss it anyway if preemptions comes before "exception" ? if CPU 1 was in IN_KERNEL state. Oleg.
On Thu, Jul 31, 2014 at 08:12:30PM +0200, Oleg Nesterov wrote: > On 07/31, Frederic Weisbecker wrote: > > > > On Thu, Jul 31, 2014 at 06:03:53PM +0200, Oleg Nesterov wrote: > > > On 07/31, Frederic Weisbecker wrote: > > > > > > > > I was convinced that the very first kernel init task is PID 0 then > > > > it forks on rest_init() to launch the userspace init with PID 1. Then init/0 becomes the > > > > idle task of the boot CPU. > > > > > > Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not > > > by "swapper". > > > > Are you sure? It's called from start_kernel() which is init/0. > > But do_initcalls() is called by kernel_init(), this is the init process which is > going to exec /sbin/init later. > > But this doesn't really matter, Yeah but tick_nohz_init() is not an initcall, it's a function called from start_kernel(), before initcalls. > > > Maybe we should just enable it everywhere. > > Yes, yes, this doesn't really matter. We can even add set(TIF_NOHZ) at the start > of start_kernel(). The question is, I still can't understand why do we want to > have the global TIF_NOHZ. Because then the flags is inherited in forks. It's better than inheriting it on context switch due to context switch being called much more often than fork. > > > > OK. To simplify, lets discuss user_enter() only. So, it is actually a nop on > > > CPU_0, and CPU_1 can miss it anyway. > > > > > > > But global TIF_NOHZ should enforce context tracking everywhere. > > > > > > And this is what I can't understand. Lets return to my initial question, why > > > we can't change __context_tracking_task_switch() > > > > > > void __context_tracking_task_switch(struct task_struct *prev, > > > struct task_struct *next) > > > { > > > if (context_tracking_cpu_is_enabled()) > > > set_tsk_thread_flag(next, TIF_NOHZ); > > > else > > > clear_tsk_thread_flag(next, TIF_NOHZ); > > > } > > > > > > ? > > > > Well we can change it to global TIF_NOHZ > > > > > How can the global TIF_NOHZ help? > > > > It avoids that flag swap on task_switch. > > Ah, you probably meant that we can kill context_tracking_task_switch() as > we discussed. Yeah. > > But I meant another thing, TIF_NOHZ is already global because it is always > set after context_tracking_cpu_set(). > > Performance-wise, this set/clear code above can be better because it avoids > the slow paths on the untracked CPU's. But all CPUs are tracked when context tracking is enabled. So there is no such per CPU granularity. > But let's ignore this, the question is > why the change above is not correct? Or why it can make the things worse? Which change above? > > > > > OK, OK, a task can return to usermode on CPU_0, notice TIF_NOHZ, take the > > > slow path, and do the "right" thing if it migrates to CPU_1 _before_ it > > > comes to user_enter(). But this case is very unlikely, certainly this can't > > > explain why do we penalize the untracked CPU's ? > > > > It's rather that CPU 0 calls user_enter() and then migrate to CPU 1 and resume > > to userspace. > > And in this case a) user_enter() is pointless on CPU_0, and b) CPU_1 misses > the necessary user_enter(). No, user_enter() is necessary on CPU 0 so that CPU 1 sees that it is in userspace from context tracking POV. > > > It's unlikely but possible. I actually observed that very easily on early testing. > > Sure. And this can happen anyway? Why the change in __context_tracking_task_switch() > is wrong? Which change? > > > And it's a big problem because then the CPU runs in userspace, possibly for a long while > > in HPC case, and context tracking thinks it is in kernelspace. As a result, RCU waits > > for that CPU to complete grace periods and cputime is accounted to kernelspace instead of > > userspace. > > > > It looks like a harmless race but it can have big consequences. > > I see. Again, does the global TIF_NOHZ really help? Yes, to remove the context switch overhead. But it doesn't change anything on those races. > > Because calling context_switch_task_switch() on every context switch is costly. > > See above. This depends, but forcing the slow paths on all CPU's can be more > costly. Yeah but I don't have a safe solution that avoids that. > > > > And of course I can't understand exception_enter/exit(). Not to mention that > > > (afaics) "prev_ctx == IN_USER" in exception_exit() can be false positive even > > > if we forget that the caller can migrate in between. Just because, once again, > > > a tracked CPU can miss user_exit(). > > > > You lost me on this. How can a tracked CPU miss user_exit()? > > I am lost too ;) Didn't we already discuss this? A task calls user_exit() on > CPU_0 for no reason, migrates to the tracked CPU_1 and finally returns to user > space leaving this cpu in IN_KERNEL state? Yeah, so? :) I'm pretty sure there is a small but important element here that makes us misunderstanding what each others says. It's like we aren't taking about the same thing :) > > That's how I implemented it first. But then I changed it the way it is now: > > 6c1e0256fad84a843d915414e4b5973b7443d48d > > ("context_tracking: Restore correct previous context state on exception exit") > > > > This is again due to the shift between hard and soft userspace boundaries. > > user_mode(regs) checks hard boundaries only. > > > > Lets get back to our beloved example: > > > > CPU 0 CPU 1 > > --------------------------------------------- > > > > returning from syscall { > > user_enter(); > > exception { > > exception_enter() > > PREEMPT! > > -----------------------> > > //resume exception > > exception_exit(); > > return to userspace > > OK, thanks, so in this case we miss user_enter(). > > But again, we can miss it anyway if preemptions comes before "exception" ? > if CPU 1 was in IN_KERNEL state. No, because preempt_schedule_irq() does the ctx_state save and restore with exception_enter/exception_exit.
2014-07-31 20:47 GMT+02:00 Frederic Weisbecker <fweisbec@gmail.com>: > On Thu, Jul 31, 2014 at 08:12:30PM +0200, Oleg Nesterov wrote: >> On 07/31, Frederic Weisbecker wrote: > No, because preempt_schedule_irq() does the ctx_state save and restore with > exception_enter/exception_exit. Similar thing happens with schedule_user(). preempt_schedule_irq() handles kernel preemption and schedule_user() the user preemption. On both cases we save and restore the context tracking state. This might be the missing piece you were missing :)
On 07/31, Frederic Weisbecker wrote: > > 2014-07-31 20:47 GMT+02:00 Frederic Weisbecker <fweisbec@gmail.com>: > > On Thu, Jul 31, 2014 at 08:12:30PM +0200, Oleg Nesterov wrote: > >> On 07/31, Frederic Weisbecker wrote: > > No, because preempt_schedule_irq() does the ctx_state save and restore with > > exception_enter/exception_exit. > > Similar thing happens with schedule_user(). > > preempt_schedule_irq() handles kernel preemption and schedule_user() > the user preemption. On both cases we save and restore the context > tracking state. > > This might be the missing piece you were missing :) YYYYYEEEEESSSS, thanks!! And in fact I was going to suggest to add this logic into preempt schedule paths to improve the situation if we can't make TIF_NOHZ per-cpu. But Frederic, perhaps I'll return here tomorrow with another question, it is too late for me now ;) Thanks! Oleg.
On 07/31, Frederic Weisbecker wrote: > > On Thu, Jul 31, 2014 at 08:12:30PM +0200, Oleg Nesterov wrote: > > > > > > > > Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not > > > > by "swapper". > > > > > > Are you sure? It's called from start_kernel() which is init/0. > > > > But do_initcalls() is called by kernel_init(), this is the init process which is > > going to exec /sbin/init later. > > > > But this doesn't really matter, > > Yeah but tick_nohz_init() is not an initcall, it's a function called from start_kernel(), > before initcalls. Ah, indeed, and context_tracking_init() too. Even better, so we only need --- x/kernel/context_tracking.c +++ x/kernel/context_tracking.c @@ -30,8 +30,10 @@ EXPORT_SYMBOL_GPL(context_tracking_enabl DEFINE_PER_CPU(struct context_tracking, context_tracking); EXPORT_SYMBOL_GPL(context_tracking); -void context_tracking_cpu_set(int cpu) +void __init context_tracking_cpu_set(int cpu) { + /* Called by "swapper" thread, all threads will inherit this flag */ + set_thread_flag(TIF_NOHZ); if (!per_cpu(context_tracking.active, cpu)) { per_cpu(context_tracking.active, cpu) = true; static_key_slow_inc(&context_tracking_enabled); and now we can kill context_tracking_task_switch() ? > > Yes, yes, this doesn't really matter. We can even add set(TIF_NOHZ) at the start > > of start_kernel(). The question is, I still can't understand why do we want to > > have the global TIF_NOHZ. > > Because then the flags is inherited in forks. It's better than inheriting it on > context switch due to context switch being called much more often than fork. This is clear, that is why I suggested this. Just we didn't understand each other, when I said "global TIF_NOHZ" I meant the current situtation when every (running) task has this bit set anyway. Sorry for confusion. > No, because preempt_schedule_irq() does the ctx_state save and restore with > exception_enter/exception_exit. Thanks again. Can't understand how I managed to miss that exception_enter/exit in preempt_schedule_*. Damn. And after I spent more time, I don't have any idea how to make this tracking cheaper. Oleg.
On Sat, Aug 02, 2014 at 07:30:24PM +0200, Oleg Nesterov wrote: > On 07/31, Frederic Weisbecker wrote: > > > > On Thu, Jul 31, 2014 at 08:12:30PM +0200, Oleg Nesterov wrote: > > > > > > > > > > Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not > > > > > by "swapper". > > > > > > > > Are you sure? It's called from start_kernel() which is init/0. > > > > > > But do_initcalls() is called by kernel_init(), this is the init process which is > > > going to exec /sbin/init later. > > > > > > But this doesn't really matter, > > > > Yeah but tick_nohz_init() is not an initcall, it's a function called from start_kernel(), > > before initcalls. > > Ah, indeed, and context_tracking_init() too. Even better, so we only need > > --- x/kernel/context_tracking.c > +++ x/kernel/context_tracking.c > @@ -30,8 +30,10 @@ EXPORT_SYMBOL_GPL(context_tracking_enabl > DEFINE_PER_CPU(struct context_tracking, context_tracking); > EXPORT_SYMBOL_GPL(context_tracking); > > -void context_tracking_cpu_set(int cpu) > +void __init context_tracking_cpu_set(int cpu) > { > + /* Called by "swapper" thread, all threads will inherit this flag */ > + set_thread_flag(TIF_NOHZ); > if (!per_cpu(context_tracking.active, cpu)) { > per_cpu(context_tracking.active, cpu) = true; > static_key_slow_inc(&context_tracking_enabled); > > and now we can kill context_tracking_task_switch() ? > > > > Yes, yes, this doesn't really matter. We can even add set(TIF_NOHZ) at the start > > > of start_kernel(). The question is, I still can't understand why do we want to > > > have the global TIF_NOHZ. > > > > Because then the flags is inherited in forks. It's better than inheriting it on > > context switch due to context switch being called much more often than fork. > > This is clear, that is why I suggested this. Just we didn't understand each other, > when I said "global TIF_NOHZ" I meant the current situtation when every (running) > task has this bit set anyway. Sorry for confusion. > > > No, because preempt_schedule_irq() does the ctx_state save and restore with > > exception_enter/exception_exit. > > Thanks again. Can't understand how I managed to miss that exception_enter/exit > in preempt_schedule_*. > > Damn. And after I spent more time, I don't have any idea how to make this > tracking cheaper. Mike Galbraith's profiles showed that timekeeping was one of the most expensive operations. Would it make sense to have the option of statistical jiffy-based accounting? The idea would be to sample the jiffies counter at each context switch, and charge the time to whoever happens to be running when the jiffies counter increments. Thanx, Paul
--- x/kernel/context_tracking.c +++ x/kernel/context_tracking.c @@ -202,7 +202,8 @@ void __context_tracking_task_switch(stru struct task_struct *next) { clear_tsk_thread_flag(prev, TIF_NOHZ); - set_tsk_thread_flag(next, TIF_NOHZ); + if (context_tracking_cpu_is_enabled()) + set_tsk_thread_flag(next, TIF_NOHZ); } #ifdef CONFIG_CONTEXT_TRACKING_FORCE