Message ID | 1347307972.7332.78.camel@oc2024037011.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, 2012-09-10 at 15:12 -0500, Andrew Theurer wrote: > + /* > + * if the target task is not running, then only yield if the > + * current task is in guest mode > + */ > + if (!(p_rq->curr->flags & PF_VCPU)) > + goto out_irq; This would make yield_to() only ever work on KVM, not that I mind this too much, its a horrid thing and making it less useful for (ab)use is a good thing, still this probably wants mention somewhere :-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/10/2012 04:19 PM, Peter Zijlstra wrote: > On Mon, 2012-09-10 at 15:12 -0500, Andrew Theurer wrote: >> + /* >> + * if the target task is not running, then only yield if the >> + * current task is in guest mode >> + */ >> + if (!(p_rq->curr->flags & PF_VCPU)) >> + goto out_irq; > > This would make yield_to() only ever work on KVM, not that I mind this > too much, its a horrid thing and making it less useful for (ab)use is a > good thing, still this probably wants mention somewhere :-) Also, it would not preempt a non-kvm task, even if we need to do that to boost a VCPU. I think the lines above should be dropped.
On 09/11/2012 01:42 AM, Andrew Theurer wrote: > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote: >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote: >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p) >>>> +{ >>>> + if (!curr->sched_class->yield_to_task) >>>> + return false; >>>> + >>>> + if (curr->sched_class != p->sched_class) >>>> + return false; >>> >>> >>> Peter, >>> >>> Should we also add a check if the runq has a skip buddy (as pointed out >>> by Raghu) and return if the skip buddy is already set. >> >> Oh right, I missed that suggestion.. the performance improvement went >> from 81% to 139% using this, right? >> >> It might make more sense to keep that separate, outside of this >> function, since its not a strict prerequisite. >> >>>> >>>> + if (task_running(p_rq, p) || p->state) >>>> + return false; >>>> + >>>> + return true; >>>> +} >> >> >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, >>> bool preempt) >>>> rq = this_rq(); >>>> >>>> again: >>>> + /* optimistic test to avoid taking locks */ >>>> + if (!__yield_to_candidate(curr, p)) >>>> + goto out_irq; >>>> + >> >> So add something like: >> >> /* Optimistic, if we 'raced' with another yield_to(), don't bother */ >> if (p_rq->cfs_rq->skip) >> goto out_irq; >>> >>> >>>> p_rq = task_rq(p); >>>> double_rq_lock(rq, p_rq); >>> >>> >> But I do have a question on this optimization though,.. Why do we check >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ? >> >> That is, I'd like to see this thing explained a little better. >> >> Does it go something like: p_rq is the runqueue of the task we'd like to >> yield to, rq is our own, they might be the same. If we have a ->skip, >> there's nothing we can do about it, OTOH p_rq having a ->skip and >> failing the yield_to() simply means us picking the next VCPU thread, >> which might be running on an entirely different cpu (rq) and could >> succeed? > > Here's two new versions, both include a __yield_to_candidate(): "v3" > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq > skip check. Raghu, I am not sure if this is exactly what you want > implemented in v4. > Andrew, Yes that is what I had. I think there was a mis-understanding. My intention was to if there is a directed_yield happened in runqueue (say rqA), do not bother to directed yield to that. But unfortunately as PeterZ pointed that would have resulted in setting next buddy of a different run queue than rqA. So we can drop this "skip" idea. Pondering more over what to do? can we use next buddy itself ... thinking.. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote: > On 09/11/2012 01:42 AM, Andrew Theurer wrote: > > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote: > >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote: > >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p) > >>>> +{ > >>>> + if (!curr->sched_class->yield_to_task) > >>>> + return false; > >>>> + > >>>> + if (curr->sched_class != p->sched_class) > >>>> + return false; > >>> > >>> > >>> Peter, > >>> > >>> Should we also add a check if the runq has a skip buddy (as pointed out > >>> by Raghu) and return if the skip buddy is already set. > >> > >> Oh right, I missed that suggestion.. the performance improvement went > >> from 81% to 139% using this, right? > >> > >> It might make more sense to keep that separate, outside of this > >> function, since its not a strict prerequisite. > >> > >>>> > >>>> + if (task_running(p_rq, p) || p->state) > >>>> + return false; > >>>> + > >>>> + return true; > >>>> +} > >> > >> > >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, > >>> bool preempt) > >>>> rq = this_rq(); > >>>> > >>>> again: > >>>> + /* optimistic test to avoid taking locks */ > >>>> + if (!__yield_to_candidate(curr, p)) > >>>> + goto out_irq; > >>>> + > >> > >> So add something like: > >> > >> /* Optimistic, if we 'raced' with another yield_to(), don't bother */ > >> if (p_rq->cfs_rq->skip) > >> goto out_irq; > >>> > >>> > >>>> p_rq = task_rq(p); > >>>> double_rq_lock(rq, p_rq); > >>> > >>> > >> But I do have a question on this optimization though,.. Why do we check > >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ? > >> > >> That is, I'd like to see this thing explained a little better. > >> > >> Does it go something like: p_rq is the runqueue of the task we'd like to > >> yield to, rq is our own, they might be the same. If we have a ->skip, > >> there's nothing we can do about it, OTOH p_rq having a ->skip and > >> failing the yield_to() simply means us picking the next VCPU thread, > >> which might be running on an entirely different cpu (rq) and could > >> succeed? > > > > Here's two new versions, both include a __yield_to_candidate(): "v3" > > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq > > skip check. Raghu, I am not sure if this is exactly what you want > > implemented in v4. > > > > Andrew, Yes that is what I had. I think there was a mis-understanding. > My intention was to if there is a directed_yield happened in runqueue > (say rqA), do not bother to directed yield to that. But unfortunately as > PeterZ pointed that would have resulted in setting next buddy of a > different run queue than rqA. > So we can drop this "skip" idea. Pondering more over what to do? can we > use next buddy itself ... thinking.. FYI, I regretfully forgot include your recent changes to kvm_vcpu_on_spin in my tests (found in kvm.git/next branch), so I am going to get some results for that before I experiment any more on 3.6-rc. -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote: > On 09/11/2012 01:42 AM, Andrew Theurer wrote: > > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote: > >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote: > >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p) > >>>> +{ > >>>> + if (!curr->sched_class->yield_to_task) > >>>> + return false; > >>>> + > >>>> + if (curr->sched_class != p->sched_class) > >>>> + return false; > >>> > >>> > >>> Peter, > >>> > >>> Should we also add a check if the runq has a skip buddy (as pointed out > >>> by Raghu) and return if the skip buddy is already set. > >> > >> Oh right, I missed that suggestion.. the performance improvement went > >> from 81% to 139% using this, right? > >> > >> It might make more sense to keep that separate, outside of this > >> function, since its not a strict prerequisite. > >> > >>>> > >>>> + if (task_running(p_rq, p) || p->state) > >>>> + return false; > >>>> + > >>>> + return true; > >>>> +} > >> > >> > >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, > >>> bool preempt) > >>>> rq = this_rq(); > >>>> > >>>> again: > >>>> + /* optimistic test to avoid taking locks */ > >>>> + if (!__yield_to_candidate(curr, p)) > >>>> + goto out_irq; > >>>> + > >> > >> So add something like: > >> > >> /* Optimistic, if we 'raced' with another yield_to(), don't bother */ > >> if (p_rq->cfs_rq->skip) > >> goto out_irq; > >>> > >>> > >>>> p_rq = task_rq(p); > >>>> double_rq_lock(rq, p_rq); > >>> > >>> > >> But I do have a question on this optimization though,.. Why do we check > >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ? > >> > >> That is, I'd like to see this thing explained a little better. > >> > >> Does it go something like: p_rq is the runqueue of the task we'd like to > >> yield to, rq is our own, they might be the same. If we have a ->skip, > >> there's nothing we can do about it, OTOH p_rq having a ->skip and > >> failing the yield_to() simply means us picking the next VCPU thread, > >> which might be running on an entirely different cpu (rq) and could > >> succeed? > > > > Here's two new versions, both include a __yield_to_candidate(): "v3" > > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq > > skip check. Raghu, I am not sure if this is exactly what you want > > implemented in v4. > > > > Andrew, Yes that is what I had. I think there was a mis-understanding. > My intention was to if there is a directed_yield happened in runqueue > (say rqA), do not bother to directed yield to that. But unfortunately as > PeterZ pointed that would have resulted in setting next buddy of a > different run queue than rqA. > So we can drop this "skip" idea. Pondering more over what to do? can we > use next buddy itself ... thinking.. As I mentioned earlier today, I did not have your changes from kvm.git tree when I tested my changes. Here are your changes and my changes compared: throughput in MB/sec kvm_vcpu_on_spin changes: 4636 +/- 15.74% yield_to changes: 4515 +/- 12.73% I would be inclined to stick with your changes which are kept in kvm code. I did try both combined, and did not get good results: both changes: 4074 +/- 19.12% So, having both is probably not a good idea. However, I feel like there's more work to be done. With no over-commit (10 VMs), total throughput is 23427 +/- 2.76%. A 2x over-commit will no doubt have some overhead, but a reduction to ~4500 is still terrible. By contrast, 8-way VMs with 2x over-commit have a total throughput roughly 10% less than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread host). We still have what appears to be scalability problems, but now it's not so much in runqueue locks for yield_to(), but now get_pid_task(): perf on host: 32.10% 320131 qemu-system-x86 [kernel.kallsyms] [k] get_pid_task 11.60% 115686 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock 10.28% 102522 qemu-system-x86 [kernel.kallsyms] [k] yield_to 9.17% 91507 qemu-system-x86 [kvm] [k] kvm_vcpu_on_spin 7.74% 77257 qemu-system-x86 [kvm] [k] kvm_vcpu_yield_to 3.56% 35476 qemu-system-x86 [kernel.kallsyms] [k] __srcu_read_lock 3.00% 29951 qemu-system-x86 [kvm] [k] __vcpu_run 2.93% 29268 qemu-system-x86 [kvm_intel] [k] vmx_vcpu_run 2.88% 28783 qemu-system-x86 [kvm] [k] vcpu_enter_guest 2.59% 25827 qemu-system-x86 [kernel.kallsyms] [k] __schedule 1.40% 13976 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock_irq 1.28% 12823 qemu-system-x86 [kernel.kallsyms] [k] resched_task 1.14% 11376 qemu-system-x86 [kvm_intel] [k] vmcs_writel 0.85% 8502 qemu-system-x86 [kernel.kallsyms] [k] pick_next_task_fair 0.53% 5315 qemu-system-x86 [kernel.kallsyms] [k] native_write_msr_safe 0.46% 4553 qemu-system-x86 [kernel.kallsyms] [k] native_load_tr_desc get_pid_task() uses some rcu fucntions, wondering how scalable this is.... I tend to think of rcu as -not- having issues like this... is there a rcu stat/tracing tool which would help identify potential problems? -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/11/2012 09:27 PM, Andrew Theurer wrote: > > So, having both is probably not a good idea. However, I feel like > there's more work to be done. With no over-commit (10 VMs), total > throughput is 23427 +/- 2.76%. A 2x over-commit will no doubt have some > overhead, but a reduction to ~4500 is still terrible. By contrast, > 8-way VMs with 2x over-commit have a total throughput roughly 10% less > than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread > host). We still have what appears to be scalability problems, but now > it's not so much in runqueue locks for yield_to(), but now > get_pid_task(): > > perf on host: > > 32.10% 320131 qemu-system-x86 [kernel.kallsyms] [k] get_pid_task > 11.60% 115686 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock > 10.28% 102522 qemu-system-x86 [kernel.kallsyms] [k] yield_to > 9.17% 91507 qemu-system-x86 [kvm] [k] kvm_vcpu_on_spin > 7.74% 77257 qemu-system-x86 [kvm] [k] kvm_vcpu_yield_to > 3.56% 35476 qemu-system-x86 [kernel.kallsyms] [k] __srcu_read_lock > 3.00% 29951 qemu-system-x86 [kvm] [k] __vcpu_run > 2.93% 29268 qemu-system-x86 [kvm_intel] [k] vmx_vcpu_run > 2.88% 28783 qemu-system-x86 [kvm] [k] vcpu_enter_guest > 2.59% 25827 qemu-system-x86 [kernel.kallsyms] [k] __schedule > 1.40% 13976 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock_irq > 1.28% 12823 qemu-system-x86 [kernel.kallsyms] [k] resched_task > 1.14% 11376 qemu-system-x86 [kvm_intel] [k] vmcs_writel > 0.85% 8502 qemu-system-x86 [kernel.kallsyms] [k] pick_next_task_fair > 0.53% 5315 qemu-system-x86 [kernel.kallsyms] [k] native_write_msr_safe > 0.46% 4553 qemu-system-x86 [kernel.kallsyms] [k] native_load_tr_desc > > get_pid_task() uses some rcu fucntions, wondering how scalable this > is.... I tend to think of rcu as -not- having issues like this... is > there a rcu stat/tracing tool which would help identify potential > problems? It's not, it's the atomics + cache line bouncing. We're basically guaranteed to bounce here. Here we're finally paying for the ioctl() based interface. A syscall based interface would have a 1:1 correspondence between vcpus and tasks, so these games would be unnecessary.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fbf1fd0..0d98a67 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4820,6 +4820,23 @@ void __sched yield(void) } EXPORT_SYMBOL(yield); +/* + * Tests preconditions required for sched_class::yield_to(). + */ +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p, struct rq *p_rq) +{ + if (!curr->sched_class->yield_to_task) + return false; + + if (curr->sched_class != p->sched_class) + return false; + + if (task_running(p_rq, p) || p->state) + return false; + + return true; +} + /** * yield_to - yield the current processor to another thread in * your thread group, or accelerate that thread toward the @@ -4844,20 +4861,27 @@ bool __sched yield_to(struct task_struct *p, bool preempt) again: p_rq = task_rq(p); + + /* optimistic test to avoid taking locks */ + if (!__yield_to_candidate(curr, p, p_rq)) + goto out_irq; + + /* + * if the target task is not running, then only yield if the + * current task is in guest mode + */ + if (!(p_rq->curr->flags & PF_VCPU)) + goto out_irq; + double_rq_lock(rq, p_rq); while (task_rq(p) != p_rq) { double_rq_unlock(rq, p_rq); goto again; } - if (!curr->sched_class->yield_to_task) - goto out; - - if (curr->sched_class != p->sched_class) - goto out; - - if (task_running(p_rq, p) || p->state) - goto out; + /* validate state, holding p_rq ensures p's state cannot change */ + if (!__yield_to_candidate(curr, p, p_rq)) + goto out_unlock; yielded = curr->sched_class->yield_to_task(rq, p, preempt); if (yielded) { @@ -4877,8 +4901,9 @@ again: rq->skip_clock_update = 0; } -out: +out_unlock: double_rq_unlock(rq, p_rq); +out_irq: local_irq_restore(flags); if (yielded) **************** v4: diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fbf1fd0..2bec2ed 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4820,6 +4820,23 @@ void __sched yield(void) } EXPORT_SYMBOL(yield); +/* + * Tests preconditions required for sched_class::yield_to(). + */ +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p, struct rq *p_rq) +{ + if (!curr->sched_class->yield_to_task) + return false; + + if (curr->sched_class != p->sched_class) + return false; + + if (task_running(p_rq, p) || p->state) + return false; + + return true; +} + /** * yield_to - yield the current processor to another thread in * your thread group, or accelerate that thread toward the @@ -4844,20 +4861,24 @@ bool __sched yield_to(struct task_struct *p, bool preempt) again: p_rq = task_rq(p); + + /* optimistic test to avoid taking locks */ + if (!__yield_to_candidate(curr, p, p_rq)) + goto out_irq; + + /* if a yield is in progress, skip */ + if (p_rq->cfs.skip) + goto out_irq; + double_rq_lock(rq, p_rq); while (task_rq(p) != p_rq) { double_rq_unlock(rq, p_rq); goto again; } - if (!curr->sched_class->yield_to_task) - goto out; - - if (curr->sched_class != p->sched_class) - goto out; - - if (task_running(p_rq, p) || p->state) - goto out; + /* validate state, holding p_rq ensures p's state cannot change */ + if (!__yield_to_candidate(curr, p, p_rq)) + goto out_unlock; yielded = curr->sched_class->yield_to_task(rq, p, preempt); if (yielded) { @@ -4877,8 +4898,9 @@ again: rq->skip_clock_update = 0; } -out: +out_unlock: double_rq_unlock(rq, p_rq); +out_irq: local_irq_restore(flags); if (yielded)