diff mbox

[v7,04/11] KVM: MMU: zap pages in batch

Message ID 1369252560-11611-5-git-send-email-xiaoguangrong@linux.vnet.ibm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Xiao Guangrong May 22, 2013, 7:55 p.m. UTC
Zap at lease 10 pages before releasing mmu-lock to reduce the overload
caused by requiring lock

After the patch, kvm_zap_obsolete_pages can forward progress anyway,
so update the comments

[ It improves kernel building 0.6% ~ 1% ]

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c |   35 +++++++++++------------------------
 1 files changed, 11 insertions(+), 24 deletions(-)

Comments

Marcelo Tosatti May 24, 2013, 8:34 p.m. UTC | #1
On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
> caused by requiring lock
> 
> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
> so update the comments
> 
> [ It improves kernel building 0.6% ~ 1% ]

Can you please describe the overload in more detail? Under what scenario
is kernel building improved?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong May 27, 2013, 2:20 a.m. UTC | #2
On 05/25/2013 04:34 AM, Marcelo Tosatti wrote:
> On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
>> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
>> caused by requiring lock
>>
>> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
>> so update the comments
>>
>> [ It improves kernel building 0.6% ~ 1% ]
> 
> Can you please describe the overload in more detail? Under what scenario
> is kernel building improved?

Yes.

The scenario is we do kernel building, meanwhile, repeatedly read PCI rom
every one second.

[
   echo 1 > /sys/bus/pci/devices/0000\:00\:03.0/rom
   cat /sys/bus/pci/devices/0000\:00\:03.0/rom > /dev/null
]

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti May 28, 2013, 12:18 a.m. UTC | #3
On Mon, May 27, 2013 at 10:20:12AM +0800, Xiao Guangrong wrote:
> On 05/25/2013 04:34 AM, Marcelo Tosatti wrote:
> > On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
> >> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
> >> caused by requiring lock
> >>
> >> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
> >> so update the comments
> >>
> >> [ It improves kernel building 0.6% ~ 1% ]
> > 
> > Can you please describe the overload in more detail? Under what scenario
> > is kernel building improved?
> 
> Yes.
> 
> The scenario is we do kernel building, meanwhile, repeatedly read PCI rom
> every one second.
> 
> [
>    echo 1 > /sys/bus/pci/devices/0000\:00\:03.0/rom
>    cat /sys/bus/pci/devices/0000\:00\:03.0/rom > /dev/null
> ]

Can't see why it reflects real world scenario (or a real world
scenario with same characteristics regarding kvm_mmu_zap_all vs faults)?

Point is, it would be good to understand why this change 
is improving performance? What are these cases where breaking out of
kvm_mmu_zap_all due to either (need_resched || spin_needbreak) on zapped
< 10 ?



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong May 28, 2013, 3:02 p.m. UTC | #4
On 05/28/2013 08:18 AM, Marcelo Tosatti wrote:
> On Mon, May 27, 2013 at 10:20:12AM +0800, Xiao Guangrong wrote:
>> On 05/25/2013 04:34 AM, Marcelo Tosatti wrote:
>>> On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
>>>> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
>>>> caused by requiring lock
>>>>
>>>> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
>>>> so update the comments
>>>>
>>>> [ It improves kernel building 0.6% ~ 1% ]
>>>
>>> Can you please describe the overload in more detail? Under what scenario
>>> is kernel building improved?
>>
>> Yes.
>>
>> The scenario is we do kernel building, meanwhile, repeatedly read PCI rom
>> every one second.
>>
>> [
>>    echo 1 > /sys/bus/pci/devices/0000\:00\:03.0/rom
>>    cat /sys/bus/pci/devices/0000\:00\:03.0/rom > /dev/null
>> ]
> 
> Can't see why it reflects real world scenario (or a real world
> scenario with same characteristics regarding kvm_mmu_zap_all vs faults)?
> 
> Point is, it would be good to understand why this change 
> is improving performance? What are these cases where breaking out of
> kvm_mmu_zap_all due to either (need_resched || spin_needbreak) on zapped
> < 10 ?

When guest read ROM, qemu will set the memory to map the device's firmware,
that is why kvm_mmu_zap_all can be called in the scenario.

The reasons why it heart the performance are:
1): Qemu use a global io-lock to sync all vcpu, so that the io-lock is held
    when we do kvm_mmu_zap_all(). If kvm_mmu_zap_all() is not efficient, all
    other vcpus need wait a long time to do I/O.

2): kvm_mmu_zap_all() is triggered in vcpu context. so it can block the IPI
    request from other vcpus.

Is it enough?



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti May 29, 2013, 11:11 a.m. UTC | #5
On Tue, May 28, 2013 at 11:02:09PM +0800, Xiao Guangrong wrote:
> On 05/28/2013 08:18 AM, Marcelo Tosatti wrote:
> > On Mon, May 27, 2013 at 10:20:12AM +0800, Xiao Guangrong wrote:
> >> On 05/25/2013 04:34 AM, Marcelo Tosatti wrote:
> >>> On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
> >>>> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
> >>>> caused by requiring lock
> >>>>
> >>>> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
> >>>> so update the comments
> >>>>
> >>>> [ It improves kernel building 0.6% ~ 1% ]
> >>>
> >>> Can you please describe the overload in more detail? Under what scenario
> >>> is kernel building improved?
> >>
> >> Yes.
> >>
> >> The scenario is we do kernel building, meanwhile, repeatedly read PCI rom
> >> every one second.
> >>
> >> [
> >>    echo 1 > /sys/bus/pci/devices/0000\:00\:03.0/rom
> >>    cat /sys/bus/pci/devices/0000\:00\:03.0/rom > /dev/null
> >> ]
> > 
> > Can't see why it reflects real world scenario (or a real world
> > scenario with same characteristics regarding kvm_mmu_zap_all vs faults)?
> > 
> > Point is, it would be good to understand why this change 
> > is improving performance? What are these cases where breaking out of
> > kvm_mmu_zap_all due to either (need_resched || spin_needbreak) on zapped
> > < 10 ?
> 
> When guest read ROM, qemu will set the memory to map the device's firmware,
> that is why kvm_mmu_zap_all can be called in the scenario.
> 
> The reasons why it heart the performance are:
> 1): Qemu use a global io-lock to sync all vcpu, so that the io-lock is held
>     when we do kvm_mmu_zap_all(). If kvm_mmu_zap_all() is not efficient, all
>     other vcpus need wait a long time to do I/O.
> 
> 2): kvm_mmu_zap_all() is triggered in vcpu context. so it can block the IPI
>     request from other vcpus.
> 
> Is it enough?

That is no problem. The problem is why you chose "10" as the minimum number of
pages to zap before considering reschedule. I would expect the need to
reschedule to be rare enough that one kvm_mmu_zap_all instance (between
schedule in and schedule out) to be able to release no less than a
thousand pages.

So i'd like to understand better what is the drive for this change (this
was the original question).

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong May 29, 2013, 1:09 p.m. UTC | #6
On 05/29/2013 07:11 PM, Marcelo Tosatti wrote:
> On Tue, May 28, 2013 at 11:02:09PM +0800, Xiao Guangrong wrote:
>> On 05/28/2013 08:18 AM, Marcelo Tosatti wrote:
>>> On Mon, May 27, 2013 at 10:20:12AM +0800, Xiao Guangrong wrote:
>>>> On 05/25/2013 04:34 AM, Marcelo Tosatti wrote:
>>>>> On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
>>>>>> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
>>>>>> caused by requiring lock
>>>>>>
>>>>>> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
>>>>>> so update the comments
>>>>>>
>>>>>> [ It improves kernel building 0.6% ~ 1% ]
>>>>>
>>>>> Can you please describe the overload in more detail? Under what scenario
>>>>> is kernel building improved?
>>>>
>>>> Yes.
>>>>
>>>> The scenario is we do kernel building, meanwhile, repeatedly read PCI rom
>>>> every one second.
>>>>
>>>> [
>>>>    echo 1 > /sys/bus/pci/devices/0000\:00\:03.0/rom
>>>>    cat /sys/bus/pci/devices/0000\:00\:03.0/rom > /dev/null
>>>> ]
>>>
>>> Can't see why it reflects real world scenario (or a real world
>>> scenario with same characteristics regarding kvm_mmu_zap_all vs faults)?
>>>
>>> Point is, it would be good to understand why this change 
>>> is improving performance? What are these cases where breaking out of
>>> kvm_mmu_zap_all due to either (need_resched || spin_needbreak) on zapped
>>> < 10 ?
>>
>> When guest read ROM, qemu will set the memory to map the device's firmware,
>> that is why kvm_mmu_zap_all can be called in the scenario.
>>
>> The reasons why it heart the performance are:
>> 1): Qemu use a global io-lock to sync all vcpu, so that the io-lock is held
>>     when we do kvm_mmu_zap_all(). If kvm_mmu_zap_all() is not efficient, all
>>     other vcpus need wait a long time to do I/O.
>>
>> 2): kvm_mmu_zap_all() is triggered in vcpu context. so it can block the IPI
>>     request from other vcpus.
>>
>> Is it enough?
> 
> That is no problem. The problem is why you chose "10" as the minimum number of
> pages to zap before considering reschedule. I would expect the need to

Well, my description above explained why batch-zapping is needed - we do
not want the vcpu spend lots of time to zap all pages because it hurts other
vcpus running.

But, why the batch page number is "10"... I can not answer this, i just guessed
that '10' can make vcpu do not spend long time on zap_all_pages and do
not cause mmu-lock too hungry. "10" is the speculative value and i am not sure
it is the best value but at lease, i think it can work.

> reschedule to be rare enough that one kvm_mmu_zap_all instance (between
> schedule in and schedule out) to be able to release no less than a
> thousand pages.

Unfortunately, no.

This information is I replied Gleb in his mail where he raced a question that
why "collapse tlb flush is needed":

======
It seems no.
Since we have reloaded mmu before zapping the obsolete pages, the mmu-lock
is easily contended. I did the simple track:

+       int num = 0;
 restart:
        list_for_each_entry_safe_reverse(sp, node,
              &kvm->arch.active_mmu_pages, link) {
@@ -4265,6 +4265,7 @@ restart:
                if (batch >= BATCH_ZAP_PAGES &&
                      cond_resched_lock(&kvm->mmu_lock)) {
                        batch = 0;
+                       num++;
                        goto restart;
                }

@@ -4277,6 +4278,7 @@ restart:
         * may use the pages.
         */
        kvm_mmu_commit_zap_page(kvm, &invalid_list);
+       printk("lock-break: %d.\n", num);
 }

I do read pci rom when doing kernel building in the guest which
has 1G memory and 4vcpus with ept enabled, this is the normal
workload and normal configuration.

# dmesg
[ 2338.759099] lock-break: 8.
[ 2339.732442] lock-break: 5.
[ 2340.904446] lock-break: 3.
[ 2342.513514] lock-break: 3.
[ 2343.452229] lock-break: 3.
[ 2344.981599] lock-break: 4.

Basically, we need to break many times.
======

You can see we should break 3 times to zap all pages even if we have zapoed
10 pages in batch. It is obviously that it need break more times without
batch-zapping.




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti May 29, 2013, 1:21 p.m. UTC | #7
On Wed, May 29, 2013 at 09:09:09PM +0800, Xiao Guangrong wrote:
> On 05/29/2013 07:11 PM, Marcelo Tosatti wrote:
> > On Tue, May 28, 2013 at 11:02:09PM +0800, Xiao Guangrong wrote:
> >> On 05/28/2013 08:18 AM, Marcelo Tosatti wrote:
> >>> On Mon, May 27, 2013 at 10:20:12AM +0800, Xiao Guangrong wrote:
> >>>> On 05/25/2013 04:34 AM, Marcelo Tosatti wrote:
> >>>>> On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
> >>>>>> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
> >>>>>> caused by requiring lock
> >>>>>>
> >>>>>> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
> >>>>>> so update the comments
> >>>>>>
> >>>>>> [ It improves kernel building 0.6% ~ 1% ]
> >>>>>
> >>>>> Can you please describe the overload in more detail? Under what scenario
> >>>>> is kernel building improved?
> >>>>
> >>>> Yes.
> >>>>
> >>>> The scenario is we do kernel building, meanwhile, repeatedly read PCI rom
> >>>> every one second.
> >>>>
> >>>> [
> >>>>    echo 1 > /sys/bus/pci/devices/0000\:00\:03.0/rom
> >>>>    cat /sys/bus/pci/devices/0000\:00\:03.0/rom > /dev/null
> >>>> ]
> >>>
> >>> Can't see why it reflects real world scenario (or a real world
> >>> scenario with same characteristics regarding kvm_mmu_zap_all vs faults)?
> >>>
> >>> Point is, it would be good to understand why this change 
> >>> is improving performance? What are these cases where breaking out of
> >>> kvm_mmu_zap_all due to either (need_resched || spin_needbreak) on zapped
> >>> < 10 ?
> >>
> >> When guest read ROM, qemu will set the memory to map the device's firmware,
> >> that is why kvm_mmu_zap_all can be called in the scenario.
> >>
> >> The reasons why it heart the performance are:
> >> 1): Qemu use a global io-lock to sync all vcpu, so that the io-lock is held
> >>     when we do kvm_mmu_zap_all(). If kvm_mmu_zap_all() is not efficient, all
> >>     other vcpus need wait a long time to do I/O.
> >>
> >> 2): kvm_mmu_zap_all() is triggered in vcpu context. so it can block the IPI
> >>     request from other vcpus.
> >>
> >> Is it enough?
> > 
> > That is no problem. The problem is why you chose "10" as the minimum number of
> > pages to zap before considering reschedule. I would expect the need to
> 
> Well, my description above explained why batch-zapping is needed - we do
> not want the vcpu spend lots of time to zap all pages because it hurts other
> vcpus running.
> 
> But, why the batch page number is "10"... I can not answer this, i just guessed
> that '10' can make vcpu do not spend long time on zap_all_pages and do
> not cause mmu-lock too hungry. "10" is the speculative value and i am not sure
> it is the best value but at lease, i think it can work.
> 
> > reschedule to be rare enough that one kvm_mmu_zap_all instance (between
> > schedule in and schedule out) to be able to release no less than a
> > thousand pages.
> 
> Unfortunately, no.
> 
> This information is I replied Gleb in his mail where he raced a question that
> why "collapse tlb flush is needed":
> 
> ======
> It seems no.
> Since we have reloaded mmu before zapping the obsolete pages, the mmu-lock
> is easily contended. I did the simple track:
> 
> +       int num = 0;
>  restart:
>         list_for_each_entry_safe_reverse(sp, node,
>               &kvm->arch.active_mmu_pages, link) {
> @@ -4265,6 +4265,7 @@ restart:
>                 if (batch >= BATCH_ZAP_PAGES &&
>                       cond_resched_lock(&kvm->mmu_lock)) {
>                         batch = 0;
> +                       num++;
>                         goto restart;
>                 }
> 
> @@ -4277,6 +4278,7 @@ restart:
>          * may use the pages.
>          */
>         kvm_mmu_commit_zap_page(kvm, &invalid_list);
> +       printk("lock-break: %d.\n", num);
>  }
> 
> I do read pci rom when doing kernel building in the guest which
> has 1G memory and 4vcpus with ept enabled, this is the normal
> workload and normal configuration.
> 
> # dmesg
> [ 2338.759099] lock-break: 8.
> [ 2339.732442] lock-break: 5.
> [ 2340.904446] lock-break: 3.
> [ 2342.513514] lock-break: 3.
> [ 2343.452229] lock-break: 3.
> [ 2344.981599] lock-break: 4.
> 
> Basically, we need to break many times.
> ======
> 
> You can see we should break 3 times to zap all pages even if we have zapoed
> 10 pages in batch. It is obviously that it need break more times without
> batch-zapping.

Yes, but this is not a real scenario, or even describes a real scenario
as far as i know. 

Are you sure this minimum-batching-before-considering-reschedule even
after obsolete pages optimization?

I fail to see why.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti May 29, 2013, 1:32 p.m. UTC | #8
On Wed, May 29, 2013 at 09:09:09PM +0800, Xiao Guangrong wrote:
> This information is I replied Gleb in his mail where he raced a question that
> why "collapse tlb flush is needed":
> 
> ======
> It seems no.
> Since we have reloaded mmu before zapping the obsolete pages, the mmu-lock
> is easily contended. I did the simple track:
> 
> +       int num = 0;
>  restart:
>         list_for_each_entry_safe_reverse(sp, node,
>               &kvm->arch.active_mmu_pages, link) {
> @@ -4265,6 +4265,7 @@ restart:
>                 if (batch >= BATCH_ZAP_PAGES &&
>                       cond_resched_lock(&kvm->mmu_lock)) {
>                         batch = 0;
> +                       num++;
>                         goto restart;
>                 }
> 
> @@ -4277,6 +4278,7 @@ restart:
>          * may use the pages.
>          */
>         kvm_mmu_commit_zap_page(kvm, &invalid_list);
> +       printk("lock-break: %d.\n", num);
>  }
> 
> I do read pci rom when doing kernel building in the guest which
> has 1G memory and 4vcpus with ept enabled, this is the normal
> workload and normal configuration.
> 
> # dmesg
> [ 2338.759099] lock-break: 8.
> [ 2339.732442] lock-break: 5.
> [ 2340.904446] lock-break: 3.
> [ 2342.513514] lock-break: 3.
> [ 2343.452229] lock-break: 3.
> [ 2344.981599] lock-break: 4.
> 
> Basically, we need to break many times.

Should measure kvm_mmu_zap_all latency.

> ======
> 
> You can see we should break 3 times to zap all pages even if we have zapoed
> 10 pages in batch. It is obviously that it need break more times without
> batch-zapping.

Again, breaking should be no problem, what matters is latency. Please
measure kvm_mmu_zap_all latency after all optimizations to justify 
this minimum batching.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong May 29, 2013, 2 p.m. UTC | #9
On 05/29/2013 09:21 PM, Marcelo Tosatti wrote:
> On Wed, May 29, 2013 at 09:09:09PM +0800, Xiao Guangrong wrote:
>> On 05/29/2013 07:11 PM, Marcelo Tosatti wrote:
>>> On Tue, May 28, 2013 at 11:02:09PM +0800, Xiao Guangrong wrote:
>>>> On 05/28/2013 08:18 AM, Marcelo Tosatti wrote:
>>>>> On Mon, May 27, 2013 at 10:20:12AM +0800, Xiao Guangrong wrote:
>>>>>> On 05/25/2013 04:34 AM, Marcelo Tosatti wrote:
>>>>>>> On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
>>>>>>>> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
>>>>>>>> caused by requiring lock
>>>>>>>>
>>>>>>>> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
>>>>>>>> so update the comments
>>>>>>>>
>>>>>>>> [ It improves kernel building 0.6% ~ 1% ]
>>>>>>>
>>>>>>> Can you please describe the overload in more detail? Under what scenario
>>>>>>> is kernel building improved?
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>> The scenario is we do kernel building, meanwhile, repeatedly read PCI rom
>>>>>> every one second.
>>>>>>
>>>>>> [
>>>>>>    echo 1 > /sys/bus/pci/devices/0000\:00\:03.0/rom
>>>>>>    cat /sys/bus/pci/devices/0000\:00\:03.0/rom > /dev/null
>>>>>> ]
>>>>>
>>>>> Can't see why it reflects real world scenario (or a real world
>>>>> scenario with same characteristics regarding kvm_mmu_zap_all vs faults)?
>>>>>
>>>>> Point is, it would be good to understand why this change 
>>>>> is improving performance? What are these cases where breaking out of
>>>>> kvm_mmu_zap_all due to either (need_resched || spin_needbreak) on zapped
>>>>> < 10 ?
>>>>
>>>> When guest read ROM, qemu will set the memory to map the device's firmware,
>>>> that is why kvm_mmu_zap_all can be called in the scenario.
>>>>
>>>> The reasons why it heart the performance are:
>>>> 1): Qemu use a global io-lock to sync all vcpu, so that the io-lock is held
>>>>     when we do kvm_mmu_zap_all(). If kvm_mmu_zap_all() is not efficient, all
>>>>     other vcpus need wait a long time to do I/O.
>>>>
>>>> 2): kvm_mmu_zap_all() is triggered in vcpu context. so it can block the IPI
>>>>     request from other vcpus.
>>>>
>>>> Is it enough?
>>>
>>> That is no problem. The problem is why you chose "10" as the minimum number of
>>> pages to zap before considering reschedule. I would expect the need to
>>
>> Well, my description above explained why batch-zapping is needed - we do
>> not want the vcpu spend lots of time to zap all pages because it hurts other
>> vcpus running.
>>
>> But, why the batch page number is "10"... I can not answer this, i just guessed
>> that '10' can make vcpu do not spend long time on zap_all_pages and do
>> not cause mmu-lock too hungry. "10" is the speculative value and i am not sure
>> it is the best value but at lease, i think it can work.
>>
>>> reschedule to be rare enough that one kvm_mmu_zap_all instance (between
>>> schedule in and schedule out) to be able to release no less than a
>>> thousand pages.
>>
>> Unfortunately, no.
>>
>> This information is I replied Gleb in his mail where he raced a question that
>> why "collapse tlb flush is needed":
>>
>> ======
>> It seems no.
>> Since we have reloaded mmu before zapping the obsolete pages, the mmu-lock
>> is easily contended. I did the simple track:
>>
>> +       int num = 0;
>>  restart:
>>         list_for_each_entry_safe_reverse(sp, node,
>>               &kvm->arch.active_mmu_pages, link) {
>> @@ -4265,6 +4265,7 @@ restart:
>>                 if (batch >= BATCH_ZAP_PAGES &&
>>                       cond_resched_lock(&kvm->mmu_lock)) {
>>                         batch = 0;
>> +                       num++;
>>                         goto restart;
>>                 }
>>
>> @@ -4277,6 +4278,7 @@ restart:
>>          * may use the pages.
>>          */
>>         kvm_mmu_commit_zap_page(kvm, &invalid_list);
>> +       printk("lock-break: %d.\n", num);
>>  }
>>
>> I do read pci rom when doing kernel building in the guest which
>> has 1G memory and 4vcpus with ept enabled, this is the normal
>> workload and normal configuration.
>>
>> # dmesg
>> [ 2338.759099] lock-break: 8.
>> [ 2339.732442] lock-break: 5.
>> [ 2340.904446] lock-break: 3.
>> [ 2342.513514] lock-break: 3.
>> [ 2343.452229] lock-break: 3.
>> [ 2344.981599] lock-break: 4.
>>
>> Basically, we need to break many times.
>> ======
>>
>> You can see we should break 3 times to zap all pages even if we have zapoed
>> 10 pages in batch. It is obviously that it need break more times without
>> batch-zapping.
> 
> Yes, but this is not a real scenario, or even describes a real scenario
> as far as i know. 

Aha.

Okay, maybe "read rom" is not the common case, but vcpu can trigger it or guest
driver can do in the further. What happen if vcpu trigger it? The worst
case is, one vcpu is always break mmu-lock due to one other vcpus doing intense
memory access and the reset vcpus is waiting IO or its IPI. It is easy soft-lockup.

More worse, if host memory is really less and host is always trying to reclaim qemu's
memory, that cause the lock always be hot and can not zap one page before rescheduled.

> 
> Are you sure this minimum-batching-before-considering-reschedule even
> after obsolete pages optimization?

Yes, this track is after all patches in this series are applied.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong May 29, 2013, 2:02 p.m. UTC | #10
On 05/29/2013 09:32 PM, Marcelo Tosatti wrote:
> On Wed, May 29, 2013 at 09:09:09PM +0800, Xiao Guangrong wrote:
>> This information is I replied Gleb in his mail where he raced a question that
>> why "collapse tlb flush is needed":
>>
>> ======
>> It seems no.
>> Since we have reloaded mmu before zapping the obsolete pages, the mmu-lock
>> is easily contended. I did the simple track:
>>
>> +       int num = 0;
>>  restart:
>>         list_for_each_entry_safe_reverse(sp, node,
>>               &kvm->arch.active_mmu_pages, link) {
>> @@ -4265,6 +4265,7 @@ restart:
>>                 if (batch >= BATCH_ZAP_PAGES &&
>>                       cond_resched_lock(&kvm->mmu_lock)) {
>>                         batch = 0;
>> +                       num++;
>>                         goto restart;
>>                 }
>>
>> @@ -4277,6 +4278,7 @@ restart:
>>          * may use the pages.
>>          */
>>         kvm_mmu_commit_zap_page(kvm, &invalid_list);
>> +       printk("lock-break: %d.\n", num);
>>  }
>>
>> I do read pci rom when doing kernel building in the guest which
>> has 1G memory and 4vcpus with ept enabled, this is the normal
>> workload and normal configuration.
>>
>> # dmesg
>> [ 2338.759099] lock-break: 8.
>> [ 2339.732442] lock-break: 5.
>> [ 2340.904446] lock-break: 3.
>> [ 2342.513514] lock-break: 3.
>> [ 2343.452229] lock-break: 3.
>> [ 2344.981599] lock-break: 4.
>>
>> Basically, we need to break many times.
> 
> Should measure kvm_mmu_zap_all latency.
> 
>> ======
>>
>> You can see we should break 3 times to zap all pages even if we have zapoed
>> 10 pages in batch. It is obviously that it need break more times without
>> batch-zapping.
> 
> Again, breaking should be no problem, what matters is latency. Please
> measure kvm_mmu_zap_all latency after all optimizations to justify 
> this minimum batching.

Okay, okay. I will benchmark the latency.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f302540..688e755 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4203,14 +4203,18 @@  restart:
 	spin_unlock(&kvm->mmu_lock);
 }
 
+#define BATCH_ZAP_PAGES	10
 static void kvm_zap_obsolete_pages(struct kvm *kvm)
 {
 	struct kvm_mmu_page *sp, *node;
 	LIST_HEAD(invalid_list);
+	int batch = 0;
 
 restart:
 	list_for_each_entry_safe_reverse(sp, node,
 	      &kvm->arch.active_mmu_pages, link) {
+		int ret;
+
 		/*
 		 * No obsolete page exists before new created page since
 		 * active_mmu_pages is the FIFO list.
@@ -4219,28 +4223,6 @@  restart:
 			break;
 
 		/*
-		 * Do not repeatedly zap a root page to avoid unnecessary
-		 * KVM_REQ_MMU_RELOAD, otherwise we may not be able to
-		 * progress:
-		 *    vcpu 0                        vcpu 1
-		 *                         call vcpu_enter_guest():
-		 *                            1): handle KVM_REQ_MMU_RELOAD
-		 *                                and require mmu-lock to
-		 *                                load mmu
-		 * repeat:
-		 *    1): zap root page and
-		 *        send KVM_REQ_MMU_RELOAD
-		 *
-		 *    2): if (cond_resched_lock(mmu-lock))
-		 *
-		 *                            2): hold mmu-lock and load mmu
-		 *
-		 *                            3): see KVM_REQ_MMU_RELOAD bit
-		 *                                on vcpu->requests is set
-		 *                                then return 1 to call
-		 *                                vcpu_enter_guest() again.
-		 *            goto repeat;
-		 *
 		 * Since we are reversely walking the list and the invalid
 		 * list will be moved to the head, skip the invalid page
 		 * can help us to avoid the infinity list walking.
@@ -4248,13 +4230,18 @@  restart:
 		if (sp->role.invalid)
 			continue;
 
-		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+		if (batch >= BATCH_ZAP_PAGES &&
+		      (need_resched() || spin_needbreak(&kvm->mmu_lock))) {
+			batch = 0;
 			kvm_mmu_commit_zap_page(kvm, &invalid_list);
 			cond_resched_lock(&kvm->mmu_lock);
 			goto restart;
 		}
 
-		if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
+		ret = kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
+		batch += ret;
+
+		if (ret)
 			goto restart;
 	}