[v7,04/11] KVM: MMU: zap pages in batch

Message ID	1369252560-11611-5-git-send-email-xiaoguangrong@linux.vnet.ibm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@vger.kernel.org> Gateway: Authorized Use Only! Violators will be prosecuted for <kvm@vger.kernel.org> from <xiaoguangrong@linux.vnet.ibm.com>; Thu, 23 May 2013 01:22:19 +0530 Gateway: Authorized Use Only! Violators will be prosecuted; Thu, 23 May 2013 01:22:17 +0530 From: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> To: gleb@redhat.com Cc: avi.kivity@gmail.com, mtosatti@redhat.com, pbonzini@redhat.com, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Subject: [PATCH v7 04/11] KVM: MMU: zap pages in batch Date: Thu, 23 May 2013 03:55:53 +0800 Message-Id: <1369252560-11611-5-git-send-email-xiaoguangrong@linux.vnet.ibm.com> In-Reply-To: <1369252560-11611-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com> References: <1369252560-11611-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com> Sender: kvm-owner@vger.kernel.org Precedence: bulk

Xiao Guangrong May 22, 2013, 7:55 p.m. UTC

Zap at lease 10 pages before releasing mmu-lock to reduce the overload
caused by requiring lock

After the patch, kvm_zap_obsolete_pages can forward progress anyway,
so update the comments

[ It improves kernel building 0.6% ~ 1% ]

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c |   35 +++++++++++------------------------
 1 files changed, 11 insertions(+), 24 deletions(-)

Marcelo Tosatti May 24, 2013, 8:34 p.m. UTC | #1

On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
> caused by requiring lock
> 
> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
> so update the comments
> 
> [ It improves kernel building 0.6% ~ 1% ]

Can you please describe the overload in more detail? Under what scenario
is kernel building improved?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Xiao Guangrong May 27, 2013, 2:20 a.m. UTC | #2

On 05/25/2013 04:34 AM, Marcelo Tosatti wrote:
> On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
>> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
>> caused by requiring lock
>>
>> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
>> so update the comments
>>
>> [ It improves kernel building 0.6% ~ 1% ]
> 
> Can you please describe the overload in more detail? Under what scenario
> is kernel building improved?

Yes.

The scenario is we do kernel building, meanwhile, repeatedly read PCI rom
every one second.

[
   echo 1 > /sys/bus/pci/devices/0000\:00\:03.0/rom
   cat /sys/bus/pci/devices/0000\:00\:03.0/rom > /dev/null
]

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti May 28, 2013, 12:18 a.m. UTC | #3

On Mon, May 27, 2013 at 10:20:12AM +0800, Xiao Guangrong wrote:
> On 05/25/2013 04:34 AM, Marcelo Tosatti wrote:
> > On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
> >> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
> >> caused by requiring lock
> >>
> >> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
> >> so update the comments
> >>
> >> [ It improves kernel building 0.6% ~ 1% ]
> > 
> > Can you please describe the overload in more detail? Under what scenario
> > is kernel building improved?
> 
> Yes.
> 
> The scenario is we do kernel building, meanwhile, repeatedly read PCI rom
> every one second.
> 
> [
>    echo 1 > /sys/bus/pci/devices/0000\:00\:03.0/rom
>    cat /sys/bus/pci/devices/0000\:00\:03.0/rom > /dev/null
> ]

Can't see why it reflects real world scenario (or a real world
scenario with same characteristics regarding kvm_mmu_zap_all vs faults)?

Point is, it would be good to understand why this change 
is improving performance? What are these cases where breaking out of
kvm_mmu_zap_all due to either (need_resched || spin_needbreak) on zapped
< 10 ?



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Xiao Guangrong May 28, 2013, 3:02 p.m. UTC | #4

On 05/28/2013 08:18 AM, Marcelo Tosatti wrote:
> On Mon, May 27, 2013 at 10:20:12AM +0800, Xiao Guangrong wrote:
>> On 05/25/2013 04:34 AM, Marcelo Tosatti wrote:
>>> On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
>>>> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
>>>> caused by requiring lock
>>>>
>>>> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
>>>> so update the comments
>>>>
>>>> [ It improves kernel building 0.6% ~ 1% ]
>>>
>>> Can you please describe the overload in more detail? Under what scenario
>>> is kernel building improved?
>>
>> Yes.
>>
>> The scenario is we do kernel building, meanwhile, repeatedly read PCI rom
>> every one second.
>>
>> [
>>    echo 1 > /sys/bus/pci/devices/0000\:00\:03.0/rom
>>    cat /sys/bus/pci/devices/0000\:00\:03.0/rom > /dev/null
>> ]
> 
> Can't see why it reflects real world scenario (or a real world
> scenario with same characteristics regarding kvm_mmu_zap_all vs faults)?
> 
> Point is, it would be good to understand why this change 
> is improving performance? What are these cases where breaking out of
> kvm_mmu_zap_all due to either (need_resched || spin_needbreak) on zapped
> < 10 ?

When guest read ROM, qemu will set the memory to map the device's firmware,
that is why kvm_mmu_zap_all can be called in the scenario.

The reasons why it heart the performance are:
1): Qemu use a global io-lock to sync all vcpu, so that the io-lock is held
    when we do kvm_mmu_zap_all(). If kvm_mmu_zap_all() is not efficient, all
    other vcpus need wait a long time to do I/O.

2): kvm_mmu_zap_all() is triggered in vcpu context. so it can block the IPI
    request from other vcpus.

Is it enough?



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti May 29, 2013, 11:11 a.m. UTC | #5

On Tue, May 28, 2013 at 11:02:09PM +0800, Xiao Guangrong wrote:
> On 05/28/2013 08:18 AM, Marcelo Tosatti wrote:
> > On Mon, May 27, 2013 at 10:20:12AM +0800, Xiao Guangrong wrote:
> >> On 05/25/2013 04:34 AM, Marcelo Tosatti wrote:
> >>> On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
> >>>> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
> >>>> caused by requiring lock
> >>>>
> >>>> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
> >>>> so update the comments
> >>>>
> >>>> [ It improves kernel building 0.6% ~ 1% ]
> >>>
> >>> Can you please describe the overload in more detail? Under what scenario
> >>> is kernel building improved?
> >>
> >> Yes.
> >>
> >> The scenario is we do kernel building, meanwhile, repeatedly read PCI rom
> >> every one second.
> >>
> >> [
> >>    echo 1 > /sys/bus/pci/devices/0000\:00\:03.0/rom
> >>    cat /sys/bus/pci/devices/0000\:00\:03.0/rom > /dev/null
> >> ]
> > 
> > Can't see why it reflects real world scenario (or a real world
> > scenario with same characteristics regarding kvm_mmu_zap_all vs faults)?
> > 
> > Point is, it would be good to understand why this change 
> > is improving performance? What are these cases where breaking out of
> > kvm_mmu_zap_all due to either (need_resched || spin_needbreak) on zapped
> > < 10 ?
> 
> When guest read ROM, qemu will set the memory to map the device's firmware,
> that is why kvm_mmu_zap_all can be called in the scenario.
> 
> The reasons why it heart the performance are:
> 1): Qemu use a global io-lock to sync all vcpu, so that the io-lock is held
>     when we do kvm_mmu_zap_all(). If kvm_mmu_zap_all() is not efficient, all
>     other vcpus need wait a long time to do I/O.
> 
> 2): kvm_mmu_zap_all() is triggered in vcpu context. so it can block the IPI
>     request from other vcpus.
> 
> Is it enough?

That is no problem. The problem is why you chose "10" as the minimum number of
pages to zap before considering reschedule. I would expect the need to
reschedule to be rare enough that one kvm_mmu_zap_all instance (between
schedule in and schedule out) to be able to release no less than a
thousand pages.

So i'd like to understand better what is the drive for this change (this
was the original question).

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Xiao Guangrong May 29, 2013, 1:09 p.m. UTC | #6

On 05/29/2013 07:11 PM, Marcelo Tosatti wrote:
> On Tue, May 28, 2013 at 11:02:09PM +0800, Xiao Guangrong wrote:
>> On 05/28/2013 08:18 AM, Marcelo Tosatti wrote:
>>> On Mon, May 27, 2013 at 10:20:12AM +0800, Xiao Guangrong wrote:
>>>> On 05/25/2013 04:34 AM, Marcelo Tosatti wrote:
>>>>> On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
>>>>>> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
>>>>>> caused by requiring lock
>>>>>>
>>>>>> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
>>>>>> so update the comments
>>>>>>
>>>>>> [ It improves kernel building 0.6% ~ 1% ]
>>>>>
>>>>> Can you please describe the overload in more detail? Under what scenario
>>>>> is kernel building improved?
>>>>
>>>> Yes.
>>>>
>>>> The scenario is we do kernel building, meanwhile, repeatedly read PCI rom
>>>> every one second.
>>>>
>>>> [
>>>>    echo 1 > /sys/bus/pci/devices/0000\:00\:03.0/rom
>>>>    cat /sys/bus/pci/devices/0000\:00\:03.0/rom > /dev/null
>>>> ]
>>>
>>> Can't see why it reflects real world scenario (or a real world
>>> scenario with same characteristics regarding kvm_mmu_zap_all vs faults)?
>>>
>>> Point is, it would be good to understand why this change 
>>> is improving performance? What are these cases where breaking out of
>>> kvm_mmu_zap_all due to either (need_resched || spin_needbreak) on zapped
>>> < 10 ?
>>
>> When guest read ROM, qemu will set the memory to map the device's firmware,
>> that is why kvm_mmu_zap_all can be called in the scenario.
>>
>> The reasons why it heart the performance are:
>> 1): Qemu use a global io-lock to sync all vcpu, so that the io-lock is held
>>     when we do kvm_mmu_zap_all(). If kvm_mmu_zap_all() is not efficient, all
>>     other vcpus need wait a long time to do I/O.
>>
>> 2): kvm_mmu_zap_all() is triggered in vcpu context. so it can block the IPI
>>     request from other vcpus.
>>
>> Is it enough?
> 
> That is no problem. The problem is why you chose "10" as the minimum number of
> pages to zap before considering reschedule. I would expect the need to

Well, my description above explained why batch-zapping is needed - we do
not want the vcpu spend lots of time to zap all pages because it hurts other
vcpus running.

But, why the batch page number is "10"... I can not answer this, i just guessed
that '10' can make vcpu do not spend long time on zap_all_pages and do
not cause mmu-lock too hungry. "10" is the speculative value and i am not sure
it is the best value but at lease, i think it can work.

> reschedule to be rare enough that one kvm_mmu_zap_all instance (between
> schedule in and schedule out) to be able to release no less than a
> thousand pages.

Unfortunately, no.

This information is I replied Gleb in his mail where he raced a question that
why "collapse tlb flush is needed":

======
It seems no.
Since we have reloaded mmu before zapping the obsolete pages, the mmu-lock
is easily contended. I did the simple track:

+       int num = 0;
 restart:
        list_for_each_entry_safe_reverse(sp, node,
              &kvm->arch.active_mmu_pages, link) {
@@ -4265,6 +4265,7 @@ restart:
                if (batch >= BATCH_ZAP_PAGES &&
                      cond_resched_lock(&kvm->mmu_lock)) {
                        batch = 0;
+                       num++;
                        goto restart;
                }

@@ -4277,6 +4278,7 @@ restart:
         * may use the pages.
         */
        kvm_mmu_commit_zap_page(kvm, &invalid_list);
+       printk("lock-break: %d.\n", num);
 }

I do read pci rom when doing kernel building in the guest which
has 1G memory and 4vcpus with ept enabled, this is the normal
workload and normal configuration.

# dmesg
[ 2338.759099] lock-break: 8.
[ 2339.732442] lock-break: 5.
[ 2340.904446] lock-break: 3.
[ 2342.513514] lock-break: 3.
[ 2343.452229] lock-break: 3.
[ 2344.981599] lock-break: 4.

Basically, we need to break many times.
======

You can see we should break 3 times to zap all pages even if we have zapoed
10 pages in batch. It is obviously that it need break more times without
batch-zapping.




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti May 29, 2013, 1:21 p.m. UTC | #7

On Wed, May 29, 2013 at 09:09:09PM +0800, Xiao Guangrong wrote:
> On 05/29/2013 07:11 PM, Marcelo Tosatti wrote:
> > On Tue, May 28, 2013 at 11:02:09PM +0800, Xiao Guangrong wrote:
> >> On 05/28/2013 08:18 AM, Marcelo Tosatti wrote:
> >>> On Mon, May 27, 2013 at 10:20:12AM +0800, Xiao Guangrong wrote:
> >>>> On 05/25/2013 04:34 AM, Marcelo Tosatti wrote:
> >>>>> On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
> >>>>>> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
> >>>>>> caused by requiring lock
> >>>>>>
> >>>>>> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
> >>>>>> so update the comments
> >>>>>>
> >>>>>> [ It improves kernel building 0.6% ~ 1% ]
> >>>>>
> >>>>> Can you please describe the overload in more detail? Under what scenario
> >>>>> is kernel building improved?
> >>>>
> >>>> Yes.
> >>>>
> >>>> The scenario is we do kernel building, meanwhile, repeatedly read PCI rom
> >>>> every one second.
> >>>>
> >>>> [
> >>>>    echo 1 > /sys/bus/pci/devices/0000\:00\:03.0/rom
> >>>>    cat /sys/bus/pci/devices/0000\:00\:03.0/rom > /dev/null
> >>>> ]
> >>>
> >>> Can't see why it reflects real world scenario (or a real world
> >>> scenario with same characteristics regarding kvm_mmu_zap_all vs faults)?
> >>>
> >>> Point is, it would be good to understand why this change 
> >>> is improving performance? What are these cases where breaking out of
> >>> kvm_mmu_zap_all due to either (need_resched || spin_needbreak) on zapped
> >>> < 10 ?
> >>
> >> When guest read ROM, qemu will set the memory to map the device's firmware,
> >> that is why kvm_mmu_zap_all can be called in the scenario.
> >>
> >> The reasons why it heart the performance are:
> >> 1): Qemu use a global io-lock to sync all vcpu, so that the io-lock is held
> >>     when we do kvm_mmu_zap_all(). If kvm_mmu_zap_all() is not efficient, all
> >>     other vcpus need wait a long time to do I/O.
> >>
> >> 2): kvm_mmu_zap_all() is triggered in vcpu context. so it can block the IPI
> >>     request from other vcpus.
> >>
> >> Is it enough?
> > 
> > That is no problem. The problem is why you chose "10" as the minimum number of
> > pages to zap before considering reschedule. I would expect the need to
> 
> Well, my description above explained why batch-zapping is needed - we do
> not want the vcpu spend lots of time to zap all pages because it hurts other
> vcpus running.
> 
> But, why the batch page number is "10"... I can not answer this, i just guessed
> that '10' can make vcpu do not spend long time on zap_all_pages and do
> not cause mmu-lock too hungry. "10" is the speculative value and i am not sure
> it is the best value but at lease, i think it can work.
> 
> > reschedule to be rare enough that one kvm_mmu_zap_all instance (between
> > schedule in and schedule out) to be able to release no less than a
> > thousand pages.
> 
> Unfortunately, no.
> 
> This information is I replied Gleb in his mail where he raced a question that
> why "collapse tlb flush is needed":
> 
> ======
> It seems no.
> Since we have reloaded mmu before zapping the obsolete pages, the mmu-lock
> is easily contended. I did the simple track:
> 
> +       int num = 0;
>  restart:
>         list_for_each_entry_safe_reverse(sp, node,
>               &kvm->arch.active_mmu_pages, link) {
> @@ -4265,6 +4265,7 @@ restart:
>                 if (batch >= BATCH_ZAP_PAGES &&
>                       cond_resched_lock(&kvm->mmu_lock)) {
>                         batch = 0;
> +                       num++;
>                         goto restart;
>                 }
> 
> @@ -4277,6 +4278,7 @@ restart:
>          * may use the pages.
>          */
>         kvm_mmu_commit_zap_page(kvm, &invalid_list);
> +       printk("lock-break: %d.\n", num);
>  }
> 
> I do read pci rom when doing kernel building in the guest which
> has 1G memory and 4vcpus with ept enabled, this is the normal
> workload and normal configuration.
> 
> # dmesg
> [ 2338.759099] lock-break: 8.
> [ 2339.732442] lock-break: 5.
> [ 2340.904446] lock-break: 3.
> [ 2342.513514] lock-break: 3.
> [ 2343.452229] lock-break: 3.
> [ 2344.981599] lock-break: 4.
> 
> Basically, we need to break many times.
> ======
> 
> You can see we should break 3 times to zap all pages even if we have zapoed
> 10 pages in batch. It is obviously that it need break more times without
> batch-zapping.

Yes, but this is not a real scenario, or even describes a real scenario
as far as i know. 

Are you sure this minimum-batching-before-considering-reschedule even
after obsolete pages optimization?

I fail to see why.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti May 29, 2013, 1:32 p.m. UTC | #8

On Wed, May 29, 2013 at 09:09:09PM +0800, Xiao Guangrong wrote:
> This information is I replied Gleb in his mail where he raced a question that
> why "collapse tlb flush is needed":
> 
> ======
> It seems no.
> Since we have reloaded mmu before zapping the obsolete pages, the mmu-lock
> is easily contended. I did the simple track:
> 
> +       int num = 0;
>  restart:
>         list_for_each_entry_safe_reverse(sp, node,
>               &kvm->arch.active_mmu_pages, link) {
> @@ -4265,6 +4265,7 @@ restart:
>                 if (batch >= BATCH_ZAP_PAGES &&
>                       cond_resched_lock(&kvm->mmu_lock)) {
>                         batch = 0;
> +                       num++;
>                         goto restart;
>                 }
> 
> @@ -4277,6 +4278,7 @@ restart:
>          * may use the pages.
>          */
>         kvm_mmu_commit_zap_page(kvm, &invalid_list);
> +       printk("lock-break: %d.\n", num);
>  }
> 
> I do read pci rom when doing kernel building in the guest which
> has 1G memory and 4vcpus with ept enabled, this is the normal
> workload and normal configuration.
> 
> # dmesg
> [ 2338.759099] lock-break: 8.
> [ 2339.732442] lock-break: 5.
> [ 2340.904446] lock-break: 3.
> [ 2342.513514] lock-break: 3.
> [ 2343.452229] lock-break: 3.
> [ 2344.981599] lock-break: 4.
> 
> Basically, we need to break many times.

Should measure kvm_mmu_zap_all latency.

> ======
> 
> You can see we should break 3 times to zap all pages even if we have zapoed
> 10 pages in batch. It is obviously that it need break more times without
> batch-zapping.

Again, breaking should be no problem, what matters is latency. Please
measure kvm_mmu_zap_all latency after all optimizations to justify 
this minimum batching.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Xiao Guangrong May 29, 2013, 2 p.m. UTC | #9

On 05/29/2013 09:21 PM, Marcelo Tosatti wrote:
> On Wed, May 29, 2013 at 09:09:09PM +0800, Xiao Guangrong wrote:
>> On 05/29/2013 07:11 PM, Marcelo Tosatti wrote:
>>> On Tue, May 28, 2013 at 11:02:09PM +0800, Xiao Guangrong wrote:
>>>> On 05/28/2013 08:18 AM, Marcelo Tosatti wrote:
>>>>> On Mon, May 27, 2013 at 10:20:12AM +0800, Xiao Guangrong wrote:
>>>>>> On 05/25/2013 04:34 AM, Marcelo Tosatti wrote:
>>>>>>> On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
>>>>>>>> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
>>>>>>>> caused by requiring lock
>>>>>>>>
>>>>>>>> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
>>>>>>>> so update the comments
>>>>>>>>
>>>>>>>> [ It improves kernel building 0.6% ~ 1% ]
>>>>>>>
>>>>>>> Can you please describe the overload in more detail? Under what scenario
>>>>>>> is kernel building improved?
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>> The scenario is we do kernel building, meanwhile, repeatedly read PCI rom
>>>>>> every one second.
>>>>>>
>>>>>> [
>>>>>>    echo 1 > /sys/bus/pci/devices/0000\:00\:03.0/rom
>>>>>>    cat /sys/bus/pci/devices/0000\:00\:03.0/rom > /dev/null
>>>>>> ]
>>>>>
>>>>> Can't see why it reflects real world scenario (or a real world
>>>>> scenario with same characteristics regarding kvm_mmu_zap_all vs faults)?
>>>>>
>>>>> Point is, it would be good to understand why this change 
>>>>> is improving performance? What are these cases where breaking out of
>>>>> kvm_mmu_zap_all due to either (need_resched || spin_needbreak) on zapped
>>>>> < 10 ?
>>>>
>>>> When guest read ROM, qemu will set the memory to map the device's firmware,
>>>> that is why kvm_mmu_zap_all can be called in the scenario.
>>>>
>>>> The reasons why it heart the performance are:
>>>> 1): Qemu use a global io-lock to sync all vcpu, so that the io-lock is held
>>>>     when we do kvm_mmu_zap_all(). If kvm_mmu_zap_all() is not efficient, all
>>>>     other vcpus need wait a long time to do I/O.
>>>>
>>>> 2): kvm_mmu_zap_all() is triggered in vcpu context. so it can block the IPI
>>>>     request from other vcpus.
>>>>
>>>> Is it enough?
>>>
>>> That is no problem. The problem is why you chose "10" as the minimum number of
>>> pages to zap before considering reschedule. I would expect the need to
>>
>> Well, my description above explained why batch-zapping is needed - we do
>> not want the vcpu spend lots of time to zap all pages because it hurts other
>> vcpus running.
>>
>> But, why the batch page number is "10"... I can not answer this, i just guessed
>> that '10' can make vcpu do not spend long time on zap_all_pages and do
>> not cause mmu-lock too hungry. "10" is the speculative value and i am not sure
>> it is the best value but at lease, i think it can work.
>>
>>> reschedule to be rare enough that one kvm_mmu_zap_all instance (between
>>> schedule in and schedule out) to be able to release no less than a
>>> thousand pages.
>>
>> Unfortunately, no.
>>
>> This information is I replied Gleb in his mail where he raced a question that
>> why "collapse tlb flush is needed":
>>
>> ======
>> It seems no.
>> Since we have reloaded mmu before zapping the obsolete pages, the mmu-lock
>> is easily contended. I did the simple track:
>>
>> +       int num = 0;
>>  restart:
>>         list_for_each_entry_safe_reverse(sp, node,
>>               &kvm->arch.active_mmu_pages, link) {
>> @@ -4265,6 +4265,7 @@ restart:
>>                 if (batch >= BATCH_ZAP_PAGES &&
>>                       cond_resched_lock(&kvm->mmu_lock)) {
>>                         batch = 0;
>> +                       num++;
>>                         goto restart;
>>                 }
>>
>> @@ -4277,6 +4278,7 @@ restart:
>>          * may use the pages.
>>          */
>>         kvm_mmu_commit_zap_page(kvm, &invalid_list);
>> +       printk("lock-break: %d.\n", num);
>>  }
>>
>> I do read pci rom when doing kernel building in the guest which
>> has 1G memory and 4vcpus with ept enabled, this is the normal
>> workload and normal configuration.
>>
>> # dmesg
>> [ 2338.759099] lock-break: 8.
>> [ 2339.732442] lock-break: 5.
>> [ 2340.904446] lock-break: 3.
>> [ 2342.513514] lock-break: 3.
>> [ 2343.452229] lock-break: 3.
>> [ 2344.981599] lock-break: 4.
>>
>> Basically, we need to break many times.
>> ======
>>
>> You can see we should break 3 times to zap all pages even if we have zapoed
>> 10 pages in batch. It is obviously that it need break more times without
>> batch-zapping.
> 
> Yes, but this is not a real scenario, or even describes a real scenario
> as far as i know. 

Aha.

Okay, maybe "read rom" is not the common case, but vcpu can trigger it or guest
driver can do in the further. What happen if vcpu trigger it? The worst
case is, one vcpu is always break mmu-lock due to one other vcpus doing intense
memory access and the reset vcpus is waiting IO or its IPI. It is easy soft-lockup.

More worse, if host memory is really less and host is always trying to reclaim qemu's
memory, that cause the lock always be hot and can not zap one page before rescheduled.

> 
> Are you sure this minimum-batching-before-considering-reschedule even
> after obsolete pages optimization?

Yes, this track is after all patches in this series are applied.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Xiao Guangrong May 29, 2013, 2:02 p.m. UTC | #10

On 05/29/2013 09:32 PM, Marcelo Tosatti wrote:
> On Wed, May 29, 2013 at 09:09:09PM +0800, Xiao Guangrong wrote:
>> This information is I replied Gleb in his mail where he raced a question that
>> why "collapse tlb flush is needed":
>>
>> ======
>> It seems no.
>> Since we have reloaded mmu before zapping the obsolete pages, the mmu-lock
>> is easily contended. I did the simple track:
>>
>> +       int num = 0;
>>  restart:
>>         list_for_each_entry_safe_reverse(sp, node,
>>               &kvm->arch.active_mmu_pages, link) {
>> @@ -4265,6 +4265,7 @@ restart:
>>                 if (batch >= BATCH_ZAP_PAGES &&
>>                       cond_resched_lock(&kvm->mmu_lock)) {
>>                         batch = 0;
>> +                       num++;
>>                         goto restart;
>>                 }
>>
>> @@ -4277,6 +4278,7 @@ restart:
>>          * may use the pages.
>>          */
>>         kvm_mmu_commit_zap_page(kvm, &invalid_list);
>> +       printk("lock-break: %d.\n", num);
>>  }
>>
>> I do read pci rom when doing kernel building in the guest which
>> has 1G memory and 4vcpus with ept enabled, this is the normal
>> workload and normal configuration.
>>
>> # dmesg
>> [ 2338.759099] lock-break: 8.
>> [ 2339.732442] lock-break: 5.
>> [ 2340.904446] lock-break: 3.
>> [ 2342.513514] lock-break: 3.
>> [ 2343.452229] lock-break: 3.
>> [ 2344.981599] lock-break: 4.
>>
>> Basically, we need to break many times.
> 
> Should measure kvm_mmu_zap_all latency.
> 
>> ======
>>
>> You can see we should break 3 times to zap all pages even if we have zapoed
>> 10 pages in batch. It is obviously that it need break more times without
>> batch-zapping.
> 
> Again, breaking should be no problem, what matters is latency. Please
> measure kvm_mmu_zap_all latency after all optimizations to justify 
> this minimum batching.

Okay, okay. I will benchmark the latency.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[v7,04/11] KVM: MMU: zap pages in batch

Commit Message

Comments

Patch