[RFC,0/4] Restore change_pte optimization to its former glory

Message ID	20190131183706.20980-1-jglisse@redhat.com (mailing list archive)
Headers	show Return-Path: <kvm-owner@kernel.org> From: jglisse@redhat.com To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= <jglisse@redhat.com>, Andrea Arcangeli <aarcange@redhat.com>, Peter Xu <peterx@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>, Arnaldo Carvalho de Melo <acme@kernel.org>, Alexander Shishkin <alexander.shishkin@linux.intel.com>, Jiri Olsa <jolsa@redhat.com>, Namhyung Kim <namhyung@kernel.org>, Andrew Morton <akpm@linux-foundation.org>, Matthew Wilcox <mawilcox@microsoft.com>, Paolo Bonzini <pbonzini@redhat.com>, =?utf-8?b?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>, Michal Hocko <mhocko@kernel.org>, kvm@vger.kernel.org Subject: [RFC PATCH 0/4] Restore change_pte optimization to its former glory Date: Thu, 31 Jan 2019 13:37:02 -0500 Message-Id: <20190131183706.20980-1-jglisse@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: kvm-owner@vger.kernel.org Precedence: bulk
Series	Restore change_pte optimization to its former glory \| expand [RFC,0/4] Restore change_pte optimization to its former glory [RFC,1/4] uprobes: use set_pte_at() not set_pte_at_notify() [RFC,2/4] mm/mmu_notifier: use unsigned for event field in range struct [RFC,3/4] mm/mmu_notifier: set MMU_NOTIFIER_USE_CHANGE_PTE flag where appropriate [RFC,4/4] kvm/mmu_notifier: re-enable the change_pte() optimization.

Jerome Glisse Jan. 31, 2019, 6:37 p.m. UTC

From: Jérôme Glisse <jglisse@redhat.com>

This patchset is on top of my patchset to add context information to
mmu notifier [1] you can find a branch with everything [2]. I have not
tested it but i wanted to get the discussion started. I believe it is
correct but i am not sure what kind of kvm test i can run to exercise
this.

The idea is that since kvm will invalidate the secondary MMUs within
invalidate_range callback then the change_pte() optimization is lost.
With this patchset everytime core mm is using set_pte_at_notify() and
thus change_pte() get calls then we can ignore the invalidate_range
callback altogether and only rely on change_pte callback.

Note that this is only valid when either going from a read and write
pte to a read only pte with same pfn, or from a read only pte to a
read and write pte with different pfn. The other side of the story
is that the primary mmu pte is clear with ptep_clear_flush_notify
before the call to change_pte.

Also with the mmu notifier context information [1] you can further
optimize other cases like mprotect or write protect when forking. You
can use the new context information to infer that the invalidation is
for read only update of the primary mmu and update the secondary mmu
accordingly instead of clearing it and forcing fault even for read
access. I do not know if that is an optimization that would bear any
fruit for kvm. It does help for device driver. You can also optimize
the soft dirty update.

Cheers,
Jérôme


[1] https://lore.kernel.org/linux-fsdevel/20190123222315.1122-1-jglisse@redhat.com/T/#m69e8f589240e18acbf196a1c8aa1d6fc97bd3565
[2] https://cgit.freedesktop.org/~glisse/linux/log/?h=kvm-restore-change_pte

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: kvm@vger.kernel.org

Jérôme Glisse (4):
  uprobes: use set_pte_at() not set_pte_at_notify()
  mm/mmu_notifier: use unsigned for event field in range struct
  mm/mmu_notifier: set MMU_NOTIFIER_USE_CHANGE_PTE flag where
    appropriate
  kvm/mmu_notifier: re-enable the change_pte() optimization.

 include/linux/mmu_notifier.h | 21 +++++++++++++++++++--
 kernel/events/uprobes.c      |  3 +--
 mm/ksm.c                     |  6 ++++--
 mm/memory.c                  |  3 ++-
 virt/kvm/kvm_main.c          | 16 ++++++++++++++++
 5 files changed, 42 insertions(+), 7 deletions(-)

Andrea Arcangeli Feb. 1, 2019, 11:57 p.m. UTC | #1

Hello everyone,

On Thu, Jan 31, 2019 at 01:37:02PM -0500, Jerome Glisse wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> This patchset is on top of my patchset to add context information to
> mmu notifier [1] you can find a branch with everything [2]. I have not
> tested it but i wanted to get the discussion started. I believe it is
> correct but i am not sure what kind of kvm test i can run to exercise
> this.
> 
> The idea is that since kvm will invalidate the secondary MMUs within
> invalidate_range callback then the change_pte() optimization is lost.
> With this patchset everytime core mm is using set_pte_at_notify() and
> thus change_pte() get calls then we can ignore the invalidate_range
> callback altogether and only rely on change_pte callback.
> 
> Note that this is only valid when either going from a read and write
> pte to a read only pte with same pfn, or from a read only pte to a
> read and write pte with different pfn. The other side of the story
> is that the primary mmu pte is clear with ptep_clear_flush_notify
> before the call to change_pte.

If it's cleared with ptep_clear_flush_notify, change_pte still won't
work. The above text needs updating with
"ptep_clear_flush". set_pte_at_notify is all about having
ptep_clear_flush only before it or it's the same as having a range
invalidate preceding it.

With regard to the code, wp_page_copy() needs
s/ptep_clear_flush_notify/ptep_clear_flush/ before set_pte_at_notify.

change_pte relies on the ptep_clear_flush preceding the
set_pte_at_notify that will make sure if the secondary MMU mapping
randomly disappears between ptep_clear_flush and set_pte_at_notify,
gup_fast will wait and block on the PT lock until after
set_pte_at_notify is completed before trying to re-establish a
secondary MMU mapping.

So then we've only to worry about what happens because we left the
secondary MMU mapping potentially intact despite we flushed the
primary MMU mapping with ptep_clear_flush (as opposed to
ptep_clear_flush_notify which would teardown the secondary MMU mapping
too).

In you wording above at least the "with a different pfn" is
superflous. I think it's ok if the protection changes from read-only
to read-write and the pfn remains the same. Like when we takeover a
page because it's not shared anymore (fork child quit).

It's also ok to change pfn if the mapping is read-only and remains
read-only, this is what KSM does in replace_page.

The read-write to read-only case must not change pfn to avoid losing
coherency from the secondary MMU point of view. This isn't so much
about change_pte itself, but the fact that the page-copy generally
happens well before the pte mangling starts. This case never presents
itself in the code because KSM is first write protecting the page and
only later merging it, regardless of change_pte or not.

The important thing is that the secondary MMU must be updated first
(unlike the invalidates) to be sure the secondary MMU already points
to the new page when the pfn changes and the protection changes from
read-only to read-write (COW fault). The primary MMU cannot read/write
to the page anyway while we update the secondary MMU because we did
ptep_clear_flush() before calling set_pte_at_notify(). So this
ordering of "ptep_clear_flush; change_pte; set_pte_at" ensures
whenever the CPU can access the memory, the access is synchronous
with the secondary MMUs because they've all been updated already.

If (in set_pte_at_notify) we were to call change_pte() after
set_pte_at() what would happen is that the CPU could write to the page
through a TLB fill without page fault while the secondary MMUs still
read the old memory in the old readonly page.

Thanks,
Andrea

Andrea Arcangeli Feb. 2, 2019, 12:14 a.m. UTC | #2

On Fri, Feb 01, 2019 at 06:57:38PM -0500, Andrea Arcangeli wrote:
> If it's cleared with ptep_clear_flush_notify, change_pte still won't
> work. The above text needs updating with
> "ptep_clear_flush". set_pte_at_notify is all about having
> ptep_clear_flush only before it or it's the same as having a range
> invalidate preceding it.
> 
> With regard to the code, wp_page_copy() needs
> s/ptep_clear_flush_notify/ptep_clear_flush/ before set_pte_at_notify.

Oops, the above two statements were incorrect because
ptep_clear_flush_notify doesn't interfere with change_pte and will
only invalidate secondary MMU mappings on those secondary MMUs that
shares the same pagetables with the primary MMU and that in turn won't
ever implement a change_pte method.

Jerome Glisse Feb. 11, 2019, 7:09 p.m. UTC | #3

On Fri, Feb 01, 2019 at 06:57:38PM -0500, Andrea Arcangeli wrote:
> Hello everyone,
> 
> On Thu, Jan 31, 2019 at 01:37:02PM -0500, Jerome Glisse wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > This patchset is on top of my patchset to add context information to
> > mmu notifier [1] you can find a branch with everything [2]. I have not
> > tested it but i wanted to get the discussion started. I believe it is
> > correct but i am not sure what kind of kvm test i can run to exercise
> > this.
> > 
> > The idea is that since kvm will invalidate the secondary MMUs within
> > invalidate_range callback then the change_pte() optimization is lost.
> > With this patchset everytime core mm is using set_pte_at_notify() and
> > thus change_pte() get calls then we can ignore the invalidate_range
> > callback altogether and only rely on change_pte callback.
> > 
> > Note that this is only valid when either going from a read and write
> > pte to a read only pte with same pfn, or from a read only pte to a
> > read and write pte with different pfn. The other side of the story
> > is that the primary mmu pte is clear with ptep_clear_flush_notify
> > before the call to change_pte.
> 
> If it's cleared with ptep_clear_flush_notify, change_pte still won't
> work. The above text needs updating with
> "ptep_clear_flush". set_pte_at_notify is all about having
> ptep_clear_flush only before it or it's the same as having a range
> invalidate preceding it.
> 
> With regard to the code, wp_page_copy() needs
> s/ptep_clear_flush_notify/ptep_clear_flush/ before set_pte_at_notify.
> 
> change_pte relies on the ptep_clear_flush preceding the
> set_pte_at_notify that will make sure if the secondary MMU mapping
> randomly disappears between ptep_clear_flush and set_pte_at_notify,
> gup_fast will wait and block on the PT lock until after
> set_pte_at_notify is completed before trying to re-establish a
> secondary MMU mapping.
> 
> So then we've only to worry about what happens because we left the
> secondary MMU mapping potentially intact despite we flushed the
> primary MMU mapping with ptep_clear_flush (as opposed to
> ptep_clear_flush_notify which would teardown the secondary MMU mapping
> too).

So all the above is moot since as you pointed out in the other email
ptep_clear_flush_notify does not invalidate kvm secondary mmu hence.


> 
> In you wording above at least the "with a different pfn" is
> superflous. I think it's ok if the protection changes from read-only
> to read-write and the pfn remains the same. Like when we takeover a
> page because it's not shared anymore (fork child quit).
> 
> It's also ok to change pfn if the mapping is read-only and remains
> read-only, this is what KSM does in replace_page.

Yes i thought this was obvious i will reword and probably just do a
list of every case that is fine.

> 
> The read-write to read-only case must not change pfn to avoid losing
> coherency from the secondary MMU point of view. This isn't so much
> about change_pte itself, but the fact that the page-copy generally
> happens well before the pte mangling starts. This case never presents
> itself in the code because KSM is first write protecting the page and
> only later merging it, regardless of change_pte or not.
> 
> The important thing is that the secondary MMU must be updated first
> (unlike the invalidates) to be sure the secondary MMU already points
> to the new page when the pfn changes and the protection changes from
> read-only to read-write (COW fault). The primary MMU cannot read/write
> to the page anyway while we update the secondary MMU because we did
> ptep_clear_flush() before calling set_pte_at_notify(). So this
> ordering of "ptep_clear_flush; change_pte; set_pte_at" ensures
> whenever the CPU can access the memory, the access is synchronous
> with the secondary MMUs because they've all been updated already.
> 
> If (in set_pte_at_notify) we were to call change_pte() after
> set_pte_at() what would happen is that the CPU could write to the page
> through a TLB fill without page fault while the secondary MMUs still
> read the old memory in the old readonly page.

Yeah, between do you have any good workload for me to test this ? I
was thinking of running few same VM and having KSM work on them. Is
there some way to trigger KVM to fork ? As the other case is breaking
COW after fork.

Cheers,
Jérôme

Andrea Arcangeli Feb. 11, 2019, 8:02 p.m. UTC | #4

On Mon, Feb 11, 2019 at 02:09:31PM -0500, Jerome Glisse wrote:
> Yeah, between do you have any good workload for me to test this ? I
> was thinking of running few same VM and having KSM work on them. Is
> there some way to trigger KVM to fork ? As the other case is breaking
> COW after fork.

KVM can fork on guest pci-hotplug events or network init to run host
scripts and re-init the signals before doing the exec, but it won't
move the needle because all guest memory registered in the MMU
notifier is set as MADV_DONTFORK... so fork() is a noop unless qemu is
also modified not to call MADV_DONTFORK.

Calling if (!fork()) exit(0) from a timer at regular intervals during
qemu runtime after turning off MADV_DONTFORK in qemu would allow to
exercise fork against the KVM MMU Notifier methods.

The optimized change_pte code in copy-on-write code is the same
post-fork or post-KSM merge and fork() itself doesn't use change_pte
while KSM does, so with regard to change_pte it should already provide
a good test coverage to test with only KSM without fork(). It'll cover
the read-write -> readonly transition with same PFN
(write_protect_page), the read-only to read-only changing PFN
(replace_page) as well as the readonly -> read-write transition
changing PFN (wp_page_copy) all three optimized with change_pte. Fork
would not leverage change_pte for the first two cases.

Jerome Glisse Feb. 18, 2019, 4:04 p.m. UTC | #5

On Mon, Feb 11, 2019 at 03:02:00PM -0500, Andrea Arcangeli wrote:
> On Mon, Feb 11, 2019 at 02:09:31PM -0500, Jerome Glisse wrote:
> > Yeah, between do you have any good workload for me to test this ? I
> > was thinking of running few same VM and having KSM work on them. Is
> > there some way to trigger KVM to fork ? As the other case is breaking
> > COW after fork.
> 
> KVM can fork on guest pci-hotplug events or network init to run host
> scripts and re-init the signals before doing the exec, but it won't
> move the needle because all guest memory registered in the MMU
> notifier is set as MADV_DONTFORK... so fork() is a noop unless qemu is
> also modified not to call MADV_DONTFORK.
> 
> Calling if (!fork()) exit(0) from a timer at regular intervals during
> qemu runtime after turning off MADV_DONTFORK in qemu would allow to
> exercise fork against the KVM MMU Notifier methods.
> 
> The optimized change_pte code in copy-on-write code is the same
> post-fork or post-KSM merge and fork() itself doesn't use change_pte
> while KSM does, so with regard to change_pte it should already provide
> a good test coverage to test with only KSM without fork(). It'll cover
> the read-write -> readonly transition with same PFN
> (write_protect_page), the read-only to read-only changing PFN
> (replace_page) as well as the readonly -> read-write transition
> changing PFN (wp_page_copy) all three optimized with change_pte. Fork
> would not leverage change_pte for the first two cases.

So i run 2 exact same VMs side by side (copy of same COW image) and
built the same kernel tree inside each (that is the only important
workload that exist ;)) but the change_pte did not have any impact:

before  mean  {real: 1358.250977, user: 16650.880859, sys: 839.199524, npages: 76855.390625}
before  stdev {real:    6.744010, user:   108.863762, sys:   6.840437, npages:  1868.071899}
after   mean  {real: 1357.833740, user: 16685.849609, sys: 839.646973, npages: 76210.601562}
after   stdev {real:    5.124797, user:    78.469360, sys:   7.009164, npages:  2468.017578}
without mean  {real: 1358.501343, user: 16674.478516, sys: 837.791992, npages: 76225.203125}
without stdev {real:    5.541104, user:    97.998367, sys:   6.715869, npages:  1682.392578}

Above is time taken by make inside each VM for all yes config. npages
is the number of page shared reported on the host at the end of the
build.

There is no change before and after the patchset to restore change
pte. I tried removing the change_pte callback alltogether to see if
that did have any effect (without above) and it did not have any
effect either.

Should we still restore change_pte() ? It does not hurt, but it does
not seems to help in anyway. Maybe you have a better benchmark i could
run ?

Cheers,
Jérôme

Andrea Arcangeli Feb. 18, 2019, 5:45 p.m. UTC | #6

On Mon, Feb 18, 2019 at 11:04:13AM -0500, Jerome Glisse wrote:
> So i run 2 exact same VMs side by side (copy of same COW image) and
> built the same kernel tree inside each (that is the only important
> workload that exist ;)) but the change_pte did not have any impact:
> 
> before  mean  {real: 1358.250977, user: 16650.880859, sys: 839.199524, npages: 76855.390625}
> before  stdev {real:    6.744010, user:   108.863762, sys:   6.840437, npages:  1868.071899}
> after   mean  {real: 1357.833740, user: 16685.849609, sys: 839.646973, npages: 76210.601562}
> after   stdev {real:    5.124797, user:    78.469360, sys:   7.009164, npages:  2468.017578}
> without mean  {real: 1358.501343, user: 16674.478516, sys: 837.791992, npages: 76225.203125}
> without stdev {real:    5.541104, user:    97.998367, sys:   6.715869, npages:  1682.392578}
> 
> Above is time taken by make inside each VM for all yes config. npages
> is the number of page shared reported on the host at the end of the
> build.

Did you set /sys/kernel/mm/ksm/sleep_millisecs to 0?

It would also help to remove the checksum check from mm/ksm.c:

-	if (rmap_item->oldchecksum != checksum) {
-		rmap_item->oldchecksum = checksum;
-		return;
-	}

One way or another, /sys/kernel/mm/ksm/pages_shared and/or
pages_sharing need to change significantly to be sure we're exercising
the COW/merging code that uses change_pte. KSM is smart enough to
merge only not frequently changing pages, and with the default KSM
code this probably works too well for a kernel build.

> Should we still restore change_pte() ? It does not hurt, but it does
> not seems to help in anyway. Maybe you have a better benchmark i could
> run ?

We could also try a microbenchmark based on
ltp/testcases/kernel/mem/ksm/ksm02.c that already should trigger a
merge flood and a COW flood during its internal processing.

Thanks,
Andrea

Jerome Glisse Feb. 18, 2019, 6:20 p.m. UTC | #7

On Mon, Feb 18, 2019 at 12:45:05PM -0500, Andrea Arcangeli wrote:
> On Mon, Feb 18, 2019 at 11:04:13AM -0500, Jerome Glisse wrote:
> > So i run 2 exact same VMs side by side (copy of same COW image) and
> > built the same kernel tree inside each (that is the only important
> > workload that exist ;)) but the change_pte did not have any impact:
> > 
> > before  mean  {real: 1358.250977, user: 16650.880859, sys: 839.199524, npages: 76855.390625}
> > before  stdev {real:    6.744010, user:   108.863762, sys:   6.840437, npages:  1868.071899}
> > after   mean  {real: 1357.833740, user: 16685.849609, sys: 839.646973, npages: 76210.601562}
> > after   stdev {real:    5.124797, user:    78.469360, sys:   7.009164, npages:  2468.017578}
> > without mean  {real: 1358.501343, user: 16674.478516, sys: 837.791992, npages: 76225.203125}
> > without stdev {real:    5.541104, user:    97.998367, sys:   6.715869, npages:  1682.392578}
> > 
> > Above is time taken by make inside each VM for all yes config. npages
> > is the number of page shared reported on the host at the end of the
> > build.
> 
> Did you set /sys/kernel/mm/ksm/sleep_millisecs to 0?

No but i have increase the pages_to_scan to 10000 and during the kernel
build i see the number of page that are shared increase steadily so it
is definitly merging thing.

> 
> It would also help to remove the checksum check from mm/ksm.c:
> 
> -	if (rmap_item->oldchecksum != checksum) {
> -		rmap_item->oldchecksum = checksum;
> -		return;
> -	}

Will try with that.

> 
> One way or another, /sys/kernel/mm/ksm/pages_shared and/or
> pages_sharing need to change significantly to be sure we're exercising
> the COW/merging code that uses change_pte. KSM is smart enough to
> merge only not frequently changing pages, and with the default KSM
> code this probably works too well for a kernel build.
> 
> > Should we still restore change_pte() ? It does not hurt, but it does
> > not seems to help in anyway. Maybe you have a better benchmark i could
> > run ?
> 
> We could also try a microbenchmark based on
> ltp/testcases/kernel/mem/ksm/ksm02.c that already should trigger a
> merge flood and a COW flood during its internal processing.

Will try that.

Cheers,
Jérôme

Peter Xu Feb. 19, 2019, 2:37 a.m. UTC | #8

On Mon, Feb 18, 2019 at 12:45:05PM -0500, Andrea Arcangeli wrote:
> On Mon, Feb 18, 2019 at 11:04:13AM -0500, Jerome Glisse wrote:
> > So i run 2 exact same VMs side by side (copy of same COW image) and
> > built the same kernel tree inside each (that is the only important
> > workload that exist ;)) but the change_pte did not have any impact:
> > 
> > before  mean  {real: 1358.250977, user: 16650.880859, sys: 839.199524, npages: 76855.390625}
> > before  stdev {real:    6.744010, user:   108.863762, sys:   6.840437, npages:  1868.071899}
> > after   mean  {real: 1357.833740, user: 16685.849609, sys: 839.646973, npages: 76210.601562}
> > after   stdev {real:    5.124797, user:    78.469360, sys:   7.009164, npages:  2468.017578}
> > without mean  {real: 1358.501343, user: 16674.478516, sys: 837.791992, npages: 76225.203125}
> > without stdev {real:    5.541104, user:    97.998367, sys:   6.715869, npages:  1682.392578}
> > 
> > Above is time taken by make inside each VM for all yes config. npages
> > is the number of page shared reported on the host at the end of the
> > build.
> 
> Did you set /sys/kernel/mm/ksm/sleep_millisecs to 0?
> 
> It would also help to remove the checksum check from mm/ksm.c:
> 
> -	if (rmap_item->oldchecksum != checksum) {
> -		rmap_item->oldchecksum = checksum;
> -		return;
> -	}
> 
> One way or another, /sys/kernel/mm/ksm/pages_shared and/or
> pages_sharing need to change significantly to be sure we're exercising
> the COW/merging code that uses change_pte. KSM is smart enough to
> merge only not frequently changing pages, and with the default KSM
> code this probably works too well for a kernel build.

Would it also make sense to track how many pages are really affected
by change_pte (say, in kvm_set_pte_rmapp, count avaliable SPTEs that
are correctly rebuilt)?  I'm thinking even if many pages are merged by
KSM it's still possible that these pages are not actively shadowed by
KVM MMU, meanwhile change_pte should only affect actively shadowed
SPTEs.  In other words, IMHO we might not be able to observe obvious
performance differeneces if the pages we are accessing are not merged
by KSM.  In our case (building the kernel), IIUC the mostly possible
shared pages are system image pages, however when building the kernel
I'm thinking whether these pages will be frequently accesses, and
whether this could lead to similar performance numbers.

Thanks,

Jerome Glisse Feb. 19, 2019, 2:43 a.m. UTC | #9

On Tue, Feb 19, 2019 at 10:37:01AM +0800, Peter Xu wrote:
> On Mon, Feb 18, 2019 at 12:45:05PM -0500, Andrea Arcangeli wrote:
> > On Mon, Feb 18, 2019 at 11:04:13AM -0500, Jerome Glisse wrote:
> > > So i run 2 exact same VMs side by side (copy of same COW image) and
> > > built the same kernel tree inside each (that is the only important
> > > workload that exist ;)) but the change_pte did not have any impact:
> > > 
> > > before  mean  {real: 1358.250977, user: 16650.880859, sys: 839.199524, npages: 76855.390625}
> > > before  stdev {real:    6.744010, user:   108.863762, sys:   6.840437, npages:  1868.071899}
> > > after   mean  {real: 1357.833740, user: 16685.849609, sys: 839.646973, npages: 76210.601562}
> > > after   stdev {real:    5.124797, user:    78.469360, sys:   7.009164, npages:  2468.017578}
> > > without mean  {real: 1358.501343, user: 16674.478516, sys: 837.791992, npages: 76225.203125}
> > > without stdev {real:    5.541104, user:    97.998367, sys:   6.715869, npages:  1682.392578}
> > > 
> > > Above is time taken by make inside each VM for all yes config. npages
> > > is the number of page shared reported on the host at the end of the
> > > build.
> > 
> > Did you set /sys/kernel/mm/ksm/sleep_millisecs to 0?
> > 
> > It would also help to remove the checksum check from mm/ksm.c:
> > 
> > -	if (rmap_item->oldchecksum != checksum) {
> > -		rmap_item->oldchecksum = checksum;
> > -		return;
> > -	}
> > 
> > One way or another, /sys/kernel/mm/ksm/pages_shared and/or
> > pages_sharing need to change significantly to be sure we're exercising
> > the COW/merging code that uses change_pte. KSM is smart enough to
> > merge only not frequently changing pages, and with the default KSM
> > code this probably works too well for a kernel build.
> 
> Would it also make sense to track how many pages are really affected
> by change_pte (say, in kvm_set_pte_rmapp, count avaliable SPTEs that
> are correctly rebuilt)?  I'm thinking even if many pages are merged by
> KSM it's still possible that these pages are not actively shadowed by
> KVM MMU, meanwhile change_pte should only affect actively shadowed
> SPTEs.  In other words, IMHO we might not be able to observe obvious
> performance differeneces if the pages we are accessing are not merged
> by KSM.  In our case (building the kernel), IIUC the mostly possible
> shared pages are system image pages, however when building the kernel
> I'm thinking whether these pages will be frequently accesses, and
> whether this could lead to similar performance numbers.

I checked that, if no KVM is running KSM never merge anything (after
bumping KSM page to scan to 10000 and sleep to 0). It starts merging
once i start KVM. Then i wait a bit for KSM to stabilize (ie to merge
the stock KVM pages). It is only once KSM count is somewhat stable
that i run the test and check that KSM count goes up significantly
while test is running.

KSM will definitly go through the change_pte path for KVM so i am
definitly testing the change_pte path.

I have been running the micro benchmark and on that i do see a perf
improvement i will report shortly once i am done gathering enough
data.

Cheers,
Jérôme

Jerome Glisse Feb. 19, 2019, 3:33 a.m. UTC | #10

On Mon, Feb 18, 2019 at 12:45:05PM -0500, Andrea Arcangeli wrote:
> On Mon, Feb 18, 2019 at 11:04:13AM -0500, Jerome Glisse wrote:
> > So i run 2 exact same VMs side by side (copy of same COW image) and
> > built the same kernel tree inside each (that is the only important
> > workload that exist ;)) but the change_pte did not have any impact:
> > 
> > before  mean  {real: 1358.250977, user: 16650.880859, sys: 839.199524, npages: 76855.390625}
> > before  stdev {real:    6.744010, user:   108.863762, sys:   6.840437, npages:  1868.071899}
> > after   mean  {real: 1357.833740, user: 16685.849609, sys: 839.646973, npages: 76210.601562}
> > after   stdev {real:    5.124797, user:    78.469360, sys:   7.009164, npages:  2468.017578}
> > without mean  {real: 1358.501343, user: 16674.478516, sys: 837.791992, npages: 76225.203125}
> > without stdev {real:    5.541104, user:    97.998367, sys:   6.715869, npages:  1682.392578}
> > 
> > Above is time taken by make inside each VM for all yes config. npages
> > is the number of page shared reported on the host at the end of the
> > build.
> 
> Did you set /sys/kernel/mm/ksm/sleep_millisecs to 0?
> 
> It would also help to remove the checksum check from mm/ksm.c:
> 
> -	if (rmap_item->oldchecksum != checksum) {
> -		rmap_item->oldchecksum = checksum;
> -		return;
> -	}
> 
> One way or another, /sys/kernel/mm/ksm/pages_shared and/or
> pages_sharing need to change significantly to be sure we're exercising
> the COW/merging code that uses change_pte. KSM is smart enough to
> merge only not frequently changing pages, and with the default KSM
> code this probably works too well for a kernel build.
> 
> > Should we still restore change_pte() ? It does not hurt, but it does
> > not seems to help in anyway. Maybe you have a better benchmark i could
> > run ?
> 
> We could also try a microbenchmark based on
> ltp/testcases/kernel/mem/ksm/ksm02.c that already should trigger a
> merge flood and a COW flood during its internal processing.

So using that and the checksum test removed there is an improvement,
roughly 7%~8% on time spends in the kernel:

before  mean  {real: 675.460632, user: 857.771423, sys: 215.929657, npages: 4773.066895}
before  stdev {real:  37.035435, user:   4.395942, sys:   3.976172, npages:  675.352783}
after   mean  {real: 672.515503, user: 855.817322, sys: 200.902710, npages: 4899.000000}
after   stdev {real:  37.340954, user:   4.051633, sys:   3.894153, npages:  742.413452}

I am guessing for kernel build this get lost in the noise and that
KSM changes do not have that much of an impact. So i will reposting
the mmu notifier changes shortly in hope to get them in 5.1 and i
will post the KVM part separatly shortly there after.

If there is any more testing you wish me to do let me know.

Cheers,
Jérôme

[RFC,0/4] Restore change_pte optimization to its former glory

Message

Comments